What the Freakiness of 2025 in AI Tells Us About 2026
Based on AI Explained's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Reasoning-heavy inference in 2025 improved benchmark performance, but it may narrow output diversity and may not create entirely new reasoning paths beyond what base models already contain.
Briefing
Reasoning-heavy AI made major benchmark gains in 2025—but the year also exposed a trade-off: pushing models to “think longer” can improve accuracy while narrowing the variety of answers. Across coding, chart/table analysis, video understanding, and general reasoning, longer inference and more tokens helped models surpass prior test results. Yet the improvement appears less like a brand-new capability and more like better selection among patterns already present in the underlying base model. That skepticism matters because it reframes what “progress” means: beating benchmarks is real, but it doesn’t automatically prove deeper general intelligence.
The same year delivered a second, more tangible shift: generative AI moved toward interactive, persistent worlds. Google DeepMind’s Genie 3, announced in August, can generate dynamic environments from text prompts or images, keeping scenes consistent for minutes at 720p resolution. The implication isn’t just prettier visuals; it’s the start of AI systems that can maintain continuity—so users can return to a world and find changes still there. That sets up a likely 2026 trajectory toward more realistic simulation and richer multimodal experiences, building on releases such as VO 3.1, Sora 2, and text-to-speech and text-to-music systems.
But 2025 also normalized a darker byproduct: AI-generated “slop” became mainstream, and trust started to erode. The transcript points to viral videos that look authentic enough to fool large audiences—life-lesson clips and politically themed deepfakes—despite being fully synthetic. The concern isn’t only misinformation; it’s the growing difficulty of knowing what to believe when even everyday feeds can be fabricated.
Public sentiment and policy reflected that tension. A summer survey of Americans found a slight net-positive view of AI’s overall impact, but attitudes toward AI art were far less favorable. In the UK, the government’s opt-out plan for artists drew limited support, highlighting how quickly cultural and legal norms are being stress-tested. Even inside top labs, questions are emerging about what it means to “solve” creativity when tools can accelerate prototyping by an order of magnitude while potentially replacing parts of creative skill.
Institutional adoption accelerated as well. Governments and militaries used generative AI for tasks ranging from policy support to efficiency gains, with mixed outcomes. Meanwhile, the transcript argues that headline-grabbing model releases can mislead: the “expert-like” feel of newer systems doesn’t eliminate hallucinations, and intelligence isn’t a single axis. Even as usage climbed—from roughly 400 million weekly ChatGPT users in February to about 900 million later—providers also faced pressure to manage user appeal, pricing, and competitive dynamics.
Looking toward 2026, the core framework is “lateral productivity”: even if frontier models aren’t perfect experts, they can still help non-experts perform domain tasks far better than they could with search alone. The transcript also frames a debate about generality—whether intelligence scales along one main axis (more data/parameters) or whether real-world performance requires optimizing across countless edge cases. The proposed middle ground is that models improve by learning general patterns at internet scale, which is why benchmark performance can rise without collapsing into a never-ending maze of niche evaluations.
Finally, the most forward-looking prediction is that progress will shift from answering questions to automated information discovery. The transcript highlights systems like Alpha Evolve and Alpha Revolve—LLM-driven loops that propose code changes, run evaluations, and keep what works—plus “alpha software” research that uses AI to generate and test new scientific methods. The message is cautious but optimistic: the next leap may come from AI systems that iteratively find better ideas and software, not just from larger models or longer prompts.
Cornell Notes
2025’s biggest AI story was reasoning: models that spend more time and tokens thinking achieved stronger benchmark results, but the gains may come from better use of existing internal patterns rather than entirely new reasoning paths. At the same time, generative systems moved toward interactive, persistent experiences—Genie 3 can turn prompts or images into playable worlds that remain consistent for minutes. Trust, however, took a hit as AI-generated “slop” went mainstream, with synthetic videos fooling millions and complicating public confidence. For 2026, the transcript emphasizes “lateral productivity” (frontier models helping non-experts do real work) and argues that progress likely comes from general patterns learned at scale plus new loops for automated information discovery, such as LLMs paired with evaluation and evolution.
Why did “thinking longer” become the defining 2025 trend, and what limitation surfaced alongside it?
What makes Genie 3’s “playable world” concept different from typical image or video generation?
How did AI-generated content change in 2025 in a way that affects society beyond technology?
What is the “benchmark gaming” concern behind the Meter Time Horizons discussion?
What does “lateral productivity” mean for 2026, and why does it matter even if models aren’t perfect?
How does the transcript connect 2026 progress to “automated information discovery”?
Review Questions
- Which 2025 evidence supports the claim that “thinking longer” improves accuracy, and what evidence supports the claim that it may reduce output diversity?
- What assumptions about intelligence does the transcript use to argue for a “middle world” between single-axis scaling and infinite benchmark optimization?
- How does automated information discovery (as described via Alpha Evolve/Alpha Revolve and alpha software) differ from relying on LLMs to answer questions directly?
Key Points
- 1
Reasoning-heavy inference in 2025 improved benchmark performance, but it may narrow output diversity and may not create entirely new reasoning paths beyond what base models already contain.
- 2
Genie 3’s key leap is persistent, consistent world generation for minutes, enabling interactive experiences rather than one-off media outputs.
- 3
AI-generated “slop” becoming mainstream is a trust problem: synthetic videos can fool large audiences and even people familiar with deepfakes.
- 4
Public attitudes toward AI are mixed—overall sentiment can be slightly positive while AI art and creative concerns face stronger resistance and policy friction.
- 5
Benchmark results require context: popular benchmarks can be gamed, and long-horizon measurements can have weak signals and large error bars.
- 6
For 2026, “lateral productivity” emphasizes that even sub-expert models can help non-experts perform real tasks and upskill faster than search alone.
- 7
The most optimistic 2026 direction centers on automated information discovery—LLMs paired with evaluation loops that generate, test, and retain better code and scientific methods.