What the Freakiness of 2025 in AI Tells Us About 2026

TL;DR

Reasoning-heavy inference in 2025 improved benchmark performance, but it may narrow output diversity and may not create entirely new reasoning paths beyond what base models already contain.

Briefing Cornell Notes

Briefing

Reasoning-heavy AI made major benchmark gains in 2025—but the year also exposed a trade-off: pushing models to “think longer” can improve accuracy while narrowing the variety of answers. Across coding, chart/table analysis, video understanding, and general reasoning, longer inference and more tokens helped models surpass prior test results. Yet the improvement appears less like a brand-new capability and more like better selection among patterns already present in the underlying base model. That skepticism matters because it reframes what “progress” means: beating benchmarks is real, but it doesn’t automatically prove deeper general intelligence.

The same year delivered a second, more tangible shift: generative AI moved toward interactive, persistent worlds. Google DeepMind’s Genie 3, announced in August, can generate dynamic environments from text prompts or images, keeping scenes consistent for minutes at 720p resolution. The implication isn’t just prettier visuals; it’s the start of AI systems that can maintain continuity—so users can return to a world and find changes still there. That sets up a likely 2026 trajectory toward more realistic simulation and richer multimodal experiences, building on releases such as VO 3.1, Sora 2, and text-to-speech and text-to-music systems.

But 2025 also normalized a darker byproduct: AI-generated “slop” became mainstream, and trust started to erode. The transcript points to viral videos that look authentic enough to fool large audiences—life-lesson clips and politically themed deepfakes—despite being fully synthetic. The concern isn’t only misinformation; it’s the growing difficulty of knowing what to believe when even everyday feeds can be fabricated.

Public sentiment and policy reflected that tension. A summer survey of Americans found a slight net-positive view of AI’s overall impact, but attitudes toward AI art were far less favorable. In the UK, the government’s opt-out plan for artists drew limited support, highlighting how quickly cultural and legal norms are being stress-tested. Even inside top labs, questions are emerging about what it means to “solve” creativity when tools can accelerate prototyping by an order of magnitude while potentially replacing parts of creative skill.

Institutional adoption accelerated as well. Governments and militaries used generative AI for tasks ranging from policy support to efficiency gains, with mixed outcomes. Meanwhile, the transcript argues that headline-grabbing model releases can mislead: the “expert-like” feel of newer systems doesn’t eliminate hallucinations, and intelligence isn’t a single axis. Even as usage climbed—from roughly 400 million weekly ChatGPT users in February to about 900 million later—providers also faced pressure to manage user appeal, pricing, and competitive dynamics.

Looking toward 2026, the core framework is “lateral productivity”: even if frontier models aren’t perfect experts, they can still help non-experts perform domain tasks far better than they could with search alone. The transcript also frames a debate about generality—whether intelligence scales along one main axis (more data/parameters) or whether real-world performance requires optimizing across countless edge cases. The proposed middle ground is that models improve by learning general patterns at internet scale, which is why benchmark performance can rise without collapsing into a never-ending maze of niche evaluations.

Finally, the most forward-looking prediction is that progress will shift from answering questions to automated information discovery. The transcript highlights systems like Alpha Evolve and Alpha Revolve—LLM-driven loops that propose code changes, run evaluations, and keep what works—plus “alpha software” research that uses AI to generate and test new scientific methods. The message is cautious but optimistic: the next leap may come from AI systems that iteratively find better ideas and software, not just from larger models or longer prompts.

Cornell Notes

2025’s biggest AI story was reasoning: models that spend more time and tokens thinking achieved stronger benchmark results, but the gains may come from better use of existing internal patterns rather than entirely new reasoning paths. At the same time, generative systems moved toward interactive, persistent experiences—Genie 3 can turn prompts or images into playable worlds that remain consistent for minutes. Trust, however, took a hit as AI-generated “slop” went mainstream, with synthetic videos fooling millions and complicating public confidence. For 2026, the transcript emphasizes “lateral productivity” (frontier models helping non-experts do real work) and argues that progress likely comes from general patterns learned at scale plus new loops for automated information discovery, such as LLMs paired with evaluation and evolution.

Why did “thinking longer” become the defining 2025 trend, and what limitation surfaced alongside it?

Reasoning-heavy approaches spent more inference time and tokens to improve accuracy across tasks like coding, chart/table analysis, video understanding, and general knowledge. The transcript credits this with major benchmark wins (including Gemini 3 Pro), but it also flags a trade-off: longer thinking can reduce output diversity. There’s skepticism that the method creates fundamentally new reasoning paths; instead, it may force models to select better answers from patterns already present in the base model.

What makes Genie 3’s “playable world” concept different from typical image or video generation?

Genie 3 generates dynamic worlds from text prompts or images, and the world isn’t purely ephemeral. It retains consistency for a few minutes at 720p resolution, enabling persistent changes—like carving initials into a tree and returning later to see them still there. That continuity is a step toward interactive simulation rather than one-off media generation.

How did AI-generated content change in 2025 in a way that affects society beyond technology?

The transcript argues that AI slop became mainstream, with synthetic videos recommended in feeds and accumulating millions of views while still being treated as real. Examples include a fake life-lessons video that drew large numbers of believing comments and a deepfake-style political clip that fooled even someone who regularly discusses deepfakes. The broader implication is a trust crisis: when convincing synthetic media is easy to produce, verification becomes harder for ordinary audiences.

What is the “benchmark gaming” concern behind the Meter Time Horizons discussion?

The transcript notes that as a benchmark becomes popular, incentives grow to train or tune specifically for it. For Meter Time Horizons, that could mean optimizing for the benchmark’s task structure (e.g., security-style capture-the-flag setups) so performance appears to follow the expected scaling curve. It also stresses that extrapolating from the benchmark is risky because the data becomes weaker at longer time horizons and error bars are large.

What does “lateral productivity” mean for 2026, and why does it matter even if models aren’t perfect?

Lateral productivity focuses on how frontier models can help people outside a domain reach useful competence quickly. The transcript cites a study where non-experts using frontier models to draft experimental protocols for viral recovery were far more likely to produce feasible protocols than a group using the internet alone. The point isn’t that models replace experts; it’s that even imperfect models can dramatically accelerate upskilling and task execution for non-experts.

How does the transcript connect 2026 progress to “automated information discovery”?

It argues that the next paradigm after getting answers is systems that iteratively discover better information and software. Examples include Alpha Evolve and Alpha Revolve: an LLM proposes code patches, automated tests evaluate them, and successful changes are retained while failures are discarded. The transcript claims this approach has produced practical gains—like improved scheduling for data centers and faster training runs on Google TPUs—suggesting progress will come from closed-loop experimentation, not just larger models.

Review Questions

Which 2025 evidence supports the claim that “thinking longer” improves accuracy, and what evidence supports the claim that it may reduce output diversity?
What assumptions about intelligence does the transcript use to argue for a “middle world” between single-axis scaling and infinite benchmark optimization?
How does automated information discovery (as described via Alpha Evolve/Alpha Revolve and alpha software) differ from relying on LLMs to answer questions directly?

Key Points

1
Reasoning-heavy inference in 2025 improved benchmark performance, but it may narrow output diversity and may not create entirely new reasoning paths beyond what base models already contain.
2
Genie 3’s key leap is persistent, consistent world generation for minutes, enabling interactive experiences rather than one-off media outputs.
3
AI-generated “slop” becoming mainstream is a trust problem: synthetic videos can fool large audiences and even people familiar with deepfakes.
4
Public attitudes toward AI are mixed—overall sentiment can be slightly positive while AI art and creative concerns face stronger resistance and policy friction.
5
Benchmark results require context: popular benchmarks can be gamed, and long-horizon measurements can have weak signals and large error bars.
6
For 2026, “lateral productivity” emphasizes that even sub-expert models can help non-experts perform real tasks and upskill faster than search alone.
7
The most optimistic 2026 direction centers on automated information discovery—LLMs paired with evaluation loops that generate, test, and retain better code and scientific methods.

Highlights

Longer thinking boosted accuracy across many tasks in 2025, but the same approach may reduce diversity—suggesting gains can come from better selection, not necessarily new reasoning capabilities.

Genie 3 turns prompts or images into worlds that stay consistent for minutes, enabling persistent changes like carving into a tree and revisiting later.

AI slop went mainstream in 2025, with synthetic videos accumulating millions of views and comments from people treating them as real.

Meter Time Horizons performance is hard to extrapolate because longer time ranges rely on fewer samples and can invite benchmark-specific optimization.

The transcript’s central 2026 bet is automated information discovery: LLMs that propose changes, run tests, and iterate toward better solutions.

Topics

Reasoning Models
Persistent World Generation
AI Slop and Trust
Benchmark Gaming
Lateral Productivity
Automated Information Discovery
General Intelligence Debate

Mentioned

Dario Amodei
Ilia Sutskever
Sam Altman
Demis Hassabis
Tony Zho
JD Vance
AGI
TPUs
LLMs
GPT
LM
CPR
EQ