The AI Revolution Hiding in Obscure Research
Based on Sabine Hossenfelder's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Early user feedback on GPT 4.5, o3, Claude 3.7, and Gemini 2.0 is portrayed as underwhelming, with persistent reliability issues and familiar LLM failure modes.
Briefing
AI progress is hitting a wall for now: recent releases from major labs look incremental on the surface, and early users describe them as underwhelming, even when they sound more “human” in demos. OpenAI’s GPT 4.5 drew praise for feeling like “talking to a thoughtful person,” but many first users reported it as a “shiny new coat of paint on the same old car,” with disappointment around reliability and basic tasks. OpenAI’s o3, meanwhile, is said to hallucinate more than GPT-4.5, still struggle with simple counting (like the letters in “strawberry”), and repeat classic large-language-model failure patterns. Similar complaints follow for Anthropic’s Claude 3.7 and Google’s Gemini 2.0, while Meta’s Llama update is described as worse than its predecessor—an outcome framed as notable rather than reassuring.
The central claim is that this slowdown reflects diminishing returns from today’s dominant model type: large language models. With LLMs, gains increasingly come from better prompting, evaluation tweaks, and incremental architectural or training improvements—useful, but not the leap that would produce broadly reliable, economically useful intelligence. A parallel anecdote from Dean Valentine, citing conversations with YC founders, describes a pattern: flashy announcements and strong benchmarks, followed by mediocre real-world performance and little evidence of meaningful improvement since mid-year. Industry leaders echo the idea that the next breakthrough requires a new paradigm rather than more scaling. Ilya Sutskever points to a shift from “scaling” in the 2010s back to “wonder and discovery,” while Google CEO Sundar Pichai predicts a slow 2025. Yann LeCun adds that the most promising work is currently trapped in “obscure academic papers,” implying the next step is not yet mainstream.
That next step, according to the argument, is emerging from “world models”—systems that learn by interacting with the world and can keep learning after training. In the meantime, companies are trying to “unhobble” current LLMs: giving them better memory, letting them use external tools for math and figures, and adding mechanisms to improve reasoning. Meta’s “Large Concept Models” aim to combine different verbal expressions of the same logical relation, while “Meta Chains-of-Thought” is positioned as a way for models to evaluate multiple reasoning paths.
But the real pivot is toward models that build internal representations of reality through interaction. DeepMind’s Genie 2, announced in December, is trained on a large video dataset and generates interactive 3D environments. The idea is to place AI agents into these virtual worlds so they can learn to learn. NVIDIA’s Cosmos platform, introduced in January, similarly generates physics-consistent 3D models, helping produce more coherent videos and—more importantly—providing training environments that teach other models how reality behaves. The reasoning is evolutionary: human intelligence developed through continuous interaction with the physical world, forming mental models from experience. The forecast is that the next AI revolution will come from systematic upgrades to reasoning and learning dynamics, not just more training—arriving first in academic work before it becomes widely visible.
Cornell Notes
Recent AI releases from major labs are drawing criticism for limited real-world gains, even when demos sound more natural. The slowdown is attributed to diminishing returns from large language models: incremental improvements can raise benchmarks but often fail to translate into economic usefulness or general reliability. The next leap is expected to come from a new paradigm—world models—that learn by interacting with environments and can continue learning after training. DeepMind’s Genie 2 and NVIDIA’s Cosmos are cited as steps toward this direction by generating interactive, physics-consistent 3D worlds for agent training. If world-model approaches deliver reliable “reality learning,” they could enable the next major jump toward human-level capabilities.
Why do recent model updates feel disappointing despite better demos?
What evidence is offered that the industry is hitting diminishing returns?
How do industry leaders frame the timing of the next breakthrough?
What interim fixes are being applied to current LLMs?
What are “world models,” and why are they treated as the next paradigm?
How do world models connect to human intelligence development?
Review Questions
- What specific user-reported failures are cited for GPT 4.5 and o3, and how do they support the claim of diminishing returns?
- Which improvements target current LLM limitations (memory, tools, reasoning), and how do they differ from the proposed world-model paradigm?
- Why are interactive, physics-consistent 3D environments (Genie 2, Cosmos) considered more promising than additional LLM scaling?
Key Points
- 1
Early user feedback on GPT 4.5, o3, Claude 3.7, and Gemini 2.0 is portrayed as underwhelming, with persistent reliability issues and familiar LLM failure modes.
- 2
A pattern from AI application startups—strong benchmarks followed by mediocre real-world performance—is used to argue that large language models are nearing diminishing returns.
- 3
Industry leaders suggest the next major leap depends on a new paradigm rather than incremental upgrades, with 2025 framed as a slower transition period.
- 4
Interim LLM fixes include better memory, tool use for math/figures, and reasoning-focused mechanisms like Meta’s “Large Concept Models” and “Meta Chains-of-Thought.”
- 5
World models are positioned as the next paradigm: systems that learn through interaction and can keep learning after training.
- 6
DeepMind’s Genie 2 and NVIDIA’s Cosmos are cited as concrete steps toward world models by generating interactive, physics-consistent 3D environments for agent training.
- 7
The transcript links world-model learning to how human intelligence evolved—building mental models from continuous interaction with the physical world.