Get AI summaries of any video or article — Sign up free
The AI Revolution Hiding in Obscure Research thumbnail

The AI Revolution Hiding in Obscure Research

Sabine Hossenfelder·
5 min read

Based on Sabine Hossenfelder's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Early user feedback on GPT 4.5, o3, Claude 3.7, and Gemini 2.0 is portrayed as underwhelming, with persistent reliability issues and familiar LLM failure modes.

Briefing

AI progress is hitting a wall for now: recent releases from major labs look incremental on the surface, and early users describe them as underwhelming, even when they sound more “human” in demos. OpenAI’s GPT 4.5 drew praise for feeling like “talking to a thoughtful person,” but many first users reported it as a “shiny new coat of paint on the same old car,” with disappointment around reliability and basic tasks. OpenAI’s o3, meanwhile, is said to hallucinate more than GPT-4.5, still struggle with simple counting (like the letters in “strawberry”), and repeat classic large-language-model failure patterns. Similar complaints follow for Anthropic’s Claude 3.7 and Google’s Gemini 2.0, while Meta’s Llama update is described as worse than its predecessor—an outcome framed as notable rather than reassuring.

The central claim is that this slowdown reflects diminishing returns from today’s dominant model type: large language models. With LLMs, gains increasingly come from better prompting, evaluation tweaks, and incremental architectural or training improvements—useful, but not the leap that would produce broadly reliable, economically useful intelligence. A parallel anecdote from Dean Valentine, citing conversations with YC founders, describes a pattern: flashy announcements and strong benchmarks, followed by mediocre real-world performance and little evidence of meaningful improvement since mid-year. Industry leaders echo the idea that the next breakthrough requires a new paradigm rather than more scaling. Ilya Sutskever points to a shift from “scaling” in the 2010s back to “wonder and discovery,” while Google CEO Sundar Pichai predicts a slow 2025. Yann LeCun adds that the most promising work is currently trapped in “obscure academic papers,” implying the next step is not yet mainstream.

That next step, according to the argument, is emerging from “world models”—systems that learn by interacting with the world and can keep learning after training. In the meantime, companies are trying to “unhobble” current LLMs: giving them better memory, letting them use external tools for math and figures, and adding mechanisms to improve reasoning. Meta’s “Large Concept Models” aim to combine different verbal expressions of the same logical relation, while “Meta Chains-of-Thought” is positioned as a way for models to evaluate multiple reasoning paths.

But the real pivot is toward models that build internal representations of reality through interaction. DeepMind’s Genie 2, announced in December, is trained on a large video dataset and generates interactive 3D environments. The idea is to place AI agents into these virtual worlds so they can learn to learn. NVIDIA’s Cosmos platform, introduced in January, similarly generates physics-consistent 3D models, helping produce more coherent videos and—more importantly—providing training environments that teach other models how reality behaves. The reasoning is evolutionary: human intelligence developed through continuous interaction with the physical world, forming mental models from experience. The forecast is that the next AI revolution will come from systematic upgrades to reasoning and learning dynamics, not just more training—arriving first in academic work before it becomes widely visible.

Cornell Notes

Recent AI releases from major labs are drawing criticism for limited real-world gains, even when demos sound more natural. The slowdown is attributed to diminishing returns from large language models: incremental improvements can raise benchmarks but often fail to translate into economic usefulness or general reliability. The next leap is expected to come from a new paradigm—world models—that learn by interacting with environments and can continue learning after training. DeepMind’s Genie 2 and NVIDIA’s Cosmos are cited as steps toward this direction by generating interactive, physics-consistent 3D worlds for agent training. If world-model approaches deliver reliable “reality learning,” they could enable the next major jump toward human-level capabilities.

Why do recent model updates feel disappointing despite better demos?

The transcript contrasts marketing-style improvements with user-reported behavior. GPT 4.5 is described as more conversational, yet early users call it a “shiny new coat of paint on the same old car,” and complain about underwhelming performance. o3 is said to hallucinate more than GPT-4.5, still fail basic counting (e.g., letters in “strawberry”), and fall for familiar LLM riddles. Similar patterns are attributed to Claude 3.7 and Gemini 2.0, while a Llama update is portrayed as worse than its prior version—suggesting that surface-level capability gains aren’t translating into dependable competence.

What evidence is offered that the industry is hitting diminishing returns?

A key anecdote comes from Dean Valentine’s essay, quoting conversations with YC founders running AI application startups. The recurring pattern: model announcements and strong benchmarks, followed by mediocre evaluated performance and little evidence of meaningful improvement since August. The transcript uses this to argue that current LLM categories are approaching a limit where additional effort yields smaller practical returns.

How do industry leaders frame the timing of the next breakthrough?

Ilya Sutskever (via a Reuters interview) characterizes the shift from the “age of scaling” to the “age of wonder and discovery,” implying a search for fundamentally new approaches. Google CEO Sundar Pichai predicts a slow year for AI in 2025. Yann LeCun adds that the most exciting work is currently in “obscure academic papers,” suggesting the next step exists but hasn’t become mainstream yet.

What interim fixes are being applied to current LLMs?

The transcript lists efforts to “unhobble” LLMs: improving memory, enabling models to use external software tools for tasks like math and generating figures, and adding reasoning-oriented mechanisms. Meta’s “Large Concept Models” are described as combining different verbal expressions of the same logical relation. Meta’s “Chains-of-Thought” variant is described as letting a model evaluate multiple reasoning lines, aiming to grasp logical relations faster and better.

What are “world models,” and why are they treated as the next paradigm?

World models are described as systems that learn by interacting with the world and can continue learning after training. The transcript argues this is a qualitative shift from LLMs that mainly learn from static text. DeepMind’s Genie 2 is cited as generating interactive 3D environments from a large video dataset, enabling agents to learn inside simulated worlds. NVIDIA’s Cosmos is cited as generating physics-consistent 3D models that improve video coherence and can train other models to learn how reality works.

How do world models connect to human intelligence development?

The transcript draws an analogy to human evolution: intelligence emerged through interaction with the physical world and the construction of mental models from experience. DeepMind calls Genie 2-style systems “foundation world models,” and the argument is that such models will likely play a major role in the next AI revolution by making learning more grounded in reality.

Review Questions

  1. What specific user-reported failures are cited for GPT 4.5 and o3, and how do they support the claim of diminishing returns?
  2. Which improvements target current LLM limitations (memory, tools, reasoning), and how do they differ from the proposed world-model paradigm?
  3. Why are interactive, physics-consistent 3D environments (Genie 2, Cosmos) considered more promising than additional LLM scaling?

Key Points

  1. 1

    Early user feedback on GPT 4.5, o3, Claude 3.7, and Gemini 2.0 is portrayed as underwhelming, with persistent reliability issues and familiar LLM failure modes.

  2. 2

    A pattern from AI application startups—strong benchmarks followed by mediocre real-world performance—is used to argue that large language models are nearing diminishing returns.

  3. 3

    Industry leaders suggest the next major leap depends on a new paradigm rather than incremental upgrades, with 2025 framed as a slower transition period.

  4. 4

    Interim LLM fixes include better memory, tool use for math/figures, and reasoning-focused mechanisms like Meta’s “Large Concept Models” and “Meta Chains-of-Thought.”

  5. 5

    World models are positioned as the next paradigm: systems that learn through interaction and can keep learning after training.

  6. 6

    DeepMind’s Genie 2 and NVIDIA’s Cosmos are cited as concrete steps toward world models by generating interactive, physics-consistent 3D environments for agent training.

  7. 7

    The transcript links world-model learning to how human intelligence evolved—building mental models from continuous interaction with the physical world.

Highlights

GPT 4.5 is described as more conversational, yet early users call it a “coat of paint on the same old car,” while o3 is criticized for increased hallucinations and basic counting failures.
The diminishing-returns argument is reinforced by YC-founder anecdotes: impressive announcements and benchmarks don’t translate into economic usefulness or generality.
The proposed breakthrough is not more LLM scaling but world models—interactive, physics-consistent environments that enable ongoing learning after training.
Genie 2 and Cosmos are presented as practical routes to reality-grounded learning, aiming to teach models how the world works rather than only how language works.

Topics

  • LLM Diminishing Returns
  • World Models
  • Reasoning Upgrades
  • Interactive 3D Environments
  • AI Benchmarks vs Usefulness