GPT-5: Have We Finally Hit The AI Scaling Wall?
Based on Sabine Hossenfelder's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Revisiting scaling laws suggests error avoidance has a long computational tail, making large reliability gains extremely compute-intensive.
Briefing
GPT-5’s lukewarm reception has reignited a long-running debate: whether AI scaling has hit a “wall.” Two fresh research threads point in opposite directions on the existence of a wall, but both land on a similar practical message—getting reliable performance may demand so much compute that progress will look stalled for real-world users.
One paper revisits scaling laws for large language models and challenges the optimism that drove earlier forecasts of rapid capability jumps. The key issue is not that models stop improving, but that reducing errors becomes dramatically more expensive as reliability targets tighten. The authors argue that earlier scaling analyses undercount the computational “tail” of error avoidance—especially when the goal is to suppress mistakes by orders of magnitude. Their extrapolations suggest that achieving even a single order-of-magnitude reduction in errors could require roughly 10 to 20 times more compute. In their framing, raising model reliability to meet the standards of scientific inquiry becomes “intractable by any reasonable measure.” Even if that bar is debated, the mechanism offers a plausible explanation for a gap seen in practice: researchers may observe steady scaling, while users experience persistent failures because humans notice and amplify the long tail of errors.
A second paper delivers a more direct blow to the idea that scaling alone yields robust reasoning. Using smaller models trained to produce “reasoning chains,” the study tests whether step-by-step reasoning generalizes to tasks that are out of distribution. The findings suggest that chain-of-thought behavior remains brittle: reasoning steps do not reliably transfer under task transformations. The authors describe large language model reasoning as a “brittle mirage,” where the model may generate reasoning-like text without aligning it with correct outcomes. In some cases, reasoning steps can be internally consistent yet still lead to wrong answers; in others, the result may be right for the wrong reasons. Overall, the work argues that these systems behave more like sophisticated simulators of reasoning patterns than principled reasoners.
Taken together, the research reframes the scaling-wall question. There may not be a hard mathematical wall where improvement stops, but the compute required to eliminate the error tail could make reliability gains effectively unreachable on practical timelines. Meanwhile, reasoning generalization appears limited, undermining expectations that “emergent” logical understanding will simply appear as models get larger.
The discussion then pivots from capability limits to where progress might come from instead: models that can learn and act in interactive environments—often grouped under “world models.” The argument is that language is a weak proxy for how intelligence works, and that true progress likely requires probing and testing in real or virtual worlds. The segment closes with a broader prediction: if AGI arrives, it may not come from scaling language models toward logic, but from systems that can ground learning in interaction and experimentation—an approach associated with DeepMind’s “Genie 3” release.
Finally, the transcript includes a call to action for Alignerr, a platform hiring people to help train next-generation AI systems by providing expertise, judgment, and problem-solving feedback across fields ranging from science and coding to law and business—framing human correction as a practical lever against model brittleness and error tails.
Cornell Notes
New work on large language models suggests that “scaling walls” may be less about a total stop and more about reliability becoming prohibitively expensive. One study re-evaluates scaling laws and argues that avoiding errors has a long computational tail: cutting error rates by an order of magnitude may take 10–20× more compute. Another study tests chain-of-thought reasoning under out-of-distribution task changes and finds brittle generalization—reasoning-like text often fails to align with correct outcomes. Together, the findings imply that scaling alone may not deliver robust reasoning or dependable reliability on practical timelines. The transcript then points toward “world models” and interactive environments as a more promising route to AGI-like capabilities, citing DeepMind’s Genie 3.
What does the reliability-scaling paper claim is missing from earlier scaling laws?
How does the compute-cost argument connect to the “wall” idea?
What does the out-of-distribution reasoning study test, and what does it find?
Why does “reasoning-like text” not guarantee correct reasoning?
What alternative path to AGI is proposed after these limitations?
Review Questions
- What computational mechanism makes error reduction potentially look like a wall even if scaling laws still predict improvement?
- How do out-of-distribution task transformations reveal weaknesses in chain-of-thought reasoning?
- Why does the transcript argue that interactive “world models” may be a better route than scaling language models alone?
Key Points
- 1
Revisiting scaling laws suggests error avoidance has a long computational tail, making large reliability gains extremely compute-intensive.
- 2
Extrapolations in one study estimate that reducing errors by one order of magnitude may require about 10–20× more compute.
- 3
Even if improvement continues, the compute required to suppress the error tail could make progress look like a practical wall.
- 4
Out-of-distribution tests indicate chain-of-thought reasoning is brittle and may not align with correct outcomes.
- 5
Reasoning-like explanations can reflect learned text patterns rather than principled reasoning.
- 6
The transcript argues AGI-like progress likely depends on interactive “world models,” not language-model scaling alone.
- 7
DeepMind’s Genie 3 is cited as a step toward grounding models in environments where they can learn through interaction.