Leak: ‘GPT-5 exhibits diminishing returns’, Sam Altman: ‘lol’
Based on AI Explained's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
An alleged Orion (GPT-4 successor) reached GPT-4-level capability after ~20% of training, but the final improvement was reportedly much smaller than the GPT-3 to GPT-4 leap.
Briefing
A leaked account of OpenAI’s next-generation language model training suggests AI progress may be slowing in raw “intelligence” gains—at least compared with the dramatic jumps seen between earlier generations. The claim centers on an early-stage successor to GPT-4 (often referred to as “Orion” in the reporting): after only about 20% of training, it was already said to match GPT-4-level capability on tasks and question-answering. But once training finished, the final quality improvement was described as far smaller than the step-change from GPT-3 to GPT-4, with some researchers inside OpenAI reportedly doubting it would be reliably better than its predecessor on coding.
The slowdown narrative is tied to two practical constraints: data and compute economics. Scaling up models by an order of magnitude becomes harder when the “easy” training data—broadly accessible text from the web—has largely been consumed. The reporting also points to cost pressure, quoting the view that training the next generation could run into hundreds of billions of dollars and that the scaling paradigm may eventually break. Yet the same reporting includes counterweights: other quotes from Sam Altman portray continued confidence that the path to AGI is known, that capability trajectories will keep rising for a long time, and that major breakthroughs are still possible—even if details can’t be shared.
To test whether the “plateau” idea holds up, the discussion pivots from leaks to benchmarks, focusing on Frontier Math, a set of roughly 100 difficult, unpublished problems created with input from about 60 mathematicians and writers of International Mathematical Olympiad questions. Results cited from the paper indicate current leading language models solve only about 1–2% of these problems—far below what would be expected if models were close to mastering the kind of multi-step reasoning needed for new proofs. The analysis also notes measurement uncertainty: the benchmark’s own estimated error rate is around 1%, while other benchmarks have been criticized for higher error rates (around 10%).
Even so, the conversation doesn’t land on “everything is stuck.” A key alternative explanation is data efficiency rather than brute-force scaling. Frontier Math is hard partly because relevant training material is scarce—only a small number of papers contain the kinds of reasoning steps needed. The argument is that future gains could come from better extraction of useful reasoning at inference time (test-time compute), potentially allowing models to “select” the right answer among many candidate outputs. The discussion links this to OpenAI’s “o1” family and suggests that incremental improvements in base model reasoning could still translate into better outcomes when models generate thousands of possible answers.
Finally, the pace of progress may differ by modality. Runway’s co-founder and CEO is cited saying OpenAI plans to release Sora in about two weeks, and the reasoning is that video and image domains have far more abundant training data than text. The broader takeaway: the evidence points to uneven progress—slower gains in some reasoning benchmarks, but continued momentum driven by new paradigms, better data use, and modality-specific advantages—while even insiders admit the timeline for any long-term plateau remains unknown.
Cornell Notes
A leaked account of OpenAI’s next model (often referred to as “Orion,” successor to GPT-4) suggests training yields diminishing returns: early performance reportedly matched GPT-4 after only ~20% of training, but the final quality gain was much smaller than the earlier GPT-3→GPT-4 leap. The slowdown is attributed to data scarcity and escalating training costs, though counterquotes from Sam Altman emphasize confidence in continued capability growth and future breakthroughs. To ground the debate, the discussion highlights Frontier Math, where leading models solve only about 1–2% of ~100 unpublished, proof-like problems—an indicator of remaining reasoning gaps, with some caveats about benchmark error. The proposed path forward is data efficiency and test-time compute: models may improve by extracting reasoning from limited relevant papers and by generating many candidate outputs to increase the chance of a correct answer. Progress may also remain fast in other modalities like video generation (Sora).
What does the leak claim about the next OpenAI model’s training and final gains?
Why would “diminishing returns” happen when scaling language models?
How does Frontier Math function as a reality check on reasoning progress?
What measurement caveats complicate interpreting Frontier Math results?
What alternative explanation is offered to reconcile slow benchmark gains with continued progress?
Why might progress look different in video or other modalities than in text reasoning?
Review Questions
- What specific training timeline and performance comparison does the leak provide for Orion relative to GPT-4?
- Why does Frontier Math’s “unpublished, proof-like” design make it a tougher test than many common benchmarks?
- How does the test-time compute/data-efficiency argument explain why accuracy could improve even if pre-training gains slow?
Key Points
- 1
An alleged Orion (GPT-4 successor) reached GPT-4-level capability after ~20% of training, but the final improvement was reportedly much smaller than the GPT-3 to GPT-4 leap.
- 2
Diminishing returns are attributed to both data scarcity (less new high-quality text) and escalating training costs (potentially hundreds of billions).
- 3
Frontier Math results cited in the discussion show only about 1–2% success on ~100 unpublished, proof-like problems, suggesting remaining gaps in frontier reasoning.
- 4
Benchmark interpretation is complicated by estimated error rates and by whether results come from single vs repeated evaluations.
- 5
A leading counter-strategy is data efficiency plus test-time compute: models may extract scarce reasoning patterns and improve by generating many candidate answers.
- 6
Progress may remain uneven across modalities, with video generation (Sora) potentially benefiting from abundant image/video training data.
- 7
Even insiders quoted in the discussion acknowledge uncertainty about how long any plateau or continued progress will last.