Leak: ‘GPT-5 exhibits diminishing returns’, Sam Altman: ‘lol’

TL;DR

An alleged Orion (GPT-4 successor) reached GPT-4-level capability after ~20% of training, but the final improvement was reportedly much smaller than the GPT-3 to GPT-4 leap.

Briefing Cornell Notes

Briefing

A leaked account of OpenAI’s next-generation language model training suggests AI progress may be slowing in raw “intelligence” gains—at least compared with the dramatic jumps seen between earlier generations. The claim centers on an early-stage successor to GPT-4 (often referred to as “Orion” in the reporting): after only about 20% of training, it was already said to match GPT-4-level capability on tasks and question-answering. But once training finished, the final quality improvement was described as far smaller than the step-change from GPT-3 to GPT-4, with some researchers inside OpenAI reportedly doubting it would be reliably better than its predecessor on coding.

The slowdown narrative is tied to two practical constraints: data and compute economics. Scaling up models by an order of magnitude becomes harder when the “easy” training data—broadly accessible text from the web—has largely been consumed. The reporting also points to cost pressure, quoting the view that training the next generation could run into hundreds of billions of dollars and that the scaling paradigm may eventually break. Yet the same reporting includes counterweights: other quotes from Sam Altman portray continued confidence that the path to AGI is known, that capability trajectories will keep rising for a long time, and that major breakthroughs are still possible—even if details can’t be shared.

To test whether the “plateau” idea holds up, the discussion pivots from leaks to benchmarks, focusing on Frontier Math, a set of roughly 100 difficult, unpublished problems created with input from about 60 mathematicians and writers of International Mathematical Olympiad questions. Results cited from the paper indicate current leading language models solve only about 1–2% of these problems—far below what would be expected if models were close to mastering the kind of multi-step reasoning needed for new proofs. The analysis also notes measurement uncertainty: the benchmark’s own estimated error rate is around 1%, while other benchmarks have been criticized for higher error rates (around 10%).

Even so, the conversation doesn’t land on “everything is stuck.” A key alternative explanation is data efficiency rather than brute-force scaling. Frontier Math is hard partly because relevant training material is scarce—only a small number of papers contain the kinds of reasoning steps needed. The argument is that future gains could come from better extraction of useful reasoning at inference time (test-time compute), potentially allowing models to “select” the right answer among many candidate outputs. The discussion links this to OpenAI’s “o1” family and suggests that incremental improvements in base model reasoning could still translate into better outcomes when models generate thousands of possible answers.

Finally, the pace of progress may differ by modality. Runway’s co-founder and CEO is cited saying OpenAI plans to release Sora in about two weeks, and the reasoning is that video and image domains have far more abundant training data than text. The broader takeaway: the evidence points to uneven progress—slower gains in some reasoning benchmarks, but continued momentum driven by new paradigms, better data use, and modality-specific advantages—while even insiders admit the timeline for any long-term plateau remains unknown.

Cornell Notes

A leaked account of OpenAI’s next model (often referred to as “Orion,” successor to GPT-4) suggests training yields diminishing returns: early performance reportedly matched GPT-4 after only ~20% of training, but the final quality gain was much smaller than the earlier GPT-3→GPT-4 leap. The slowdown is attributed to data scarcity and escalating training costs, though counterquotes from Sam Altman emphasize confidence in continued capability growth and future breakthroughs. To ground the debate, the discussion highlights Frontier Math, where leading models solve only about 1–2% of ~100 unpublished, proof-like problems—an indicator of remaining reasoning gaps, with some caveats about benchmark error. The proposed path forward is data efficiency and test-time compute: models may improve by extracting reasoning from limited relevant papers and by generating many candidate outputs to increase the chance of a correct answer. Progress may also remain fast in other modalities like video generation (Sora).

What does the leak claim about the next OpenAI model’s training and final gains?

The reporting says an early Orion model (successor to GPT-4) looked strong after only about 20% of training—already on par with GPT-4 on intelligence and task performance. After training finished, the final quality increase was described as far smaller than the jump between GPT-3 and GPT-4. It also includes internal skepticism that Orion may not be reliably better than its predecessor on coding, even if it could be stronger on language tasks.

Why would “diminishing returns” happen when scaling language models?

Two constraints are emphasized. First, data: GPT-4 is characterized as having trained on much of the accessible web, making another order-of-magnitude increase harder because there’s less high-quality “new” text to harvest. Second, cost: the discussion cites a claim that training the next generation could cost hundreds of billions of dollars, implying the scaling approach may hit an economic wall.

How does Frontier Math function as a reality check on reasoning progress?

Frontier Math is described as about 100 unpublished, very hard problems created with collaboration from roughly 60 mathematicians and top contest/proof writers. These tasks typically take hours or days for specialists and are not found in training data. The cited results say current leading models solve only about 1–2% of them, which is framed as a “canary in the coal mine” for how far models are from proof-like reasoning.

What measurement caveats complicate interpreting Frontier Math results?

The benchmark’s own estimated error rate is around 1%, but the discussion notes other benchmarks have been criticized for roughly 10% error rates. It also mentions that some reported performance (e.g., for a model tied to Gemini 1.5 Pro and a model labeled 01 preview) may come from a single evaluation, and that repeated trials can change which model looks best.

What alternative explanation is offered to reconcile slow benchmark gains with continued progress?

The discussion argues that progress may come through data efficiency and test-time compute rather than only larger pre-training runs. Frontier Math’s difficulty is linked to scarce relevant training material (only a dozen or so papers with the needed reasoning). If models can better extract signal from limited sources and use inference-time computation to generate many candidate outputs, they can improve accuracy even without massive new pre-training gains.

Why might progress look different in video or other modalities than in text reasoning?

The conversation points to data abundance: video and image domains have far more training data (e.g., YouTube and image corpora) than text proof datasets. Runway’s CEO is cited saying OpenAI plans to release Sora in about two weeks, and the implication is that modalities with richer data may keep improving faster even if text-based reasoning benchmarks plateau.

Review Questions

What specific training timeline and performance comparison does the leak provide for Orion relative to GPT-4?
Why does Frontier Math’s “unpublished, proof-like” design make it a tougher test than many common benchmarks?
How does the test-time compute/data-efficiency argument explain why accuracy could improve even if pre-training gains slow?

Key Points

1
An alleged Orion (GPT-4 successor) reached GPT-4-level capability after ~20% of training, but the final improvement was reportedly much smaller than the GPT-3 to GPT-4 leap.
2
Diminishing returns are attributed to both data scarcity (less new high-quality text) and escalating training costs (potentially hundreds of billions).
3
Frontier Math results cited in the discussion show only about 1–2% success on ~100 unpublished, proof-like problems, suggesting remaining gaps in frontier reasoning.
4
Benchmark interpretation is complicated by estimated error rates and by whether results come from single vs repeated evaluations.
5
A leading counter-strategy is data efficiency plus test-time compute: models may extract scarce reasoning patterns and improve by generating many candidate answers.
6
Progress may remain uneven across modalities, with video generation (Sora) potentially benefiting from abundant image/video training data.
7
Even insiders quoted in the discussion acknowledge uncertainty about how long any plateau or continued progress will last.

Highlights

The leak’s core claim is not that the next model fails, but that the final quality gain after full training is far smaller than earlier generation jumps.

Frontier Math is positioned as a “canary” because it uses unpublished, Olympiad-level proof problems—where leading models reportedly solve only ~1–2%.

The proposed way out of diminishing returns is less about bigger pre-training runs and more about extracting reasoning efficiently at inference time (test-time compute).

Modality matters: video generation progress may stay fast even if text reasoning benchmarks slow, due to richer training data.

Topics

Model Scaling
Frontier Math
Benchmark Error
Test-Time Compute
Video Generation

Mentioned

OpenAI
ChatGPT
Anthropic
Google
Gemini
Claude
Runway
AssemblyAI
Sora
o1
GPT-4
GPT-3
GPT-5
Orion
Claude 3.5 Haiku
Gemini 1.5 Pro
01 preview
DAR 3D
Simple Bench
MMLU
Sam Altman
Gome Brown
Terence Tao
Dario Amodei
AGI
LLMs
GPU
MMLU