AGI progress, surprising breakthroughs, and the road ahead

TL;DR

AGI progress is increasingly judged by real-world utility—whether models generate useful artifacts and insights—rather than by narrow benchmark scores alone.

Briefing Cornell Notes

Briefing

AGI progress is increasingly measured less by whether models hit narrow benchmarks and more by whether they can reliably produce real-world value—especially by automating parts of the scientific process. OpenAI’s chief scientist Yakob Pahhatzki and researcher Simone Sedor frame the biggest “meaningful” leap as the ability to generalize: not just answering questions well in a test setting, but accelerating discovery, engineering, and even research itself at scale.

Pahhatzki describes AGI in practical terms rather than as a single technical threshold. As capabilities have matured, abilities that once seemed separate—natural conversation, strong math performance, and research-like problem solving—have started to converge. But pointwise metrics are losing their usefulness. Many benchmarks are now saturating: models can reach human-level performance on constrained tasks, and training methods can become increasingly specialized, making benchmark scores less representative of overall intelligence. The conversation shifts toward “reward and utility”—whether a system can find new insights and produce artifacts people can use, not just perform like a test-taker.

The discussion uses International Mathematical Olympiad (IMO) performance as a concrete example of a benchmark that still feels informative. The IMO is described as difficult yet relatively knowledge-light, emphasizing sustained reasoning over hours and creative problem solving rather than formula application. Pahhatzki notes that the model achieving gold-level performance did so without calculators or external tools, and that earlier attempts could fail on simpler tasks like multiplying four-digit numbers—highlighting how reasoning capabilities improved over time.

Sedor adds that benchmarks can also be misleading because different users want different things. A model can score well while still being unhelpful for a particular job, since “good” depends on the use case. That’s why broader measures matter, including tools and workflows that reflect real adoption. One proposed anchor is “ChatGPT usage” and the ability to produce useful technology artifacts when given enough compute—an approach that better reflects how systems will operate in practice.

The pair also points to reasoning as a key driver of recent surprises. They describe a moment when reasoning models began working well enough that the team worried about whether the organization was ready for fast-paced progress. They emphasize that “longer chain-of-thought” isn’t a trivial tweak; making it work required substantial effort. In competitive settings, the models have shown strong performance across multiple contests, including IMO and a Japan-based optimization competition called ADCER, where the model placed second behind Sedor’s friend Siho.

Looking ahead, the next breakthroughs are expected to come from scaling and extending what models can plan and reason over longer horizons. Pahhatzki argues that compute budgets for meaningful problems will dwarf what individual users can afford, enabling systems to persist on tasks that matter—such as medical research questions or building the next generation of models. The practical AGI picture is a largely automated “company of researchers and engineers” that can interface with the world: taking inputs, running experiments, and producing code, designs, and other artifacts. At the same time, trust and robustness remain unresolved, especially as models gain access to personal data and become more persistent and humanlike in their interactions.

Finally, the conversation lands on advice for students: learn to code to build structured problem-solving skills, and don’t treat perceived constraints as permanent. The overarching message is that AGI progress is real and accelerating—but the yardstick must shift from test performance to utility, discovery, and trustworthy impact in the world.

Cornell Notes

OpenAI’s chief scientist Yakob Pahhatzki and researcher Simone Sedor argue that AGI progress can’t be tracked well with narrow benchmarks once models saturate. As systems reach human-level performance on constrained tasks like the International Mathematical Olympiad, scores become less informative about general intelligence and real-world usefulness. They push for measures tied to utility: whether models can discover new insights, automate parts of research, and produce usable technology artifacts. Reasoning improvements—built through hard-earned training changes rather than a simple “longer chain-of-thought” trick—help explain recent competitive wins. The practical AGI vision is a persistent, compute-backed “automated research organization” that interfaces with people, runs experiments, and accelerates technical progress while still needing work on robustness and trust.

Why do benchmark scores become less reliable as models improve?

Saturation is one reason: standardized tests can reach a point where models are already at human-level performance, leaving little room for meaningful differentiation. Another issue is training specialization. The field can develop data-efficient methods that make models disproportionately strong on particular benchmark formats (e.g., math or writing) without that strength reflecting overall intelligence across domains. Together, these effects mean benchmark gains can look incremental even when broader capability changes are happening—or vice versa.

What makes IMO-style competition a useful milestone compared with other tests?

IMO problems are described as constrained and knowledge-light, yet still hard because they require sustained, creative reasoning over hours. The model’s gold-level performance is highlighted as coming without calculators or external tools, emphasizing reasoning rather than tool use. The IMO also matters because many people care about solving it, and the difficulty is well established by competition culture.

How do the guests distinguish “being a good test taker” from being useful?

They argue that high benchmark scores don’t guarantee usefulness for real work. Different users have different goals: a system might excel at creative writing but be weak at math, or vice versa. Even within math, a model that performs well on a test may not translate into the ability to discover new insights or generate artifacts that help in a specific job. Utility and reward—what a system can produce that people actually use—become the more relevant yardstick.

What role does reasoning play in recent progress, and why isn’t it just a simple tweak?

Reasoning is portrayed as a key driver of surprising capability jumps. The team noticed that reasoning models could be trained to get better, and that improvement arrived quickly enough to raise internal questions about organizational readiness for fast-paced progress. They also stress that making reasoning work well was hard-earned engineering and training work, not merely extending a chain of thought.

What does “automating research” look like in practice?

The envisioned system resembles a company of capable researchers and engineers operating largely through automation. It would interface with people, take inputs, run experiments, and produce outputs like code bases and designs. The goal is to accelerate technical progress by automating discovery and production of new technology, especially through general intelligence rather than deploying narrow systems into isolated domains.

What limits the near-term trustworthiness of more capable, persistent systems?

The guests describe a trade-off: models can deliver clear personal and economic value when they access more data (e.g., reading a calendar in Gmail), but they are not yet robust enough to be fully trusted against exploitation. They expect the field to iterate on robustness and safety as models become more persistent, capable, and better at expressing themselves across formats.

Review Questions

How do saturation and benchmark specialization undermine the usefulness of standardized intelligence tests?
What characteristics of IMO problems make them a better proxy for reasoning than many other benchmarks?
Why might a model’s benchmark performance fail to predict its usefulness for a specific real-world task?

Key Points

1
AGI progress is increasingly judged by real-world utility—whether models generate useful artifacts and insights—rather than by narrow benchmark scores alone.
2
Benchmark saturation and training specialization can make test performance less representative of general intelligence.
3
IMO-style competitions remain informative because they emphasize sustained, creative reasoning over hours without relying on calculators or external tools.
4
A model can be a strong test taker yet still be unhelpful for many users; “good” depends on the use case and the kind of output needed.
5
Reasoning improvements have produced major surprises, but making them work required substantial effort beyond simple interface changes.
6
Scaling compute and extending planning/reasoning horizons are expected to drive the next breakthroughs, enabling longer persistence on high-impact problems.
7
Trust and robustness lag capability: more access and persistence increase value but also raise risks that require continued safety work.

Highlights

The team argues that benchmark gains are losing meaning as models saturate and as training becomes more data-efficient and benchmark-specific.

IMO performance is treated as a reasoning milestone because it rewards creative problem solving over hours without calculators or external tools.

A practical AGI vision is an automated “company of researchers and engineers” that can run experiments and produce technology artifacts, not a single black-box chatbot.

Reasoning capability improvements arrived quickly enough to prompt internal concerns about organizational readiness for fast-paced progress.

Even with growing trust—like calendar access—models still aren’t robust enough to be fully trusted against exploitation.

Topics

AGI Measurement
Benchmarks Saturation
Reasoning Models
Automated Research
Scaling Compute

Mentioned

Andrew Maine
Yakob Pahhatzki
Simone Sedor
Mr. Rashard Subarovski
Paul Graham
Tim Berners-Lee
Siho
Anna Makandu
AGI
IMO
IC
GPT
GPT1
GPT2
GPT3
GPT4
GPT5
ADCER

AGI progress, surprising breakthroughs, and the road ahead — the OpenAI Podcast Ep. 5