o3 breaks (some) records, but AI becomes pay-to-win

TL;DR

o3 and Gemini 2.5 Pro are close competitors, but the lead changes by benchmark: o3 is stronger on long-context cross-chapter puzzle assembly, while Gemini often leads on spatial/physics-style reasoning.

Briefing Cornell Notes

Briefing

OpenAI’s o3 has landed with record-breaking benchmark results in just days, but the bigger shift is economic: top-tier AI performance is increasingly tied to paid access, turning advanced reasoning into a “pay-to-win” market. Early comparisons put o3 and Google’s Gemini 2.5 Pro in a tight race across major tests, yet the newest results show o3 pulling ahead on long-form puzzle assembly while Gemini retains advantages in some other domains—especially spatial and physics-style reasoning.

On long-context reasoning, o3 shows a consistent edge when tasks require connecting clues across widely separated sections of fiction, including texts around 100,000 words. The expectation that Gemini—long considered strong at long context—would dominate doesn’t hold up in these newer measurements. In a separate physics and spatial reasoning benchmark released within days, Gemini 2.5 Pro leads, with o3 trailing despite both models falling well short of human experts. The gap isn’t just about “knowing facts”; it reflects limitations in how models interpret real-world spatial relationships from text alone. A concrete example illustrates the problem: following instructions that involve physically looping an arm through a gap is easy for humans but hard for models that can’t truly visualize the motion.

The picture flips again on troubleshooting-style tasks. o3 posts a 94th percentile score on a text-based evaluation of complex lab protocol troubleshooting, while Gemini 2.5 Pro performs better on competition-style mathematics. In math, both models reach roughly 90% on an AIM 2025 high-school competition benchmark without tools, but exceed 99% with tools. On the harder USMO qualifier, o3 on high settings lands around 22% correct versus about 24% for Gemini 2.5 Pro—again with Gemini described as roughly four times cheaper. Visual benchmarks show more unevenness: o3 edges out Gemini on some image-based questions (like directional cues), but Gemini is stronger at geolocation from street-view images and also performs better on certain visual puzzles, where Gemini can even underperform older models like o1.

A key technical thread behind o3’s vision gains is the VAR method. The approach addresses a common failure mode: high-resolution images can overwhelm multimodal models. VAR uses a multimodal language model to predict which region of an image is most relevant, crops that region, and feeds it back with the original image into the model’s visual working memory—helping it “zoom in” on the right details. The method still isn’t magic; even with VAR, some “where’s Waldo” style tasks remain unsolved.

Economically, the stakes are rising fast. OpenAI projects $174 billion in revenue by 2030, up from $4 billion in 2024. The transcript argues that this growth likely depends on compute scaling and product packaging rather than a single breakthrough that instantly delivers cheap, universal intelligence. With companies already moving toward premium tiers (and similar plans emerging across major labs), the incentive structure shifts: if performance improvements require expensive post-training and massive compute, then users may need to pay more to stay competitive. The result is a market where advanced capabilities—especially tool-using, longer reasoning, and deeper “research” modes—could increasingly be gated by cost, even as progress accelerates toward more autonomous systems that can outperform humans on economically valuable work.

Cornell Notes

o3 and Gemini 2.5 Pro are trading leads across benchmarks, with o3 showing an edge on long-context puzzle assembly and Gemini often outperforming on spatial/physics-style reasoning. o3’s vision improvements are linked to VAR, which crops the most relevant image region using a multimodal model to reduce overwhelm from high-resolution inputs. In math and competition-style tests, both models can reach very high scores with tools, while differences without tools are smaller and sometimes favor Gemini on cost. Despite rapid capability gains, the transcript frames a larger shift: top performance is increasingly tied to expensive compute and premium access tiers, creating a “pay-to-win” dynamic. The long-term question becomes whether progress will come from cheap algorithmic leaps or from scaling that forces users to pay more.

Why does o3’s performance look strong on long-context puzzle tasks, and what does that imply about reasoning?

In newer long-form evaluations (up to ~100,000 words), o3 is reported to outperform Gemini 2.5 Pro at connecting clues across distant chapters—e.g., a clue in chapter 3 mapping to chapter 16. That pattern suggests o3 is better at maintaining and using relevant information over long spans, at least for this style of cross-reference reasoning. It also challenges the expectation that Gemini’s long-context specialization would automatically translate into dominance across every long-context benchmark.

What’s the core limitation highlighted by spatial/physics benchmarks, and how is it illustrated?

The transcript emphasizes that models can struggle with spatial reasoning when the task depends on real-world physical relationships that aren’t naturally “visualized” from text. A specific example describes placing a right palm on a left shoulder and looping the left arm through a gap between the right arm and the chest—humans can follow the motion, but models “have no idea what’s going on” because it isn’t in training data and they can’t truly visualize the movement. In the referenced benchmark, Gemini 2.5 Pro leads over o3, but both remain far below human expert accuracy.

How does VAR change vision behavior, and what problem is it designed to solve?

VAR is framed as a response to a failure mode: high-resolution images can overwhelm a multimodal model. VAR uses a multimodal language model to predict which part of the image is most relevant to the question, crops that region, and adds it to the model’s visual working memory alongside the original image. The transcript gives a “Where’s Waldo” example where the model zooms toward likely locations (like top vantage points or walkways), though it still fails to find Waldo in that instance.

Why do tool-augmented math results look dramatically better than no-tool results?

On AIM 2025, both o3 and o4 mini are described as scoring around 90% without tools, but exceeding 99% with tools. The transcript notes AIM is used to qualify for USMO, and that USMO is harder and more proof-based. On USMO, o3 high settings are around 22% correct versus about 24% for Gemini 2.5 Pro, with Gemini characterized as about four times cheaper. The implication: tools can dramatically boost performance, but baseline reasoning differences still show up on harder qualifiers.

What does the transcript mean by “pay-to-win,” and how is compute economics tied to it?

The argument is that if major performance gains increasingly require expensive scaling—especially post-training/reinforcement learning and large compute budgets—then companies will monetize access through premium tiers. With labs planning higher-priced service levels and with reasoning improvements tied to costly compute, users who pay more get access to stronger models and deeper modes. The transcript contrasts two scenarios: if AGI were a quick algorithmic tweak, companies would push it broadly; if performance requires scaling compute, users effectively pay for the compute.

How does the transcript connect tool use and autonomy to AGI expectations without claiming AGI is here?

A senior OpenAI staff member is quoted defining AGI as a highly autonomous system that can outperform humans at most economically valuable work, while emphasizing that current systems are not there yet. The transcript points to “AGI vibes” from o3’s dynamic tool use as a sign of progress, not proof of completion. It also stresses a pace pattern—“things will go slow until they go fast”—suggesting acceleration ahead even while benchmarks show gaps like spatial reasoning.

Review Questions

Which benchmark domains show o3 leading Gemini 2.5 Pro, and which domains show Gemini leading—according to the transcript’s latest comparisons?
Describe VAR in your own words: what triggers cropping, what gets added to context, and what failure mode it targets?
What economic mechanism does the transcript use to justify a “pay-to-win” future for AI capabilities?

Key Points

1
o3 and Gemini 2.5 Pro are close competitors, but the lead changes by benchmark: o3 is stronger on long-context cross-chapter puzzle assembly, while Gemini often leads on spatial/physics-style reasoning.
2
Spatial reasoning gaps persist even when models answer textually plausible instructions; the transcript highlights the inability to truly visualize physical motion from text alone.
3
o3’s vision gains are linked to VAR, which crops the most relevant image region predicted by a multimodal model to reduce overwhelm from high-resolution inputs.
4
Tool use can massively boost performance in math competitions, with AIM 2025 scores jumping from ~90% without tools to over 99% with tools.
5
Gemini is repeatedly described as about four times cheaper than o3 in the comparisons, even when o3 wins certain tasks.
6
OpenAI’s projected $174 billion revenue by 2030 is used to argue that compute scaling and premium access tiers will increasingly shape who gets the best capabilities.
7
The transcript frames progress toward autonomy and tool-using systems as real, but insists that AGI—defined as outperforming humans at most economically valuable work—has not arrived yet.

Highlights

o3 takes the lead on long-context puzzle assembly (up to ~100,000 words), connecting distant clues across chapters where Gemini falls behind.

VAR is presented as a practical fix for high-resolution vision overload: predict the relevant region, crop it, and feed it back into the model’s visual working memory.

Even with strong benchmark performance, spatial/physics reasoning remains a weak spot relative to human experts, illustrated by a hands-and-arm motion example.

The “pay-to-win” claim hinges on economics: if post-training and reasoning improvements require expensive compute, premium tiers will gate top performance. 

Topics

Model Benchmarks
Long Context Reasoning
Spatial Reasoning
VAR Vision Method
AI Pricing Economics

Mentioned

Samman
Francois
Grace Swan
AGI
USMO
AIM
VAR
RL
TPU
GPUs