o3 breaks (some) records, but AI becomes pay-to-win
Based on AI Explained's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
o3 and Gemini 2.5 Pro are close competitors, but the lead changes by benchmark: o3 is stronger on long-context cross-chapter puzzle assembly, while Gemini often leads on spatial/physics-style reasoning.
Briefing
OpenAI’s o3 has landed with record-breaking benchmark results in just days, but the bigger shift is economic: top-tier AI performance is increasingly tied to paid access, turning advanced reasoning into a “pay-to-win” market. Early comparisons put o3 and Google’s Gemini 2.5 Pro in a tight race across major tests, yet the newest results show o3 pulling ahead on long-form puzzle assembly while Gemini retains advantages in some other domains—especially spatial and physics-style reasoning.
On long-context reasoning, o3 shows a consistent edge when tasks require connecting clues across widely separated sections of fiction, including texts around 100,000 words. The expectation that Gemini—long considered strong at long context—would dominate doesn’t hold up in these newer measurements. In a separate physics and spatial reasoning benchmark released within days, Gemini 2.5 Pro leads, with o3 trailing despite both models falling well short of human experts. The gap isn’t just about “knowing facts”; it reflects limitations in how models interpret real-world spatial relationships from text alone. A concrete example illustrates the problem: following instructions that involve physically looping an arm through a gap is easy for humans but hard for models that can’t truly visualize the motion.
The picture flips again on troubleshooting-style tasks. o3 posts a 94th percentile score on a text-based evaluation of complex lab protocol troubleshooting, while Gemini 2.5 Pro performs better on competition-style mathematics. In math, both models reach roughly 90% on an AIM 2025 high-school competition benchmark without tools, but exceed 99% with tools. On the harder USMO qualifier, o3 on high settings lands around 22% correct versus about 24% for Gemini 2.5 Pro—again with Gemini described as roughly four times cheaper. Visual benchmarks show more unevenness: o3 edges out Gemini on some image-based questions (like directional cues), but Gemini is stronger at geolocation from street-view images and also performs better on certain visual puzzles, where Gemini can even underperform older models like o1.
A key technical thread behind o3’s vision gains is the VAR method. The approach addresses a common failure mode: high-resolution images can overwhelm multimodal models. VAR uses a multimodal language model to predict which region of an image is most relevant, crops that region, and feeds it back with the original image into the model’s visual working memory—helping it “zoom in” on the right details. The method still isn’t magic; even with VAR, some “where’s Waldo” style tasks remain unsolved.
Economically, the stakes are rising fast. OpenAI projects $174 billion in revenue by 2030, up from $4 billion in 2024. The transcript argues that this growth likely depends on compute scaling and product packaging rather than a single breakthrough that instantly delivers cheap, universal intelligence. With companies already moving toward premium tiers (and similar plans emerging across major labs), the incentive structure shifts: if performance improvements require expensive post-training and massive compute, then users may need to pay more to stay competitive. The result is a market where advanced capabilities—especially tool-using, longer reasoning, and deeper “research” modes—could increasingly be gated by cost, even as progress accelerates toward more autonomous systems that can outperform humans on economically valuable work.
Cornell Notes
o3 and Gemini 2.5 Pro are trading leads across benchmarks, with o3 showing an edge on long-context puzzle assembly and Gemini often outperforming on spatial/physics-style reasoning. o3’s vision improvements are linked to VAR, which crops the most relevant image region using a multimodal model to reduce overwhelm from high-resolution inputs. In math and competition-style tests, both models can reach very high scores with tools, while differences without tools are smaller and sometimes favor Gemini on cost. Despite rapid capability gains, the transcript frames a larger shift: top performance is increasingly tied to expensive compute and premium access tiers, creating a “pay-to-win” dynamic. The long-term question becomes whether progress will come from cheap algorithmic leaps or from scaling that forces users to pay more.
Why does o3’s performance look strong on long-context puzzle tasks, and what does that imply about reasoning?
What’s the core limitation highlighted by spatial/physics benchmarks, and how is it illustrated?
How does VAR change vision behavior, and what problem is it designed to solve?
Why do tool-augmented math results look dramatically better than no-tool results?
What does the transcript mean by “pay-to-win,” and how is compute economics tied to it?
How does the transcript connect tool use and autonomy to AGI expectations without claiming AGI is here?
Review Questions
- Which benchmark domains show o3 leading Gemini 2.5 Pro, and which domains show Gemini leading—according to the transcript’s latest comparisons?
- Describe VAR in your own words: what triggers cropping, what gets added to context, and what failure mode it targets?
- What economic mechanism does the transcript use to justify a “pay-to-win” future for AI capabilities?
Key Points
- 1
o3 and Gemini 2.5 Pro are close competitors, but the lead changes by benchmark: o3 is stronger on long-context cross-chapter puzzle assembly, while Gemini often leads on spatial/physics-style reasoning.
- 2
Spatial reasoning gaps persist even when models answer textually plausible instructions; the transcript highlights the inability to truly visualize physical motion from text alone.
- 3
o3’s vision gains are linked to VAR, which crops the most relevant image region predicted by a multimodal model to reduce overwhelm from high-resolution inputs.
- 4
Tool use can massively boost performance in math competitions, with AIM 2025 scores jumping from ~90% without tools to over 99% with tools.
- 5
Gemini is repeatedly described as about four times cheaper than o3 in the comparisons, even when o3 wins certain tasks.
- 6
OpenAI’s projected $174 billion revenue by 2030 is used to argue that compute scaling and premium access tiers will increasingly shape who gets the best capabilities.
- 7
The transcript frames progress toward autonomy and tool-using systems as real, but insists that AGI—defined as outperforming humans at most economically valuable work—has not arrived yet.