The King is Back. o3 & o4-mini are ELECTRIC! Can Google Compete?

TL;DR

OpenAI’s o3 and o4-mini are framed as tool-using, planning-first models that can browse, run Python/terminal, and produce chart-backed recommendations.

Briefing Cornell Notes

Briefing

OpenAI’s new o3 and o4-mini models are being positioned as a major leap in “agentic” AI—systems that can plan, use tools (web search, Python, terminal), and reason through multimodal inputs to produce results that look closer to scientific workflows than chat-only answers. Early access through paid ChatGPT tiers highlights a practical shift: the models can think for minutes, pull external data, run code, generate charts, and then deliver bottom-line recommendations with supporting analysis.

The core distinction is scale versus efficiency. o3 is the larger, “state-of-the-art” model aimed at pushing capabilities in coding, math, science, and visual perception. External evaluations cited in the discussion credit o3 with fewer major errors than OpenAI’s earlier o1 on difficult real-world tasks, with particular strength in programming, business/consulting-style reasoning, and creative hypothesis generation. o4-mini, meanwhile, is framed as the next-generation smaller model—optimized for fast, cost-efficient reasoning—while still delivering “remarkable performance for its size.” The naming is treated as confusing but meaningful: o4-mini replaces o3-mini and is described as a glimpse of the next generation, while the full o4 is implied to be not yet ready.

Benchmarks presented as the centerpiece show tool use as the accelerant. On AIM 2024 and AIM 2025, the scores climb sharply when Python and web tools are enabled, with o4-mini reaching the top end of the charts in the tool-augmented setting (including scores near 99.5 and 99.5-level performance). Code-focused results are similarly framed as dominant: terminal-enabled o3 and o4-mini lead on Codeforces-style competition metrics, with o4-mini outperforming the larger o3 while remaining cheaper.

The discussion also contrasts academic-style tests with exam-style tasks. On GPQA Diamond, o3 and o4-mini show slight leads over older models without tools. On “Humanity’s Last Exam,” tool access again drives the gap: o4-mini improves substantially with Python/browsing, and o3 with tools rises even further—though a specialized “Deep Research” system still tops the chart, described as powered by a fine-tuned variant of o3. Multimodal performance is repeatedly emphasized as improved, including image-based reasoning where the models can interpret diagrams or whiteboards and then manipulate images (rotate/zoom/transform) to extract hard-to-read details.

Cost and usability are treated as the other half of the story. API pricing comparisons claim o4-mini is dramatically cheaper than o3, while staying close to Google’s Gemini 2.5 Pro on many benchmarks. The narrative repeatedly returns to the idea that tool-using o-series models saturate performance on certain benchmark suites—making price and workflow fit the deciding factors.

Community testing and real-world behavior add texture. Users report strong performance on a hexagon-and-bouncing-balls physics benchmark, with several older models failing or glitching. At the same time, jailbreak attempts appear to work, including a user producing a detailed amphetamine synthesis procedure after bypassing safety. Coding experiments in ChatGPT’s interface are more mixed: the models can generate and verify smaller code tasks, but large projects (like a Minecraft clone) struggle, with output limits and canvas constraints suspected.

Finally, OpenAI’s “Codeex CLI” is introduced as an open-source coding agent that runs on a user’s computer, aiming to make tool-using models more practical for developers. The overall takeaway is that o3 and o4-mini aren’t just smarter—they’re being trained to choose tools, reason about them, and handle visual inputs in ways that make them feel more like autonomous problem-solvers than conventional chatbots.

Cornell Notes

OpenAI’s o3 and o4-mini are presented as tool-using, “agentic” reasoning models that can plan, browse the web, run Python/terminal, and work with images to solve multi-step problems. o3 is the larger, highest-capability model aimed at pushing coding, math, science, and visual reasoning, while o4-mini targets faster, cheaper reasoning without giving up top-tier performance. Benchmark results emphasize that tool access drives scores upward, with o4-mini reaching the highest AIM scores in tool-augmented settings and both models leading in terminal-based coding competitions. Multimodal tests highlight image-aware reasoning plus on-the-fly image manipulation (zoom/rotate/transform) to extract information from messy visuals. Pricing comparisons argue o4-mini is far more cost-effective than o3 while staying close to Gemini 2.5 Pro for many tasks.

What makes o3 and o4-mini feel “agentic” rather than like standard chat models?

They’re trained to plan and then use external tools based on desired outcomes. In practice, that means web search to gather data, Python to compute and visualize results, and terminal-style workflows for coding tasks. The models can also reason over images inside the reasoning process—then manipulate the image (rotate/zoom/transform) to read small or low-quality text—before producing a final answer. The transcript also notes that they may think for minutes, then return outputs with charting, verification, and a clear recommendation.

How do the models’ benchmark results change when tools are enabled?

Tool use is described as the major performance multiplier. For AIM 2024/2025, scores rise substantially when Python and browsing are allowed, and the tool-augmented o4-mini is claimed to top the charts (with near-perfect scores in the presented AIM results). Code benchmarks similarly improve when terminal access is used, with o3 and o4-mini dominating Codeforces-style ELO metrics.

Why does “Deep Research” still outperform the general models on Humanity’s Last Exam?

The transcript claims Deep Research is powered by a fine-tuned variant of o3 (not the same o3 released in ChatGPT). That specialization is credited for keeping it at the top on Humanity’s Last Exam, even when general o3 and o4-mini are given tools. In other words, the general model catches up, but the dedicated deep-research tuning still yields the best exam performance.

What tradeoff does the transcript highlight between o3 and o4-mini in real deployment?

o3 is positioned as stronger but much more expensive in API token pricing. o4-mini is described as dramatically cheaper—often comparable to Gemini 2.5 Pro in cost—while still delivering very high benchmark performance, especially when tools are available. The practical implication: if cost matters, o4-mini is framed as the better default; if maximum capability matters and budget allows, o3 is the heavier hitter.

What limitations show up in coding tests inside ChatGPT’s app interface?

While the models can handle smaller coding tasks and verification workflows, larger projects struggle. The transcript describes attempts to build a Minecraft clone with Three.js failing, and it suspects ChatGPT’s coding canvas has output limits (e.g., the model generating only a few hundred lines at once). The guidance given is to test coding in the API for bigger projects, where constraints may differ.

What is Codeex CLI, and how does it relate to tool use?

Codeex CLI is introduced as an open-source coding agent that runs directly on the user’s computer. The transcript frames it as OpenAI’s way to make tool-using models more useful for development by enabling local execution while still relying on API calls to the models. It’s positioned as an answer to competing agent-style coding tools and as a real step toward practical “agentic” coding workflows.

Review Questions

Which benchmark categories in the transcript are most sensitive to tool use, and what happens to scores when Python/web tools are enabled?
How do the transcript’s pricing comparisons influence the choice between o3 and o4-mini for typical users?
What multimodal capabilities are described for o3/o4-mini, and how do image manipulation steps improve problem-solving outcomes?

Key Points

1
OpenAI’s o3 and o4-mini are framed as tool-using, planning-first models that can browse, run Python/terminal, and produce chart-backed recommendations.
2
o3 targets maximum capability across coding, math, science, and visual reasoning, while o4-mini targets fast, cost-efficient reasoning with strong performance for its size.
3
Tool access is repeatedly shown as the biggest performance driver in the cited AIM and coding benchmarks, pushing scores toward saturation levels.
4
Multimodal reasoning is emphasized as a step change: models can interpret images within their reasoning process and manipulate images (zoom/rotate/transform) to extract hard details.
5
Deep Research is described as a specialized, fine-tuned system that still tops Humanity’s Last Exam even when general o3 is given tools.
6
API pricing comparisons claim o4-mini is far cheaper than o3 while staying close to Gemini 2.5 Pro on many benchmarks, making it the more cost-effective option.
7
Community tests report strong physics and image-zoom performance, but also show jailbreak success and limitations for large coding projects inside ChatGPT’s canvas.

Highlights

Tool-augmented AIM results are presented as near-saturated, with o4-mini reaching the top end when Python/web tools are enabled.

o3 and o4-mini are described as able to “think with” images—then zoom/rotate/transform them to read and reason over blurry or low-quality visuals.

Codeex CLI is introduced as an open-source local coding agent meant to make tool-using models more practical for developers.

ChatGPT canvas coding is portrayed as limited for large projects, even when the models excel at smaller tasks and verification steps.

Topics

Mentioned

ChatGPT
Gemini
Codeforces
Windsurf
Stack Overflow
Matt VidPro
Sam Altman
Jimmy Apple
Aiden
Matthew Burman
Favio
Dan Shippert
Yam
Le
Algib
Whoosh
API
ELO
MMU
GPQA
SWE Lancer