OpenAI o3 & o4-mini

TL;DR

OpenAI’s o3 and o4-mini are trained to use tools during reasoning, enabling multi-step workflows like web browsing, Python execution, and image manipulation.

Briefing Cornell Notes

Briefing

OpenAI is rolling out two new reasoning models—o3 and o4-mini—positioned as a qualitative jump because they behave like tool-using AI systems rather than standalone text predictors. The core shift is training for tool use inside the reasoning process: the models can call tools repeatedly (including long multi-step runs) while solving hard tasks, using those tool outputs as part of their chain-of-thought workflow. That matters because it turns “reasoning” into something operational—models can browse, run code, manipulate images, and work through multi-stage problems instead of producing one-off answers.

The rollout starts incrementally via the API and ChatGPT, with o3 and o4-mini also described as strong in science, coding, and math. OpenAI ties the improvement to combining its O-series reasoning approach with a full suite of tools, reporting state-of-the-art results across benchmarks including AM, GPQA, Codeforces, and Sweetbench. Tool use is presented as the enabling layer: like calculators for math or maps for navigation, the models become more capable when paired with the right instruments. A key example is multimodal reasoning with images. Users can upload difficult images—blurry, upside down, or otherwise messy—and the model can use Python-based image manipulation (crop/transform) to extract what it needs.

A physics demo illustrates the “agentic” workflow. One researcher feeds o3 a 2015 physics poster and asks it to recover a specific quantity from the project and compare it to recent literature. The result isn’t on the poster, so the model zooms around the image, infers what calculations are needed (including slope/extrapolation and normalization), then searches the web for updated estimates by reading multiple papers. The demo emphasizes time savings and uncertainty handling: the model produces a plausible estimate with caveats about precision relative to state-of-the-art.

A second demo shows personalization plus tool-driven research. With memory enabled, o3 reads the news, looks up a topic aligned with the user’s interests (scuba diving and music), and generates a blog-style writeup with plots and citations. The described research thread: playing recordings of healthy coral reefs underwater to accelerate coral and fish settlement—an intersection of environmental restoration and underwater audio.

Beyond demos, OpenAI reports benchmark performance and also shows how tool use emerges. On an AM math problem, the model first generates a brute-force program, runs it in a Python interpreter, then simplifies into a cleaner solution and double-checks reliability—without being explicitly instructed to follow those exact strategies. In coding, a Sweetbench example uses a containerized environment with a preloaded repository to locate and fix a bug in the senpai Python package, then confirms the fix by running unit tests.

Training and evaluation are framed as scaling both training and test-time compute within an RL paradigm, with reported gains continuing as RL compute increases. OpenAI also introduces Codex CLI as a safer, lightweight interface for connecting models to users’ computers, including multimodal inputs and an execution mode that restricts network access and limits edits to the working directory. Finally, ChatGPT access is staged: Pro Plus subscribers begin receiving o3, o4-mini, and o4-mini high, while Enterprise/edu access waits a week; model availability expands to the API, with tool usage planned for upcoming weeks.

Cornell Notes

OpenAI’s o3 and o4-mini are presented as reasoning models trained to use tools as part of solving problems, not just to generate text. The models combine O-series reasoning with a full tool suite, enabling repeated tool calls (including long runs), web browsing, Python-based computation, and image manipulation. Demos show o3 extracting missing physics results from an old poster, searching recent papers to update estimates, and producing personalized research outputs with plots and citations. Benchmark reporting claims strong performance in math, coding, and science, with tool use demonstrated through workflows like generating and running code, then simplifying and double-checking answers. The rollout also includes Codex CLI for safer local agent execution and staged ChatGPT/API availability.

What’s the biggest practical change behind o3 and o4-mini?

Tool use is trained into the reasoning workflow. Instead of producing an answer directly from text, the models call tools repeatedly while solving—sometimes hundreds of tool calls in sequence for hard tasks. That enables web search for updated literature, Python execution for calculations, and image manipulation (via Python to crop/transform) when inputs are messy.

How does the physics demo demonstrate “reasoning + tools” together?

A physics poster from 2015 is provided, but the key result isn’t actually printed on the poster. o3 zooms and inspects the image, infers the needed steps (e.g., slope/extrapolation and normalization), then searches the web for recent findings to compare against state-of-the-art estimates. The demo highlights both speed (reading multiple papers quickly) and uncertainty (precision lagging recent results).

What does the AM math example reveal about how strategies appear during problem solving?

The model first generates a brute-force program and runs it in a Python interpreter to get the correct answer (82). It then recognizes the approach is inelegant, simplifies the solution, and double-checks the result to improve reliability. OpenAI emphasizes that these behaviors were not hard-coded as explicit instructions; they emerged from training for usefulness.

How does the Sweetbench coding demo work end-to-end?

The model is given a container with the senpai repository preloaded and asked to find and fix a bug. It starts by verifying the reported behavior (checking how output formatting differs), then browses the codebase using terminal-like commands, identifies a relevant Python construct (MRO/inheritance), applies a patch, and finally runs unit tests to confirm the fix changes square-bracket rendering as expected. The demo reports a relatively short run (about 22 interactions) with thousands of tokens.

Why is Codex CLI positioned as a step toward “future programming”?

Codex CLI is described as a lightweight interface that connects models to users’ computers via public APIs (including the responses API) and supports multimodal reasoning. It includes execution modes: a suggest mode that requires approvals for edits/commands, and a full auto mode that runs commands with network disabled and restricts edits to the directory where it was launched—aiming for safer agent deployment.

What performance and cost tradeoffs are highlighted for o3 vs o4-mini?

OpenAI reports that o4-mini is substantially better than o3-mini at comparable estimated inference cost on an external eval (Humanity’s Tax exam). It also claims o3 can reach similar performance with less inference cost, and that paying the same amount as a higher-tier model yields a higher score. The message: o4-mini targets speed/cost efficiency, while o3 targets stronger performance.

Review Questions

How does training for tool use change what o3/o4-mini can do compared with earlier reasoning models?
In the physics demo, what two distinct tool-driven steps are required to produce an updated estimate?
What evidence from the AM and Sweetbench examples suggests the models can iteratively improve their approach rather than just compute once?

Key Points

1
OpenAI’s o3 and o4-mini are trained to use tools during reasoning, enabling multi-step workflows like web browsing, Python execution, and image manipulation.
2
Tool use is described as central to performance gains across math, coding, and science benchmarks, including AM, GPQA, Codeforces, and Sweetbench.
3
Multimodal capability is demonstrated through Python-based image handling, letting the models work with blurry or rotated inputs.
4
Demos show iterative problem solving: generating a brute-force solution, running it, then simplifying and double-checking for reliability.
5
Coding performance is illustrated via containerized repository debugging (senpai), including patching and unit-test verification.
6
OpenAI attributes improvements to scaling both training and test-time compute within an RL paradigm, with continued gains as RL compute increases.
7
Codex CLI is introduced as a safer way to deploy tool-using agents locally, including suggest mode and a full auto mode with network disabled and edit restrictions.

Highlights

o3 can treat tools as part of the reasoning loop—web search, Python computation, and image manipulation—rather than producing answers in one pass.

In the physics demo, the model reconstructs missing results from an old poster and then searches recent literature to update estimates and compare precision.

The AM math example shows an emergent workflow: brute-force program → run in Python → simplify → double-check.

The senpai Sweetbench demo demonstrates real codebase debugging inside a container, ending with unit tests confirming the fix.

Codex CLI aims to make local code-executing agents practical by combining multimodal inputs with safer execution controls. 

Topics

Tool-Using Reasoning
Multimodal Image Reasoning
Benchmark Results
RL Scaling
Codex CLI
ChatGPT Rollout

Mentioned

Greg Brockman
Mark Chen
Brandon McKenzie
Eric Mitchell
Wenda
Ana
Fouad
Michael
RL
API
MRO
AM
GPQA