OpenAI o3 & o4-mini
Based on OpenAI's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
OpenAI’s o3 and o4-mini are trained to use tools during reasoning, enabling multi-step workflows like web browsing, Python execution, and image manipulation.
Briefing
OpenAI is rolling out two new reasoning models—o3 and o4-mini—positioned as a qualitative jump because they behave like tool-using AI systems rather than standalone text predictors. The core shift is training for tool use inside the reasoning process: the models can call tools repeatedly (including long multi-step runs) while solving hard tasks, using those tool outputs as part of their chain-of-thought workflow. That matters because it turns “reasoning” into something operational—models can browse, run code, manipulate images, and work through multi-stage problems instead of producing one-off answers.
The rollout starts incrementally via the API and ChatGPT, with o3 and o4-mini also described as strong in science, coding, and math. OpenAI ties the improvement to combining its O-series reasoning approach with a full suite of tools, reporting state-of-the-art results across benchmarks including AM, GPQA, Codeforces, and Sweetbench. Tool use is presented as the enabling layer: like calculators for math or maps for navigation, the models become more capable when paired with the right instruments. A key example is multimodal reasoning with images. Users can upload difficult images—blurry, upside down, or otherwise messy—and the model can use Python-based image manipulation (crop/transform) to extract what it needs.
A physics demo illustrates the “agentic” workflow. One researcher feeds o3 a 2015 physics poster and asks it to recover a specific quantity from the project and compare it to recent literature. The result isn’t on the poster, so the model zooms around the image, infers what calculations are needed (including slope/extrapolation and normalization), then searches the web for updated estimates by reading multiple papers. The demo emphasizes time savings and uncertainty handling: the model produces a plausible estimate with caveats about precision relative to state-of-the-art.
A second demo shows personalization plus tool-driven research. With memory enabled, o3 reads the news, looks up a topic aligned with the user’s interests (scuba diving and music), and generates a blog-style writeup with plots and citations. The described research thread: playing recordings of healthy coral reefs underwater to accelerate coral and fish settlement—an intersection of environmental restoration and underwater audio.
Beyond demos, OpenAI reports benchmark performance and also shows how tool use emerges. On an AM math problem, the model first generates a brute-force program, runs it in a Python interpreter, then simplifies into a cleaner solution and double-checks reliability—without being explicitly instructed to follow those exact strategies. In coding, a Sweetbench example uses a containerized environment with a preloaded repository to locate and fix a bug in the senpai Python package, then confirms the fix by running unit tests.
Training and evaluation are framed as scaling both training and test-time compute within an RL paradigm, with reported gains continuing as RL compute increases. OpenAI also introduces Codex CLI as a safer, lightweight interface for connecting models to users’ computers, including multimodal inputs and an execution mode that restricts network access and limits edits to the working directory. Finally, ChatGPT access is staged: Pro Plus subscribers begin receiving o3, o4-mini, and o4-mini high, while Enterprise/edu access waits a week; model availability expands to the API, with tool usage planned for upcoming weeks.
Cornell Notes
OpenAI’s o3 and o4-mini are presented as reasoning models trained to use tools as part of solving problems, not just to generate text. The models combine O-series reasoning with a full tool suite, enabling repeated tool calls (including long runs), web browsing, Python-based computation, and image manipulation. Demos show o3 extracting missing physics results from an old poster, searching recent papers to update estimates, and producing personalized research outputs with plots and citations. Benchmark reporting claims strong performance in math, coding, and science, with tool use demonstrated through workflows like generating and running code, then simplifying and double-checking answers. The rollout also includes Codex CLI for safer local agent execution and staged ChatGPT/API availability.
What’s the biggest practical change behind o3 and o4-mini?
How does the physics demo demonstrate “reasoning + tools” together?
What does the AM math example reveal about how strategies appear during problem solving?
How does the Sweetbench coding demo work end-to-end?
Why is Codex CLI positioned as a step toward “future programming”?
What performance and cost tradeoffs are highlighted for o3 vs o4-mini?
Review Questions
- How does training for tool use change what o3/o4-mini can do compared with earlier reasoning models?
- In the physics demo, what two distinct tool-driven steps are required to produce an updated estimate?
- What evidence from the AM and Sweetbench examples suggests the models can iteratively improve their approach rather than just compute once?
Key Points
- 1
OpenAI’s o3 and o4-mini are trained to use tools during reasoning, enabling multi-step workflows like web browsing, Python execution, and image manipulation.
- 2
Tool use is described as central to performance gains across math, coding, and science benchmarks, including AM, GPQA, Codeforces, and Sweetbench.
- 3
Multimodal capability is demonstrated through Python-based image handling, letting the models work with blurry or rotated inputs.
- 4
Demos show iterative problem solving: generating a brute-force solution, running it, then simplifying and double-checking for reliability.
- 5
Coding performance is illustrated via containerized repository debugging (senpai), including patching and unit-test verification.
- 6
OpenAI attributes improvements to scaling both training and test-time compute within an RL paradigm, with continued gains as RL compute increases.
- 7
Codex CLI is introduced as a safer way to deploy tool-using agents locally, including suggest mode and a full auto mode with network disabled and edit restrictions.