OpenAI o3 and o3-mini—12 Days of OpenAI: Day 12

TL;DR

OpenAI announced o3 and o3-mini and is gating broad release behind external public safety testing applications.

Briefing Cornell Notes

Briefing

OpenAI is announcing two new reasoning models—o3 and o3-mini—positioned as a step-change in performance on coding, math, and general reasoning benchmarks, while also expanding public safety testing before any broad release. The company is not launching them for general users yet, but it is opening applications for researchers in public safety testing starting immediately, with additional external access planned for o3-mini and eventually o3.

On the capabilities front, o3 posts strong results across technical benchmarks. In software-style coding tasks (SweetBench Verified), it reaches 71.7% accuracy, reported as more than 20% better than o1. On Codeforces competition coding, o3 is described as reaching an ELO near 2727 under aggressive test-time compute. For math, o3 scores 96.7% on the AIME benchmark (vs. 83.3% for o1) and 87.7% on GPQ Diamond, a PhD-level science/math benchmark (vs. 78% for o1). OpenAI also highlights a “hard mode” math test—Epic AI’s Frontier Math—where most systems score under 2% accuracy; o3 is reported to exceed 25% under aggressive settings, a notable jump on what’s framed as the toughest available math benchmark.

A centerpiece of the announcement is o3’s performance on ARC Prize’s Arc AGI benchmark, a long-running test designed to measure learning of new skills from examples rather than memorization. ARC Prize says its Arc AGI benchmark has been unbeaten for five years, and it reports o3 achieving a new state-of-the-art score of 75.7 on a semi-private holdout set under low compute, with 87.5 under high compute. The company ties this to a milestone: the high-compute score is described as surpassing a human-performance threshold (around 85%), and OpenAI says it has not previously seen a model clear that bar on this benchmark.

Alongside o3, OpenAI is rolling out o3-mini as a cost-efficient reasoning option. With “adaptive thinking time” exposed in the API (low, median, and high reasoning effort), o3-mini is presented as delivering coding performance that scales with more compute while remaining far cheaper than top-tier reasoning models. In live demos, o3-mini can generate and execute Python code via a local code-generation-and-execution workflow, and it can run a self-evaluation script against a hard GPQ dataset quickly using low reasoning effort. OpenAI also reports that o3-mini supports key API features such as function calling and structured outputs, aiming to make the reasoning models practical for developers.

Finally, OpenAI pairs the capability push with safety work. External safety testing access is opened for o3-mini starting today, with o3-mini planned for broader availability around the end of January and full o3 shortly after. The company also introduces “deliberative alignment,” a safety technique that uses model reasoning to better determine when to reject or review prompts, improving performance on rejection and refusal-related benchmarks. The overall message is clear: stronger reasoning models are arriving, but the rollout is gated by expanded safety evaluation and tighter alignment methods.

Cornell Notes

OpenAI announced two new reasoning models, o3 and o3-mini, and is opening applications for external safety testing before general release. o3 is reported to deliver major gains on coding and math benchmarks, including SweetBench Verified (71.7%), AIME (96.7%), GPQ Diamond (87.7%), and Epic AI’s Frontier Math (over 25% under aggressive settings). A key milestone comes from ARC Prize’s Arc AGI benchmark: o3 scores 75.7 on a semi-private holdout set under low compute and 87.5 under high compute, surpassing a human-performance threshold. o3-mini targets cost-efficient reasoning with adaptive thinking time (low/median/high) and supports API features like function calling and structured outputs. OpenAI also highlights “deliberative alignment” to improve safety decisions by using reasoning to refine the safe/unsafe boundary.

What benchmarks does o3 outperform, and what do those numbers imply about its capabilities?

OpenAI reports o3 at 71.7% accuracy on SweetBench Verified (real-world software tasks), described as over 20% better than o1. On Codeforces, it reaches nearly a 2727 ELO under aggressive test-time compute. For math, o3 scores 96.7% on AIME (vs. 83.3% for o1) and 87.7% on GPQ Diamond (vs. 78% for o1). These results collectively suggest stronger reasoning-driven performance in both coding and high-level math, not just narrow pattern matching.

Why is Epic AI’s Frontier Math benchmark treated as a tougher test than earlier math benchmarks?

Epic AI’s Frontier Math is framed as the toughest math benchmark available, built from novel, unpublished, very hard problems that can take professional mathematicians hours or days. OpenAI says current offerings score under 2% accuracy on it, while o3 exceeds 25% under aggressive test-time compute. The low baseline makes the jump more meaningful as evidence of genuine capability.

What makes ARC Prize’s Arc AGI benchmark different from typical benchmark leaderboards?

Arc AGI is designed around learning transformation rules from input-output examples, with each task requiring distinct skills. OpenAI emphasizes that tasks are intentionally varied so the model can’t rely on memorizing a single pattern; instead, it must infer rules and generalize to new tasks. The benchmark’s long-standing difficulty is highlighted by the claim that it had been unbeaten for five years.

How do o3 and o3-mini differ in rollout and intended use?

OpenAI is not publicly launching either model for general users immediately. Instead, it opens public safety testing applications starting today, with external access focused first on o3-mini. o3-mini is positioned as a cost-efficient reasoning frontier for developers, using adaptive thinking time (low/median/high) so users can trade latency and cost against reasoning effort. o3 is the higher-capability model, with broader availability planned after o3-mini.

What is “deliberative alignment,” and how does it change safety evaluation?

Deliberative alignment uses the model’s reasoning ability to evaluate prompts against a safety specification, aiming to uncover hidden intent (e.g., attempts to trick the model) even when prompts are obfuscated. OpenAI reports improved performance on a rejection benchmark (reject vs. review decisions), showing better ability both to reject unsafe prompts and to flag prompts for review—two metrics that are often in tension.

What practical capability does o3-mini demonstrate in the API demo?

In a live demo, o3-mini is used to generate Python code that runs in a local server workflow: the model returns code, the system saves it locally, and a terminal executes it automatically. The demo also includes a self-evaluation workflow where o3-mini downloads a raw GPQ dataset file, parses questions/options/answers, and grades results—reported as fast when using low reasoning effort.

Review Questions

Which reported o3 benchmark results best support the claim that it improves reasoning utility rather than only narrow coding performance?
How does Arc AGI’s “distinct skills per task” design reduce the value of memorization, and why does that matter for interpreting scores?
What tradeoffs does adaptive thinking time introduce for o3-mini, and how does deliberative alignment aim to improve safety decisions under those reasoning capabilities?

Key Points

1
OpenAI announced o3 and o3-mini and is gating broad release behind external public safety testing applications.
2
o3 is reported to achieve 71.7% on SweetBench Verified, nearly 2727 ELO on Codeforces under aggressive compute, 96.7% on AIME, and 87.7% on GPQ Diamond.
3
o3’s performance on Epic AI’s Frontier Math is reported as over 25% accuracy under aggressive test-time compute, contrasted with <2% for other offerings.
4
ARC Prize reports o3 scores 75.7 on Arc AGI under low compute and 87.5 under high compute, with the high-compute score described as above a human-performance threshold.
5
o3-mini is positioned as cost-efficient reasoning with adaptive thinking time (low/median/high) and support for API features like function calling and structured outputs.
6
OpenAI introduced deliberative alignment, using model reasoning to refine the safety boundary and improve rejection vs. review performance.

Highlights

o3 is reported to exceed 25% on Epic AI’s Frontier Math—an area where other systems are said to score under 2%.

ARC Prize’s Arc AGI benchmark shows o3 reaching 75.7 (low compute) and 87.5 (high compute), with the high-compute score described as surpassing a human-performance threshold.

o3-mini’s adaptive thinking time lets developers trade cost and latency against reasoning effort while retaining strong coding performance.

Deliberative alignment aims to improve safety by having reasoning help detect hidden intent, boosting both rejection and review-related metrics.

Topics

Reasoning Models
Benchmarking
Safety Testing
Adaptive Thinking
Deliberative Alignment

Mentioned

Mark
Sam
Greg Camad
AGI
ELO
API
GPQ
AIME