OpenAI DevDay 2024 | OpenAI Research

TL;DR

o1 is trained to reason via reinforcement learning, iteratively refining strategies and correcting mistakes rather than relying on a single attempt.

Briefing Cornell Notes

Briefing

OpenAI’s o1 family is positioned as a reasoning-first shift: the models are trained to “think with reinforcement learning,” iteratively refine strategies, and correct mistakes by trying an approach, learning from failure, and moving toward a better plan. In a concrete example, o1 works through a difficult cipher by repeatedly reassessing whether the current line of reasoning is productive, switching strategies when progress stalls, and eventually arriving at a correct solution. The takeaway is less about faster answers and more about a fundamentally different problem-solving behavior—described as a potential “new paradigm”—where patience and strategy refinement matter for hard tasks.

That paradigm framing quickly turns into practical guidance on when to use o1 versus GPT-4o. For extremely hard math and code, o1-preview and o1 show a clear advantage on benchmarks such as AIME (competition math) and Codeforces (coding). In those results, GPT-4o and o1-preview are reported as barely solving a small fraction of questions, while o1-preview solves more than half and o1 solves the majority of problems in the dataset. The message is that there’s a specific subset of tasks where GPT-4o struggles and o1 models take over.

Broader evaluations reinforce the pattern but also add nuance. On math-heavy benchmarks—Math from Hendrix, Physics, College Math, and LSAT—o1-preview delivers large gains over GPT-4o. But the improvement is not universal: tasks such as AP English Language and Literature, SAT, and Public Relations show little difference, suggesting o1’s edge concentrates in domains where multi-step reasoning and structured problem solving dominate.

Cost and latency become the tradeoff. Because o1-preview and o1 require time to think, they are described as more expensive and slower than GPT-4o, which remains the better default for most API use cases. GPT-4o is framed as lower-latency and lower-cost, but weaker on prompts that demand strong reasoning or coding/math depth.

The briefing also distinguishes between o1-preview and o1-mini. An inference-cost versus performance plot on AIME indicates o1-mini is “strictly better” than o1-preview, attributed to specialization for fast but still capable math and coding. The guidance is straightforward: choose o1-mini when speed and cost matter, especially for math/coding; choose o1-preview when you want stronger performance and can tolerate the added inference cost.

Finally, several API use cases are highlighted for o1-preview and o1-mini: accuracy detection in medical workflows (flagging whether a diagnosis is correct), coding assistance (including tools like Cursor), hard-sciences research, and reasoning-heavy brainstorming in math or legal domains. Overall, the o1 family is presented as a targeted tool for the hardest reasoning problems—where iterative strategy refinement pays off—paired with clear decision rules based on benchmark performance, latency, and cost.

Cornell Notes

OpenAI describes o1 as a reasoning model trained with reinforcement learning to iteratively refine strategies and correct mistakes. In hard problems, it may not find the right approach immediately; instead, it tries a strategy, learns from failure, and then switches to a better plan. Benchmark results emphasize that o1-preview and especially o1 outperform GPT-4o on extremely challenging math and coding tasks (AIME and Codeforces), with large gains on math/physics/LSAT-style evaluations but little improvement on some reading/writing and general tasks. The tradeoff is practical: o1 models take longer and cost more due to “time to think.” o1-mini is positioned as a faster, cheaper option that can be more cost-effective than o1-preview for math and coding.

What makes o1’s behavior different from earlier model generations, and why does that matter for difficult tasks?

o1 is trained to “think with reinforcement learning,” learning to refine its thinking strategies and recognize/correct mistakes. When a problem is very hard, it may not reach a working strategy in one attempt; it tries a strategy, even if unsuccessful, and uses that attempt as a cue for what to try next. This patience-and-refinement loop is presented as a new problem-solving paradigm, particularly relevant when success depends on multi-step reasoning rather than a single-shot guess.

On which benchmarks does o1 outperform GPT-4o most clearly, and what does that imply about task selection?

The strongest contrast is reported on AIME (competition math) and Codeforces (coding). GPT-4o and o1-preview are described as barely solving a few questions, while o1-preview solves more than half and o1 solves the majority of problems in the dataset. The implication is that o1 is best reserved for a subset of tasks where reasoning depth is the bottleneck—tasks that look “extremely hard” under standard prompting.

Why do o1-preview gains vary across domains?

Large improvements are reported on math and reasoning-heavy benchmarks such as Math from Hendrix, Physics, College Math, and LSAT. But the gains are not universal: tasks like AP English Language and Literature, SAT, and Public Relations show little improvement over GPT-4o. That pattern suggests o1’s advantage concentrates in domains where structured reasoning and problem-solving steps matter more than other skills.

How should developers balance performance against cost and latency when choosing between o1 and GPT-4o?

o1-preview and o1 require time to think, making them more expensive and higher-latency than GPT-4o. GPT-4o remains the recommended default for most API use cases because it’s lower cost and lower latency. The decision rule becomes: use o1 when the task needs strong reasoning/coding/math performance; otherwise, prefer GPT-4o for efficiency.

When is o1-mini the better choice than o1-preview?

A plot of inference cost versus performance on AIME indicates o1-mini is strictly better than o1-preview, attributed to specialization for fast but performant math and coding. So o1-mini is recommended when speed and cost matter, while o1-preview is a good choice when you want stronger performance and can accept higher inference cost.

What are concrete API use cases mentioned for o1-preview and o1-mini?

Examples include medical accuracy detection—using the model to judge whether a diagnosis is correct—coding assistance (with tools like Cursor), hard-sciences research, and brainstorming partners for math or legal-domain reasoning. These map to scenarios where multi-step reasoning and verification are valuable.

Review Questions

Which parts of the reported benchmark results suggest o1’s advantage is domain-specific rather than universal?
What practical tradeoffs (latency and cost) are associated with using o1-preview or o1, and how does that affect product decisions?
How do the AIME and Codeforces results guide when to choose o1-family models over GPT-4o?

Key Points

1
o1 is trained to reason via reinforcement learning, iteratively refining strategies and correcting mistakes rather than relying on a single attempt.
2
o1’s advantage is most pronounced on extremely hard math and coding tasks, with reported major gains on AIME and Codeforces versus GPT-4o.
3
o1-preview shows large improvements on math/physics/LSAT-style benchmarks, but not on every task category (e.g., some SAT/AP English and Public Relations tasks).
4
Using o1-preview or o1 costs more and adds higher latency because the models require time to think.
5
GPT-4o remains the efficient default for most API use cases, especially when strong reasoning/coding/math depth is not the main requirement.
6
o1-mini is positioned as a cost-effective alternative for math and coding, offering faster inference and better cost-performance than o1-preview on AIME.
7
o1-preview and o1-mini are suggested for applications like diagnosis accuracy detection, coding workflows, hard-sciences research, and legal/math reasoning support.

Highlights

o1 is described as “patient” reasoning: it tries a strategy, learns from failure, and then switches approaches until it finds a workable plan.

On AIME and Codeforces, o1-preview and especially o1 are reported to solve far more problems than GPT-4o, indicating a strong fit for the hardest reasoning tasks.

o1-mini is reported to be strictly better than o1-preview on cost-performance for AIME, reflecting specialization for fast math/coding.

The main tradeoff is operational: o1 models take longer and cost more due to “time to think,” while GPT-4o stays faster and cheaper for general use. 

Topics

Reasoning Models
Reinforcement Learning
Model Evaluation
Math and Coding Benchmarks
Latency and Cost Tradeoffs

Mentioned

Hung Won
Jason
AIME
LSAT
API