Explaining OpenAI's o1 Reasoning Models

TL;DR

o1 and o1 mini are reasoning models, not GPT-5, and they’re intended to complement the ongoing GPT series.

Briefing Cornell Notes

Briefing

OpenAI’s o1 and o1 mini are reasoning-first models that trade speed for deeper problem solving by spending substantially more compute during inference—often generating long internal reasoning traces and selecting among multiple candidate paths. OpenAI is explicit that these models are not GPT-5, and they’re positioned as a new direction alongside continued development of the GPT series. The practical takeaway is straightforward: for hard tasks like math, coding, and complex analysis, o1-style models can deliver stronger results, but they cost more and may take longer because they “think” longer before answering.

Unlike standard GPT-style chat models that typically run a single pass from prompt to response, o1 models are built to reason through prompts using reinforcement learning at scale and then continue that reinforcement-like selection behavior at inference time. The training approach described centers on large-scale reinforcement learning that teaches productive chain-of-thought behavior using trajectories or trees, then uses those outcomes to improve the model. Crucially, the system also appears to use additional compute during inference—suggesting multiple passes, backtracking, or other multi-step search over reasoning traces—because the time spent generating answers scales with how difficult the question is.

A key theme is that chain-of-thought should be learned in the model rather than merely elicited via prompting. OpenAI’s approach appears to include post-training steps that generate many reasoning traces (including self-play-style trees) and then use a checker or evaluator to reinforce the best strategies. That design aims to let the model recognize mistakes midstream, correct course, and even backtrack to an earlier point before continuing. The result is a model that can break down instructions, translate them into structured subproblems, and assemble logic with built-in self-checking.

Public commentary from researchers tied to the work reinforces this direction: hidden chain-of-thought training can improve generalization and data efficiency, and the gains are especially visible on benchmarks that reward long-form reasoning. In evaluation comparisons, o1 is assessed under a “maximum test time compute” setting, meaning it’s allowed to use the most time/compute possible; under those conditions it can outperform o1 preview, while the preview may be constrained to less inference effort. The tradeoff shows up in subjective tasks: outputs for personal writing and other human-judged preferences don’t always beat GPT-4o, while performance improves on data analysis, programming, and math.

OpenAI also declines to show the hidden reasoning traces, citing user experience, competitive advantage, and safety monitoring concerns (including detecting attempts to manipulate users). The smaller o1 mini is optimized for STEM reasoning and tracks o1 preview closely on many tasks, raising the question of whether the gains come from the method itself rather than sheer model size.

Finally, the transcript’s hands-on API testing highlights the cost reality: reasoning tokens can dwarf visible output tokens, meaning users pay for internal deliberation they don’t receive. Pricing for o1 preview is described as far higher than recent GPT-4o options, and o1 mini is still significantly more expensive than GPT-4o mini. For LLM apps and agent-like systems, the implication is clear: o1-style models are best viewed as planners that can spend compute to stay on track, but developers may need routing strategies to reserve the expensive reasoning model for genuinely difficult requests.

Cornell Notes

OpenAI’s o1 and o1 mini are reasoning models designed to spend more compute during inference, producing stronger results on tasks that benefit from long-form problem solving. Their training emphasizes reinforcement learning over reasoning trajectories/trees, and their inference behavior appears to include additional search-like effort such as multi-pass reasoning and backtracking. The approach aims to “train chain-of-thought into the model” rather than rely on prompting to elicit it. Performance gains are most noticeable in math, coding, and data analysis, while subjective writing tasks may not improve as much. The tradeoff is cost and latency: reasoning tokens can be far larger than the visible output, and pricing is substantially higher than recent GPT-4o models.

What makes o1 different from typical GPT-style chat models in how it answers questions?

o1 is built to reason longer before responding. Instead of a single prompt-to-response pass, it appears to generate and evaluate extended reasoning traces (potentially via multiple passes or backtracking) and then select the best path. The transcript notes that the time spent varies with question complexity, hinting at repeated internal attempts rather than one-shot generation.

How does reinforcement learning fit into o1’s training and inference?

The described approach uses large-scale reinforcement learning that evaluates trajectories or trees of reasoning. Training teaches the model productive chain-of-thought behavior using data-efficient reinforcement learning. Then, at inference time, the system also appears to apply compute to generate and compare candidate reasoning paths, with scaling constraints differing from standard LLM pre-training—likely because long reasoning trajectories are involved.

Why does OpenAI emphasize “hidden” chain-of-thought, and why isn’t it shown to users?

OpenAI does not display the internal reasoning traces. The transcript lists reasons: user experience, protecting competitive advantage, and safety monitoring—specifically the ability to detect manipulation attempts or other underhand behavior. Even though users can’t see the traces, the model still uses them internally to improve correctness.

Which kinds of tasks benefit most, and which don’t?

Benchmarks and evaluations described in the transcript suggest strong gains for math, code, and data analysis—areas that reward structured, long-form reasoning. Subjective tasks like personal writing may not always match GPT-4o’s performance, indicating the reasoning-heavy approach isn’t uniformly better across all evaluation styles.

What does the API testing imply about cost and “reasoning tokens”?

Hands-on testing shows that reasoning tokens can be thousands while visible output tokens are only a few hundred. The transcript notes that users pay for those extra reasoning tokens even though they don’t receive the internal trace content. This makes o1-style models much more expensive than GPT-4o variants and raises questions about whether the added deliberation is worth it for a given application.

What is o1 mini, and what does it suggest about scaling?

o1 mini is positioned as optimized for STEM reasoning and is close to o1 preview on many tasks, sometimes ranking higher on data analysis. That pattern raises the possibility that the method (long reasoning traces plus reinforcement-style selection) can improve performance beyond just model size, pointing toward a new scaling paradigm that emphasizes inference-time compute.

Review Questions

How does inference-time compute change the behavior of o1 compared with a single-pass GPT-style response?
What evidence in the transcript suggests o1’s chain-of-thought is trained into the model rather than only prompted out?
Why might reasoning-token pricing matter more for agentic applications than for simple chat use?

Key Points

1
o1 and o1 mini are reasoning models, not GPT-5, and they’re intended to complement the ongoing GPT series.
2
o1 spends more compute during inference, with time-to-answer varying by problem difficulty, suggesting multi-step search and/or backtracking.
3
The training approach uses reinforcement learning over reasoning trajectories/trees, and post-training appears to reinforce high-quality reasoning strategies.
4
Performance gains are strongest on math, coding, and data analysis, while subjective writing tasks may not improve as much as GPT-4o.
5
OpenAI does not reveal hidden chain-of-thought, citing user experience, competitive advantage, and safety monitoring needs.
6
API usage can show reasoning tokens far exceeding visible output tokens, making o1-style models significantly more expensive and slower.
7
For LLM apps and agents, o1 is best treated as a planner that can spend extra compute on hard tasks, potentially requiring routing to cheaper models for easy prompts.

Highlights

o1’s core shift is paying for extra inference compute to generate and evaluate long reasoning traces before committing to an answer.

OpenAI’s decision not to show hidden chain-of-thought is tied to safety monitoring and competitive advantage, not just UX.

Reasoning tokens can dwarf visible output tokens in API usage, turning “thinking” into a direct cost driver.

o1’s biggest wins cluster around STEM-style tasks—math, code, and data analysis—while personal writing doesn’t consistently lead.

o1 mini’s performance suggests the reasoning method may generalize beyond just scaling up model size.

Topics

Mentioned

Lukasz Kaiser
Jason Wei
Noam Brown
GPT-5
GPT-4o
o1
o1 mini
RLHF
RLAIF
DPO
COT
API
STEM