Explaining OpenAI's o1 Reasoning Models
Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
o1 and o1 mini are reasoning models, not GPT-5, and they’re intended to complement the ongoing GPT series.
Briefing
OpenAI’s o1 and o1 mini are reasoning-first models that trade speed for deeper problem solving by spending substantially more compute during inference—often generating long internal reasoning traces and selecting among multiple candidate paths. OpenAI is explicit that these models are not GPT-5, and they’re positioned as a new direction alongside continued development of the GPT series. The practical takeaway is straightforward: for hard tasks like math, coding, and complex analysis, o1-style models can deliver stronger results, but they cost more and may take longer because they “think” longer before answering.
Unlike standard GPT-style chat models that typically run a single pass from prompt to response, o1 models are built to reason through prompts using reinforcement learning at scale and then continue that reinforcement-like selection behavior at inference time. The training approach described centers on large-scale reinforcement learning that teaches productive chain-of-thought behavior using trajectories or trees, then uses those outcomes to improve the model. Crucially, the system also appears to use additional compute during inference—suggesting multiple passes, backtracking, or other multi-step search over reasoning traces—because the time spent generating answers scales with how difficult the question is.
A key theme is that chain-of-thought should be learned in the model rather than merely elicited via prompting. OpenAI’s approach appears to include post-training steps that generate many reasoning traces (including self-play-style trees) and then use a checker or evaluator to reinforce the best strategies. That design aims to let the model recognize mistakes midstream, correct course, and even backtrack to an earlier point before continuing. The result is a model that can break down instructions, translate them into structured subproblems, and assemble logic with built-in self-checking.
Public commentary from researchers tied to the work reinforces this direction: hidden chain-of-thought training can improve generalization and data efficiency, and the gains are especially visible on benchmarks that reward long-form reasoning. In evaluation comparisons, o1 is assessed under a “maximum test time compute” setting, meaning it’s allowed to use the most time/compute possible; under those conditions it can outperform o1 preview, while the preview may be constrained to less inference effort. The tradeoff shows up in subjective tasks: outputs for personal writing and other human-judged preferences don’t always beat GPT-4o, while performance improves on data analysis, programming, and math.
OpenAI also declines to show the hidden reasoning traces, citing user experience, competitive advantage, and safety monitoring concerns (including detecting attempts to manipulate users). The smaller o1 mini is optimized for STEM reasoning and tracks o1 preview closely on many tasks, raising the question of whether the gains come from the method itself rather than sheer model size.
Finally, the transcript’s hands-on API testing highlights the cost reality: reasoning tokens can dwarf visible output tokens, meaning users pay for internal deliberation they don’t receive. Pricing for o1 preview is described as far higher than recent GPT-4o options, and o1 mini is still significantly more expensive than GPT-4o mini. For LLM apps and agent-like systems, the implication is clear: o1-style models are best viewed as planners that can spend compute to stay on track, but developers may need routing strategies to reserve the expensive reasoning model for genuinely difficult requests.
Cornell Notes
OpenAI’s o1 and o1 mini are reasoning models designed to spend more compute during inference, producing stronger results on tasks that benefit from long-form problem solving. Their training emphasizes reinforcement learning over reasoning trajectories/trees, and their inference behavior appears to include additional search-like effort such as multi-pass reasoning and backtracking. The approach aims to “train chain-of-thought into the model” rather than rely on prompting to elicit it. Performance gains are most noticeable in math, coding, and data analysis, while subjective writing tasks may not improve as much. The tradeoff is cost and latency: reasoning tokens can be far larger than the visible output, and pricing is substantially higher than recent GPT-4o models.
What makes o1 different from typical GPT-style chat models in how it answers questions?
How does reinforcement learning fit into o1’s training and inference?
Why does OpenAI emphasize “hidden” chain-of-thought, and why isn’t it shown to users?
Which kinds of tasks benefit most, and which don’t?
What does the API testing imply about cost and “reasoning tokens”?
What is o1 mini, and what does it suggest about scaling?
Review Questions
- How does inference-time compute change the behavior of o1 compared with a single-pass GPT-style response?
- What evidence in the transcript suggests o1’s chain-of-thought is trained into the model rather than only prompted out?
- Why might reasoning-token pricing matter more for agentic applications than for simple chat use?
Key Points
- 1
o1 and o1 mini are reasoning models, not GPT-5, and they’re intended to complement the ongoing GPT series.
- 2
o1 spends more compute during inference, with time-to-answer varying by problem difficulty, suggesting multi-step search and/or backtracking.
- 3
The training approach uses reinforcement learning over reasoning trajectories/trees, and post-training appears to reinforce high-quality reasoning strategies.
- 4
Performance gains are strongest on math, coding, and data analysis, while subjective writing tasks may not improve as much as GPT-4o.
- 5
OpenAI does not reveal hidden chain-of-thought, citing user experience, competitive advantage, and safety monitoring needs.
- 6
API usage can show reasoning tokens far exceeding visible output tokens, making o1-style models significantly more expensive and slower.
- 7
For LLM apps and agents, o1 is best treated as a planner that can spend extra compute on hard tasks, potentially requiring routing to cheaper models for easy prompts.