OpenAI’s new “deep-thinking” o1 model crushes coding benchmarks
Based on Fireship's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
o1 is claimed to deliver major gains on math, formal logic, and PhD-level physics benchmarks, with especially large improvements on coding benchmarks versus GPT-4.
Briefing
OpenAI’s new o1 model is being pitched as a “deep-thinking” reasoning system that sharply raises performance on math, coding, and high-level science benchmarks—especially coding tasks—while keeping key technical details out of public view. The headline numbers are striking: on PhD-level physics and formal-logic style benchmarks, o1 posts large accuracy gains versus GPT-4, and its coding jump is even more dramatic. In an Olympiad-style setup with 50 submissions per problem, o1 reportedly reached a 49th percentile; when the submission limit was raised to 10,000, it crossed into gold-medal territory. Compared with GPT-4, its code “Force ELO” is described as rising from the 11th percentile to the 93rd percentile.
The model’s performance is tied to a different internal workflow than standard chat-style generation. OpenAI describes o1 as relying on reinforcement learning to handle complex reasoning: when given a problem, it generates intermediate “reasoning tokens” (a form of internal step-by-step refinement) before producing the final answer. That mechanism is presented as a way to reduce hallucinations and improve compliance with constraints—at the cost of more compute, time, and money. Users don’t see the full chain-of-thought; the intermediate reasoning is hidden, though OpenAI provides some illustrative examples showing how the system might first inspect input/output structure, then consider programming-language constraints, and only afterward produce code.
In practice, the transcript contrasts o1 with GPT-4 using a custom coding task: rebuilding a classic DOS-style game (“Drug Wars”) with specific gameplay requirements and “random encounters” involving an “officer hard ass.” GPT-4 code reportedly “almost works” but fails to compile and needs multiple follow-ups to reach a working version, with limited game logic. The o1 attempt is described as compiling immediately and matching requirements more closely, but the resulting game still contains serious bugs—such as an infinite loop and a poor UI—suggesting that stronger reasoning doesn’t automatically guarantee reliable, production-ready software.
The transcript also places o1 in a broader competitive and market context. It notes that OpenAI is offering multiple tiers—o1 mini and o1 preview are available, while o1 regular remains locked behind a paywall, with hints of a $2,000 Premium Plus plan. It further claims OpenAI is in discussions to raise money at a $150 billion valuation and that users are shifting between major assistants like ChatGPT and Claude.
Finally, the transcript urges skepticism about hype. o1 is framed as a major leap in benchmark performance, but not as AGI or ASI, and not as a fully transparent system—key details remain closed. Even with improved reasoning, the example game’s failures reinforce a central takeaway: better internal problem-solving can raise success rates, but it still doesn’t eliminate bugs, hallucinations, or the need for careful testing and iteration.
Cornell Notes
OpenAI’s o1 model is presented as a “deep-thinking” reasoning system that boosts results on math, formal logic, PhD-level physics, and especially coding benchmarks compared with GPT-4. Its core mechanism is reinforcement learning that generates intermediate “reasoning tokens” to refine answers before producing a final response; full chain-of-thought remains hidden from users. The trade-off is higher compute and slower responses, plus cost for reasoning tokens. A practical coding test suggests o1 can compile and follow requirements more reliably than GPT-4, yet it can still produce serious bugs and unstable behavior. The overall implication: reasoning-focused models improve performance, but they aren’t guaranteed to produce correct, production-ready software.
What benchmark improvements are claimed for o1, and why do they matter?
How does o1’s “deep thinking” differ from standard chat generation?
What are reasoning tokens, and what do they imply for cost and latency?
What does the coding demonstration suggest about real-world reliability?
What access and product-tier details are mentioned for o1?
How does the transcript balance hype with skepticism?
Review Questions
- What mechanism in o1 is credited with improving accuracy, and what is the main downside of that mechanism?
- In the “Drug Wars” coding test, how did o1’s initial output differ from GPT-4’s, and what failures still occurred?
- Why does the transcript argue that o1 is not equivalent to AGI or GPT-5, despite strong benchmark results?
Key Points
- 1
o1 is claimed to deliver major gains on math, formal logic, and PhD-level physics benchmarks, with especially large improvements on coding benchmarks versus GPT-4.
- 2
The model’s reasoning workflow relies on reinforcement learning and intermediate “reasoning tokens,” which refine answers before final output.
- 3
Full chain-of-thought is hidden from users, but reasoning tokens are billed, with the transcript citing $60 per 1 million tokens.
- 4
Stronger reasoning can improve compilation and requirement-following in coding tasks, yet it still produces serious bugs that require debugging.
- 5
o1 is offered in tiers (o1 mini, o1 preview, o1 regular), with o1 regular described as locked behind a high-cost plan.
- 6
Despite benchmark dominance, the transcript emphasizes that o1 is not AGI/ASI and remains less than fully transparent due to closed technical details.