OpenAI’s new “deep-thinking” o1 model crushes coding benchmarks

TL;DR

o1 is claimed to deliver major gains on math, formal logic, and PhD-level physics benchmarks, with especially large improvements on coding benchmarks versus GPT-4.

Briefing Cornell Notes

Briefing

OpenAI’s new o1 model is being pitched as a “deep-thinking” reasoning system that sharply raises performance on math, coding, and high-level science benchmarks—especially coding tasks—while keeping key technical details out of public view. The headline numbers are striking: on PhD-level physics and formal-logic style benchmarks, o1 posts large accuracy gains versus GPT-4, and its coding jump is even more dramatic. In an Olympiad-style setup with 50 submissions per problem, o1 reportedly reached a 49th percentile; when the submission limit was raised to 10,000, it crossed into gold-medal territory. Compared with GPT-4, its code “Force ELO” is described as rising from the 11th percentile to the 93rd percentile.

The model’s performance is tied to a different internal workflow than standard chat-style generation. OpenAI describes o1 as relying on reinforcement learning to handle complex reasoning: when given a problem, it generates intermediate “reasoning tokens” (a form of internal step-by-step refinement) before producing the final answer. That mechanism is presented as a way to reduce hallucinations and improve compliance with constraints—at the cost of more compute, time, and money. Users don’t see the full chain-of-thought; the intermediate reasoning is hidden, though OpenAI provides some illustrative examples showing how the system might first inspect input/output structure, then consider programming-language constraints, and only afterward produce code.

In practice, the transcript contrasts o1 with GPT-4 using a custom coding task: rebuilding a classic DOS-style game (“Drug Wars”) with specific gameplay requirements and “random encounters” involving an “officer hard ass.” GPT-4 code reportedly “almost works” but fails to compile and needs multiple follow-ups to reach a working version, with limited game logic. The o1 attempt is described as compiling immediately and matching requirements more closely, but the resulting game still contains serious bugs—such as an infinite loop and a poor UI—suggesting that stronger reasoning doesn’t automatically guarantee reliable, production-ready software.

The transcript also places o1 in a broader competitive and market context. It notes that OpenAI is offering multiple tiers—o1 mini and o1 preview are available, while o1 regular remains locked behind a paywall, with hints of a $2,000 Premium Plus plan. It further claims OpenAI is in discussions to raise money at a $150 billion valuation and that users are shifting between major assistants like ChatGPT and Claude.

Finally, the transcript urges skepticism about hype. o1 is framed as a major leap in benchmark performance, but not as AGI or ASI, and not as a fully transparent system—key details remain closed. Even with improved reasoning, the example game’s failures reinforce a central takeaway: better internal problem-solving can raise success rates, but it still doesn’t eliminate bugs, hallucinations, or the need for careful testing and iteration.

Cornell Notes

OpenAI’s o1 model is presented as a “deep-thinking” reasoning system that boosts results on math, formal logic, PhD-level physics, and especially coding benchmarks compared with GPT-4. Its core mechanism is reinforcement learning that generates intermediate “reasoning tokens” to refine answers before producing a final response; full chain-of-thought remains hidden from users. The trade-off is higher compute and slower responses, plus cost for reasoning tokens. A practical coding test suggests o1 can compile and follow requirements more reliably than GPT-4, yet it can still produce serious bugs and unstable behavior. The overall implication: reasoning-focused models improve performance, but they aren’t guaranteed to produce correct, production-ready software.

What benchmark improvements are claimed for o1, and why do they matter?

The transcript highlights large gains on PhD-level physics and on multitask language understanding benchmarks for math and formal logic. The most emphasized leap is coding: in an Olympiad-style setting, o1 reportedly reached a 49th percentile when allowed 50 submissions per problem, then achieved gold-medal-level performance when allowed 10,000 submissions. Relative to GPT-4, its code “Force ELO” is described as rising from the 11th percentile to the 93rd percentile—signals that the model is better at solving constrained, multi-step programming problems rather than just generating plausible text.

How does o1’s “deep thinking” differ from standard chat generation?

o1 is described as using reinforcement learning to perform complex reasoning. Instead of producing an answer in one pass, it generates intermediate “reasoning tokens” that help it refine steps and backtrack when necessary. Those internal steps are not shown as full chain-of-thought to the end user, but the model uses them to improve accuracy and reduce hallucinations. The transcript also notes that this approach increases compute time and cost.

What are reasoning tokens, and what do they imply for cost and latency?

Reasoning tokens are treated as outputs that represent intermediate refinement steps. Because the model must generate and process these tokens before answering, responses take more time and require more compute. The transcript claims users pay for these tokens at a rate of $60 per 1 million, and it emphasizes that the hidden chain-of-thought still comes with a direct pricing impact.

What does the coding demonstration suggest about real-world reliability?

In a test to recreate a DOS-style game (“Drug Wars”) with specific gameplay requirements, GPT-4 produced code that nearly worked but failed to compile and required multiple follow-ups, with limited logic. The o1 attempt reportedly compiled immediately and followed requirements more closely, but the game still had major issues: an infinite loop involving “officer hard ass” and a poor UI. Follow-up prompts reportedly worsened hallucinations and bugs. The takeaway is that improved reasoning can help, but it doesn’t remove the need for debugging and testing.

What access and product-tier details are mentioned for o1?

The transcript says three models were released: o1 mini, o1 preview, and o1 regular. It claims “us plebs” have access only to mini and preview, while o1 regular remains locked. It also hints at a $2,000 Premium Plus plan to access o1 regular, tying the strongest capabilities to higher-cost tiers.

How does the transcript balance hype with skepticism?

It frames o1 as a major leap in benchmark performance but explicitly denies it is ASI, AGI, or “GPT 5.” It also points out that OpenAI keeps “interesting details” closed off, limiting transparency. The coding example’s bugs reinforce that reasoning improvements don’t guarantee correctness, and the transcript warns that hype can outpace real-world capability.

Review Questions

What mechanism in o1 is credited with improving accuracy, and what is the main downside of that mechanism?
In the “Drug Wars” coding test, how did o1’s initial output differ from GPT-4’s, and what failures still occurred?
Why does the transcript argue that o1 is not equivalent to AGI or GPT-5, despite strong benchmark results?

Key Points

1
o1 is claimed to deliver major gains on math, formal logic, and PhD-level physics benchmarks, with especially large improvements on coding benchmarks versus GPT-4.
2
The model’s reasoning workflow relies on reinforcement learning and intermediate “reasoning tokens,” which refine answers before final output.
3
Full chain-of-thought is hidden from users, but reasoning tokens are billed, with the transcript citing $60 per 1 million tokens.
4
Stronger reasoning can improve compilation and requirement-following in coding tasks, yet it still produces serious bugs that require debugging.
5
o1 is offered in tiers (o1 mini, o1 preview, o1 regular), with o1 regular described as locked behind a high-cost plan.
6
Despite benchmark dominance, the transcript emphasizes that o1 is not AGI/ASI and remains less than fully transparent due to closed technical details.

Highlights

o1’s coding performance is described as leaping from GPT-4’s 11th percentile to a 93rd percentile “Force ELO,” with Olympiad-style gold-medal results under high submission limits.

Reasoning tokens let o1 refine and backtrack internally, but they increase compute time and come with direct token costs.

In a game-recreation coding test, o1 compiled immediately and matched requirements better than GPT-4—yet still produced an infinite loop and a buggy UI.

The chain-of-thought process is hidden from users, even though examples show the model can reason about input/output shapes and language constraints before generating code.

Topics

Mentioned

Sam Altman
GPT
PhD
ELO
ASI
AGI