Open Reasoning vs OpenAI
Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
OpenAI’s o1-style reasoning performance is tied to generating intermediate reasoning traces and using extra inference-time compute to verify/select better paths.
Briefing
OpenAI’s “o1” reasoning models may not keep their edge for long: within roughly two to two and a half months, multiple open-weights labs released reasoning-focused models that land close to OpenAI’s published benchmark performance—often using the same core idea of spending more compute at inference time to generate and select longer reasoning traces.
The comparison starts with what OpenAI emphasized when o1 arrived: performance on three benchmark families—Math (aimed at high school math Olympiad prep), Code (Codeforces-style problems), and PhD-level science questions where even access to Google doesn’t make the answers likely. OpenAI didn’t fully disclose how o1 was built, but public discussion and prior OpenAI research point to a training approach that differs from a standard “pretrain then instruction-tune/RLHF” pipeline. Instead, reasoning models are trained to produce intermediate reasoning traces (often structured as trees), then use verification and reinforcement-style scoring to keep the best traces. At inference time, that translates into generating multiple candidate reasoning paths and using test-time compute to improve the final answer.
Historically, open-source reasoning has lagged proprietary models by about one to two years, with estimates in the transcript ranging from 18 months to three years. The new wave of open releases challenges that timeline. In the past couple of weeks, three open-weights efforts—DeepSeek R1 lite preview, Qwen’s “QwQ” reasoning model, and Alibaba’s Marco-01—arrived quickly after o1’s debut, suggesting open development is accelerating.
DeepSeek R1 lite preview is the first test case. On math-style benchmarks, it reportedly scores well relative to o1 preview numbers that were widely reused by the community. The transcript notes a pattern: DeepSeek can outperform o1 preview on Math, while lagging on the PhD-level science set—possibly because o1 preview benefits from a larger model and/or more test-time compute. DeepSeek also makes the reasoning process visible, letting users inspect longer “thought token” traces; accuracy rises as the number of reasoning tokens increases. In interactive examples, DeepSeek’s “deep think” mode can catch deliberate traps (like misspelling “strawberry” to change the count of letter “r”s), and it can reason through multi-step logic (like the “Sally’s brothers and sisters” puzzle). It also shows failure modes: Qwen’s model can fall into looping behavior on the strawberry counting task, repeatedly re-checking instead of concluding.
Qwen’s QwQ model shows a different benchmark profile. It reportedly beats DeepSeek on GPQA, but sits below DeepSeek on AIME-style math, while still trailing or matching depending on which OpenAI baseline is used (o1 preview vs o1 mini). The transcript stresses that benchmarking is hard unless compute budgets and iteration counts are matched, since reasoning models’ performance depends heavily on how long they’re allowed to think.
Marco-01 from Alibaba takes a more academic angle. Rather than claiming parity with o1-level maturity, it aims to reproduce o1-like behavior using Monte Carlo Tree Search over reasoning trees, built on a smaller Qwen 2 7B Instruct base. It also releases a dataset of chain-of-thought demonstrations, positioning transparency and trace generation as a differentiator.
The overall takeaway is less about any single model and more about the direction of travel: open-weights labs without OpenAI-scale compute are producing reasoning systems that approach proprietary benchmark claims, largely by scaling inference-time compute and improving trace generation/selection. If OpenAI releases full o1 weights soon, the transcript suggests the “ball” may already be back in OpenAI’s court—because the open ecosystem is closing the gap quickly.
Cornell Notes
OpenAI’s o1 reasoning models rely on generating and verifying intermediate reasoning traces, then using extra test-time compute to improve the final answer. The transcript argues that open-weights labs are catching up faster than expected: DeepSeek R1 lite preview, Qwen’s QwQ, and Alibaba’s Marco-01 arrived within about two to two and a half months of o1’s release. In comparisons, DeepSeek can look strong on math benchmarks and shows how accuracy improves as more reasoning tokens are generated, while Qwen’s results vary by benchmark and can include looping failures. Marco-01 uses an MCTS-style approach over reasoning trees and releases chain-of-thought datasets, emphasizing reproducible reasoning mechanics. The practical implication: performance gaps may shrink quickly as open models scale inference-time thinking rather than only scaling pretraining size.
What makes o1-style “reasoning models” different from a standard LLM pipeline?
Why do benchmark comparisons between open models and o1 get tricky?
What does DeepSeek R1 lite preview reveal about the relationship between reasoning length and accuracy?
What are concrete examples of reasoning behavior (and failure modes) shown in the transcript?
How does Marco-01 differ from the other open reasoning models discussed?
Review Questions
- Which parts of o1-style training and inference are most responsible for performance gains: pretraining scale, instruction tuning, or test-time trace generation and verification?
- How would you design a fair benchmark comparison between two reasoning models if their allowed thinking time differs?
- What failure mode did the transcript highlight for Qwen’s QwQ, and what does it suggest about stopping criteria in reasoning loops?
Key Points
- 1
OpenAI’s o1-style reasoning performance is tied to generating intermediate reasoning traces and using extra inference-time compute to verify/select better paths.
- 2
Open-weights reasoning models (DeepSeek R1 lite preview, Qwen QwQ, and Alibaba Marco-01) reached close-to-o1 benchmark territory within roughly two to two and a half months of o1’s release.
- 3
Benchmark comparisons are unreliable unless compute budgets (iterations and reasoning-token budgets) are matched, since longer thinking can boost scores.
- 4
DeepSeek’s results emphasize a measurable tradeoff: increasing thought-token length improves accuracy on math-style tasks.
- 5
Qwen’s QwQ shows that reasoning systems can fail by looping—re-checking the same conclusion without converging.
- 6
Marco-01 uses Monte Carlo Tree Search over reasoning trees and releases chain-of-thought datasets, aiming for a more reproducible reasoning mechanism.
- 7
The open ecosystem’s main advantage is accelerating inference-time reasoning techniques, not necessarily matching proprietary training compute scales.