Open Reasoning vs OpenAI

TL;DR

OpenAI’s o1-style reasoning performance is tied to generating intermediate reasoning traces and using extra inference-time compute to verify/select better paths.

Briefing Cornell Notes

Briefing

OpenAI’s “o1” reasoning models may not keep their edge for long: within roughly two to two and a half months, multiple open-weights labs released reasoning-focused models that land close to OpenAI’s published benchmark performance—often using the same core idea of spending more compute at inference time to generate and select longer reasoning traces.

The comparison starts with what OpenAI emphasized when o1 arrived: performance on three benchmark families—Math (aimed at high school math Olympiad prep), Code (Codeforces-style problems), and PhD-level science questions where even access to Google doesn’t make the answers likely. OpenAI didn’t fully disclose how o1 was built, but public discussion and prior OpenAI research point to a training approach that differs from a standard “pretrain then instruction-tune/RLHF” pipeline. Instead, reasoning models are trained to produce intermediate reasoning traces (often structured as trees), then use verification and reinforcement-style scoring to keep the best traces. At inference time, that translates into generating multiple candidate reasoning paths and using test-time compute to improve the final answer.

Historically, open-source reasoning has lagged proprietary models by about one to two years, with estimates in the transcript ranging from 18 months to three years. The new wave of open releases challenges that timeline. In the past couple of weeks, three open-weights efforts—DeepSeek R1 lite preview, Qwen’s “QwQ” reasoning model, and Alibaba’s Marco-01—arrived quickly after o1’s debut, suggesting open development is accelerating.

DeepSeek R1 lite preview is the first test case. On math-style benchmarks, it reportedly scores well relative to o1 preview numbers that were widely reused by the community. The transcript notes a pattern: DeepSeek can outperform o1 preview on Math, while lagging on the PhD-level science set—possibly because o1 preview benefits from a larger model and/or more test-time compute. DeepSeek also makes the reasoning process visible, letting users inspect longer “thought token” traces; accuracy rises as the number of reasoning tokens increases. In interactive examples, DeepSeek’s “deep think” mode can catch deliberate traps (like misspelling “strawberry” to change the count of letter “r”s), and it can reason through multi-step logic (like the “Sally’s brothers and sisters” puzzle). It also shows failure modes: Qwen’s model can fall into looping behavior on the strawberry counting task, repeatedly re-checking instead of concluding.

Qwen’s QwQ model shows a different benchmark profile. It reportedly beats DeepSeek on GPQA, but sits below DeepSeek on AIME-style math, while still trailing or matching depending on which OpenAI baseline is used (o1 preview vs o1 mini). The transcript stresses that benchmarking is hard unless compute budgets and iteration counts are matched, since reasoning models’ performance depends heavily on how long they’re allowed to think.

Marco-01 from Alibaba takes a more academic angle. Rather than claiming parity with o1-level maturity, it aims to reproduce o1-like behavior using Monte Carlo Tree Search over reasoning trees, built on a smaller Qwen 2 7B Instruct base. It also releases a dataset of chain-of-thought demonstrations, positioning transparency and trace generation as a differentiator.

The overall takeaway is less about any single model and more about the direction of travel: open-weights labs without OpenAI-scale compute are producing reasoning systems that approach proprietary benchmark claims, largely by scaling inference-time compute and improving trace generation/selection. If OpenAI releases full o1 weights soon, the transcript suggests the “ball” may already be back in OpenAI’s court—because the open ecosystem is closing the gap quickly.

Cornell Notes

OpenAI’s o1 reasoning models rely on generating and verifying intermediate reasoning traces, then using extra test-time compute to improve the final answer. The transcript argues that open-weights labs are catching up faster than expected: DeepSeek R1 lite preview, Qwen’s QwQ, and Alibaba’s Marco-01 arrived within about two to two and a half months of o1’s release. In comparisons, DeepSeek can look strong on math benchmarks and shows how accuracy improves as more reasoning tokens are generated, while Qwen’s results vary by benchmark and can include looping failures. Marco-01 uses an MCTS-style approach over reasoning trees and releases chain-of-thought datasets, emphasizing reproducible reasoning mechanics. The practical implication: performance gaps may shrink quickly as open models scale inference-time thinking rather than only scaling pretraining size.

What makes o1-style “reasoning models” different from a standard LLM pipeline?

Instead of relying mainly on pretraining plus instruction tuning (and possibly RLHF/RLAIF), the approach emphasizes producing intermediate reasoning traces (often as trees), then using verification/scoring to keep better traces. At inference time, the model generates multiple candidate reasoning paths and uses additional compute to select or refine toward the correct final answer. The transcript links this to OpenAI’s earlier work on verifiers and step-by-step verification, and to the idea of using reinforcement-style training to improve the quality of reasoning traces.

Why do benchmark comparisons between open models and o1 get tricky?

Reasoning models’ scores depend heavily on test-time compute: how many iterations, how many reasoning tokens, and how long the system is allowed to “think” before answering. The transcript notes that unless compute budgets are matched, a model that generates longer traces (or more candidates) can look better even if the underlying base model isn’t larger. That makes it hard to attribute gains purely to architecture or training quality.

What does DeepSeek R1 lite preview reveal about the relationship between reasoning length and accuracy?

DeepSeek provides an analysis showing accuracy rising as the average number of “thought tokens” per problem increases. That supports the transcript’s broader point: spending more inference-time compute on longer reasoning traces can materially improve results. It also highlights why o1 might outperform o1 preview—o1 is described as using more test-time compute, potentially generating longer traces.

What are concrete examples of reasoning behavior (and failure modes) shown in the transcript?

DeepSeek’s “deep think” mode can handle deliberate traps like misspelling “strawberry” to change the number of letter “r”s, and it can work through multi-step logic such as the “Sally has four brothers” puzzle. Qwen’s QwQ, however, can sometimes loop on the strawberry counting task—re-checking and re-evaluating instead of converging to a final answer, with the transcript describing repeated “accept the answer” cycles that never fully terminate.

How does Marco-01 differ from the other open reasoning models discussed?

Marco-01 is framed as a more academic reproduction attempt using Monte Carlo Tree Search (MCTS) over reasoning trees. It uses Qwen 2 7B Instruct as a base model, then builds a system that generates reasoning-tree candidates and scores them to train better tree generation. It also releases a dataset of chain-of-thought demonstrations, aiming to make reasoning traces more accessible for study and iteration.

Review Questions

Which parts of o1-style training and inference are most responsible for performance gains: pretraining scale, instruction tuning, or test-time trace generation and verification?
How would you design a fair benchmark comparison between two reasoning models if their allowed thinking time differs?
What failure mode did the transcript highlight for Qwen’s QwQ, and what does it suggest about stopping criteria in reasoning loops?

Key Points

1
OpenAI’s o1-style reasoning performance is tied to generating intermediate reasoning traces and using extra inference-time compute to verify/select better paths.
2
Open-weights reasoning models (DeepSeek R1 lite preview, Qwen QwQ, and Alibaba Marco-01) reached close-to-o1 benchmark territory within roughly two to two and a half months of o1’s release.
3
Benchmark comparisons are unreliable unless compute budgets (iterations and reasoning-token budgets) are matched, since longer thinking can boost scores.
4
DeepSeek’s results emphasize a measurable tradeoff: increasing thought-token length improves accuracy on math-style tasks.
5
Qwen’s QwQ shows that reasoning systems can fail by looping—re-checking the same conclusion without converging.
6
Marco-01 uses Monte Carlo Tree Search over reasoning trees and releases chain-of-thought datasets, aiming for a more reproducible reasoning mechanism.
7
The open ecosystem’s main advantage is accelerating inference-time reasoning techniques, not necessarily matching proprietary training compute scales.

Highlights

Open-weights labs closed part of the o1 gap quickly—within about two to two and a half months—by focusing on reasoning traces and test-time compute.

DeepSeek’s accuracy rises as more reasoning tokens are generated, reinforcing that “thinking longer” is a major lever for these models.

Qwen’s QwQ sometimes gets stuck in a strawberry-counting loop, illustrating that reasoning trace generation needs robust stopping criteria.

Marco-01’s MCTS-based approach treats reasoning as a tree search problem over intermediate traces, paired with released chain-of-thought datasets.

Topics

Reasoning Models
Test-Time Compute
Open Weights
Benchmarking
Monte Carlo Tree Search