Comparison: DeepSeek vs. OpenAI o1 Preview

TL;DR

OpenAI’s test-time inference scaling claim centers on spending extra inference compute to generate better answers by exploring multiple reasoning paths and selecting the best output.

Briefing Cornell Notes

Briefing

OpenAI’s claim that “test-time inference” can follow a scaling law—spending extra compute at inference to produce smarter answers—faces a real-world stress test in a head-to-head comparison with DeepSeek. The key takeaway: when the task shifts from familiar math/science benchmarks to a logic-heavy, ambiguity-filled reasoning scenario, OpenAI’s o1 Preview delivers more coherent, tightly argued conclusions, while DeepSeek’s responses appear less logically grounded despite using additional time to think.

The comparison centers on a murder-mystery style detective prompt designed to probe reasoning under uncertainty. The scenario was intentionally chosen to avoid the kinds of domains where DeepSeek has been widely reported as strong (mathematical and scientific problems) and instead target logic, evidence reconciliation, and decision-making when facts conflict. Both models received the exact same prompt, so differences in output could be attributed to reasoning behavior rather than task variation.

The results diverged sharply. OpenAI’s o1 Preview produced a response that named a suspect and then laid out reasoning that the narrator found tightly argued and clearly connected to the evidence presented in the prompt. It appeared to “examine all of the evidence,” assemble the pieces, and return a rational conclusion with high internal coherence.

DeepSeek, by contrast, named a different suspect and generated reasoning text that the narrator characterized as less logical and less able to articulate why the chosen conclusion followed from the prompt’s evidence. Instead of a clear chain of justification, the response came across as more fragmented—described as “gray text” with reasoning that did not match the coherence level seen from o1 Preview.

Beyond the model comparison itself, the broader issue raised is evaluation quality. If inference-time scaling claims are meant to be taken seriously, the comparison argues that evaluation suites must include tests outside standard knowledge domains and beyond familiar benchmark formats. Relying only on hypothesis-style reasoning tasks (like physics problems) or memorization/production tasks (like recalling literature) may not measure the specific kind of novel reasoning intelligence that inference-time compute is supposed to improve.

The takeaway isn’t that DeepSeek is weak overall; the comparison acknowledges that math and science performance may still be a genuine strength. But for this particular reasoning-under-ambiguity scenario, the test favored o1 Preview. The conclusion is practical: try DeepSeek and compare it directly to o1 Preview on reasoning tasks that are genuinely new and uncertain, and demand evaluation methods that match the inference-time scaling claims being made.

Cornell Notes

The comparison tests whether spending extra compute at inference time reliably improves reasoning, a claim tied to OpenAI’s test-time inference scaling law. A logic-focused murder-mystery prompt with conflicting evidence was used to avoid domains where DeepSeek has been reported as especially strong (math and science). OpenAI’s o1 Preview produced a more coherent, tightly argued conclusion that clearly connected its choices to the evidence. DeepSeek produced a different suspect and reasoning that was judged less logically articulated under the same prompt. The broader lesson is that inference-time scaling claims need evaluation tasks that measure novel reasoning ability, not just performance on familiar benchmark types.

What specific claim about test-time inference scaling is being evaluated, and why does it matter for model comparisons?

The claim is that allocating more time at inference—running parallel reasoning paths and selecting the best response—should yield smarter answers in a way that follows a scaling law. It matters because it predicts a consistent advantage for models designed to “think longer,” which should show up in fair comparisons across tasks. If the scaling effect doesn’t generalize beyond certain benchmark types, then model comparisons based on standard evaluations may mislead users about real reasoning performance.

Why was a murder-mystery logic scenario chosen instead of math/science or coding tasks?

DeepSeek has been reported as very strong on mathematical and scientific problems and less strong on language and coding. The test deliberately avoids those domains to target reasoning, logic, and evidence reconciliation under uncertainty. The prompt includes ambiguity and conflicting evidence, aiming to measure whether the system can sort through uncertainty and still produce a logically supported conclusion.

What did the comparison find about OpenAI o1 Preview’s reasoning quality in the scenario?

o1 Preview returned a conclusion that the tester described as tightly argued and rational. It was seen as examining the evidence provided in the prompt and assembling it into a coherent chain of justification for its chosen suspect. The reasoning was described as clearly articulated with a higher degree of coherence than the alternative model.

How did DeepSeek’s response differ, and what was the critique?

DeepSeek named a different suspect and produced reasoning text that was judged less logical and less able to articulate why its choices followed from the prompt’s evidence. Instead of a clear, evidence-grounded justification comparable to o1 Preview’s, the reasoning was characterized as fragmented and not as coherent.

What evaluation principle is raised as the main takeaway beyond the two-model comparison?

Inference-time scaling claims should be matched with evaluation tasks that test the relevant capability—novel reasoning under uncertainty—rather than relying only on standard knowledge-domain benchmarks. The critique is that evaluations should go beyond familiar formats like physics-style reasoning or literary recall and instead use tasks that genuinely stress reasoning ability on new, ambiguous problems.

Review Questions

How does the test-time inference scaling claim predict model behavior, and what would count as evidence for or against it?
Why might a model that performs well on math/science still fail on reasoning-under-ambiguity tasks?
What characteristics of an evaluation task make it more likely to measure “novel intelligence” rather than benchmark familiarity?

Key Points

1
OpenAI’s test-time inference scaling claim centers on spending extra inference compute to generate better answers by exploring multiple reasoning paths and selecting the best output.
2
DeepSeek is positioned as a competitor designed to scale intelligence through test-time inference, but its advantage may depend on the task type.
3
A logic-heavy murder-mystery prompt with ambiguity and conflicting evidence was used to test reasoning rather than math/science performance.
4
OpenAI’s o1 Preview produced a more coherent, evidence-linked justification for its conclusion than DeepSeek in the same scenario.
5
DeepSeek’s response named a different suspect and provided reasoning judged less logically articulated under uncertainty.
6
The comparison argues that inference-time scaling claims require evaluation tasks outside standard knowledge domains and beyond familiar benchmark formats.

Highlights

A reasoning-under-uncertainty murder-mystery prompt favored OpenAI’s o1 Preview over DeepSeek, despite DeepSeek’s test-time inference focus.

The key difference wasn’t just the final suspect—it was the coherence of the reasoning chain connecting evidence to the conclusion.

The broader warning: inference-time scaling claims need evals that measure novel reasoning, not only math/science or other standard benchmark categories.

Topics

Test-Time Inference
Model Comparison
Reasoning Under Uncertainty
Evaluation Design
DeepSeek vs OpenAI