DeepSeekR1 - Full Breakdown

TL;DR

DeepSeek released open weights for DeepSeek R1 along with a model family that includes DeepSeek V3 and multiple distilled variants down to 1.5B parameters.

Briefing Cornell Notes

Briefing

DeepSeek has released open weights for its reasoning model family, led by DeepSeek R1, along with a set of distilled smaller models that can outperform several well-known proprietary systems on benchmark tasks. The release matters because it makes “reasoning-style” performance—previously locked behind closed APIs—available to anyone who can run the models locally or via DeepSeek’s chat interface, and it also includes the training approach behind the results.

The biggest headline is that DeepSeek R1 arrives not as a single model, but as an ecosystem: the original R1 weights, a precursor called DeepSeek V3, and multiple distilled variants down to 1.5 billion parameters. In reported benchmark comparisons, DeepSeek R1 (shown as the blue line) lands on par with OpenAI o1 in some tests and beats it in others, while also surpassing OpenAI o1 mini across multiple benchmarks. The distilled models are positioned as especially striking: smaller parameter counts still deliver strong scores, including cases where a 1.5B model is reported to score higher than Claude 3.5 Sonnet on certain tasks.

A key technical thread ties the family together: DeepSeek R1 uses the same underlying mixture-of-experts (MoE) base as DeepSeek V3. DeepSeek V3 is described as an MoE system with 671 billion total parameters, but only 37 billion active at any moment. DeepSeek R1 keeps that active-parameter profile while changing the post-training and reinforcement-learning pipeline, suggesting the performance jump comes more from how the model is trained after pretraining than from simply scaling raw size.

In live testing through DeepSeek.com’s chat demo, the model’s behavior emphasizes stepwise internal deliberation. It often produces “thinking” content enclosed in tags, then backtracks or clarifies before answering. Example prompts include simple factual reasoning (e.g., a sibling-count question), spelling correction, and scenario analysis such as how geopolitical dynamics might shift after a new abundant clean energy source is discovered. When asked about sentience, it responds with a careful definition-first approach and concludes it is not sentient. Asked how close it is to AGI, it gives a blunt assessment that current LLMs are nowhere near AGI.

The accompanying technical report frames the training pipeline around two main models: DeepSeek R0 and DeepSeek R1. DeepSeek R0 is described as taking the pretrained DeepSeek V3 base and applying reinforcement learning without the usual supervised fine-tuning step. Instead, the system is coaxed into generating chain-of-thought style outputs using a prompt template that forces user/assistant conversation formatting and “think”/“answer” tags. Rewards are rule-based—using deterministic checks on tasks like GSM8K—rather than relying on an external learned reward model.

DeepSeek R1 then follows a multi-stage process: start with a small “cold start” fine-tuning set (thousands of examples), run reinforcement learning, return to supervised fine-tuning using rejection sampling from the RL checkpoint, and finally apply another RL stage. Distillation is presented as a major lever for the smaller models: DeepSeek reportedly fine-tunes distilled variants using 800,000 curated samples generated from DeepSeek R1, and those distilled models outperform applying RL directly at smaller scales.

Finally, the release includes practical guidance for running models. The full-size 671B MoE version is likely out of reach for most local setups, but distilled sizes from 1.5B up to 70B are available, including quantized options. Local experiments highlight strengths in math-style reasoning (GSM8K with LaTeX outputs) while showing limitations in tool use, structured JSON output, and agent-like behavior—areas expected to improve in later iterations.

Cornell Notes

DeepSeek released open weights for its reasoning model family, led by DeepSeek R1, plus distilled variants as small as 1.5B parameters. Reported benchmarks show DeepSeek R1 matching or beating OpenAI o1 on multiple tests, while smaller distilled models can outperform several proprietary systems on specific tasks. The performance jump is attributed less to raw scaling and more to post-training: DeepSeek R1 builds on the same mixture-of-experts DeepSeek V3 base (671B total parameters, 37B active) and changes reinforcement-learning and multi-stage training. Training uses rule-based rewards (e.g., deterministic GSM8K checks) and chain-of-thought prompting with “think”/“answer” tags, then distills from DeepSeek R1 using 800,000 curated samples to create smaller models. The release matters because it makes reasoning-focused LLM behavior reproducible locally and via DeepSeek’s chat demo.

What makes DeepSeek R1’s release different from earlier “reasoning model” announcements?

It arrives with open weights and a whole family of models, not just one checkpoint. Alongside DeepSeek R1, DeepSeek releases DeepSeek V3 and multiple distilled models down to 1.5B parameters. The distilled models are reported to perform strongly on benchmark tasks, and the MIT licensing means outputs can be used to train other models.

How do the benchmarks position DeepSeek R1 against OpenAI and Claude?

In the benchmark comparisons described, DeepSeek R1 (blue) is often on par with OpenAI o1 and beats it in multiple benchmarks. It also outperforms OpenAI o1 mini across several tests. The distilled models are highlighted as sometimes exceeding Claude 3.5 Sonnet on certain tasks, including cases involving the 1.5B model.

Why does the training method matter more than parameter count in this family?

DeepSeek R1 is built on the same MoE base as DeepSeek V3: 671B total parameters with 37B active at any time. That means the “effective” active capacity is similar, while the post-training pipeline changes. The reported takeaway is that reinforcement learning and multi-stage post-training drive much of the performance improvement rather than simply scaling size.

What is the role of DeepSeek R0 in the DeepSeek R1 training story?

DeepSeek R0 is described as reinforcement-learning applied directly to the pretrained DeepSeek V3 base without the usual supervised fine-tuning step. DeepSeek R1 then builds on learning from DeepSeek R0, using a multi-stage pipeline that includes a small cold-start fine-tuning phase, RL, supervised fine-tuning via rejection sampling from the RL checkpoint, and a final RL stage.

How are rewards computed during reinforcement learning, and what’s unusual about it?

Rewards are rule-based rather than coming from an external learned reward model. For example, tasks like GSM8K have deterministic correctness checks, so the system can score outputs based on whether answers match expected results. The pipeline also uses GRPO, which generates multiple trajectories, normalizes, and updates based on relative best outcomes.

Why do distilled models sometimes beat expectations despite being much smaller?

DeepSeek reports that direct distillation from DeepSeek R1 outperforms applying RL directly to smaller models. The smaller models are fine-tuned using 800,000 curated samples generated from DeepSeek R1, capturing long “thinking” traces and reasoning behavior. That distillation approach helps smaller parameter models retain strong task performance.

Review Questions

Which parts of DeepSeek’s pipeline are described as rule-based scoring versus learned reward modeling, and why does that distinction matter?
How does the mixture-of-experts design (671B total, 37B active) change the interpretation of “model size” when comparing DeepSeek R1 to other systems?
What limitations show up when running the smaller DeepSeek R1 distilled models locally, especially around tool use and structured outputs?

Key Points

1
DeepSeek released open weights for DeepSeek R1 along with a model family that includes DeepSeek V3 and multiple distilled variants down to 1.5B parameters.
2
Benchmark results reported for DeepSeek R1 place it on par with OpenAI o1 in some tests and ahead in others, with distilled models also showing strong task performance.
3
DeepSeek R1’s gains are tied to post-training and reinforcement learning rather than simply increasing active parameter count, since it shares DeepSeek V3’s MoE structure (671B total, 37B active).
4
Training uses chain-of-thought prompting with “think”/“answer” tags and rule-based rewards (e.g., deterministic GSM8K checks) instead of external learned reward models.
5
DeepSeek R1 follows a multi-stage pipeline: cold-start fine-tuning, RL, supervised fine-tuning via rejection sampling, and a final RL stage.
6
Distillation is central for small models: DeepSeek fine-tunes distilled variants using 800,000 curated DeepSeek R1 samples, and this approach beats applying RL directly at smaller scales.
7
Local use is feasible mainly through distilled sizes (and quantized options), but tool use and structured JSON-style outputs remain weaker than pure math/reasoning tasks.

Highlights

DeepSeek R1 comes with open weights and a full ecosystem of distilled models, making reasoning-focused performance accessible without closed APIs.

DeepSeek R1 and DeepSeek V3 share the same MoE backbone (671B total parameters, 37B active), pointing to post-training as the main performance driver.

Rule-based reward signals (like deterministic GSM8K correctness) replace learned reward models in the reinforcement-learning setup.

Distilled models are trained from DeepSeek R1 using 800,000 curated samples, and that distillation beats smaller-scale RL.

Local experiments emphasize strengths in GSM8K-style reasoning and LaTeX outputs, while tool/agent behavior and structured outputs lag. 

Topics

Mentioned

MoE
RL
RLHF
DPO
GRPO
AGI
SFT
LLM