DeepSeekR1 - Full Breakdown
Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
DeepSeek released open weights for DeepSeek R1 along with a model family that includes DeepSeek V3 and multiple distilled variants down to 1.5B parameters.
Briefing
DeepSeek has released open weights for its reasoning model family, led by DeepSeek R1, along with a set of distilled smaller models that can outperform several well-known proprietary systems on benchmark tasks. The release matters because it makes “reasoning-style” performance—previously locked behind closed APIs—available to anyone who can run the models locally or via DeepSeek’s chat interface, and it also includes the training approach behind the results.
The biggest headline is that DeepSeek R1 arrives not as a single model, but as an ecosystem: the original R1 weights, a precursor called DeepSeek V3, and multiple distilled variants down to 1.5 billion parameters. In reported benchmark comparisons, DeepSeek R1 (shown as the blue line) lands on par with OpenAI o1 in some tests and beats it in others, while also surpassing OpenAI o1 mini across multiple benchmarks. The distilled models are positioned as especially striking: smaller parameter counts still deliver strong scores, including cases where a 1.5B model is reported to score higher than Claude 3.5 Sonnet on certain tasks.
A key technical thread ties the family together: DeepSeek R1 uses the same underlying mixture-of-experts (MoE) base as DeepSeek V3. DeepSeek V3 is described as an MoE system with 671 billion total parameters, but only 37 billion active at any moment. DeepSeek R1 keeps that active-parameter profile while changing the post-training and reinforcement-learning pipeline, suggesting the performance jump comes more from how the model is trained after pretraining than from simply scaling raw size.
In live testing through DeepSeek.com’s chat demo, the model’s behavior emphasizes stepwise internal deliberation. It often produces “thinking” content enclosed in tags, then backtracks or clarifies before answering. Example prompts include simple factual reasoning (e.g., a sibling-count question), spelling correction, and scenario analysis such as how geopolitical dynamics might shift after a new abundant clean energy source is discovered. When asked about sentience, it responds with a careful definition-first approach and concludes it is not sentient. Asked how close it is to AGI, it gives a blunt assessment that current LLMs are nowhere near AGI.
The accompanying technical report frames the training pipeline around two main models: DeepSeek R0 and DeepSeek R1. DeepSeek R0 is described as taking the pretrained DeepSeek V3 base and applying reinforcement learning without the usual supervised fine-tuning step. Instead, the system is coaxed into generating chain-of-thought style outputs using a prompt template that forces user/assistant conversation formatting and “think”/“answer” tags. Rewards are rule-based—using deterministic checks on tasks like GSM8K—rather than relying on an external learned reward model.
DeepSeek R1 then follows a multi-stage process: start with a small “cold start” fine-tuning set (thousands of examples), run reinforcement learning, return to supervised fine-tuning using rejection sampling from the RL checkpoint, and finally apply another RL stage. Distillation is presented as a major lever for the smaller models: DeepSeek reportedly fine-tunes distilled variants using 800,000 curated samples generated from DeepSeek R1, and those distilled models outperform applying RL directly at smaller scales.
Finally, the release includes practical guidance for running models. The full-size 671B MoE version is likely out of reach for most local setups, but distilled sizes from 1.5B up to 70B are available, including quantized options. Local experiments highlight strengths in math-style reasoning (GSM8K with LaTeX outputs) while showing limitations in tool use, structured JSON output, and agent-like behavior—areas expected to improve in later iterations.
Cornell Notes
DeepSeek released open weights for its reasoning model family, led by DeepSeek R1, plus distilled variants as small as 1.5B parameters. Reported benchmarks show DeepSeek R1 matching or beating OpenAI o1 on multiple tests, while smaller distilled models can outperform several proprietary systems on specific tasks. The performance jump is attributed less to raw scaling and more to post-training: DeepSeek R1 builds on the same mixture-of-experts DeepSeek V3 base (671B total parameters, 37B active) and changes reinforcement-learning and multi-stage training. Training uses rule-based rewards (e.g., deterministic GSM8K checks) and chain-of-thought prompting with “think”/“answer” tags, then distills from DeepSeek R1 using 800,000 curated samples to create smaller models. The release matters because it makes reasoning-focused LLM behavior reproducible locally and via DeepSeek’s chat demo.
What makes DeepSeek R1’s release different from earlier “reasoning model” announcements?
How do the benchmarks position DeepSeek R1 against OpenAI and Claude?
Why does the training method matter more than parameter count in this family?
What is the role of DeepSeek R0 in the DeepSeek R1 training story?
How are rewards computed during reinforcement learning, and what’s unusual about it?
Why do distilled models sometimes beat expectations despite being much smaller?
Review Questions
- Which parts of DeepSeek’s pipeline are described as rule-based scoring versus learned reward modeling, and why does that distinction matter?
- How does the mixture-of-experts design (671B total, 37B active) change the interpretation of “model size” when comparing DeepSeek R1 to other systems?
- What limitations show up when running the smaller DeepSeek R1 distilled models locally, especially around tool use and structured outputs?
Key Points
- 1
DeepSeek released open weights for DeepSeek R1 along with a model family that includes DeepSeek V3 and multiple distilled variants down to 1.5B parameters.
- 2
Benchmark results reported for DeepSeek R1 place it on par with OpenAI o1 in some tests and ahead in others, with distilled models also showing strong task performance.
- 3
DeepSeek R1’s gains are tied to post-training and reinforcement learning rather than simply increasing active parameter count, since it shares DeepSeek V3’s MoE structure (671B total, 37B active).
- 4
Training uses chain-of-thought prompting with “think”/“answer” tags and rule-based rewards (e.g., deterministic GSM8K checks) instead of external learned reward models.
- 5
DeepSeek R1 follows a multi-stage pipeline: cold-start fine-tuning, RL, supervised fine-tuning via rejection sampling, and a final RL stage.
- 6
Distillation is central for small models: DeepSeek fine-tunes distilled variants using 800,000 curated DeepSeek R1 samples, and this approach beats applying RL directly at smaller scales.
- 7
Local use is feasible mainly through distilled sizes (and quantized options), but tool use and structured JSON-style outputs remain weaker than pure math/reasoning tasks.