DeepSeek R1 Local Test with Ollama: Coding, Data Extraction, Data Labelling, Summarization, RAG
Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
DeepSeek R1/R10 training adds a supervised “cold start” step after an initial reinforcement phase on V3 to reduce repetition, improve readability, and reduce language mixing.
Briefing
DeepSeek R1 and R10 are reasoning-focused large language models trained with a multi-stage process that aims to fix early problems like endless repetition, weak readability, and language mixing. After an initial reinforcement learning phase on top of DeepSeek V3, the training pipeline adds a “cold start” step: supervised fine-tuning on higher-quality data, followed by further reinforcement learning. The result is a model family that emphasizes longer, more structured reasoning traces—something the paper’s benchmark charts connect to improved performance across tasks.
A key practical detail is how these models are packaged for real-world use. DeepSeek R1 and R10 are released under the MIT license, and the ecosystem includes six distilled variants derived from R1, built on Qwen and Llama baselines. Benchmark tables in the transcript highlight that a 32B distilled model (based on Qwen) can land near the performance of OpenAI’s o1 mini on the tested suites, making it a candidate for local inference. The base R1 and R10 models are described as mixture-of-experts systems with 671B total parameters but only 37B active per token, which should help inference speed relative to fully dense models. Context length is listed as 128k, which is positioned as sufficient for most practical workflows.
The transcript also zeroes in on prompting and model behavior. DeepSeek R1 is said to be poorly compatible with a system prompt in local testing; instead, the recommended approach is to place instructions in the user prompt. Training prompt templates are described as sparse—user question and assistant solution—while the reasoning process is wrapped in <think>…</think> tags, with the final answer outside those tags. A standout chart tracks average response length within the reasoning text as training progresses, showing a steady increase in reasoning verbosity and an “aha moment” in intermediate versions where the model begins evaluating steps more explicitly.
Local testing then uses Ollama with a DeepSeek R1 14B model (Qwen-based) quantized to Q4, targeting about 9GB of VRAM. The tester reports fast streaming token output on an M3 machine and demonstrates that the model can produce structured, JSON-formatted results when asked. In coding, it generates a pandas workflow to create a dataset of wealthiest people by continent and extract top entries per continent; the first attempt contains a library-parameter error (unexpected keyword argument “scale”), but a manual fix yields correct sorting from poorest to richest. For classification tasks (audience, tone, complexity), it produces valid JSON and performs better than several larger models tested.
However, the model’s limits show up in document understanding and summarization. When summarizing a meta earnings report, it produces an incorrect revenue growth figure and includes some non-English characters. In table extraction, it generally succeeds, but a more complex table that requires calculations becomes a failure case: the model spends several minutes generating a solution that is ultimately incorrect. Overall, the transcript frames DeepSeek R1 as strong for structured generation, extraction, and many coding tasks locally, while still prone to hallucinations and calculation errors on harder quantitative problems.
Cornell Notes
DeepSeek R1 and R10 are reasoning-focused models built through a staged training process: reinforcement learning on top of DeepSeek V3, then a “cold start” supervised fine-tuning step on higher-quality data, followed by more reinforcement learning. The transcript highlights that R1’s reasoning traces grow in length and specificity during training, including moments where the model begins evaluating steps more carefully. For local use, the models are MIT-licensed and include distilled variants (notably Qwen- and Llama-based) that can perform near smaller top-tier models on benchmark suites. In hands-on Ollama testing with a DeepSeek R1 14B Q4 model, the system performs well at JSON-structured extraction and many coding/classification tasks, but it can still hallucinate in summaries and fail on calculation-heavy table reconstruction.
What training changes distinguish DeepSeek R1/R10 from an earlier V3-based reinforcement setup?
Why do mixture-of-experts details matter for running R1 locally?
How does prompting affect DeepSeek R1 behavior in local tests?
Where did the 14B DeepSeek R1 model succeed in coding and data tasks?
What were the main failure modes observed during summarization and table reconstruction?
What hardware and configuration were used for the local Ollama test?
Review Questions
- Which specific training stage is credited with addressing repetition/readability/language-mixing problems seen after the initial V3 reinforcement phase?
- What evidence from the transcript suggests DeepSeek R1’s reasoning traces become more detailed over training, and how is that measured?
- In the local tests, what kinds of tasks tended to work reliably (e.g., JSON extraction, classification), and which ones most often failed (e.g., calculations in tables, numeric summarization)?
Key Points
- 1
DeepSeek R1/R10 training adds a supervised “cold start” step after an initial reinforcement phase on V3 to reduce repetition, improve readability, and reduce language mixing.
- 2
R1/R10 are MIT-licensed and come with distilled variants based on Qwen and Llama baselines, enabling local experimentation.
- 3
The base models are described as mixture-of-experts with 671B total parameters but 37B active per token, supporting faster inference than fully dense models.
- 4
Local prompting tests suggest avoiding system prompts and placing instructions in the user prompt; reasoning is expected inside <think>…</think> tags.
- 5
A DeepSeek R1 14B Q4 Ollama setup (~9GB VRAM) can stream tokens quickly on an M3 machine, making interactive coding and extraction practical.
- 6
The model performs well at structured JSON outputs and many extraction/classification tasks, but it can hallucinate in numeric summaries and fail on calculation-heavy table reconstruction.