DeepSeek R1 Local Test with Ollama: Coding, Data Extraction, Data Labelling, Summarization, RAG

TL;DR

DeepSeek R1/R10 training adds a supervised “cold start” step after an initial reinforcement phase on V3 to reduce repetition, improve readability, and reduce language mixing.

Briefing Cornell Notes

Briefing

DeepSeek R1 and R10 are reasoning-focused large language models trained with a multi-stage process that aims to fix early problems like endless repetition, weak readability, and language mixing. After an initial reinforcement learning phase on top of DeepSeek V3, the training pipeline adds a “cold start” step: supervised fine-tuning on higher-quality data, followed by further reinforcement learning. The result is a model family that emphasizes longer, more structured reasoning traces—something the paper’s benchmark charts connect to improved performance across tasks.

A key practical detail is how these models are packaged for real-world use. DeepSeek R1 and R10 are released under the MIT license, and the ecosystem includes six distilled variants derived from R1, built on Qwen and Llama baselines. Benchmark tables in the transcript highlight that a 32B distilled model (based on Qwen) can land near the performance of OpenAI’s o1 mini on the tested suites, making it a candidate for local inference. The base R1 and R10 models are described as mixture-of-experts systems with 671B total parameters but only 37B active per token, which should help inference speed relative to fully dense models. Context length is listed as 128k, which is positioned as sufficient for most practical workflows.

The transcript also zeroes in on prompting and model behavior. DeepSeek R1 is said to be poorly compatible with a system prompt in local testing; instead, the recommended approach is to place instructions in the user prompt. Training prompt templates are described as sparse—user question and assistant solution—while the reasoning process is wrapped in <think>…</think> tags, with the final answer outside those tags. A standout chart tracks average response length within the reasoning text as training progresses, showing a steady increase in reasoning verbosity and an “aha moment” in intermediate versions where the model begins evaluating steps more explicitly.

Local testing then uses Ollama with a DeepSeek R1 14B model (Qwen-based) quantized to Q4, targeting about 9GB of VRAM. The tester reports fast streaming token output on an M3 machine and demonstrates that the model can produce structured, JSON-formatted results when asked. In coding, it generates a pandas workflow to create a dataset of wealthiest people by continent and extract top entries per continent; the first attempt contains a library-parameter error (unexpected keyword argument “scale”), but a manual fix yields correct sorting from poorest to richest. For classification tasks (audience, tone, complexity), it produces valid JSON and performs better than several larger models tested.

However, the model’s limits show up in document understanding and summarization. When summarizing a meta earnings report, it produces an incorrect revenue growth figure and includes some non-English characters. In table extraction, it generally succeeds, but a more complex table that requires calculations becomes a failure case: the model spends several minutes generating a solution that is ultimately incorrect. Overall, the transcript frames DeepSeek R1 as strong for structured generation, extraction, and many coding tasks locally, while still prone to hallucinations and calculation errors on harder quantitative problems.

Cornell Notes

DeepSeek R1 and R10 are reasoning-focused models built through a staged training process: reinforcement learning on top of DeepSeek V3, then a “cold start” supervised fine-tuning step on higher-quality data, followed by more reinforcement learning. The transcript highlights that R1’s reasoning traces grow in length and specificity during training, including moments where the model begins evaluating steps more carefully. For local use, the models are MIT-licensed and include distilled variants (notably Qwen- and Llama-based) that can perform near smaller top-tier models on benchmark suites. In hands-on Ollama testing with a DeepSeek R1 14B Q4 model, the system performs well at JSON-structured extraction and many coding/classification tasks, but it can still hallucinate in summaries and fail on calculation-heavy table reconstruction.

What training changes distinguish DeepSeek R1/R10 from an earlier V3-based reinforcement setup?

After training DeepSeek V3 with reinforcement learning (without supervised fine-tuning), the process produced promising capabilities but also issues like endless repetition, poor readability, and language mixing. The next step introduces a “cold start” phase: supervised fine-tuning on higher-quality data, then continuing reinforcement learning. The transcript links this to improved reasoning behavior in R1.

Why do mixture-of-experts details matter for running R1 locally?

The transcript describes R1/R10 as mixture-of-experts models with 671B total parameters but only 37B active parameters per token. That active-parameter setup is presented as a reason inference can be faster than fully dense models. It also notes a 128k context window, which helps when working with long documents in local pipelines.

How does prompting affect DeepSeek R1 behavior in local tests?

Local testing reports that DeepSeek R1 doesn’t work well with a system prompt. Instead, instructions should be placed in the user prompt. The training template is described as sparse (user question + assistant solution), with reasoning wrapped in <think>…</think> tags and the final answer outside those tags.

Where did the 14B DeepSeek R1 model succeed in coding and data tasks?

In a pandas task to generate a dataset of wealthiest people by continent (≥1,000 examples) and then sort to get top five per continent, the model produced a long, structured solution and correct sorting logic after a manual fix. It also generated believable names using a faker library and produced valid JSON for a multi-attribute tweet classification task (audience, tone, complexity), outperforming other tested models in that range.

What were the main failure modes observed during summarization and table reconstruction?

Summarization of a meta earnings report included a clearly incorrect revenue growth figure and even non-Ameran English characters. For a calculation-heavy table reconstruction task, the model produced a solution after several minutes but the resulting table was incorrect—suggesting hallucinated or unreliable intermediate calculations.

What hardware and configuration were used for the local Ollama test?

The transcript uses Ollama with a DeepSeek R1 14B model based on Qwen, quantized to Q4. It estimates roughly 9GB of VRAM and reports running on an M3 machine with fast streaming token output, making interactive use feasible.

Review Questions

Which specific training stage is credited with addressing repetition/readability/language-mixing problems seen after the initial V3 reinforcement phase?
What evidence from the transcript suggests DeepSeek R1’s reasoning traces become more detailed over training, and how is that measured?
In the local tests, what kinds of tasks tended to work reliably (e.g., JSON extraction, classification), and which ones most often failed (e.g., calculations in tables, numeric summarization)?

Key Points

1
DeepSeek R1/R10 training adds a supervised “cold start” step after an initial reinforcement phase on V3 to reduce repetition, improve readability, and reduce language mixing.
2
R1/R10 are MIT-licensed and come with distilled variants based on Qwen and Llama baselines, enabling local experimentation.
3
The base models are described as mixture-of-experts with 671B total parameters but 37B active per token, supporting faster inference than fully dense models.
4
Local prompting tests suggest avoiding system prompts and placing instructions in the user prompt; reasoning is expected inside <think>…</think> tags.
5
A DeepSeek R1 14B Q4 Ollama setup (~9GB VRAM) can stream tokens quickly on an M3 machine, making interactive coding and extraction practical.
6
The model performs well at structured JSON outputs and many extraction/classification tasks, but it can hallucinate in numeric summaries and fail on calculation-heavy table reconstruction.

Highlights

R1’s reasoning behavior is tied to training progress: average reasoning-text length increases with training steps, and intermediate versions show an “aha moment” where the model evaluates steps more explicitly.

Despite huge total parameter counts, the mixture-of-experts design (671B total, 37B active) is presented as a practical lever for inference speed.

In local coding tests, the model produced correct pandas sorting logic after a library-parameter mismatch was manually corrected.

Summarization and calculation-heavy table reconstruction exposed the model’s weak spots: incorrect revenue figures and an ultimately wrong computed table.

Topics

DeepSeek R1 Training
Ollama Local Testing
Reasoning Tags
JSON Extraction
RAG and Summarization

Mentioned

Venelin Valkov
MIT
RAG
VRAM
Q4
JSON
M3
Qwen
Llama
V3
R1
R10
Ollama