DeepSeek Coder: AI Writes Code | Free LLM For Code Generation Beats ChatGPT, ChatDev & Code Llama

TL;DR

DeepSeek Coder is trained on about two trillion tokens with roughly 87% code, supporting strong code generation.

Briefing Cornell Notes

Briefing

DeepSeek Coder is an open-source code-focused language model from DeepSeek AI that’s trained heavily on programming data and tuned to follow coding instructions. The core claim behind it is straightforward: it can outperform major coding LLMs on benchmark suites and human evaluation, while remaining practical to use via public model releases and an online interface. That matters because it positions a freely available model as a serious option for real coding workflows—especially when developers want strong Python and algorithmic performance without paying for closed models.

Training details point to why it may work well. DeepSeek Coder is trained on roughly two trillion tokens, with about 87% coming from code and the remainder largely from English and Chinese natural language. Model sizes range from 1 billion parameters up to 33 billion parameters, and the context window is large—around 16k tokens—so it can handle longer prompts and codebases. Beyond the base model, an instruction-tuned variant called DeepSeek Coder Instruct is further fine-tuned with about 2 billion tokens of instruction data, aiming to produce more task-aligned outputs.

Performance comparisons are a major part of the pitch. The transcript cites benchmark results where DeepSeek Coder is reported to land roughly 8–10% better than Code Llama–style baselines on multiple coding benchmarks. It also claims that DeepSeek Coder Instruct performs better than GPT-3.5 Turbo on human evaluation data. In the practical comparisons described, the 33B instruction-tuned model performs very competitively against GPT-3.5 Turbo across most benchmarks, with one benchmark (MBPP) where it’s close rather than clearly ahead. The overall positioning offered is that DeepSeek Coder sits between ChatGPT and GPT-4 in capability—stronger than many open alternatives, and not far from top-tier closed models.

The transcript then tests the model on hands-on tasks. For Python, it generates correct functions quickly for simple prompts like computing the square of a sum and splitting a list into three parts, including input validation via ValueError. A more creative prompt—writing excuses in the voice of Dwight from The Office—produces working code that generates excuses, though the output doesn’t fully match the requested persona.

Algorithmic tests on LeetCode show a clearer success pattern. The model produces accepted solutions for an easy Two Sum problem and a medium “array game” problem on the first attempt, with runtime and memory rankings reported as strong (e.g., beating large portions of submissions). It also tackles a hard “count of range sum” problem using a prefix-sum approach combined with sorting and Binary Indexed Tree (Fenwick tree) logic, and the solution is accepted—despite earlier attempts with the same prompt reportedly failing.

Finally, the transcript extends beyond algorithms into software building. DeepSeek Coder generates a basic Flask app that stores submitted emails in a SQLite database and provides an admin page to view them. It also creates a Flappy Bird clone in Pygame: the first version has physics issues, but a follow-up prompt fixes gravity and adds scoring and collision-based game over behavior. Across these demos, the recurring theme is fast inference and frequent first-try correctness on structured coding tasks, making DeepSeek Coder a compelling free option for developers and interview-style problem solving.

Cornell Notes

DeepSeek Coder is an open-source, code-heavy LLM trained on about two trillion tokens, with roughly 87% of the data coming from code. It comes in multiple sizes (1B to 33B parameters) and supports a large ~16k token context window. An instruction-tuned variant, DeepSeek Coder Instruct, adds about 2B tokens of instruction fine-tuning to improve task-following. In the transcript’s tests, the model generates correct Python functions, produces accepted LeetCode solutions for easy and medium problems, and even succeeds on a hard range-sum problem using prefix sums plus a Binary Indexed Tree. It also builds small Flask and Pygame projects, often working on the first attempt and improving after targeted feedback.

What training choices are most likely responsible for DeepSeek Coder’s coding performance?

The transcript highlights three training factors: (1) scale—about two trillion tokens; (2) code density—around 87% of tokens are code; and (3) multilingual natural language—English and Chinese make up the remaining portion. It also notes a large context window of about 16k tokens, which helps when prompts include more surrounding code. Finally, the instruction-tuned variant (DeepSeek Coder Instruct) adds about 2B tokens of instruction data to improve alignment with coding tasks.

How does the instruction-tuned model differ from the base model in practical use?

The base model is optimized for general code generation, while DeepSeek Coder Instruct is additionally fine-tuned on instruction data (about 2B tokens). In the transcript, the instruct model is used for prompts that demand specific output formats—like LeetCode-style class/method signatures and multi-step implementations—where instruction following and structure matter as much as raw code correctness.

Why do the LeetCode demos matter more than the simple Python examples?

LeetCode tasks force the model to produce algorithmically correct, performance-aware solutions under strict constraints. The transcript reports accepted solutions for Two Sum (easy) and an “array game” (medium), including strong runtime and memory rankings. It also reports success on a hard “count of range sum” problem, where the model uses a more complex strategy (prefix sums + sorting/deduplication + Binary Indexed Tree) rather than a trivial approach.

What algorithmic technique appears in the hard range-sum solution?

The transcript describes a combination of merge-sort ideas and binary indexing: it computes prefix sums, sorts and removes duplicates, then uses a Binary Indexed Tree (Fenwick tree) to count how many prefix sums fall within the required bounds for each index. It updates the BIT as it iterates, and uses BIT queries to count values ≤ a threshold so it can derive counts in [lower, upper] inclusive ranges.

How well does DeepSeek Coder handle building small applications, not just functions?

The transcript shows it generating a Flask app that collects emails via a form, stores them in SQLite (referred to as SQ Y3 in the transcript), and displays them on an admin page. It also generates a Flappy Bird clone in Pygame: the first attempt has physics issues (the bird doesn’t fall), but a revised prompt adds gravity/score and improves collision/game-over behavior. This suggests the model can scaffold working projects, then refine them with targeted follow-ups.

What limitation shows up in the creative coding prompt?

When asked to generate excuses using Dwight Schrute’s personality from The Office, the model produces code that generates excuses, but the resulting excuses don’t fully match the requested persona. The code runs and uses random selection, but the prompt-to-output alignment is imperfect—highlighting that creative style constraints may require more careful prompting or additional constraints.

Review Questions

What specific training and tuning details (token mix, instruction fine-tuning, context window) are cited as reasons DeepSeek Coder performs well on coding tasks?
Describe the core approach used to solve the hard “count of range sum” problem as reported in the transcript.
In the Flask and Flappy Bird demos, what kinds of issues were corrected after the initial generation?

Key Points

1
DeepSeek Coder is trained on about two trillion tokens with roughly 87% code, supporting strong code generation.
2
Model sizes range from 1B to 33B parameters, and the context window is about 16k tokens.
3
DeepSeek Coder Instruct adds about 2B tokens of instruction fine-tuning to improve task-following.
4
The transcript reports benchmark advantages for DeepSeek Coder (about 8–10% on cited coding benchmarks) and competitive results versus GPT-3.5 Turbo on human evaluation.
5
In hands-on tests, the model produced accepted LeetCode solutions for easy and medium problems and also succeeded on a hard range-sum problem using prefix sums and a Binary Indexed Tree.
6
Beyond algorithms, it generated a working Flask email-collection/admin-view app and a Pygame Flappy Bird clone, with follow-up prompts fixing gameplay issues.

Highlights

DeepSeek Coder Instruct is positioned as a free, open-source alternative that can match or exceed many open coding LLMs, with reported gains of roughly 8–10% on benchmarks.

Accepted LeetCode solutions were generated on the first try for Two Sum and a medium “array game,” with strong reported runtime and memory rankings.

For the hard “count of range sum,” the model produced an accepted solution using prefix sums plus Binary Indexed Tree counting logic.

The model can scaffold small apps (Flask + SQLite) and games (Pygame), then improve them after targeted prompt feedback.

Topics

DeepSeek Coder
Code Generation
LeetCode
Flask
Pygame

Mentioned

DeepSeek
Hugging Face
Transformers
Flask
Pygame
SQLite
Dwight Schrute
LLM
GPT
MBPP
BIT
SQ Y3
UI