DeepSeek v3 Tested - Coding, Data Extraction, Summarization, Data Labelling, RAG

TL;DR

DeepSeek V3 uses a mixture-of-experts setup with 671B total parameters but only 37B active during inference, improving compute efficiency while still making full-weight handling heavy.

Briefing Cornell Notes

Briefing

DeepSeek V3 is positioned as a top-tier open-weight mixture-of-experts (MoE) model—strong on benchmarks and notably effective at real-world information tasks like summarization, classification, and structured extraction—while showing a clear weakness on at least one coding/data-manipulation challenge. The model’s practical appeal comes from its MoE design: although the total parameter count across experts reaches 671B, only 37B parameters are active during inference. That active subset helps keep responses competitive in quality without requiring the full compute footprint of a dense 671B model, though the transcript notes that loading and working with the full set of weights still makes local deployment difficult.

On training scale, DeepSeek V3 is described as being trained on nearly 15 trillion high-quality tokens, with training costs reported in the paper as roughly $6 million for pre-training (assuming an H800 GPU-hour cost estimate). The architecture is tied to MoE, and the transcript highlights two training choices that are framed as performance drivers: a multi-token prediction objective (instead of single-token prediction) and training in 8-bit mixed precision (FP8). It also describes a two-stage post-training approach that adds “chain-of-thought” style reasoning capabilities via a smaller post-training/distillation step after the large pre-training run. Context length is listed as 128k, and the model is distributed via an official GitHub repository and weights on Hugging Face.

In hands-on tests using an API workflow, the model delivers mixed results across task types. For creative generation, it produces a technology-themed hip-hop lyric that the tester ranks among the best outputs tried, though generation latency is described as slow (about 1.5 minutes for the lyric request). For coding, the model attempts to generate Python/pandas code to create and sort a dataset of people by continent and other fields, but it fails with a pandas error: “continent is both an index level and a column level which is ambiguous.” That hard failure stands out because smaller models reportedly managed to produce runnable code in similar situations.

Where DeepSeek V3 performs most strongly is structured language work. It classifies five tweets into audience, tone/sentiment, complexity level, and main themes, producing results that align well with the tester’s expectations (with about 33 seconds for the batch). It summarizes a multi-page Meta earnings report into 3–4 sentences in roughly 18 seconds, capturing key points including AI development with Llama 3, metaverse investment, and headcount reduction from layoffs. It also rewrites a LinkedIn post from the report markdown in about 30 seconds, with readable formatting and engagement-oriented phrasing.

For extraction, it parses a receipt image into store name, purchase date/time, total, tax, payment method, and item list. The transcript reports correct merchant details and pricing, with quantity marked as “na” when not present on the receipt—an example of cautious extraction. It then answers targeted questions about the Meta report (e.g., what Mark Zuckerberg is most proud of, and the expected 2024 tax rate) with direct alignment to the source text. Finally, it generates correctly formatted financial comparison tables from markdown spread across pages, including dollar signs and accurate figures.

Overall, DeepSeek V3 looks like a strong general-purpose tool for summarization, classification, and document/receipt/table extraction, with one notable gap in producing working code for a pandas sorting task—suggesting that reasoning and structured output can be reliable even when code execution details still trip it up.

Cornell Notes

DeepSeek V3 is a mixture-of-experts model with 671B total parameters across experts, but only 37B active during inference, aiming to balance quality and compute. Training is described as spanning nearly 15T high-quality tokens, using an 8-bit mixed-precision (FP8) approach and a multi-token prediction objective, plus a smaller post-training step to add chain-of-thought-style reasoning. In practical tests via an API, it performs especially well on summarization, tweet classification, receipt parsing, and question answering over long markdown (including multi-page financial tables). A standout weakness appears in a pandas coding task: the generated code fails due to an “ambiguous continent” index/column issue. The net result is strong reliability for structured information tasks, with less dependable code correctness.

How does the mixture-of-experts design affect what’s practical about running DeepSeek V3?

The model is described as MoE with 671B total parameters across experts, but only 37B are active during inference. That means inference can avoid the full compute cost of a dense 671B model. However, the transcript also notes a practical downside: weights for all experts still need to be loaded/handled, making local setups difficult even if active computation is smaller.

What training choices are highlighted as likely contributors to performance?

Three choices get emphasis: (1) a multi-token prediction objective, which reportedly improves results versus single-token prediction; (2) FP8/8-bit mixed-precision training, framed as faster and less VRAM/compute intensive than 16-bit training; and (3) a two-stage training approach where chain-of-thought capabilities are added through a smaller post-training/distillation step after large-scale pre-training. Context length is listed as 128k.

Where did DeepSeek V3 succeed in hands-on tasks, and what were the outcomes?

It succeeded strongly on structured language tasks: tweet classification into audience/tone/complexity/themes; summarizing a Meta earnings report into 3–4 sentences while capturing AI (Llama 3), metaverse investment, and layoffs/headcount reduction; and generating a readable LinkedIn post with engagement-oriented phrasing. It also extracted receipt fields (store, date/time, totals, tax, payment method) and answered specific questions from the Meta report with direct alignment to quoted text.

What was the most notable failure in the coding test, and why does it matter?

In a pandas coding challenge to generate a dataset and sort/group it by continent and other fields, the model produced code that crashed with: “continent is both an index level and a commum level which is ambiguous.” The transcript treats this as a hard fail because even smaller models reportedly produced runnable code, indicating that code correctness/execution details remain a weak spot despite strong reasoning and structured output elsewhere.

How did the model handle long-context table extraction?

The transcript describes financial table extraction where relevant numbers were spread across different pages in markdown. DeepSeek V3 reportedly pulled the correct figures, formatted tables cleanly (including dollar signs), and produced accurate comparisons for net cash from operating activities and purchases of property/equipment across 2023 and 2024. This is presented as a context-utilization strength.

Review Questions

What trade-off does the MoE design create between active inference compute and the practical burden of loading weights?
Which training objectives and precision choices are named as performance drivers, and how might each influence model behavior?
What specific error occurred in the pandas coding task, and what does it suggest about the model’s reliability for executable code?

Key Points

1
DeepSeek V3 uses a mixture-of-experts setup with 671B total parameters but only 37B active during inference, improving compute efficiency while still making full-weight handling heavy.
2
Training is described as spanning nearly 15T tokens, with reported pre-training costs around $6M and an additional, much cheaper post-training step to add chain-of-thought-style reasoning.
3
A multi-token prediction objective and FP8/8-bit mixed-precision training are highlighted as key differences that may speed training and improve results.
4
In API-based tests, DeepSeek V3 produced strong summaries, classifications, and structured extractions from markdown and images, including receipts and multi-page financial reports.
5
The model’s biggest weakness in the tester’s workflow was a pandas code generation task that failed with an index/column ambiguity error for “continent.”
6
Long-context table extraction worked well when numbers were distributed across pages in markdown, with correct values and formatting (including dollar signs).

Highlights

Only 37B of 671B parameters are active during inference, making DeepSeek V3’s MoE design central to its practical performance.

Receipt extraction included cautious behavior: when quantity wasn’t present, the output used “na” rather than inventing a number.

Summarization of Meta’s earnings report captured multiple specific elements—Llama 3 AI development, metaverse investment, and headcount reduction—into a tight 3–4 sentence output.

A single coding task failed hard due to a pandas ambiguity error (“continent” as both index and column), contrasting with strong structured extraction performance elsewhere.

Topics

Mixture of Experts
FP8 Training
Chain-of-Thought Post-Training
Receipt Extraction
Long-Context Summarization

Mentioned

Venelin Valkov
MoE
FP8
API
RAG