DeepSeek v3 Tested - Coding, Data Extraction, Summarization, Data Labelling, RAG
Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
DeepSeek V3 uses a mixture-of-experts setup with 671B total parameters but only 37B active during inference, improving compute efficiency while still making full-weight handling heavy.
Briefing
DeepSeek V3 is positioned as a top-tier open-weight mixture-of-experts (MoE) model—strong on benchmarks and notably effective at real-world information tasks like summarization, classification, and structured extraction—while showing a clear weakness on at least one coding/data-manipulation challenge. The model’s practical appeal comes from its MoE design: although the total parameter count across experts reaches 671B, only 37B parameters are active during inference. That active subset helps keep responses competitive in quality without requiring the full compute footprint of a dense 671B model, though the transcript notes that loading and working with the full set of weights still makes local deployment difficult.
On training scale, DeepSeek V3 is described as being trained on nearly 15 trillion high-quality tokens, with training costs reported in the paper as roughly $6 million for pre-training (assuming an H800 GPU-hour cost estimate). The architecture is tied to MoE, and the transcript highlights two training choices that are framed as performance drivers: a multi-token prediction objective (instead of single-token prediction) and training in 8-bit mixed precision (FP8). It also describes a two-stage post-training approach that adds “chain-of-thought” style reasoning capabilities via a smaller post-training/distillation step after the large pre-training run. Context length is listed as 128k, and the model is distributed via an official GitHub repository and weights on Hugging Face.
In hands-on tests using an API workflow, the model delivers mixed results across task types. For creative generation, it produces a technology-themed hip-hop lyric that the tester ranks among the best outputs tried, though generation latency is described as slow (about 1.5 minutes for the lyric request). For coding, the model attempts to generate Python/pandas code to create and sort a dataset of people by continent and other fields, but it fails with a pandas error: “continent is both an index level and a column level which is ambiguous.” That hard failure stands out because smaller models reportedly managed to produce runnable code in similar situations.
Where DeepSeek V3 performs most strongly is structured language work. It classifies five tweets into audience, tone/sentiment, complexity level, and main themes, producing results that align well with the tester’s expectations (with about 33 seconds for the batch). It summarizes a multi-page Meta earnings report into 3–4 sentences in roughly 18 seconds, capturing key points including AI development with Llama 3, metaverse investment, and headcount reduction from layoffs. It also rewrites a LinkedIn post from the report markdown in about 30 seconds, with readable formatting and engagement-oriented phrasing.
For extraction, it parses a receipt image into store name, purchase date/time, total, tax, payment method, and item list. The transcript reports correct merchant details and pricing, with quantity marked as “na” when not present on the receipt—an example of cautious extraction. It then answers targeted questions about the Meta report (e.g., what Mark Zuckerberg is most proud of, and the expected 2024 tax rate) with direct alignment to the source text. Finally, it generates correctly formatted financial comparison tables from markdown spread across pages, including dollar signs and accurate figures.
Overall, DeepSeek V3 looks like a strong general-purpose tool for summarization, classification, and document/receipt/table extraction, with one notable gap in producing working code for a pandas sorting task—suggesting that reasoning and structured output can be reliable even when code execution details still trip it up.
Cornell Notes
DeepSeek V3 is a mixture-of-experts model with 671B total parameters across experts, but only 37B active during inference, aiming to balance quality and compute. Training is described as spanning nearly 15T high-quality tokens, using an 8-bit mixed-precision (FP8) approach and a multi-token prediction objective, plus a smaller post-training step to add chain-of-thought-style reasoning. In practical tests via an API, it performs especially well on summarization, tweet classification, receipt parsing, and question answering over long markdown (including multi-page financial tables). A standout weakness appears in a pandas coding task: the generated code fails due to an “ambiguous continent” index/column issue. The net result is strong reliability for structured information tasks, with less dependable code correctness.
How does the mixture-of-experts design affect what’s practical about running DeepSeek V3?
What training choices are highlighted as likely contributors to performance?
Where did DeepSeek V3 succeed in hands-on tasks, and what were the outcomes?
What was the most notable failure in the coding test, and why does it matter?
How did the model handle long-context table extraction?
Review Questions
- What trade-off does the MoE design create between active inference compute and the practical burden of loading weights?
- Which training objectives and precision choices are named as performance drivers, and how might each influence model behavior?
- What specific error occurred in the pandas coding task, and what does it suggest about the model’s reliability for executable code?
Key Points
- 1
DeepSeek V3 uses a mixture-of-experts setup with 671B total parameters but only 37B active during inference, improving compute efficiency while still making full-weight handling heavy.
- 2
Training is described as spanning nearly 15T tokens, with reported pre-training costs around $6M and an additional, much cheaper post-training step to add chain-of-thought-style reasoning.
- 3
A multi-token prediction objective and FP8/8-bit mixed-precision training are highlighted as key differences that may speed training and improve results.
- 4
In API-based tests, DeepSeek V3 produced strong summaries, classifications, and structured extractions from markdown and images, including receipts and multi-page financial reports.
- 5
The model’s biggest weakness in the tester’s workflow was a pandas code generation task that failed with an index/column ambiguity error for “continent.”
- 6
Long-context table extraction worked well when numbers were distributed across pages in markdown, with correct values and formatting (including dollar signs).