This model is better than ChatGPT and 10x cheaper
Based on AI News & Strategy Daily | Nate B Jones's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
DeepSeek V3 is described as an open-source, high-performing model with an estimated ~$5 million training cost, far below commonly cited costs for top closed models.
Briefing
A new open-source “frontier” language model, DeepSeek V3, is being positioned as a major cost-and-capability shift: it reportedly cost about $5 million to train—far below the tens of millions to $100 million often associated with top-tier models—while delivering strong performance in everyday uses like English, coding, and math. The practical implication is that high-end model capability is moving out of the “only a few well-funded labs can afford it” category and toward a world where smaller startups can realistically build and iterate on competitive systems.
DeepSeek V3’s training approach is described as unusually selective rather than “grab everything from the internet.” Instead of a broad, low-curation scrape, the model was trained on a focused corpus of high-quality tokens, with explicit attention to performance across English, Chinese, math, and coding. After that, it was reinforced using human responses to improve accuracy—an ingredient that matters because it increases confidence during generation. That confidence, in turn, enables more efficient token prediction at inference time, letting the model generate ahead rather than relying strictly on one-token-at-a-time continuation.
The model’s efficiency is framed as a key technical differentiator. While DeepSeek V3 is described as a very large model (about 617 billion parameters), it does not use the full parameter space for every response. Instead, it draws on roughly 37 billion parameters—presented as a “sliver” of the total—for a given query. The claim is that this selective use of model capacity helps keep compute and latency manageable without sacrificing quality.
Another highlighted capability is multi-token prediction. Rather than predicting only the next token, DeepSeek V3 is said to predict two tokens ahead, leveraging the confidence built during training. The transcript also points to a training technique called “dual pipe,” described as a mechanism that allows learning and “regurgitating” (reusing learned patterns) to happen in a coordinated way through a specialized network setup.
Strategically, the emphasis is less on one-off benchmarks and more on what the cost and openness enable. DeepSeek V3 is open-sourced, meaning developers can inspect it, use it directly, and attempt their own improvements. That openness, combined with the much lower training cost, is portrayed as accelerating the spread of competitive “frontier” models.
The transcript contrasts this with a broader industry trend: the cutting edge is shifting toward inference-time compute—systems that run multiple parallel next-token prediction threads to find better continuations. In that framing, ChatGPT’s advantage is tied to inference-time compute, and open-source replication of that exact approach may take longer. Still, the immediate takeaway is that a $5 million model is outperforming or matching leading models (including ChatGPT-4 and Claude variants) on many common tasks, suggesting that “freeing” intelligence for business applications is becoming increasingly feasible—especially as replication gets cheaper and faster than the first breakthrough.
Cornell Notes
DeepSeek V3 is presented as a high-performing, open-source “frontier” language model that reportedly cost about $5 million to train—dramatically less than the $70–$100 million often cited for top closed models. Its strength is linked to careful, high-quality training data (not a broad internet scrape), reinforcement with human responses for accuracy, and more efficient generation that can predict multiple tokens ahead. Even though it’s described as a ~617B-parameter model, it reportedly uses only about 37B parameters per response, improving compute efficiency. The broader impact: smaller startups may be able to build and iterate on competitive models, pushing “frontier-level” intelligence toward lower costs and wider availability.
Why does DeepSeek V3’s training cost matter beyond bragging rights?
What training-data strategy is credited for DeepSeek V3’s quality?
How does the model generate more efficiently if it’s so large?
What does “predicting two tokens ahead” change in practice?
What is “dual pipe,” and why is it mentioned?
How does inference-time compute fit into the competitive landscape?
Review Questions
- What evidence does the transcript give that DeepSeek V3’s performance comes from training choices rather than just scale?
- How do selective parameter usage (37B out of 617B) and multi-token prediction (two tokens ahead) work together to improve efficiency?
- Why does the transcript treat inference-time compute as a separate axis of advantage from training cost?
Key Points
- 1
DeepSeek V3 is described as an open-source, high-performing model with an estimated ~$5 million training cost, far below commonly cited costs for top closed models.
- 2
Carefully curated training data (high-quality tokens rather than broad internet scraping) is credited for strong performance across English, Chinese, math, and coding.
- 3
Human-response reinforcement is presented as a key accuracy driver that increases generation confidence during inference.
- 4
Despite being described as ~617B parameters, DeepSeek V3 reportedly uses only ~37B parameters per response, improving compute efficiency.
- 5
The model is said to predict two tokens ahead rather than only one, leveraging training confidence to generate more efficiently.
- 6
A training technique called “dual pipe” is highlighted as an additional innovation used during the model-build process.
- 7
The competitive edge is framed as shifting toward inference-time compute, where multiple next-token prediction threads can be evaluated—an advantage that may be harder to replicate quickly in open source.