Get AI summaries of any video or article — Sign up free
This model is better than ChatGPT and 10x cheaper thumbnail

This model is better than ChatGPT and 10x cheaper

5 min read

Based on AI News & Strategy Daily | Nate B Jones's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

DeepSeek V3 is described as an open-source, high-performing model with an estimated ~$5 million training cost, far below commonly cited costs for top closed models.

Briefing

A new open-source “frontier” language model, DeepSeek V3, is being positioned as a major cost-and-capability shift: it reportedly cost about $5 million to train—far below the tens of millions to $100 million often associated with top-tier models—while delivering strong performance in everyday uses like English, coding, and math. The practical implication is that high-end model capability is moving out of the “only a few well-funded labs can afford it” category and toward a world where smaller startups can realistically build and iterate on competitive systems.

DeepSeek V3’s training approach is described as unusually selective rather than “grab everything from the internet.” Instead of a broad, low-curation scrape, the model was trained on a focused corpus of high-quality tokens, with explicit attention to performance across English, Chinese, math, and coding. After that, it was reinforced using human responses to improve accuracy—an ingredient that matters because it increases confidence during generation. That confidence, in turn, enables more efficient token prediction at inference time, letting the model generate ahead rather than relying strictly on one-token-at-a-time continuation.

The model’s efficiency is framed as a key technical differentiator. While DeepSeek V3 is described as a very large model (about 617 billion parameters), it does not use the full parameter space for every response. Instead, it draws on roughly 37 billion parameters—presented as a “sliver” of the total—for a given query. The claim is that this selective use of model capacity helps keep compute and latency manageable without sacrificing quality.

Another highlighted capability is multi-token prediction. Rather than predicting only the next token, DeepSeek V3 is said to predict two tokens ahead, leveraging the confidence built during training. The transcript also points to a training technique called “dual pipe,” described as a mechanism that allows learning and “regurgitating” (reusing learned patterns) to happen in a coordinated way through a specialized network setup.

Strategically, the emphasis is less on one-off benchmarks and more on what the cost and openness enable. DeepSeek V3 is open-sourced, meaning developers can inspect it, use it directly, and attempt their own improvements. That openness, combined with the much lower training cost, is portrayed as accelerating the spread of competitive “frontier” models.

The transcript contrasts this with a broader industry trend: the cutting edge is shifting toward inference-time compute—systems that run multiple parallel next-token prediction threads to find better continuations. In that framing, ChatGPT’s advantage is tied to inference-time compute, and open-source replication of that exact approach may take longer. Still, the immediate takeaway is that a $5 million model is outperforming or matching leading models (including ChatGPT-4 and Claude variants) on many common tasks, suggesting that “freeing” intelligence for business applications is becoming increasingly feasible—especially as replication gets cheaper and faster than the first breakthrough.

Cornell Notes

DeepSeek V3 is presented as a high-performing, open-source “frontier” language model that reportedly cost about $5 million to train—dramatically less than the $70–$100 million often cited for top closed models. Its strength is linked to careful, high-quality training data (not a broad internet scrape), reinforcement with human responses for accuracy, and more efficient generation that can predict multiple tokens ahead. Even though it’s described as a ~617B-parameter model, it reportedly uses only about 37B parameters per response, improving compute efficiency. The broader impact: smaller startups may be able to build and iterate on competitive models, pushing “frontier-level” intelligence toward lower costs and wider availability.

Why does DeepSeek V3’s training cost matter beyond bragging rights?

The transcript frames training cost as the bottleneck that kept frontier models out of reach for most startups. If a top-tier model can be trained for roughly $5 million (compared with $70–$100 million for models like ChatGPT-4), more teams can afford to experiment, fine-tune, and iterate. That lowers the barrier to entry and accelerates replication, especially when the model is open-sourced.

What training-data strategy is credited for DeepSeek V3’s quality?

Instead of “suck up the whole internet,” DeepSeek V3 is described as trained on a specific corpus of high-quality tokens. The transcript emphasizes deliberate coverage across English, Chinese, math, and coding, followed by reinforcement using human responses to improve accuracy. That accuracy is tied to stronger confidence during generation.

How does the model generate more efficiently if it’s so large?

The transcript claims DeepSeek V3 can use only a portion of its parameter space per response. Although it’s described as ~617B parameters, it reportedly uses about 37B parameters for any given query—presented as a small “sliver” of the full model. This selective parameter usage is portrayed as a major reason it can be efficient while still performing well.

What does “predicting two tokens ahead” change in practice?

With higher confidence from training, the model can predict more than one token ahead. The transcript contrasts this with one-token-at-a-time generation and says DeepSeek V3 predicts two tokens ahead. That can reduce wasted computation and improve throughput, since the system can commit to a short continuation more confidently.

What is “dual pipe,” and why is it mentioned?

“Dual pipe” is described as a training technique that enables learning and “regurgitating” at the same time, supported by a special network setup. The transcript doesn’t fully unpack the mechanism, but it treats dual pipe as one of several training-phase innovations that contributed to the model’s results.

How does inference-time compute fit into the competitive landscape?

The transcript argues that the cutting edge is shifting toward inference-time compute—running multiple parallel next-token prediction threads and selecting the best continuation. It suggests open-source replication of that exact approach may take time, so even if DeepSeek V3 is efficient, ChatGPT’s advantage may still come from inference-time compute. Meanwhile, cost pressure is pushing many “frontier” models toward near-zero marginal availability.

Review Questions

  1. What evidence does the transcript give that DeepSeek V3’s performance comes from training choices rather than just scale?
  2. How do selective parameter usage (37B out of 617B) and multi-token prediction (two tokens ahead) work together to improve efficiency?
  3. Why does the transcript treat inference-time compute as a separate axis of advantage from training cost?

Key Points

  1. 1

    DeepSeek V3 is described as an open-source, high-performing model with an estimated ~$5 million training cost, far below commonly cited costs for top closed models.

  2. 2

    Carefully curated training data (high-quality tokens rather than broad internet scraping) is credited for strong performance across English, Chinese, math, and coding.

  3. 3

    Human-response reinforcement is presented as a key accuracy driver that increases generation confidence during inference.

  4. 4

    Despite being described as ~617B parameters, DeepSeek V3 reportedly uses only ~37B parameters per response, improving compute efficiency.

  5. 5

    The model is said to predict two tokens ahead rather than only one, leveraging training confidence to generate more efficiently.

  6. 6

    A training technique called “dual pipe” is highlighted as an additional innovation used during the model-build process.

  7. 7

    The competitive edge is framed as shifting toward inference-time compute, where multiple next-token prediction threads can be evaluated—an advantage that may be harder to replicate quickly in open source.

Highlights

DeepSeek V3 is positioned as a $5 million training-cost model that can still beat or match leading closed models on common tasks like English, coding, and math.
Selective compute is central: the transcript claims the model uses about 37B parameters per response out of a ~617B total.
Multi-token generation is emphasized—predicting two tokens ahead—paired with training-driven confidence.
Open-sourcing is treated as a catalyst for replication, letting startups inspect and improve the approach rather than starting from scratch.

Topics

Mentioned