Get AI summaries of any video or article — Sign up free
Qwen3 Next - Behind the Curtain thumbnail

Qwen3 Next - Behind the Curtain

Sam Witteveen·
5 min read

Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Qwen 3 Next is an 80B MoE model that activates only about 3B parameters during inference, targeting faster generation without full-model compute.

Briefing

Qwen 3 Next is an 80B Mixture of Experts (MoE) model built to run with only 3B active parameters per inference—an efficiency leap that still lands it near the performance of much larger MoE setups. The release matters because it targets two bottlenecks at once: faster training (by scaling tokens without scaling compute proportionally) and faster, smarter inference (by activating a tiny slice of the model and using architectural changes to improve generation efficiency).

At the core is a sparse MoE design with 512 experts, but only about 3.7% of parameters active during inference. That’s a stark contrast to earlier Qwen 3 MoE variants mentioned in the discussion: Qwen 3 235B with 22B active parameters, and a smaller Qwen 3 30B with 3B active parameters. Qwen 3 Next keeps roughly the same active-parameter budget as the 30B model while using a much larger overall model size, and it reportedly performs comparably to a 230B MoE configuration with 22B active parameters—despite being far smaller overall (and with far fewer active parameters). The implication is that adding experts can improve specialization and routing quality, letting different parts of the network handle different aspects of the task without paying the full compute cost every time.

Several engineering choices are highlighted as likely drivers. One is a “hybrid attention” mechanism, framed as a departure from the attention approach associated with older GPT-style designs. Another is multi-token prediction, where the model generates groups of tokens rather than strictly one token at a time. That matters for speed and decoding strategy: multi-token prediction can support speculative decoding and reduce the overhead of step-by-step generation. It also aligns with broader research trends, including earlier work on multi-token pretraining objectives.

Training efficiency is another headline. Qwen 3 Next is trained on 15 trillion tokens derived from Qwen 3’s 36 trillion-token corpus. Even with less than a tenth of the compute cost attributed to the Qwen 3 32B dense model, it achieves strong results on certain tasks—suggesting that token-level scaling and the new training objectives/architecture combination may deliver more learning per unit compute. The discussion also hints that a larger or fully trained variant could be underway, potentially pushing benchmarks further.

Benchmark comparisons are presented in two layers: base-model results and post-training (instruction) results. The “thinking” and “instruct” versions of Qwen 3 Next are said to outperform the Qwen 3 32B dense model and also generally beat a prior “thinking” base model that had been trained on double the tokens. In practice tests via OpenRouter, the “thinking” mode appears to deliberate longer than GPT-style alternatives, while producing cleaner, simpler outputs in instruct mode. Streaming behavior also looks consistent with multi-token generation, showing multiple tokens arriving per step rather than one token at a time.

Finally, early agent/tool-use impressions are positive. The model is recommended to work with Qwen’s own agent framework and seems capable of code-and-execute style workflows involving multiple function calls. The broader takeaway is less about a single benchmark score and more about direction: open labs are sharing experiments that could pressure competitors globally, while other teams—DeepSeek, Z AI, and others—may adopt similar efficiency and decoding ideas.

Cornell Notes

Qwen 3 Next is an 80B Mixture of Experts model designed for efficient inference: only 3B parameters are active per token, with about 3.7% of parameters used during inference. It uses 512 experts and pairs that sparsity with architectural changes like hybrid attention and multi-token prediction, aiming to improve both speed and intelligence. Training emphasizes efficiency too, using 15T tokens from Qwen 3’s 36T corpus while reportedly spending under 10% of the compute cost of the Qwen 3 32B dense model. Reported benchmarks suggest the model’s base and post-trained “thinking” and “instruct” variants can rival much larger MoE systems and outperform prior Qwen 3 thinking variants. Practical tests via OpenRouter show longer “thinking” and streaming behavior consistent with multi-token generation.

How does Qwen 3 Next achieve fast inference despite being an 80B model?

It uses a Mixture of Experts setup where only a small fraction of parameters activate per inference step. The model is described as having 80B total parameters, but just 3B are active during inference—about 3.7% active parameters. That’s paired with 512 experts, so routing can select a small subset of experts for each token rather than running the full network every time.

Why is multi-token prediction a big deal for both speed and decoding quality?

Instead of generating one token at a time, multi-token prediction generates groups of tokens. That can reduce the overhead of step-by-step generation and supports techniques like speculative decoding, where a smaller model primes a larger one. In streaming tests, multiple tokens appear to arrive together rather than one token per step, which is consistent with multi-token generation behavior.

What training-efficiency claim stands out, and what data scale is involved?

The model is trained on 15 trillion tokens drawn from Qwen 3’s 36 trillion-token pretraining corpus. Despite that scale, the discussion claims the training compute cost is under 10% of the Qwen 3 32B dense model’s compute, yet performance improves on some tasks. The implication is that the combination of architecture and training setup yields more learning per unit compute.

How do the benchmark comparisons frame Qwen 3 Next’s competitiveness?

Base-model benchmarks are compared against other Qwen 3 MoE variants, including a larger 235B model with 22B active parameters. The question raised is whether a similarly trained larger model on the full corpus would beat it. Post-training benchmarks compare “thinking” and “instruct” variants, with Qwen 3 Next described as outperforming the Qwen 3 32B dense model and generally beating a previous “thinking” base model trained on double the tokens.

What did hands-on testing suggest about output style and streaming behavior?

In OpenRouter tests, the “thinking” mode appears to deliberate longer than a GPT-style 120B-class model, while the instruct mode produces simpler outputs. Streaming experiments using a pipe-like setup suggest multiple tokens are returned per step, aligning with the multi-token prediction claim. The tests also note that the model’s internal “thinking” may run in English even when the final answer is requested in Thai or Chinese.

How does Qwen 3 Next perform in tool use or agent-style workflows?

Early impressions are that it handles tool/function calling and codeact-like workflows where it writes code and then runs it to trigger multiple function calls. The discussion notes Qwen’s own agent framework is recommended and seems to work well, but it also flags that broader testing with other agent frameworks is still needed.

Review Questions

  1. What does “3B active parameters” mean in practice, and how does it relate to the number of experts (512) in Qwen 3 Next?
  2. How would multi-token prediction change the way you expect streaming tokens to appear compared with single-token generation?
  3. Why might training on 15T tokens with lower compute still outperform models trained with more data or compute, according to the discussion?

Key Points

  1. 1

    Qwen 3 Next is an 80B MoE model that activates only about 3B parameters during inference, targeting faster generation without full-model compute.

  2. 2

    The model routes through 512 experts, and the sparsity level is described as roughly 3.7% active parameters per inference step.

  3. 3

    Hybrid attention and multi-token prediction are presented as key architectural changes aimed at improving inference efficiency and generation behavior.

  4. 4

    Training uses 15T tokens from Qwen 3’s 36T corpus, with claims of under 10% of the compute cost of the Qwen 3 32B dense model while still improving task performance.

  5. 5

    Reported benchmarks suggest strong base and post-training results, with “thinking” and “instruct” variants outperforming prior Qwen 3 dense and earlier thinking setups.

  6. 6

    Hands-on tests via OpenRouter indicate longer “thinking” in the thinking mode, simpler outputs in instruct mode, and streaming patterns consistent with multi-token generation.

  7. 7

    Early agent/tool-use impressions are positive, especially when using Qwen’s recommended agent framework, though comparisons with other frameworks are still pending.

Highlights

Qwen 3 Next keeps roughly the same active-parameter budget as a much smaller Qwen 3 30B model (3B active) while scaling the overall model to 80B.
Multi-token prediction appears to show up in streaming tests as multiple tokens arriving per step, not one token at a time.
Training on 15T tokens from a 36T corpus is paired with a claim of far lower compute than a dense 32B baseline, yet performance remains competitive.
In “thinking” mode, internal deliberation seems to run in English even when the requested final language is Thai or Chinese.
Tool/agent behavior looks promising for code-and-execute workflows, particularly with Qwen’s own agent framework.

Topics

Mentioned