Get AI summaries of any video or article — Sign up free
Qwen 3.5 - The next NEXT model thumbnail

Qwen 3.5 - The next NEXT model

Sam Witteveen·
5 min read

Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Qwen 3.5 is a 397B mixture-of-experts model with 17B active parameters, aiming to deliver high capability without activating the full parameter set each token.

Briefing

Qwen 3.5 lands as a major shift in how fast, capable AI can be—pairing a large mixture-of-experts model with a reported up to 19x decoding speed boost while matching or beating stronger, much larger competitors. The headline numbers are striking: Qwen 3.5 is a 397B-parameter model with only 17B active at a time, and it’s built to decode at very long contexts (up to 256k) dramatically faster than Qwen 3 Max thinking. That combination matters because it targets the usual tradeoff between intelligence and latency, making it more plausible to deploy high-end reasoning without the cost of running a full trillion-parameter system.

Under the hood, the model continues the Qwen 3 “Next” direction: it uses a mixture-of-experts design with a larger expert pool than earlier releases. Where Qwen 3 had 128 experts, Qwen 3.5 uses 512 experts, continuing the trend toward more specialization. The practical implication is that the model can scale capacity without activating everything for every token—though the raw size still makes local use demanding. Even with quantization, the transcript estimates roughly 256GB of RAM (possibly 512GB) for comfortable runs, pushing most people toward server-based deployment or GPU clusters.

Performance comparisons lean toward “better than expected” rather than “bigger is better.” Without dwelling on benchmark methodology, the transcript says Qwen 3.5 already beats Qwen 3 Max thinking (which the Qwen team described as greater than a trillion parameters) and is competitive with models such as “Gemini 3 Pro,” “Claude Opus 4.5,” and “GID 5.2” (as named in the transcript). Vision results are a second pillar of the pitch. Instead of bolting an image encoder onto a language model, Qwen 3.5 is multimodal from scratch—trained on both text and images—aiming to improve visual question answering and other image-grounded tasks.

Several training and architecture changes are presented as the engine behind the speed and capability gains. The architecture builds on Qwen 3 Next with an attention system designed to reduce RAM pressure at large context lengths. Decoding speed improves further through a move from single-token autoregressive prediction to multi-token prediction, a technique associated with faster learning during pre-training and strong results in proprietary systems. Multilingual coverage also expands sharply: the transcript cites growth from 119 languages to over 200 languages and dialects, alongside a larger tokenizer vocabulary of 250K tokens—positioned as more efficient for non-English languages than smaller tokenizers.

Finally, the model’s reasoning push is linked to reinforcement learning (RL) at scale. The transcript notes Qwen’s RL training environments are capped around 15,000 for this model, while another lab (Miniax) has claimed hundreds of thousands—raising the question of whether those environments are truly unique or mostly variations. Qwen Chat provides access to try Qwen 3.5+ with a “full million token context window,” including modes like thinking/fast/auto and demos that involve web search and tool-like behavior. Looking ahead, the transcript anticipates distilled and smaller Qwen 3.5 variants rolling out over weeks, plus continued competition on inference cost and quality depending on which provider serves the model and with what configuration.

Cornell Notes

Qwen 3.5 is a 397B mixture-of-experts model with only 17B active parameters, designed to deliver high capability without the usual latency and compute burden. The key deployment-facing claim is speed: up to 19x faster decoding than Qwen 3 Max thinking at 256k context, and even 7.2x faster than a smaller Qwen 3.235B model. It also shifts toward multimodal training “from scratch,” pairing text and images rather than bolting on an encoder, which boosts vision question answering. Additional upgrades include multi-token prediction, a larger 250K tokenizer vocabulary, and expanded multilingual coverage (over 200 languages). Qwen Chat offers a Qwen 3.5+ setup for up to a million-token context window, with thinking/fast/auto modes for interactive testing.

What makes Qwen 3.5 different from earlier “bigger model” releases?

It combines a large total parameter count (397B) with a mixture-of-experts design that activates only 17B parameters per token. That lets the model scale capacity while keeping per-token compute lower than a dense 397B system. The transcript also highlights a major speed claim: at 256k decoding, Qwen 3.5 is reported to be up to 19x faster than Qwen 3 Max thinking, and 7.2x faster than a much smaller Qwen 3.235B model.

How does the mixture-of-experts setup change from Qwen 3 to Qwen 3.5?

The number of experts increases substantially. Qwen 3 is described as having 128 experts, while Qwen 3.5 uses 512 experts. The transcript notes the architectures differ slightly, but the direction is consistent: more experts than prior Qwen 3 mixture-of-experts models, continuing the “Next” approach.

Why does multimodal training “from scratch” matter for vision tasks?

Instead of training a language model and then adding an image encoder, Qwen 3.5 is trained from scratch on both text and images. The transcript links this to better visual question answering and other image-related tasks, and it positions Qwen 3.5 as competitive with strong multimodal systems (notably Gemini 3) while surpassing some models described as weaker on multimodal performance (e.g., Claude Opus models).

What architectural and training changes are tied to the speed boost?

Two changes are emphasized: an attention system inherited from the Qwen 3 Next lineage that reduces RAM needs at very large context lengths, and a shift from single-token autoregressive prediction to multi-token prediction. The transcript connects multi-token prediction to faster learning during pre-training, which is also associated with strong results in proprietary systems.

How does Qwen 3.5 expand multilingual support, and why is tokenizer size relevant?

Multilingual coverage rises from 119 languages to over 200 languages and dialects. The tokenizer vocabulary expands to 250K tokens, which the transcript frames as more efficient for languages outside English/Chinese/Western European scripts than smaller tokenizers (like 32K). The claim is that larger vocabularies reduce inefficiency when tokenizing diverse languages.

What does the transcript suggest about reasoning improvements and RL training scale?

Reasoning gains are attributed to reinforcement learning in training environments. Qwen’s RL environments are described as maxing out around 15,000 for this model. The transcript contrasts this with Miniax’s claim of hundreds of thousands of environments and raises a key uncertainty: those may be variations rather than truly unique environments. A forthcoming technical report is expected to provide more detail.

Review Questions

  1. How do mixture-of-experts activation (17B active out of 397B total) and the number of experts (512) jointly affect deployment tradeoffs?
  2. Which two changes are credited with Qwen 3.5’s decoding speed gains at 256k context, and how do they relate to memory usage and prediction style?
  3. Why might a larger tokenizer vocabulary (250K) improve multilingual performance compared with a smaller tokenizer (e.g., 32K) for non-Western languages?

Key Points

  1. 1

    Qwen 3.5 is a 397B mixture-of-experts model with 17B active parameters, aiming to deliver high capability without activating the full parameter set each token.

  2. 2

    The transcript reports up to a 19x decoding speed boost at 256k context versus Qwen 3 Max thinking, and 7.2x faster versus a smaller Qwen 3.235B model.

  3. 3

    Qwen 3.5 uses 512 experts (up from 128 in Qwen 3), continuing the trend toward more specialized expert routing.

  4. 4

    Vision performance is positioned as a core strength because the model is multimodal from scratch (trained on text and images), not via a bolted-on encoder.

  5. 5

    Speed and long-context efficiency are linked to a Qwen 3 Next attention system plus a shift from single-token autoregressive prediction to multi-token prediction.

  6. 6

    Multilingual coverage expands to over 200 languages and dialects, supported by a 250K-token vocabulary designed to reduce tokenization inefficiency.

  7. 7

    Qwen Chat provides Qwen 3.5+ with a million-token context window and tool-like demos, but real-world results can depend heavily on how providers serve the model and configuration quality.

Highlights

Qwen 3.5 pairs a 397B total parameter count with only 17B active parameters, using 512 experts to scale capability without fully activating the model each step.
At 256k decoding, the transcript claims Qwen 3.5 is up to 19x faster than Qwen 3 Max thinking—an explicit attempt to break the intelligence/latency tradeoff.
Multimodal training “from scratch” (text + images) replaces the older pattern of bolting on an image encoder, targeting stronger visual question answering.
Multi-token prediction and a long-context attention design are presented as the main technical levers behind the speed gains.
Qwen Chat’s Qwen 3.5+ mode supports a million-token context window, making interactive testing practical even without local hardware.

Topics

Mentioned