Qwen 3.5 - The next NEXT model
Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Qwen 3.5 is a 397B mixture-of-experts model with 17B active parameters, aiming to deliver high capability without activating the full parameter set each token.
Briefing
Qwen 3.5 lands as a major shift in how fast, capable AI can be—pairing a large mixture-of-experts model with a reported up to 19x decoding speed boost while matching or beating stronger, much larger competitors. The headline numbers are striking: Qwen 3.5 is a 397B-parameter model with only 17B active at a time, and it’s built to decode at very long contexts (up to 256k) dramatically faster than Qwen 3 Max thinking. That combination matters because it targets the usual tradeoff between intelligence and latency, making it more plausible to deploy high-end reasoning without the cost of running a full trillion-parameter system.
Under the hood, the model continues the Qwen 3 “Next” direction: it uses a mixture-of-experts design with a larger expert pool than earlier releases. Where Qwen 3 had 128 experts, Qwen 3.5 uses 512 experts, continuing the trend toward more specialization. The practical implication is that the model can scale capacity without activating everything for every token—though the raw size still makes local use demanding. Even with quantization, the transcript estimates roughly 256GB of RAM (possibly 512GB) for comfortable runs, pushing most people toward server-based deployment or GPU clusters.
Performance comparisons lean toward “better than expected” rather than “bigger is better.” Without dwelling on benchmark methodology, the transcript says Qwen 3.5 already beats Qwen 3 Max thinking (which the Qwen team described as greater than a trillion parameters) and is competitive with models such as “Gemini 3 Pro,” “Claude Opus 4.5,” and “GID 5.2” (as named in the transcript). Vision results are a second pillar of the pitch. Instead of bolting an image encoder onto a language model, Qwen 3.5 is multimodal from scratch—trained on both text and images—aiming to improve visual question answering and other image-grounded tasks.
Several training and architecture changes are presented as the engine behind the speed and capability gains. The architecture builds on Qwen 3 Next with an attention system designed to reduce RAM pressure at large context lengths. Decoding speed improves further through a move from single-token autoregressive prediction to multi-token prediction, a technique associated with faster learning during pre-training and strong results in proprietary systems. Multilingual coverage also expands sharply: the transcript cites growth from 119 languages to over 200 languages and dialects, alongside a larger tokenizer vocabulary of 250K tokens—positioned as more efficient for non-English languages than smaller tokenizers.
Finally, the model’s reasoning push is linked to reinforcement learning (RL) at scale. The transcript notes Qwen’s RL training environments are capped around 15,000 for this model, while another lab (Miniax) has claimed hundreds of thousands—raising the question of whether those environments are truly unique or mostly variations. Qwen Chat provides access to try Qwen 3.5+ with a “full million token context window,” including modes like thinking/fast/auto and demos that involve web search and tool-like behavior. Looking ahead, the transcript anticipates distilled and smaller Qwen 3.5 variants rolling out over weeks, plus continued competition on inference cost and quality depending on which provider serves the model and with what configuration.
Cornell Notes
Qwen 3.5 is a 397B mixture-of-experts model with only 17B active parameters, designed to deliver high capability without the usual latency and compute burden. The key deployment-facing claim is speed: up to 19x faster decoding than Qwen 3 Max thinking at 256k context, and even 7.2x faster than a smaller Qwen 3.235B model. It also shifts toward multimodal training “from scratch,” pairing text and images rather than bolting on an encoder, which boosts vision question answering. Additional upgrades include multi-token prediction, a larger 250K tokenizer vocabulary, and expanded multilingual coverage (over 200 languages). Qwen Chat offers a Qwen 3.5+ setup for up to a million-token context window, with thinking/fast/auto modes for interactive testing.
What makes Qwen 3.5 different from earlier “bigger model” releases?
How does the mixture-of-experts setup change from Qwen 3 to Qwen 3.5?
Why does multimodal training “from scratch” matter for vision tasks?
What architectural and training changes are tied to the speed boost?
How does Qwen 3.5 expand multilingual support, and why is tokenizer size relevant?
What does the transcript suggest about reasoning improvements and RL training scale?
Review Questions
- How do mixture-of-experts activation (17B active out of 397B total) and the number of experts (512) jointly affect deployment tradeoffs?
- Which two changes are credited with Qwen 3.5’s decoding speed gains at 256k context, and how do they relate to memory usage and prediction style?
- Why might a larger tokenizer vocabulary (250K) improve multilingual performance compared with a smaller tokenizer (e.g., 32K) for non-Western languages?
Key Points
- 1
Qwen 3.5 is a 397B mixture-of-experts model with 17B active parameters, aiming to deliver high capability without activating the full parameter set each token.
- 2
The transcript reports up to a 19x decoding speed boost at 256k context versus Qwen 3 Max thinking, and 7.2x faster versus a smaller Qwen 3.235B model.
- 3
Qwen 3.5 uses 512 experts (up from 128 in Qwen 3), continuing the trend toward more specialized expert routing.
- 4
Vision performance is positioned as a core strength because the model is multimodal from scratch (trained on text and images), not via a bolted-on encoder.
- 5
Speed and long-context efficiency are linked to a Qwen 3 Next attention system plus a shift from single-token autoregressive prediction to multi-token prediction.
- 6
Multilingual coverage expands to over 200 languages and dialects, supported by a 250K-token vocabulary designed to reduce tokenization inefficiency.
- 7
Qwen Chat provides Qwen 3.5+ with a million-token context window and tool-like demos, but real-world results can depend heavily on how providers serve the model and configuration quality.