Get AI summaries of any video or article — Sign up free
Mistral 8x7B Part 1- So What is a Mixture of Experts Model? thumbnail

Mistral 8x7B Part 1- So What is a Mixture of Experts Model?

Sam Witteveen·
5 min read

Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Mistral’s “8x7B” MoE model uses eight experts, each roughly Mistral 7B scale, combined via a gating mechanism that routes inputs to expert(s).

Briefing

Mistral’s newly released “8x7B” model is a Mixture of Experts (MoE) system: eight separate expert networks, each roughly the size of Mistral 7B, are combined under a gating mechanism that routes each input to the most relevant expert(s). The practical payoff is efficiency at scale—MoE architectures can activate only a subset of parameters per token—while enabling models that are effectively much larger than a single dense network. The catch is that MoE inference tends to be slower and more hardware-dependent, making local use difficult for most people.

At the core of MoE is a gating layer that decides which “expert” should handle a given prompt. Instead of sending every token through one monolithic transformer decoder, the model uses multiple experts (in this case, eight) and activates only the chosen one(s). Training therefore has two moving parts: the experts themselves must learn specialized skills, and the gating network must learn how to map inputs to the right expert. Some MoE variants can route to multiple experts and then combine their outputs, but the key theme remains specialization plus selective computation.

Interest in MoE surged again because of earlier rumors about GPT-4’s architecture. One widely discussed claim was that GPT-4 used an MoE design with eight experts, each on the order of 220B parameters, plus a gating mechanism. That kind of setup would make it easier to dedicate experts to distinct capabilities—function calling is the example that gets attention—while other experts handle complementary tasks. Similar ideas have also been speculated for Gemini, reflecting how MoE can be framed as a way to modularize model behavior.

MoE is not new. The transcript points to early work from 2014, including papers with authors such as Ilya Sutskever, and later “sparsely-gated” approaches associated with Noam Shazeer (now CEO and co-founder of CharacterAI) and collaborators including Geoffrey Hinton and Jeff Dean. Switch Transformers, another milestone, pushed training toward trillion-parameter territory using MoE-style routing, building on T5-like architectures. More recently, open-source efforts such as OpenMoE have tackled the engineering and compute hurdles required to train and run MoE models, including reliance on large-scale infrastructure.

For hands-on testing, the transcript notes several ways to try Mistral’s MoE variant: community uploads on Hugging Face (including a “Mistral-7B-8Expert” version), GPTQ weight releases from TheBloke, and online tooling such as Replicate and Vercel AI SDK. A key warning is that the MoE model is described as a base model rather than instruction-tuned, so prompts may need adjustment. Benchmarks reported on the model card include MMLU and GSM-8K results, but the scores are still far from GPT-4-level performance, and inference speed remains a limitation.

Overall, the release lands in a broader trend: using MoE to scale model capacity, then potentially distilling those larger systems into smaller, faster models later. That distillation path is presented as a likely next step for making MoE benefits more accessible.

Cornell Notes

Mistral’s “8x7B” release is a Mixture of Experts model that combines eight experts under a gating network. Each expert is about the size of Mistral 7B, and the gating layer routes each input token to the most relevant expert(s), so only a subset of parameters may be activated per step. Training must optimize both the experts (specialization) and the gating mechanism (routing accuracy). MoE has resurfaced in mainstream attention due to rumors about GPT-4 and related architectural speculation for other frontier models. The transcript also emphasizes that MoE inference can be slow and hardware-heavy, but it enables larger effective model capacity and may later feed into distillation to smaller, more practical models.

How does a Mixture of Experts model differ from a standard transformer at inference time?

A standard model runs a single forward pass through one transformer decoder stack for each token. In an MoE setup, multiple expert networks exist in parallel, and a gating layer decides which expert(s) should process the input. Only the selected expert(s) contribute to the output, while the other experts remain inactive for that token, enabling selective computation rather than always using all parameters.

What exactly must be learned during MoE training?

Training optimizes two components: (1) the experts, which learn to specialize in different subsets of tasks, and (2) the gating function, which learns to predict which expert should handle a given input. The gating layer’s routing decisions determine which expert outputs are used, so routing quality is as important as expert quality.

Why did MoE become a hot topic again in recent frontier-model discussions?

MoE attention spiked due to rumors that GPT-4 used an MoE architecture with eight experts and a gating mechanism. The idea is that experts could be specialized—for example, one expert could be tuned for function calling while others handle other capabilities. Similar architectural possibilities have been discussed for Gemini, reflecting how MoE can be framed as modular capability routing.

What historical research milestones are cited as foundations for today’s MoE systems?

The transcript points to early MoE work from 2014 (with Ilya Sutskever among the authors) and later “sparsely-gated” MoE research associated with Noam Shazeer (now CEO/co-founder of CharacterAI) and collaborators including Geoffrey Hinton and Jeff Dean. It also highlights Switch Transformers (early 2021, with later revisions) as an important step toward training models beyond a trillion parameters using MoE-style routing built on T5-like ideas.

What practical constraints come with running Mistral’s MoE model, and how can users try it anyway?

MoE inference is described as slow and not well-suited to typical local setups because it generally requires substantial hardware (the transcript suggests at least 2×80GB A100s or 4×40GB A100s for local/server runs). For experimentation, users can try community-hosted versions on Hugging Face, use GPTQ-quantized weights from TheBloke, or run via hosted services like Replicate and Vercel AI SDK. The MoE model is also flagged as a base model rather than instruction-tuned, so prompts may need careful design.

Review Questions

  1. What role does the gating network play in determining which expert(s) process an input token, and why does that matter for efficiency?
  2. Why does MoE training require learning both expert parameters and routing behavior, rather than only training the experts?
  3. Based on the transcript, what are the main reasons MoE models can be difficult to run locally, and what workaround options are suggested?

Key Points

  1. 1

    Mistral’s “8x7B” MoE model uses eight experts, each roughly Mistral 7B scale, combined via a gating mechanism that routes inputs to expert(s).

  2. 2

    MoE efficiency comes from selective activation: only the chosen expert(s) contribute to each token’s output rather than using all parameters every time.

  3. 3

    MoE training optimizes two systems simultaneously: expert specialization and gating accuracy for correct routing decisions.

  4. 4

    MoE’s renewed popularity is tied to architectural rumors about GPT-4 and related speculation about other frontier models, including capability specialization like function calling.

  5. 5

    MoE is grounded in earlier research (2014 MoE work, sparsely-gated MoE, and Switch Transformers), and modern open-source projects like OpenMoE have tackled training/infrastructure challenges.

  6. 6

    Running MoE locally is typically hardware-intensive and slower; hosted inference and quantized/community implementations are practical alternatives.

  7. 7

    The transcript suggests a likely next trend: distilling larger MoE systems into smaller models to improve speed and accessibility.

Highlights

MoE routes each input token through a gating layer to activate only the most relevant expert(s), turning “many experts” into “selective computation.”
The transcript links today’s MoE wave to earlier milestones like sparsely-gated MoE and Switch Transformers, showing the idea has been evolving for over a decade.
Mistral’s MoE release is treated as a base model (not instruction-tuned), so prompt design matters when testing it.
Even with strong reported benchmark numbers on the model card, inference speed and local deployability remain major friction points.

Topics

Mentioned