Grok-1 Open Source: 314B Mixture-of-Experts Model by xAI | Blog post, GitHub/Source Code
Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
xAI open-sourced Gro-1 with both weights and architecture, under Apache License 2.0.
Briefing
xAI has open-sourced Gro-1, a 314B-parameter mixture-of-experts (MoE) model, releasing not only weights but also the model architecture and training checkpoint details. The release matters because it gives researchers and developers a full path to reproduce and study how a very large MoE system is built—while also raising practical barriers, since the published weights are about 320 GB and require substantial GPU memory to run.
The Gro-1 release page says the model is trained from scratch using xAI’s own data, with pre-training concluding in October 2023. It is described as a base (pre-trained) checkpoint rather than a chat-tuned model, implying additional fine-tuning would be needed for conversational use on xAI’s platforms. A key architectural detail is conditional computation: only a fraction of parameters activate per token. The documentation indicates that 25% of weights are active in the MoE routing setup, and later repository details specify two active experts out of eight total, yielding roughly 86B active parameters per token.
Gro-1 is positioned as an MoE system in the same family as other widely discussed sparse models, where a router selects which experts handle each token. The transcript also notes speculation that other major model families (including Google’s Gemini models) may use MoE-style designs, though Gro-1’s release provides concrete specifics. The model is licensed under Apache License 2.0, meaning downstream use, modification, and redistribution are broadly permitted.
On the implementation side, the open-source repository (hosted under xAI’s Gro-1 GitHub organization) includes JAX-based code and CUDA-related requirements, plus SentencePiece for tokenization. The repository contains example code and checkpoint runners intended to validate and run the model, but it warns that the MoE model’s sheer size makes offline testing difficult. The author estimates that running the example likely needs more than 50 GB of VRAM.
The code inspection highlights several technical choices. The main model definition appears largely self-contained in a large model.py file, with quantized weights referenced as 8-bit (8-bit quantization) to reduce memory footprint. The implementation includes a router mechanism typical of MoE layers, and the Transformer configuration uses rotary embeddings for positional encoding. The attention stack follows a multi-head Transformer pattern, with additional engineering aimed at correctness without relying on custom kernels inside the repository.
Finally, configuration details in the run.py file give a clearer picture of the model’s scale: a vocabulary size of 128,124; an 8K token sequence length; 848 attention heads; and 64 Transformer layers. The default MoE setup lists eight experts total with two active experts per token. Together, these details portray Gro-1 as a large, sparsely activated Transformer built for scalable training and inference in distributed environments—now available for developers to examine, run, and extend under an open license.
Cornell Notes
xAI open-sourced Gro-1, a 314B-parameter mixture-of-experts (MoE) model, releasing both weights and architecture under Apache License 2.0. The model is a pre-trained base checkpoint (not a chat model), with pre-training ending in October 2023. MoE routing activates only part of the network per token: the release describes 25% of weights active, and repository config specifies 2 active experts out of 8, for roughly 86B active parameters per token. The code is JAX-based with CUDA support and uses SentencePiece tokenization. Running the model is resource-intensive: weights are ~320 GB and example execution likely needs >50 GB VRAM.
What exactly did xAI release for Gro-1, and why is that more than just a weight dump?
How does Gro-1’s mixture-of-experts design reduce compute per token?
What does “base checkpoint” imply for how Gro-1 should be used?
What implementation details in the repository affect how hard it is to run Gro-1?
Which model configuration numbers define Gro-1’s scale in the repo?
What tokenization and positional-encoding choices show up in the code inspection?
Review Questions
- Why does activating only 2 of 8 experts per token matter for compute, and how does it relate to the “active parameters” figure?
- What practical constraints does the ~320 GB weight size impose, and how does the repository’s quantization mention change the expected hardware requirements?
- Which configuration values (sequence length, vocabulary size, heads, layers) most directly determine the memory and compute profile of Gro-1?
Key Points
- 1
xAI open-sourced Gro-1 with both weights and architecture, under Apache License 2.0.
- 2
Gro-1 is a pre-trained base checkpoint (not a chat-tuned model), with pre-training ending in October 2023.
- 3
The MoE design activates only a subset of parameters per token: 2 active experts out of 8, about 86B active parameters per token.
- 4
The repository is JAX-based with CUDA support and uses SentencePiece for tokenization.
- 5
Gro-1 weights are about 320 GB, and running provided examples likely requires >50 GB VRAM.
- 6
The Transformer implementation uses rotary embeddings and includes a router mechanism typical of MoE models.
- 7
Key config values include 128,124 vocabulary size and ~8K sequence length, with 848 attention heads and 64 layers.