Grok-1 Open Source: 314B Mixture-of-Experts Model by xAI

TL;DR

xAI open-sourced Gro-1 with both weights and architecture, under Apache License 2.0.

Briefing Cornell Notes

Briefing

xAI has open-sourced Gro-1, a 314B-parameter mixture-of-experts (MoE) model, releasing not only weights but also the model architecture and training checkpoint details. The release matters because it gives researchers and developers a full path to reproduce and study how a very large MoE system is built—while also raising practical barriers, since the published weights are about 320 GB and require substantial GPU memory to run.

The Gro-1 release page says the model is trained from scratch using xAI’s own data, with pre-training concluding in October 2023. It is described as a base (pre-trained) checkpoint rather than a chat-tuned model, implying additional fine-tuning would be needed for conversational use on xAI’s platforms. A key architectural detail is conditional computation: only a fraction of parameters activate per token. The documentation indicates that 25% of weights are active in the MoE routing setup, and later repository details specify two active experts out of eight total, yielding roughly 86B active parameters per token.

Gro-1 is positioned as an MoE system in the same family as other widely discussed sparse models, where a router selects which experts handle each token. The transcript also notes speculation that other major model families (including Google’s Gemini models) may use MoE-style designs, though Gro-1’s release provides concrete specifics. The model is licensed under Apache License 2.0, meaning downstream use, modification, and redistribution are broadly permitted.

On the implementation side, the open-source repository (hosted under xAI’s Gro-1 GitHub organization) includes JAX-based code and CUDA-related requirements, plus SentencePiece for tokenization. The repository contains example code and checkpoint runners intended to validate and run the model, but it warns that the MoE model’s sheer size makes offline testing difficult. The author estimates that running the example likely needs more than 50 GB of VRAM.

The code inspection highlights several technical choices. The main model definition appears largely self-contained in a large model.py file, with quantized weights referenced as 8-bit (8-bit quantization) to reduce memory footprint. The implementation includes a router mechanism typical of MoE layers, and the Transformer configuration uses rotary embeddings for positional encoding. The attention stack follows a multi-head Transformer pattern, with additional engineering aimed at correctness without relying on custom kernels inside the repository.

Finally, configuration details in the run.py file give a clearer picture of the model’s scale: a vocabulary size of 128,124; an 8K token sequence length; 848 attention heads; and 64 Transformer layers. The default MoE setup lists eight experts total with two active experts per token. Together, these details portray Gro-1 as a large, sparsely activated Transformer built for scalable training and inference in distributed environments—now available for developers to examine, run, and extend under an open license.

Cornell Notes

xAI open-sourced Gro-1, a 314B-parameter mixture-of-experts (MoE) model, releasing both weights and architecture under Apache License 2.0. The model is a pre-trained base checkpoint (not a chat model), with pre-training ending in October 2023. MoE routing activates only part of the network per token: the release describes 25% of weights active, and repository config specifies 2 active experts out of 8, for roughly 86B active parameters per token. The code is JAX-based with CUDA support and uses SentencePiece tokenization. Running the model is resource-intensive: weights are ~320 GB and example execution likely needs >50 GB VRAM.

What exactly did xAI release for Gro-1, and why is that more than just a weight dump?

The release provides the model weights and the architecture/checkpoint details, not only the parameters. That means developers can inspect the MoE routing structure, Transformer configuration, and implementation choices (e.g., embeddings, attention, and expert layers) rather than treating Gro-1 as a black box.

How does Gro-1’s mixture-of-experts design reduce compute per token?

Gro-1 uses sparse conditional computation. The release describes that about 25% of weights are active per token. The repository configuration further specifies 8 total experts with 2 active experts per token, which corresponds to roughly 86B active parameters at a time.

What does “base checkpoint” imply for how Gro-1 should be used?

The model is described as trained from scratch and pre-trained using xAI’s own data, with pre-training concluding in October 2023. It is not presented as a chat model, so conversational behavior would likely require additional fine-tuning beyond the released pre-trained checkpoint.

What implementation details in the repository affect how hard it is to run Gro-1?

The weights are extremely large (~320 GB), and the repository warns that a machine with enough GPU memory is required to test the example code. The transcript estimates that running the examples likely needs more than 50 GB of VRAM. The code also references 8-bit quantized weights, which can reduce memory pressure, but the model remains heavy.

Which model configuration numbers define Gro-1’s scale in the repo?

In run.py, the transcript cites vocabulary size 128,124 and sequence length of about 8K tokens. It also lists 848 attention heads and 64 Transformer layers. The MoE defaults include 8 experts total and 2 active experts per token.

What tokenization and positional-encoding choices show up in the code inspection?

Tokenization uses SentencePiece. For positional encoding, the model uses rotary embeddings (RoPE), with references to the relevant rotary embedding paper included in the code.

Review Questions

Why does activating only 2 of 8 experts per token matter for compute, and how does it relate to the “active parameters” figure?
What practical constraints does the ~320 GB weight size impose, and how does the repository’s quantization mention change the expected hardware requirements?
Which configuration values (sequence length, vocabulary size, heads, layers) most directly determine the memory and compute profile of Gro-1?

Key Points

1
xAI open-sourced Gro-1 with both weights and architecture, under Apache License 2.0.
2
Gro-1 is a pre-trained base checkpoint (not a chat-tuned model), with pre-training ending in October 2023.
3
The MoE design activates only a subset of parameters per token: 2 active experts out of 8, about 86B active parameters per token.
4
The repository is JAX-based with CUDA support and uses SentencePiece for tokenization.
5
Gro-1 weights are about 320 GB, and running provided examples likely requires >50 GB VRAM.
6
The Transformer implementation uses rotary embeddings and includes a router mechanism typical of MoE models.
7
Key config values include 128,124 vocabulary size and ~8K sequence length, with 848 attention heads and 64 layers.

Highlights

Gro-1’s release includes architecture and checkpoint details, enabling real reproduction rather than just inference with opaque weights.

MoE routing is concrete: 8 experts total, 2 active per token, translating to roughly 86B active parameters per token.

Despite 8-bit quantization references, the published weights (~320 GB) make local experimentation hardware-intensive.

The repo’s configuration lists an 8K context window, 128,124 vocabulary size, 848 attention heads, and 64 Transformer layers.

Topics

Gro-1 Open Source
Mixture of Experts
JAX Implementation
Quantized Weights
Transformer Configuration

Mentioned

Elon Musk
MoE
VRAM
RoPE
JAX
CUDA
MoE

Grok-1 Open Source: 314B Mixture-of-Experts Model by xAI | Blog post, GitHub/Source Code