Introducing Gemma - 2B 7B 6Trillion Tokens

TL;DR

Gemma is released as open-weight, text-only LLMs in four variants: 7B base, 7B instruct, 2B base, and 2B instruct.

Briefing Cornell Notes

Briefing

Google’s new Gemma model suite brings open-weight, English text-only large language models in four sizes—2B and 7B, each available in base and instruction-tuned variants—trained on a reported 6 trillion tokens. The headline detail is the scale: compared with earlier open models such as LLaMA (about 1–1.4T tokens) and LLaMA 2 (about 2T), Gemma’s training token count positions it among the largest datasets for an open-style release, which matters because token volume often correlates with stronger generalization and downstream fine-tuning potential.

Gemma is presented as a “family of lightweight state-of-the-art open models” built from the same research and technology used for Google’s Gemini models. It’s explicitly not multimodal: inputs and outputs are text strings, with prompts in English and generated English responses. The model card also describes training on web documents, code, and mathematics, alongside data processing intended to remove sensitive content. Training is reported on TPU V5e, using JAX and ML Pathways, and the release includes benchmark results and evaluations such as toxicity checks.

On licensing, Gemma comes with a terms-of-use document and a prohibited-use policy. The restrictions are described as broadly similar to earlier open LLM licensing patterns, including limits around certain chatbot-style uses. The practical implication is that developers can experiment and build, but they still need to align projects with the stated usage boundaries.

Access to the weights requires an approval step: users fill out a Gemma access request, review the restrictions, and accept them before downloading. Once approved, weights can be used in Kaggle and also downloaded for external workflows such as Colab.

For getting started, the transcript walks through a “get started” notebook using Keras NLP (with Keras 3.0). The setup includes both base and instruction models, and for PyTorch it also lists two quantized versions of the 7B base model. A notable technical detail is the tokenizer vocabulary size: 256,000 versus LLaMA’s 32,000. That larger vocabulary is framed as a potential advantage for multilingual fine-tuning, even though the current weights are described as English-only—leaving open the possibility of future multilingual releases.

Simple generation is demonstrated via Keras NLP’s generate flow, with options such as top-k sampling and temperature. The transcript also flags that instruction-tuned performance may not be maximized out of the box; community fine-tunes built on the base models—examples mentioned include “Gemma Orca”-style derivatives—are expected to become the most interesting instruction-following systems. Overall, Gemma’s combination of open weights, large training scale, and an accessible Keras NLP path is positioned as a meaningful new entry point for building and fine-tuning open LLM applications.

Cornell Notes

Gemma is Google’s open-weight, English text-only LLM family released in four variants: 7B base, 7B instruct, 2B base, and 2B instruct. The standout figure is training on 6 trillion tokens, far above earlier open releases like LLaMA (≈1–1.4T) and LLaMA 2 (≈2T), which can improve generalization and fine-tuning outcomes. The models are trained on TPU V5e using JAX and ML Pathways, with datasets described as web documents, code, and mathematics plus sensitive-data filtering. Access requires agreeing to Gemma’s terms and prohibited-use policy. A Keras NLP walkthrough shows how to load models, generate text, and adjust sampling (top-k, temperature), with a tokenizer vocab size of 256,000 that may help multilingual fine-tuning later.

What are the four Gemma model variants, and what does “base” vs “instruct” imply for use?

The release lists four models: a 7 billion base model, a 7 billion instruct model, a 2 billion base model, and a 2 billion instruct model. “Base” models are used for general text generation without instruction tuning, while “instruct” variants are tuned to follow prompts more directly. The transcript notes that instruction-tuned results may not be optimal immediately, and expects community fine-tunes built from the base models to produce stronger instruction-following systems.

Why does the reported 6 trillion token training matter compared with earlier open LLMs?

The transcript contrasts Gemma’s 6T tokens with LLaMA at roughly 1–1.4T and LLaMA 2 at around 2T, plus mentions some Chinese models reaching about 3T. More training tokens can translate into better generalization and more capable downstream behavior, especially when developers fine-tune for specific tasks. The claim is that Gemma is among the highest-token open-style training runs for an openly released model.

What training data and infrastructure details are provided in the model card?

Gemma is described as trained on web documents, code, and mathematics. It also includes data processing intended to remove sensitive data. Training is reported to use TPU V5e, with JAX and ML Pathways as the framework/tooling stack.

How does the tokenizer vocabulary size differ from LLaMA, and what potential advantage is suggested?

The tokenizer vocab size is reported as 256,000 for Gemma, compared with 32,000 for LLaMA. The transcript suggests this larger vocabulary could make Gemma more suitable for fine-tuning toward multilingual capabilities, even though the current weights are described as English-only—leaving room for future multilingual variants.

What steps are required to obtain Gemma weights, and what constraints come with them?

Users must submit a Gemma access request, review the definitions and restrictions, and accept them to receive access to the model weights. The transcript also highlights a prohibited-use policy that appears reasonably similar to earlier LLaMA-style licensing, including restrictions around certain chatbot-style uses. That means developers can build, but must stay within the stated terms.

What does the Keras NLP setup demonstrate for running Gemma, and which generation controls are highlighted?

The walkthrough uses Keras 3.0 and Keras NLP, with a JAX backend. It loads base and instruct models and demonstrates text generation via a generate command. It also shows batch generation and sampling controls such as top-k sampling (e.g., setting top-k around 40) and temperature, which affect randomness and output diversity.

Review Questions

Gemma is described as text-only and English-only—what does that mean for input/output formats and likely future releases?
How do top-k sampling and temperature change generation behavior in the Keras NLP workflow described?
Why might a larger tokenizer vocabulary (256,000 vs 32,000) be relevant for fine-tuning outcomes?

Key Points

1
Gemma is released as open-weight, text-only LLMs in four variants: 7B base, 7B instruct, 2B base, and 2B instruct.
2
Google reports training on 6 trillion tokens, substantially higher than LLaMA (≈1–1.4T) and LLaMA 2 (≈2T).
3
The training mix is described as web documents, code, and mathematics, with sensitive-data filtering, trained on TPU V5e using JAX and ML Pathways.
4
Access to weights requires submitting an access request and accepting Gemma’s terms, including a prohibited-use policy with chatbot-related restrictions.
5
A Keras NLP quickstart uses Keras 3.0 and demonstrates loading models, generating text, and adjusting sampling via top-k and temperature.
6
Gemma’s tokenizer vocabulary size is 256,000 (vs LLaMA’s 32,000), suggesting potential advantages for multilingual fine-tuning even though current weights are English-only.
7
Instruction-tuned models may improve further through community fine-tunes built on the base models, with “Gemma Orca”-style derivatives expected to be notable.

Highlights

Gemma’s most striking figure is the reported 6 trillion tokens used for training—an order-of-magnitude step up from earlier open releases like LLaMA 2’s ~2T.

Gemma is explicitly text-in/text-out and not multimodal, with English prompts and English generations.

The tokenizer vocab size jumps to 256,000 from LLaMA’s 32,000, hinting at better fine-tuning flexibility for languages beyond English.

Getting started is framed around Keras NLP (Keras 3.0 + JAX backend), including practical controls like top-k sampling and temperature.

Topics

Gemma Models
Open-Weight LLMs
Training Tokens
Keras NLP
Instruction Tuning

Mentioned

Sam Witteveen
TPU
JAX
ML
LoRA