Introducing Gemma - 2B 7B 6Trillion Tokens
Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Gemma is released as open-weight, text-only LLMs in four variants: 7B base, 7B instruct, 2B base, and 2B instruct.
Briefing
Google’s new Gemma model suite brings open-weight, English text-only large language models in four sizes—2B and 7B, each available in base and instruction-tuned variants—trained on a reported 6 trillion tokens. The headline detail is the scale: compared with earlier open models such as LLaMA (about 1–1.4T tokens) and LLaMA 2 (about 2T), Gemma’s training token count positions it among the largest datasets for an open-style release, which matters because token volume often correlates with stronger generalization and downstream fine-tuning potential.
Gemma is presented as a “family of lightweight state-of-the-art open models” built from the same research and technology used for Google’s Gemini models. It’s explicitly not multimodal: inputs and outputs are text strings, with prompts in English and generated English responses. The model card also describes training on web documents, code, and mathematics, alongside data processing intended to remove sensitive content. Training is reported on TPU V5e, using JAX and ML Pathways, and the release includes benchmark results and evaluations such as toxicity checks.
On licensing, Gemma comes with a terms-of-use document and a prohibited-use policy. The restrictions are described as broadly similar to earlier open LLM licensing patterns, including limits around certain chatbot-style uses. The practical implication is that developers can experiment and build, but they still need to align projects with the stated usage boundaries.
Access to the weights requires an approval step: users fill out a Gemma access request, review the restrictions, and accept them before downloading. Once approved, weights can be used in Kaggle and also downloaded for external workflows such as Colab.
For getting started, the transcript walks through a “get started” notebook using Keras NLP (with Keras 3.0). The setup includes both base and instruction models, and for PyTorch it also lists two quantized versions of the 7B base model. A notable technical detail is the tokenizer vocabulary size: 256,000 versus LLaMA’s 32,000. That larger vocabulary is framed as a potential advantage for multilingual fine-tuning, even though the current weights are described as English-only—leaving open the possibility of future multilingual releases.
Simple generation is demonstrated via Keras NLP’s generate flow, with options such as top-k sampling and temperature. The transcript also flags that instruction-tuned performance may not be maximized out of the box; community fine-tunes built on the base models—examples mentioned include “Gemma Orca”-style derivatives—are expected to become the most interesting instruction-following systems. Overall, Gemma’s combination of open weights, large training scale, and an accessible Keras NLP path is positioned as a meaningful new entry point for building and fine-tuning open LLM applications.
Cornell Notes
Gemma is Google’s open-weight, English text-only LLM family released in four variants: 7B base, 7B instruct, 2B base, and 2B instruct. The standout figure is training on 6 trillion tokens, far above earlier open releases like LLaMA (≈1–1.4T) and LLaMA 2 (≈2T), which can improve generalization and fine-tuning outcomes. The models are trained on TPU V5e using JAX and ML Pathways, with datasets described as web documents, code, and mathematics plus sensitive-data filtering. Access requires agreeing to Gemma’s terms and prohibited-use policy. A Keras NLP walkthrough shows how to load models, generate text, and adjust sampling (top-k, temperature), with a tokenizer vocab size of 256,000 that may help multilingual fine-tuning later.
What are the four Gemma model variants, and what does “base” vs “instruct” imply for use?
Why does the reported 6 trillion token training matter compared with earlier open LLMs?
What training data and infrastructure details are provided in the model card?
How does the tokenizer vocabulary size differ from LLaMA, and what potential advantage is suggested?
What steps are required to obtain Gemma weights, and what constraints come with them?
What does the Keras NLP setup demonstrate for running Gemma, and which generation controls are highlighted?
Review Questions
- Gemma is described as text-only and English-only—what does that mean for input/output formats and likely future releases?
- How do top-k sampling and temperature change generation behavior in the Keras NLP workflow described?
- Why might a larger tokenizer vocabulary (256,000 vs 32,000) be relevant for fine-tuning outcomes?
Key Points
- 1
Gemma is released as open-weight, text-only LLMs in four variants: 7B base, 7B instruct, 2B base, and 2B instruct.
- 2
Google reports training on 6 trillion tokens, substantially higher than LLaMA (≈1–1.4T) and LLaMA 2 (≈2T).
- 3
The training mix is described as web documents, code, and mathematics, with sensitive-data filtering, trained on TPU V5e using JAX and ML Pathways.
- 4
Access to weights requires submitting an access request and accepting Gemma’s terms, including a prohibited-use policy with chatbot-related restrictions.
- 5
A Keras NLP quickstart uses Keras 3.0 and demonstrates loading models, generating text, and adjusting sampling via top-k and temperature.
- 6
Gemma’s tokenizer vocabulary size is 256,000 (vs LLaMA’s 32,000), suggesting potential advantages for multilingual fine-tuning even though current weights are English-only.
- 7
Instruction-tuned models may improve further through community fine-tunes built on the base models, with “Gemma Orca”-style derivatives expected to be notable.