Get AI summaries of any video or article — Sign up free
Running Gemma using HuggingFace Transformers or Ollama thumbnail

Running Gemma using HuggingFace Transformers or Ollama

Sam Witteveen·
5 min read

Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Gemma’s chat prompting relies on explicit turn markers and end-of-turn stopping tokens, so system-style instructions often must be folded into the first user message.

Briefing

Running Gemma locally or in a notebook is straightforward, but the biggest practical takeaway is that Gemma’s chat behavior depends heavily on its prompt formatting—especially because it doesn’t rely on a separate “system prompt” the way many other instruction-tuned models do. In Hugging Face Transformers, the gated access process requires accepting terms on Kaggle to download weights, then using a Hugging Face token in Colab so the notebook can fetch the model. Once set up, the workflow centers on loading Gemma 7B instruction-tuned in 4-bit quantization (so it fits on hardware like a Tesla T4), using bitsandbytes quantization and device mapping to place the model on GPU.

The transcript then drills into how prompts must be structured. Gemma uses a chat template with explicit turn markers such as “start of turn user” and “start of turn model,” plus corresponding end-of-turn tokens that also act as stopping conditions. That means a “system prompt” has to be folded into the first user message content for Hugging Face inference, since Gemma doesn’t use system prompts directly. The tokenizer is another notable difference: Gemma’s vocabulary/tokenizer supports 256,000 tokens, compared with roughly 32,000 for Llama-family models, which the narrator suggests may affect how much usable information fits into a given token budget.

A series of prompt experiments illustrates how Gemma responds differently depending on whether step-by-step reasoning is requested. When asked for “write out your reasoning step-by-step,” the model tends to produce structured chains of thought and often stops early if max length is too low; when step-by-step is not requested, it can produce more directly usable outputs like a coherent email. The transcript also compares behavior to other models (including Mistral-style tendencies toward “first, second, third” formatting) and emphasizes that prompting strategy matters more than expecting uniform behavior across model families.

On the math and coding side, results are mixed: some GSM8K-style problems are solved correctly with clear intermediate steps, while others fail due to arithmetic details like rounding or dividing by the wrong number. However, when the problem is reformulated into an algebraic form (e.g., using variables and equations), Gemma’s reasoning improves and the correct answer emerges. The transcript also touches translation limits: Gemma can handle basic translation, but Thai translation is weak, with the model effectively improvising a “translate API” style explanation rather than performing true internet-backed translation.

Finally, the Ollama path is presented as simpler for local use. Ollama already includes Gemma models, offering both 2B and 7B variants. The Ollama model file shows the same turn-based template logic, including where system-like instructions can be inserted into the prompt text. The 2B model runs quickly but is less reliable on factual tasks, making it more suitable for tasks like rewriting, summarizing, or language polishing—potentially alongside RAG—rather than strict question answering. Overall, Gemma is positioned as a capable, locally runnable model whose best results come from correct chat templating, careful quantization choices, and prompt designs tailored to its instruction and reasoning tendencies.

Cornell Notes

Gemma can be run for inference either via Hugging Face Transformers (in a notebook) or locally through Ollama. Hugging Face setup requires accepting gated access for Gemma weights on Kaggle and using a Hugging Face token in Colab; practical deployment often uses 4-bit quantization so the 7B model fits on GPUs like a Tesla T4. Prompting is the core technical detail: Gemma’s chat format uses explicit “start of turn user/model” markers and end-of-turn stopping tokens, and it doesn’t use a separate system prompt—so system instructions must be folded into the first user message. Experiments show that asking for step-by-step reasoning changes output style dramatically, and math performance can be sensitive to rounding and problem formulation. Ollama simplifies local execution by packaging Gemma models and templates, with similar turn-based stopping behavior.

Why does Gemma require special handling for “system prompts” in Hugging Face inference?

Gemma’s instruction/chat formatting relies on turn markers (“start of turn user” / “start of turn model”) rather than a dedicated system role. In the Hugging Face workflow, the workaround is to embed the system prompt text directly into the first user message content, then apply the model’s chat template. This also aligns with the stopping logic: generation halts at the end-of-turn token for the model’s turn.

How does quantization affect which Gemma model sizes can run on common hardware?

The transcript uses 4-bit loading for Gemma 7B because full-resolution weights won’t fit on a Tesla T4. It recommends bitsandbytes quantization with a quantization config (load_in_4bit=True) and uses device mapping to place the model on GPU. With larger GPUs like an A100, loading full resolution is possible, but the default notebook setup targets constrained hardware.

What tokenizer detail is highlighted, and why might it matter?

Gemma’s tokenizer vocabulary/tokenization size is described as 256,000 tokens, compared with about 32,000 for Llama-family models. The claim is that this can change how text is split into tokens, potentially letting more information fit into a comparable token budget, which can affect output quality under max-length limits.

How does requesting step-by-step reasoning change Gemma’s output style and usefulness?

When prompted to “write out your reasoning step-by-step,” Gemma produces structured reasoning and often stops early if max length is too low. Without that instruction, the same user goal can yield more directly usable artifacts—like a coherent email with subject and message content—rather than a chain-of-thought style response. The transcript also notes that Gemma’s reasoning behavior differs from models that default to “first/second/third” formatting.

What patterns show up in Gemma’s math performance?

Results are mixed. Some problems are solved correctly with intermediate steps (e.g., a simplified arithmetic path to a final answer), but others fail due to arithmetic precision issues (like rounding) or incorrect operations (like dividing by the wrong number). Reformulating into algebraic equations (using variables and isolating terms) improves correctness, suggesting the model handles structured symbolic reasoning better than ambiguous word problems.

How does Gemma handle translation, especially for Thai?

Basic translation is described as possible given Gemma’s training on a very large corpus (6 trillion tokens). But Thai translation is weak: instead of producing a reliable translation, the model tends to respond with help-like phrasing and sometimes invents an explanation involving a “translate API,” reflecting limitations in its ability to translate from Thai reliably.

Review Questions

  1. When embedding a system instruction for Gemma in Hugging Face, what exact prompt-structure change is required, and how does the chat template influence stopping tokens?
  2. Why might a 2B Gemma model be better suited for rewriting or RAG pipelines than for factual question answering?
  3. What differences in output style appear when you toggle step-by-step reasoning instructions, and how can max_length settings affect the result?

Key Points

  1. 1

    Gemma’s chat prompting relies on explicit turn markers and end-of-turn stopping tokens, so system-style instructions often must be folded into the first user message.

  2. 2

    Gated Gemma weights require accepting terms on Kaggle and using a Hugging Face token in Colab to download model files.

  3. 3

    Running Gemma 7B on limited GPUs typically requires 4-bit quantization via bitsandbytes and careful device mapping.

  4. 4

    Gemma’s tokenizer uses a much larger token vocabulary (256,000) than Llama-family models (~32,000), which can affect how much information fits into a token budget.

  5. 5

    Prompting for step-by-step reasoning can drastically change output usefulness, shifting from direct deliverables (like emails) to chain-of-thought style responses.

  6. 6

    Math accuracy can hinge on rounding and correct arithmetic operations; algebraic reformulation often improves results.

  7. 7

    Ollama packages Gemma models for local inference and uses a similar turn/template structure, making local setup simpler than manual Transformers configuration.

Highlights

Gemma doesn’t use a standalone system prompt in the way many instruction models do; system instructions are best embedded into the first user message when using Hugging Face.
Turn-based tokens like “start of turn user/model” and “end of turn” also function as practical generation boundaries, shaping both formatting and when output stops.
Step-by-step reasoning requests can turn a task like “write an email” into a chain-of-thought output, while removing that instruction can produce a more directly usable email.
Math performance is uneven on word problems but improves when the task is expressed in algebraic form with variables and equations.
Ollama makes local Gemma execution easy by bundling both 2B and 7B models, but the 2B variant is less reliable for factual tasks.

Topics

Mentioned