Gemma 3 Local Test with Ollama: Coding, Data Extraction, Data Labelling, Summarization, RAG

TL;DR

Gemma 3’s officially quantized 12B model can run locally and still produce dependable structured outputs like Markdown and valid JSON.

Briefing Cornell Notes

Briefing

Gemma 3’s biggest practical win in local testing is its ability to deliver reliable, structured outputs—especially for coding, data extraction, and JSON-style labeling—using an officially quantized 12B model running through Ollama. Across several hands-on tasks, the model produced clean Python/pandas code that correctly sorted and grouped data, generated valid JSON with consistent fields, and summarized long text in a mostly usable format. The results matter because they show a general-purpose multimodal model can be made to behave like a dependable “workhorse” for data workflows, not just chat.

The setup centers on Google DeepMind’s Gemma 3 family: it comes in multiple sizes (1B, 4B, 12B, 27B) and introduces vision capabilities “baked in,” with multilingual support across dozens of languages and a large context window extended to 128k tokens via RoPE. The tester highlights that Gemma 3 supports function calling and structured output in official documentation, but the local configuration used here reportedly doesn’t enable tool/function use in the same way. Prompt formatting also matters: conversation turns rely on special tokens (BOS, start/end of turn), and system messages appear not to work as expected—system instructions were effectively prepended into the user prompt.

In coding, the model handled a dataset-generation and analysis task that many other models struggled with: creating at least 1,000 fictional “wealthiest people” entries across continents (name, gender, world, continent), then building a pandas DataFrame and extracting the top five per continent sorted from poorest to richest within each continent. The generated solution included multiple functions and basic error checking (e.g., handling empty DataFrames). The output ran without warnings and matched the requested sorting order, which the tester frames as the strongest result among models tested in the 7B–15B range.

For data labeling, Gemma 3 produced structured JSON for five tweets, filling fields like target audience, tone, complexity, and topics. Unlike some alternatives that returned multiple topics despite instructions to choose one, Gemma 3 largely followed the format and instruction constraints.

Summarization was strong in general, with one notable miss: when summarizing a Meta earnings report into only a few sentences, the model failed to include a specific revenue growth percentage, though it captured other key points like capital returns via stock buybacks. LinkedIn post generation was less successful—responses came out poorly formatted and the model drifted into conversational questions about audience and messaging rather than delivering a polished post.

Vision and RAG tests further shaped the picture. On a receipt image, the model extracted key fields (store name, date/time, totals, tax, card amount) accurately when given the image directly, while a text+OCR approach introduced some item-level confusion. In RAG-style Q&A over the earnings report, number extraction worked well, but questions requiring specific narrative facts (e.g., what Mark Zuckerberg is most proud of) were answered incorrectly or incompletely, and some calculation-based requests didn’t clearly ground answers in the provided text.

Overall, the local 12B quantized Gemma 3 model delivered dependable formatting and strong data-workflow performance, with remaining weaknesses tied to grounded reasoning in RAG and occasional factual omissions in tightly constrained summaries.

Cornell Notes

Gemma 3’s locally run, officially quantized 12B model performs best when tasks demand structured outputs and repeatable formatting. It generated working pandas code that correctly sorted and grouped data by continent and wealth rank, and it returned valid JSON labels for tweet attributes (audience, tone, complexity, topics) while largely respecting “single topic” instructions. Summarization was generally strong but missed at least one specific numeric detail (a revenue growth percentage) in a short earnings-report summary. Vision extraction from a receipt was accurate when the image was provided directly, while RAG-style questions sometimes failed when answers required narrative facts or explicit grounding in the retrieved text.

What capabilities of Gemma 3 matter most for local, practical workflows in this test?

The test emphasizes three areas: (1) structured output reliability (Markdown and JSON), (2) coding competence for data pipelines (Python/pandas generation), and (3) multimodal extraction (vision on receipts). It also notes prompt-format constraints: Gemma 3 expects BOS and turn tokens, and system messages appear not to work in the local setup—system content was effectively treated as part of the user prompt.

How did Gemma 3 perform on the pandas coding task, and why was it considered a standout result?

Gemma 3 generated code that (a) created a dataset with at least 1,000 entries including name, gender, world (wealth), and continent, (b) built a pandas DataFrame, and (c) extracted the top five per continent after sorting continent first and then wealth from poorest to richest within each continent. The solution included multiple functions and error checking (e.g., empty DataFrame handling). The tester reports correct ordering with no warnings and says other models commonly failed this task.

What did the JSON labeling test reveal about instruction-following?

When asked to classify five tweets into fields like target audience, tone, complexity, and topics, Gemma 3 returned properly formatted JSON. The tester highlights that it largely followed the instruction to produce a single topic per text, whereas other models sometimes returned multiple topics (e.g., lists) despite the constraint.

Where did summarization break down, even when the rest of the summary looked plausible?

In a short 2–4 sentence summary of a Meta earnings report, Gemma 3 captured themes like strong financial performance and capital returns via stock buybacks, but it omitted a specific revenue growth percentage. The tester treats that numeric omission as a significant factual miss, even though the qualitative points were aligned with the report.

How did vision extraction differ between providing image-only versus providing extracted text/OCR?

With the receipt image provided directly, the model extracted key fields accurately (store name, date/time, totals, tax, debit card amount) and kept wine items correct. When the model relied on a Markdown/OCR text representation of the receipt, it incorrectly merged or reassigned some wine items—suggesting that OCR/text preprocessing introduced errors that the model then propagated.

What patterns showed up in RAG-style question answering over the earnings report?

Number extraction from tables worked well: the model pulled values correctly when the information was present as plain numbers. However, questions requiring narrative grounding or explicit statements (e.g., what Mark Zuckerberg is most proud of) were answered incorrectly or with missing context. Some calculation-oriented prompts were also only partially successful, with the model not clearly grounding its response in the retrieved text.

Review Questions

Which prompt-format details (tokens/turn structure) are critical for getting Gemma 3 to respond correctly in this local setup?
What specific sorting/grouping logic did Gemma 3 implement in the pandas task, and how did the tester verify it was correct?
In the RAG tests, what kinds of questions tended to fail (narrative facts, calculations, or table lookups), and what does that imply about grounding?

Key Points

1
Gemma 3’s officially quantized 12B model can run locally and still produce dependable structured outputs like Markdown and valid JSON.
2
Gemma 3 generated working pandas code for a complex “top five per continent” task, including correct continent-first sorting and within-continent wealth ordering.
3
JSON labeling for tweet attributes worked well, with the model largely respecting constraints such as producing a single topic per text.
4
Short earnings-report summarization can miss critical numeric details even when the overall themes are correct.
5
Vision extraction from a receipt was more accurate when the image was provided directly than when relying on OCR/text preprocessing.
6
RAG-style Q&A over long reports worked better for table/number extraction than for narrative or explicitly grounded questions.
7
Prompting constraints matter: system messages were not reliably supported in the tested local configuration, affecting how instructions should be formatted.

Highlights

The 12B quantized Gemma 3 model produced pandas code that correctly sorted by continent and then selected the top five per continent from poorest to richest—something many other models failed in the same task.

Gemma 3 returned perfectly formed JSON for tweet labeling (audience, tone, complexity, topics) and largely avoided the “multiple topics” behavior seen in other models.

Receipt understanding was strong when the image was provided directly, with accurate totals/tax/card fields and correct item-level extraction.

In RAG tests, number extraction from the earnings report worked, but narrative questions about Mark Zuckerberg’s stated pride were answered incorrectly or without grounding.

Topics

Gemma 3
Ollama
Quantized Models
Pandas Coding
RAG
Vision Extraction