Gemma 3 Local Test with Ollama: Coding, Data Extraction, Data Labelling, Summarization, RAG
Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Gemma 3’s officially quantized 12B model can run locally and still produce dependable structured outputs like Markdown and valid JSON.
Briefing
Gemma 3’s biggest practical win in local testing is its ability to deliver reliable, structured outputs—especially for coding, data extraction, and JSON-style labeling—using an officially quantized 12B model running through Ollama. Across several hands-on tasks, the model produced clean Python/pandas code that correctly sorted and grouped data, generated valid JSON with consistent fields, and summarized long text in a mostly usable format. The results matter because they show a general-purpose multimodal model can be made to behave like a dependable “workhorse” for data workflows, not just chat.
The setup centers on Google DeepMind’s Gemma 3 family: it comes in multiple sizes (1B, 4B, 12B, 27B) and introduces vision capabilities “baked in,” with multilingual support across dozens of languages and a large context window extended to 128k tokens via RoPE. The tester highlights that Gemma 3 supports function calling and structured output in official documentation, but the local configuration used here reportedly doesn’t enable tool/function use in the same way. Prompt formatting also matters: conversation turns rely on special tokens (BOS, start/end of turn), and system messages appear not to work as expected—system instructions were effectively prepended into the user prompt.
In coding, the model handled a dataset-generation and analysis task that many other models struggled with: creating at least 1,000 fictional “wealthiest people” entries across continents (name, gender, world, continent), then building a pandas DataFrame and extracting the top five per continent sorted from poorest to richest within each continent. The generated solution included multiple functions and basic error checking (e.g., handling empty DataFrames). The output ran without warnings and matched the requested sorting order, which the tester frames as the strongest result among models tested in the 7B–15B range.
For data labeling, Gemma 3 produced structured JSON for five tweets, filling fields like target audience, tone, complexity, and topics. Unlike some alternatives that returned multiple topics despite instructions to choose one, Gemma 3 largely followed the format and instruction constraints.
Summarization was strong in general, with one notable miss: when summarizing a Meta earnings report into only a few sentences, the model failed to include a specific revenue growth percentage, though it captured other key points like capital returns via stock buybacks. LinkedIn post generation was less successful—responses came out poorly formatted and the model drifted into conversational questions about audience and messaging rather than delivering a polished post.
Vision and RAG tests further shaped the picture. On a receipt image, the model extracted key fields (store name, date/time, totals, tax, card amount) accurately when given the image directly, while a text+OCR approach introduced some item-level confusion. In RAG-style Q&A over the earnings report, number extraction worked well, but questions requiring specific narrative facts (e.g., what Mark Zuckerberg is most proud of) were answered incorrectly or incompletely, and some calculation-based requests didn’t clearly ground answers in the provided text.
Overall, the local 12B quantized Gemma 3 model delivered dependable formatting and strong data-workflow performance, with remaining weaknesses tied to grounded reasoning in RAG and occasional factual omissions in tightly constrained summaries.
Cornell Notes
Gemma 3’s locally run, officially quantized 12B model performs best when tasks demand structured outputs and repeatable formatting. It generated working pandas code that correctly sorted and grouped data by continent and wealth rank, and it returned valid JSON labels for tweet attributes (audience, tone, complexity, topics) while largely respecting “single topic” instructions. Summarization was generally strong but missed at least one specific numeric detail (a revenue growth percentage) in a short earnings-report summary. Vision extraction from a receipt was accurate when the image was provided directly, while RAG-style questions sometimes failed when answers required narrative facts or explicit grounding in the retrieved text.
What capabilities of Gemma 3 matter most for local, practical workflows in this test?
How did Gemma 3 perform on the pandas coding task, and why was it considered a standout result?
What did the JSON labeling test reveal about instruction-following?
Where did summarization break down, even when the rest of the summary looked plausible?
How did vision extraction differ between providing image-only versus providing extracted text/OCR?
What patterns showed up in RAG-style question answering over the earnings report?
Review Questions
- Which prompt-format details (tokens/turn structure) are critical for getting Gemma 3 to respond correctly in this local setup?
- What specific sorting/grouping logic did Gemma 3 implement in the pandas task, and how did the tester verify it was correct?
- In the RAG tests, what kinds of questions tended to fail (narrative facts, calculations, or table lookups), and what does that imply about grounding?
Key Points
- 1
Gemma 3’s officially quantized 12B model can run locally and still produce dependable structured outputs like Markdown and valid JSON.
- 2
Gemma 3 generated working pandas code for a complex “top five per continent” task, including correct continent-first sorting and within-continent wealth ordering.
- 3
JSON labeling for tweet attributes worked well, with the model largely respecting constraints such as producing a single topic per text.
- 4
Short earnings-report summarization can miss critical numeric details even when the overall themes are correct.
- 5
Vision extraction from a receipt was more accurate when the image was provided directly than when relying on OCR/text preprocessing.
- 6
RAG-style Q&A over long reports worked better for table/number extraction than for narrative or explicitly grounded questions.
- 7
Prompting constraints matter: system messages were not reliably supported in the tested local configuration, affecting how instructions should be formatted.