Gemma 4 Local Test | New Open LLM King?
Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Gemma 4 is Apache 2.0 licensed, enabling local use and fine-tuning for custom applications.
Briefing
Gemma 4’s open, on-device push is starting to look practical: a 26B mixture-of-experts (MoE) instruction-tuned model running locally via Wama CP delivered around 40–43 tokens per second on an M4-class machine, while also handling image understanding and structured extraction tasks that previously tripped up smaller Gemma variants. The takeaway isn’t that benchmarks are suddenly trustworthy—it’s that real-world multimodal workflows (image → reasoning → JSON/HTML) are working well enough to test in a local setup.
The session begins with a hands-on run of the Gemma 4 26B MoE model (with an “active” 4B parameter configuration) through a Wama CP instance, using a quantized 8-bit model from Hugging Face. On an M4 with 38 GB RAM, the model’s throughput landed in the low-40s tokens/sec, and the user notes that the speed is consistent enough to keep iterative testing moving. Hardware constraints remain real: a commenter with a 6 GB GPU is told it won’t be enough for the 26B model, and the host suggests that only smaller effective-2B/4B variants (or heavier quantization) are feasible on modest setups.
Under the hood, the discussion leans on Google’s published details: Gemma 4 is licensed under Apache 2.0, making it usable in applications and fine-tunable by developers. Model sizing is framed as “mobile-first,” with effective 2B and effective 4B options plus a dense 31B and the MoE 26B variant. Context windows differ by size: the smaller effective models are described as supporting 128k context, while the larger MoE/dense variants are described as supporting 25k–26k. The models are also described as natively trained across 140 languages, though the data distribution is expected to be uneven.
Multimodal capability is a major focus. The user tests image understanding by prompting the model to identify ingredients from a photo of a Bulgarian fridge spread, then asks for a recipe-like output. The model’s responses are described as “quite decent,” with particular praise for image comprehension. Next, the model is asked to extract receipt fields (vendor name, total amount, date, and itemized wine entries) into structured output. The extraction lands correctly on key fields—vendor name “Cut Yong Vic,” total amount “9619,” and date “2025 10th of April”—even though the reasoning text can be long and occasionally “overthinky.”
The session also tests chart digitization: a Tesla monthly chart is converted into CSV rows for import into pandas. Results are mixed, with the earlier Quent 3.5 performing better on that specific chart segment. For longer documents, a PDF technical paper is summarized; the user notes that the Wama CP UI likely converts PDFs into images before sending them to the model. Finally, the workflow is extended beyond the UI by running the model through an OpenAI-compatible completions endpoint pointed at the local Wama CP server, including a “thinking level” parameter to control reasoning verbosity.
Overall, Gemma 4’s practical edge in this test is less about leaderboard placement and more about usable local pipelines: image understanding, HTML generation, and structured extraction with tool-friendly, JSON-like outputs—provided the hardware and quantization are chosen realistically.
Cornell Notes
Gemma 4 is an Apache 2.0–licensed, instruction-tuned open model family that can run locally through Wama CP, including multimodal image understanding and structured outputs. In hands-on tests, the 26B MoE variant (active 4B) ran at roughly 40–43 tokens per second on an M4 with 38 GB RAM using an 8-bit quantized model. The model produced strong image-based ingredient/recipe outputs and accurate receipt field extraction (vendor, total, date, and items), though it sometimes generates long, overthinking reasoning. Chart-to-CSV digitization worked but was weaker than Quent 3.5 on at least one chart. The setup also supports PDF summarization (likely via PDF-to-image conversion) and OpenAI-compatible API calls to the local server, with a controllable “thinking level.”
What makes Gemma 4 “open” in a way that matters for developers?
How does the local setup affect performance, and what speeds were observed?
What multimodal tasks worked well in the tests?
Where did the model struggle or underperform?
How do context window and “thinking level” show up in practice?
How were PDFs handled in the local workflow?
Review Questions
- Which Gemma 4 variant was tested locally, and what throughput was reported under the described hardware and quantization settings?
- What evidence from the receipt and image tasks suggests Gemma 4 can produce reliable structured outputs locally?
- In what task did Gemma 4 underperform Quent 3.5, and what was the user’s stated reason for that comparison?
Key Points
- 1
Gemma 4 is Apache 2.0 licensed, enabling local use and fine-tuning for custom applications.
- 2
A local Wama CP setup running the Gemma 4 26B MoE (active 4B) 8-bit quantized model achieved roughly 40–43 tokens per second on an M4 with 38 GB RAM.
- 3
Gemma 4’s multimodal image understanding performed well on ingredient identification and recipe-style outputs from real photos.
- 4
Receipt extraction into structured fields (vendor, total, date, and itemized entries) produced correct results even when reasoning text was verbose.
- 5
Chart-to-CSV digitization was less consistent; Quent 3.5 produced closer results on at least one Tesla chart test.
- 6
PDF workflows appear to rely on converting pages into images before model inference, and the model can summarize and extract specific values from the document.
- 7
OpenAI-compatible API calls can target the local Wama CP server, with a “thinking level” setting to reduce overthinking.