Get AI summaries of any video or article — Sign up free
Gemma 4 Local Test | New Open LLM King? thumbnail

Gemma 4 Local Test | New Open LLM King?

Venelin Valkov·
5 min read

Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Gemma 4 is Apache 2.0 licensed, enabling local use and fine-tuning for custom applications.

Briefing

Gemma 4’s open, on-device push is starting to look practical: a 26B mixture-of-experts (MoE) instruction-tuned model running locally via Wama CP delivered around 40–43 tokens per second on an M4-class machine, while also handling image understanding and structured extraction tasks that previously tripped up smaller Gemma variants. The takeaway isn’t that benchmarks are suddenly trustworthy—it’s that real-world multimodal workflows (image → reasoning → JSON/HTML) are working well enough to test in a local setup.

The session begins with a hands-on run of the Gemma 4 26B MoE model (with an “active” 4B parameter configuration) through a Wama CP instance, using a quantized 8-bit model from Hugging Face. On an M4 with 38 GB RAM, the model’s throughput landed in the low-40s tokens/sec, and the user notes that the speed is consistent enough to keep iterative testing moving. Hardware constraints remain real: a commenter with a 6 GB GPU is told it won’t be enough for the 26B model, and the host suggests that only smaller effective-2B/4B variants (or heavier quantization) are feasible on modest setups.

Under the hood, the discussion leans on Google’s published details: Gemma 4 is licensed under Apache 2.0, making it usable in applications and fine-tunable by developers. Model sizing is framed as “mobile-first,” with effective 2B and effective 4B options plus a dense 31B and the MoE 26B variant. Context windows differ by size: the smaller effective models are described as supporting 128k context, while the larger MoE/dense variants are described as supporting 25k–26k. The models are also described as natively trained across 140 languages, though the data distribution is expected to be uneven.

Multimodal capability is a major focus. The user tests image understanding by prompting the model to identify ingredients from a photo of a Bulgarian fridge spread, then asks for a recipe-like output. The model’s responses are described as “quite decent,” with particular praise for image comprehension. Next, the model is asked to extract receipt fields (vendor name, total amount, date, and itemized wine entries) into structured output. The extraction lands correctly on key fields—vendor name “Cut Yong Vic,” total amount “9619,” and date “2025 10th of April”—even though the reasoning text can be long and occasionally “overthinky.”

The session also tests chart digitization: a Tesla monthly chart is converted into CSV rows for import into pandas. Results are mixed, with the earlier Quent 3.5 performing better on that specific chart segment. For longer documents, a PDF technical paper is summarized; the user notes that the Wama CP UI likely converts PDFs into images before sending them to the model. Finally, the workflow is extended beyond the UI by running the model through an OpenAI-compatible completions endpoint pointed at the local Wama CP server, including a “thinking level” parameter to control reasoning verbosity.

Overall, Gemma 4’s practical edge in this test is less about leaderboard placement and more about usable local pipelines: image understanding, HTML generation, and structured extraction with tool-friendly, JSON-like outputs—provided the hardware and quantization are chosen realistically.

Cornell Notes

Gemma 4 is an Apache 2.0–licensed, instruction-tuned open model family that can run locally through Wama CP, including multimodal image understanding and structured outputs. In hands-on tests, the 26B MoE variant (active 4B) ran at roughly 40–43 tokens per second on an M4 with 38 GB RAM using an 8-bit quantized model. The model produced strong image-based ingredient/recipe outputs and accurate receipt field extraction (vendor, total, date, and items), though it sometimes generates long, overthinking reasoning. Chart-to-CSV digitization worked but was weaker than Quent 3.5 on at least one chart. The setup also supports PDF summarization (likely via PDF-to-image conversion) and OpenAI-compatible API calls to the local server, with a controllable “thinking level.”

What makes Gemma 4 “open” in a way that matters for developers?

Gemma 4 is licensed under Apache 2.0, which allows developers to use the model in their own applications and fine-tune variants for custom tasks. The transcript also notes that model variants are available via Hugging Face, and the user specifically runs a Wama CP (Wama CPP) build that points to a Hugging Face Gemma 4 GGUF model.

How does the local setup affect performance, and what speeds were observed?

Performance depends heavily on hardware and quantization. The user runs the Gemma 4 26B MoE model using an 8-bit quantized version on an M4 with 38 GB RAM and reports about 40–43 tokens per second. A separate comment warns that a 6 GB GPU (e.g., a 3060) is unlikely to be enough for local model runs, and the host suggests smaller effective-2B/4B models or more aggressive quantization for limited hardware.

What multimodal tasks worked well in the tests?

Image understanding was a standout. The model identified ingredients from a photo and generated a recipe-like output. It also extracted receipt information into structured fields (vendor name, total amount, date, and item entries). The transcript emphasizes that the receipt extraction succeeded even when the reasoning text was lengthy.

Where did the model struggle or underperform?

Chart digitization into CSV was mixed. For a Tesla monthly chart, the Gemma 4 output did not match the original price action as closely as the earlier Quent 3.5 attempt, which the user says handled that chart segment better. The model also sometimes produced long reasoning loops, which could be a drawback for fast, production-style workflows.

How do context window and “thinking level” show up in practice?

The transcript distinguishes context windows by model size: effective 2B/4B models are described as supporting 128k context, while larger MoE/dense variants are described as supporting roughly 25k–26k. For reasoning control, the user mentions an OpenAI-compatible local completions setup where a “thinking level” can be set to minimum to reduce overthinking during tasks like PDF summarization.

How were PDFs handled in the local workflow?

The user tests a technical paper PDF and notes that the Wama CP UI likely converts PDFs into images before sending them to the model. The model then produces a high-level summary and answers questions about the paper, including extracting an “energy per token” figure from a specific page.

Review Questions

  1. Which Gemma 4 variant was tested locally, and what throughput was reported under the described hardware and quantization settings?
  2. What evidence from the receipt and image tasks suggests Gemma 4 can produce reliable structured outputs locally?
  3. In what task did Gemma 4 underperform Quent 3.5, and what was the user’s stated reason for that comparison?

Key Points

  1. 1

    Gemma 4 is Apache 2.0 licensed, enabling local use and fine-tuning for custom applications.

  2. 2

    A local Wama CP setup running the Gemma 4 26B MoE (active 4B) 8-bit quantized model achieved roughly 40–43 tokens per second on an M4 with 38 GB RAM.

  3. 3

    Gemma 4’s multimodal image understanding performed well on ingredient identification and recipe-style outputs from real photos.

  4. 4

    Receipt extraction into structured fields (vendor, total, date, and itemized entries) produced correct results even when reasoning text was verbose.

  5. 5

    Chart-to-CSV digitization was less consistent; Quent 3.5 produced closer results on at least one Tesla chart test.

  6. 6

    PDF workflows appear to rely on converting pages into images before model inference, and the model can summarize and extract specific values from the document.

  7. 7

    OpenAI-compatible API calls can target the local Wama CP server, with a “thinking level” setting to reduce overthinking.

Highlights

Gemma 4’s local multimodal pipeline produced correct receipt fields (including vendor, total, and date) rather than only free-form text.
The 26B MoE model ran at about 40–43 tokens per second on an M4 with 38 GB RAM using an 8-bit quantized build.
Context support differs by size: effective 2B/4B models are described as 128k, while larger MoE/dense variants are described as ~25k–26k.
Chart digitization into CSV was mixed—Quent 3.5 beat Gemma 4 on at least one chart segment.
A local OpenAI-compatible endpoint can wrap the model, including a “thinking level” control for faster, less verbose outputs.

Mentioned

  • MoE
  • OCR
  • API
  • JSON
  • CSV
  • PDF
  • GPU
  • VRAM
  • RAM
  • GGUF
  • HTML
  • SVG
  • MLX