Gemma 4 Local OCR Test with llama.cpp | How Accurate It Is for PDF Document Understanding (🔴 Live)

TL;DR

Gemma 4 can be run locally for OCR-style extraction by converting PDFs into per-page images and sending those images with an OCR prompt to a llama.cpp server.

Briefing Cornell Notes

Briefing

Gemma 4 can perform surprisingly strong document understanding for local OCR-style extraction—especially when the goal is to recover layout and structured fields from scanned pages—but it still struggles with strict character-level accuracy and can miss or hallucinate numbers in edge cases. Running Gemma 4 through llama.cpp with a dedicated Gemma 4 parser, the workflow converts PDFs into page images (via PyMuPDF), base64-encodes each page, and sends them alongside an OCR prompt to a local llama.cpp server. In receipts and financial filings, the model often returns correct totals, merchant names, and key table values, and it can extract specific tables as Markdown with formatting that preserves headers, columns, and alignment.

The setup hinges on using the correct llama.cpp build and the model’s visual token budget. A specialized Gemma 4 parser was added to the llama.cpp master branch, and the testing approach recommends updating to the latest master/head build (or building from source on macOS) so recent Gemma 4 fixes are included. On the inference side, Gemma 4 supports multiple visual token budgets; the testing uses the maximum allowed budget by setting image_min_tokens and image_max_tokens to that value. The server also requires a specific universal batch size to run reliably.

Prompting and input ordering matter. The OCR prompt used is taken from Google’s Gemma 4 OCR/document-understanding guidance, and the image input is sent first in the multimodal message. There’s also a practical limitation: larger Gemma 4 variants support text and images, while smaller variants additionally support audio. That means audio OCR/transcription workflows won’t map cleanly onto the larger local OCR setup.

In live extraction tests, Gemma 4 frequently captures document structure—tables, rows, and column formatting—and returns coherent JSON when asked for fields like wine items, totals, merchant, and date. When prompts request additional fields (such as tax), the model can still produce correct values in at least some cases. The most consistent weakness is classic OCR failure mode for vision-language models: characters and punctuation can be replaced, removed, or hallucinated, even when the overall structure looks right. The streamer recommends using a PDF-native pipeline (e.g., extracting text from digital PDFs and using layout-aware parsing) before falling back to vision-language extraction.

Financial documents provide the clearest “structured OCR” win. For Apple quarterly/financial-style PDFs, the model extracts company name, form type, and ticker correctly from multiple pages passed in a single prompt. More impressively, it can output a specific “note to revenue” table as Markdown and preserve header splits and alignment, and it can reconstruct a condensed consolidated balance sheets table with totals that appear accurate on first verification. The tests also show that model behavior can be inconsistent: some runs return empty responses, and long multi-page prompts can lead to long “thinking” time that affects whether output appears.

A final stress test shows how easily strict numeric extraction can fail. When asked to extract a specific value from a figure/table in a research PDF (e.g., at T_eval = 1.2 and T_train = 0.9), Gemma 4 returns a close-but-wrong number. Even narrowing to a single page didn’t reliably fix the error. The takeaway is pragmatic: Gemma 4 is promising for local, table-heavy document understanding and structured extraction, but it’s not a drop-in replacement for high-precision OCR when exact values must be guaranteed. For large-scale ingestion, the recommended approach is to use a fast PDF/layout pipeline first and apply Gemma 4 selectively where layout understanding or image-based content makes conventional extraction difficult.

Cornell Notes

Gemma 4 can be used locally as an OCR/document-understanding engine by sending page images (from PDFs converted to images) plus an OCR prompt to a llama.cpp server. With the Gemma 4-specific llama.cpp parser and the model’s maximum visual token budget, the system often recovers document structure—especially tables—and can extract fields into JSON or output specific tables as Markdown. Receipts and financial filings show frequent success on totals, dates, and key table values, with formatting that looks coherent. Still, character-level OCR accuracy remains unreliable: punctuation can be wrong, and numeric extraction from figures can be close but incorrect. For production pipelines, it’s safer to use PDF-native text/layout extraction first and reserve Gemma 4 for cases where the PDF is scanned or layout understanding is the bottleneck.

Why does the llama.cpp setup matter for Gemma 4 OCR tests?

A Gemma 4-specific parser was added to llama.cpp’s master branch, and the workflow depends on using an updated build so Gemma 4 calls are handled correctly. On macOS, the recommended path is installing the latest llama.cpp build (e.g., via a build/install command for the head/master version) or building from source to pick up recent fixes. The motivation is that Gemma 4-related fixes can land frequently, so tests should run against the newest working code.

How does the visual token budget affect OCR/document understanding quality?

Gemma 4 supports multiple visual token budgets, and the testing uses the maximum allowed budget by setting image_min_tokens and image_max_tokens to the same high value. This matters because OCR over images and table-heavy pages consumes visual tokens; using a low budget can truncate or degrade what the model “sees,” leading to weaker extraction or missing fields.

What prompting and input-ordering practice improves results in multimodal OCR?

The OCR prompt follows Google’s Gemma 4 OCR/document-understanding sample style, and the image is passed first in the multimodal message. The testing also notes a general best practice from multimodal prompting research: send media inputs (images/audio/video) before text so the model’s input parsing aligns with expected multimodal ordering.

Where does Gemma 4 perform well as an OCR substitute?

It’s strong at recovering document structure and producing usable structured outputs. In receipts, it often extracts totals and other fields into JSON while preserving the overall layout. In financial filings, it can extract specific tables as Markdown and keep headers/columns aligned, even when the table formatting is complex (padding, alignment differences, and multi-page context).

What are the main failure modes when using Gemma 4 for OCR?

Two recurring issues appear: (1) character-level errors—punctuation can be removed, added, or replaced, even when the table layout is correct; and (2) numeric precision failures—when asked for a specific value from a figure/table, the model can return a close but incorrect number. The tests also show occasional empty responses, likely tied to long “thinking” time or runtime behavior when prompts are large.

Review Questions

When using Gemma 4 locally for OCR, what settings and build steps are necessary to ensure the model is called correctly and has enough visual capacity?
Why might a vision-language model return correct table structure but still fail at strict character-level OCR accuracy?
In a production document pipeline, when should Gemma 4 be used versus PDF-native extraction tools?

Key Points

1
Gemma 4 can be run locally for OCR-style extraction by converting PDFs into per-page images and sending those images with an OCR prompt to a llama.cpp server.
2
Using the Gemma 4-specific llama.cpp parser (from the updated master/head build) is important for correct model handling and recent fixes.
3
Gemma 4’s visual token budget is a key lever; setting image_min_tokens and image_max_tokens to the maximum allowed value improves table-heavy extraction.
4
Multimodal prompting works better when images are provided before text, matching recommended input ordering for vision-language models.
5
Gemma 4 often recovers document structure and can output JSON fields or specific tables as Markdown, with formatting that can be surprisingly accurate.
6
Character-level OCR remains unreliable: punctuation and exact values can be wrong or hallucinated, and figure-based numeric extraction can be close-but-wrong.
7
For ingestion at scale, a safer strategy is PDF-native text/layout extraction first, then use Gemma 4 selectively for scanned or layout-difficult documents.

Highlights

Gemma 4 frequently reconstructs table structure and outputs coherent Markdown for specific financial tables, even when multiple pages are provided in one prompt.

Receipts show that structured JSON extraction (totals, merchant, date, tax) can be largely correct, but punctuation/character-level mistakes still occur.

A targeted figure-value extraction test produced a wrong number even when the correct figure was identified, illustrating limits of strict numeric OCR.

Topics

Local OCR
Gemma 4
llama.cpp
PDF to Images
Table Extraction

Mentioned

Venelin Valkov
OCR
JSON
PDF
MCP
RTX
API
AI
PI
AI SDK
GPU
SSD