Gemma 4 Local OCR Test with llama.cpp | How Accurate It Is for PDF Document Understanding (đź”´ Live)
Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Gemma 4 can be run locally for OCR-style extraction by converting PDFs into per-page images and sending those images with an OCR prompt to a llama.cpp server.
Briefing
Gemma 4 can perform surprisingly strong document understanding for local OCR-style extraction—especially when the goal is to recover layout and structured fields from scanned pages—but it still struggles with strict character-level accuracy and can miss or hallucinate numbers in edge cases. Running Gemma 4 through llama.cpp with a dedicated Gemma 4 parser, the workflow converts PDFs into page images (via PyMuPDF), base64-encodes each page, and sends them alongside an OCR prompt to a local llama.cpp server. In receipts and financial filings, the model often returns correct totals, merchant names, and key table values, and it can extract specific tables as Markdown with formatting that preserves headers, columns, and alignment.
The setup hinges on using the correct llama.cpp build and the model’s visual token budget. A specialized Gemma 4 parser was added to the llama.cpp master branch, and the testing approach recommends updating to the latest master/head build (or building from source on macOS) so recent Gemma 4 fixes are included. On the inference side, Gemma 4 supports multiple visual token budgets; the testing uses the maximum allowed budget by setting image_min_tokens and image_max_tokens to that value. The server also requires a specific universal batch size to run reliably.
Prompting and input ordering matter. The OCR prompt used is taken from Google’s Gemma 4 OCR/document-understanding guidance, and the image input is sent first in the multimodal message. There’s also a practical limitation: larger Gemma 4 variants support text and images, while smaller variants additionally support audio. That means audio OCR/transcription workflows won’t map cleanly onto the larger local OCR setup.
In live extraction tests, Gemma 4 frequently captures document structure—tables, rows, and column formatting—and returns coherent JSON when asked for fields like wine items, totals, merchant, and date. When prompts request additional fields (such as tax), the model can still produce correct values in at least some cases. The most consistent weakness is classic OCR failure mode for vision-language models: characters and punctuation can be replaced, removed, or hallucinated, even when the overall structure looks right. The streamer recommends using a PDF-native pipeline (e.g., extracting text from digital PDFs and using layout-aware parsing) before falling back to vision-language extraction.
Financial documents provide the clearest “structured OCR” win. For Apple quarterly/financial-style PDFs, the model extracts company name, form type, and ticker correctly from multiple pages passed in a single prompt. More impressively, it can output a specific “note to revenue” table as Markdown and preserve header splits and alignment, and it can reconstruct a condensed consolidated balance sheets table with totals that appear accurate on first verification. The tests also show that model behavior can be inconsistent: some runs return empty responses, and long multi-page prompts can lead to long “thinking” time that affects whether output appears.
A final stress test shows how easily strict numeric extraction can fail. When asked to extract a specific value from a figure/table in a research PDF (e.g., at T_eval = 1.2 and T_train = 0.9), Gemma 4 returns a close-but-wrong number. Even narrowing to a single page didn’t reliably fix the error. The takeaway is pragmatic: Gemma 4 is promising for local, table-heavy document understanding and structured extraction, but it’s not a drop-in replacement for high-precision OCR when exact values must be guaranteed. For large-scale ingestion, the recommended approach is to use a fast PDF/layout pipeline first and apply Gemma 4 selectively where layout understanding or image-based content makes conventional extraction difficult.
Cornell Notes
Gemma 4 can be used locally as an OCR/document-understanding engine by sending page images (from PDFs converted to images) plus an OCR prompt to a llama.cpp server. With the Gemma 4-specific llama.cpp parser and the model’s maximum visual token budget, the system often recovers document structure—especially tables—and can extract fields into JSON or output specific tables as Markdown. Receipts and financial filings show frequent success on totals, dates, and key table values, with formatting that looks coherent. Still, character-level OCR accuracy remains unreliable: punctuation can be wrong, and numeric extraction from figures can be close but incorrect. For production pipelines, it’s safer to use PDF-native text/layout extraction first and reserve Gemma 4 for cases where the PDF is scanned or layout understanding is the bottleneck.
Why does the llama.cpp setup matter for Gemma 4 OCR tests?
How does the visual token budget affect OCR/document understanding quality?
What prompting and input-ordering practice improves results in multimodal OCR?
Where does Gemma 4 perform well as an OCR substitute?
What are the main failure modes when using Gemma 4 for OCR?
Review Questions
- When using Gemma 4 locally for OCR, what settings and build steps are necessary to ensure the model is called correctly and has enough visual capacity?
- Why might a vision-language model return correct table structure but still fail at strict character-level OCR accuracy?
- In a production document pipeline, when should Gemma 4 be used versus PDF-native extraction tools?
Key Points
- 1
Gemma 4 can be run locally for OCR-style extraction by converting PDFs into per-page images and sending those images with an OCR prompt to a llama.cpp server.
- 2
Using the Gemma 4-specific llama.cpp parser (from the updated master/head build) is important for correct model handling and recent fixes.
- 3
Gemma 4’s visual token budget is a key lever; setting image_min_tokens and image_max_tokens to the maximum allowed value improves table-heavy extraction.
- 4
Multimodal prompting works better when images are provided before text, matching recommended input ordering for vision-language models.
- 5
Gemma 4 often recovers document structure and can output JSON fields or specific tables as Markdown, with formatting that can be surprisingly accurate.
- 6
Character-level OCR remains unreliable: punctuation and exact values can be wrong or hallucinated, and figure-based numeric extraction can be close-but-wrong.
- 7
For ingestion at scale, a safer strategy is PDF-native text/layout extraction first, then use Gemma 4 selectively for scanned or layout-difficult documents.