GLM-OCR (9B) - Local OCR Test | OCR, Document Extraction, Table Recognition
Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
GM-OCR uses a two-stage approach: page layout analysis followed by OCR recognition within detected layout elements.
Briefing
GM-OCR is a two-stage OCR system that combines document layout analysis with character-level recognition, and it’s drawing attention because it delivers strong real-world extraction quality while staying small enough to run locally. The model’s practical appeal hinges on its size—about 0.9B parameters—and its MIT license, which makes it easier to deploy in private or offline document pipelines. In benchmarks it’s positioned as especially effective for complex tables, code-like content, and visual elements such as figures and charts, and the transcript’s local test aims to validate that promise outside of curated benchmark settings.
The testing setup runs GM-OCR in a Google Colab notebook using a T4 GPU with roughly 16 GB of VRAM. The model weights are about 2.2 GB, and the saved tensors file is reported around 2.7 GB—small enough to fit comfortably on common consumer GPUs. The workflow uses the Hugging Face Transformers library (noted as requiring a latest version, 5.1 in the transcript), with an auto processor for image preprocessing and an auto model for image-to-text generation. The model is loaded in full 16-bit floating-point precision and moved entirely onto the GPU for inference. Quantized variants are also mentioned as available via OA and CP, which could speed up inference further.
GM-OCR’s inference is prompt-driven and supports a limited set of prompt types: general text recognition, table recognition, and custom extraction via a schema that returns results in JSON. A single recognition pass on a document takes about 53 seconds initially, with subsequent runs faster—an important operational detail for anyone building batch extraction.
On document quality, the results are mixed but often impressive. For a document with a title and formatted text, the extracted text is described as nearly perfect across a manual check of roughly 60–70% of the content, with only minor punctuation issues. Tables, however, don’t always come out as clean Markdown tables; instead, the transcript shows table content appearing in less structured text form. In a large “one giant table” financial report, the extraction is largely accurate for values and spacing, but some column headers (including “2025” and “2025”) are missing or misplaced, suggesting table reconstruction still has edge cases.
A receipt test—described as skewed but well laid out—produces strong results for key fields like unit, address, telephone number, date, time, and even card numbers. The most compelling capability is custom field extraction: by specifying fields such as address, telephone, date, time, order, and total, the model returns JSON with values that match prior extraction, though the total may omit the currency symbol.
Finally, a fake ID card shows mostly correct extraction but includes small errors, such as the date of birth being off by a couple of years (1908 vs. 1906). Overall, the transcript frames GM-OCR as a strong performer for a relatively small model, with practical deployment options ranging from full-precision local runs to quantized inference for speed.
Cornell Notes
GM-OCR runs OCR in two stages: first it analyzes page layout (titles, paragraphs, tables, figures), then it performs character recognition within those layout elements. The model is small (about 0.9B parameters) and licensed under MIT, making it feasible for local or private document pipelines. In a Colab test on a T4 GPU, the model produced high-quality text and field extraction for documents like financial reports, receipts, and ID cards. Tables are often extracted accurately but may not render as clean Markdown, and some headers can be missing in large table layouts. Custom JSON extraction works by providing a schema in the prompt, though formatting details like currency symbols may be inconsistent.
How does GM-OCR’s two-stage pipeline work, and why does that matter for real documents?
What hardware and software setup was used to test GM-OCR, and what does it imply for local deployment?
How did GM-OCR perform on tables, and what limitation showed up?
What did the receipt test reveal about OCR robustness?
How does custom data extraction work, and what was the main caveat?
What kinds of errors appeared on the ID card test?
Review Questions
- What are the two stages in GM-OCR, and how does the first stage influence the second?
- Why might table outputs fail to appear as Markdown even when the numeric content is correct?
- When using custom JSON extraction, what field formatting issue was observed with the receipt total?
Key Points
- 1
GM-OCR uses a two-stage approach: page layout analysis followed by OCR recognition within detected layout elements.
- 2
The model is about 0.9B parameters and MIT-licensed, making it practical for local/private document extraction.
- 3
A T4 GPU test used full 16-bit precision with Transformers 5.1 and Hugging Face auto processor/model components.
- 4
Custom extraction works via prompt schema and returns JSON, enabling targeted field extraction beyond plain text.
- 5
Table recognition can be accurate for values and spacing but may not produce clean Markdown and can miss headers in complex tables.
- 6
Receipt OCR performed strongly even with skew, including extraction of contact details and card numbers.
- 7
ID card extraction was mostly correct but showed occasional date errors, indicating remaining edge cases for sensitive fields.