GLM-OCR (9B) - Local OCR Test | OCR, Document Extraction, Table Recognition

TL;DR

GM-OCR uses a two-stage approach: page layout analysis followed by OCR recognition within detected layout elements.

Briefing Cornell Notes

Briefing

GM-OCR is a two-stage OCR system that combines document layout analysis with character-level recognition, and it’s drawing attention because it delivers strong real-world extraction quality while staying small enough to run locally. The model’s practical appeal hinges on its size—about 0.9B parameters—and its MIT license, which makes it easier to deploy in private or offline document pipelines. In benchmarks it’s positioned as especially effective for complex tables, code-like content, and visual elements such as figures and charts, and the transcript’s local test aims to validate that promise outside of curated benchmark settings.

The testing setup runs GM-OCR in a Google Colab notebook using a T4 GPU with roughly 16 GB of VRAM. The model weights are about 2.2 GB, and the saved tensors file is reported around 2.7 GB—small enough to fit comfortably on common consumer GPUs. The workflow uses the Hugging Face Transformers library (noted as requiring a latest version, 5.1 in the transcript), with an auto processor for image preprocessing and an auto model for image-to-text generation. The model is loaded in full 16-bit floating-point precision and moved entirely onto the GPU for inference. Quantized variants are also mentioned as available via OA and CP, which could speed up inference further.

GM-OCR’s inference is prompt-driven and supports a limited set of prompt types: general text recognition, table recognition, and custom extraction via a schema that returns results in JSON. A single recognition pass on a document takes about 53 seconds initially, with subsequent runs faster—an important operational detail for anyone building batch extraction.

On document quality, the results are mixed but often impressive. For a document with a title and formatted text, the extracted text is described as nearly perfect across a manual check of roughly 60–70% of the content, with only minor punctuation issues. Tables, however, don’t always come out as clean Markdown tables; instead, the transcript shows table content appearing in less structured text form. In a large “one giant table” financial report, the extraction is largely accurate for values and spacing, but some column headers (including “2025” and “2025”) are missing or misplaced, suggesting table reconstruction still has edge cases.

A receipt test—described as skewed but well laid out—produces strong results for key fields like unit, address, telephone number, date, time, and even card numbers. The most compelling capability is custom field extraction: by specifying fields such as address, telephone, date, time, order, and total, the model returns JSON with values that match prior extraction, though the total may omit the currency symbol.

Finally, a fake ID card shows mostly correct extraction but includes small errors, such as the date of birth being off by a couple of years (1908 vs. 1906). Overall, the transcript frames GM-OCR as a strong performer for a relatively small model, with practical deployment options ranging from full-precision local runs to quantized inference for speed.

Cornell Notes

GM-OCR runs OCR in two stages: first it analyzes page layout (titles, paragraphs, tables, figures), then it performs character recognition within those layout elements. The model is small (about 0.9B parameters) and licensed under MIT, making it feasible for local or private document pipelines. In a Colab test on a T4 GPU, the model produced high-quality text and field extraction for documents like financial reports, receipts, and ID cards. Tables are often extracted accurately but may not render as clean Markdown, and some headers can be missing in large table layouts. Custom JSON extraction works by providing a schema in the prompt, though formatting details like currency symbols may be inconsistent.

How does GM-OCR’s two-stage pipeline work, and why does that matter for real documents?

GM-OCR uses a document analysis stage to identify the structure of a page—titles, paragraphs, tables, diagrams, and figures—then uses that structure to guide the OCR step that recognizes characters inside each layout element. This matters because it can separate “where text belongs” from “what the text is,” which is especially helpful for messy layouts like receipts, reports, and documents with figures or code-like blocks.

What hardware and software setup was used to test GM-OCR, and what does it imply for local deployment?

The test ran in Google Colab on a T4 GPU with about 16 GB VRAM. The model weights are roughly 2.2 GB (saved tensors about 2.7 GB), and inference used full 16-bit floating point precision. It required a recent Transformers version (5.1 noted in the transcript) and used Hugging Face’s auto processor and auto model. The implication: the model is small enough for many modern GPUs, and quantized versions are available for faster inference.

How did GM-OCR perform on tables, and what limitation showed up?

Table recognition was described as strong in value extraction and spacing, but the output often wasn’t formatted as Markdown tables. In a financial report with a “one giant table,” some column headers (including “2025” and “2025”) were missing or not found, even though many numeric fields like totals and “current assets” were correct. That points to table reconstruction/label placement as a remaining weak spot.

What did the receipt test reveal about OCR robustness?

The receipt had some skew, but the extraction still looked very accurate for key fields: unit, address, telephone number, date, time, and card numbers (including MasterCard-like text). The transcript also suggests that layout clarity can compensate for skew, letting the model reliably locate and read structured fields.

How does custom data extraction work, and what was the main caveat?

Custom extraction is done by prompting with a schema specifying fields to extract (e.g., address, telephone, date, time, order, total) and requesting JSON output. The returned values matched earlier OCR results for those fields, but the “total” sometimes lacked the currency symbol. So downstream validation may be needed for formatting-sensitive fields.

What kinds of errors appeared on the ID card test?

Most fields were extracted correctly, but the date of birth was off (model output 1906 vs. expected 1908). The identity card number was captured correctly, and other attributes like given names, sex, nationality, and valid-until were largely accurate. The error pattern suggests occasional digit-level mistakes on dates even when other fields succeed.

Review Questions

What are the two stages in GM-OCR, and how does the first stage influence the second?
Why might table outputs fail to appear as Markdown even when the numeric content is correct?
When using custom JSON extraction, what field formatting issue was observed with the receipt total?

Key Points

1
GM-OCR uses a two-stage approach: page layout analysis followed by OCR recognition within detected layout elements.
2
The model is about 0.9B parameters and MIT-licensed, making it practical for local/private document extraction.
3
A T4 GPU test used full 16-bit precision with Transformers 5.1 and Hugging Face auto processor/model components.
4
Custom extraction works via prompt schema and returns JSON, enabling targeted field extraction beyond plain text.
5
Table recognition can be accurate for values and spacing but may not produce clean Markdown and can miss headers in complex tables.
6
Receipt OCR performed strongly even with skew, including extraction of contact details and card numbers.
7
ID card extraction was mostly correct but showed occasional date errors, indicating remaining edge cases for sensitive fields.

Highlights

GM-OCR’s two-stage design separates layout understanding from character recognition, which helps it handle complex documents like receipts and reports.

The model’s small footprint (roughly 0.9B parameters; ~2.2 GB weights) makes local GPU deployment feasible, not just cloud inference.

Custom JSON extraction via a schema prompt can pull structured fields directly, though currency formatting for totals may be inconsistent.

Table extraction often preserves numeric accuracy but may fail on formatting (not Markdown) and can omit column headers in large tables.

Topics

GM-OCR
Document Layout Analysis
Table Recognition
Receipt OCR
Custom JSON Extraction

Mentioned

Venelin Valkov
OCR
VRAM
MIT
JSON
GPU
T4