NanoNets OCR-s
Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Nanet’s OCR Small is a ~3B open-weight OCR model built on Quen 2.5VL, aimed at structured document extraction rather than plain text alone.
Briefing
A newly released “OCR Small” model from Nanets—built on the open-weight Quen 2.5VL base—turns a roughly 3B parameter vision-language model into a multi-task document reader that goes beyond plain text. The standout point isn’t just accuracy on OCR; it’s the ability to extract structured outputs for specialized needs like LaTeX equations, signatures, watermarks, checkboxes, and complex tables, all while staying small enough to run on modest hardware.
The model’s feature set is framed around six capabilities: LaTeX equation recognition, intelligent image description, signature detection, watermark extraction, smart checkbox handling, and table extraction. In practice, that means it can pull equations out of scanned or rendered documents rather than leaving them as images, detect signatures well enough to identify a name (the example shown reads “J Walker”), and flag watermarks so they can be filtered and extracted. It also handles tables in a way that outputs an HTML-like structure—useful because downstream systems (including retrieval-augmented generation pipelines) can query that structure instead of treating the table as an opaque image.
A key driver behind the performance is dataset curation. Nanets trained on a dataset of 250,000 pages selected to represent research papers, financial documents, legal documents, healthcare tax forms, receipts, and invoices—document types where tables, equations, signatures, and watermarks are common. The training mix is described as both synthetically generated and manually annotated data, with the emphasis on enhancing those specific visual/textual artifacts. That focus helps explain why the model looks strong on structured documents while not being positioned as a general-purpose handwriting OCR system.
The comparison against Mistral highlights the practical difference between “text extraction” and “document understanding.” In examples involving equations and watermarks, the Nanets OCR Small output is more directly usable: equations are extracted, watermarks are detected and labeled, and tables are returned in structured form. The tradeoff is that some alternative approaches may preserve images for later multimodal processing, which can be valuable in certain pipelines—but it tends to be less convenient for immediate RAG-style retrieval when the goal is to embed text descriptions rather than image blobs.
Beyond the headline capabilities, the model is available on Hugging Face as open weights, enabling organizations to fine-tune or deploy privately without sending sensitive documents to external APIs. Testing described in the transcript suggests it performs reasonably on non-English characters (including umlauts and Japanese text), and it handles financial statement tables (profit and loss) with consistent dollar-sign placement and section alignment. Still, there’s no clear evidence of a dedicated multilingual training claim, so real-world performance for languages like Arabic is presented as an open question.
Overall, the release reinforces a broader shift: OCR is moving toward smaller, specialized open models that can be run on consumer or cloud GPUs (including a T4-class setup) and then adapted to specific document workflows. With Quen 2.5VL aging and newer Quen generations likely on the horizon, the expectation is that future OCR models will get even smaller and faster—making latency and cost planning a near-term priority for document extraction systems.
Cornell Notes
Nanet’s OCR Small is a ~3B vision-language model for document extraction built on the open-weight Quen 2.5VL base. Its main value is not just reading text, but producing structured outputs for tasks like LaTeX equation recognition, signature detection, watermark extraction, smart checkbox handling, and complex table extraction (often in an HTML-like layout). That structure matters for downstream RAG and information extraction systems because it enables querying and embedding meaningful text rather than treating tables or equations as images. Performance is attributed to a curated training set of 250,000 pages spanning research, finance, legal, healthcare forms, receipts, and invoices, with emphasis on tables, equations, signatures, and watermarks. The model is available on Hugging Face as open weights, supporting private, on-prem deployments and customization.
What makes Nanets OCR Small different from “basic OCR” systems?
Why does dataset curation matter so much for this model’s performance?
How does structured table extraction change what downstream systems can do?
What’s the practical difference between extracting text/labels vs returning images for later multimodal processing?
How feasible is deployment given the model size and hardware mentioned?
What uncertainties remain about language coverage and handwriting?
Review Questions
- Which of OCR Small’s six capabilities would be most directly useful for building a RAG system that needs to answer questions about a document’s table contents?
- How does the training dataset composition (250,000 pages across specific document types, plus synthetic and manual annotation) relate to the model’s strengths and likely weaknesses?
- Why might returning an image for later multimodal interpretation be less convenient than extracting structured text/labels for immediate downstream use?
Key Points
- 1
Nanet’s OCR Small is a ~3B open-weight OCR model built on Quen 2.5VL, aimed at structured document extraction rather than plain text alone.
- 2
The model’s core strengths include LaTeX equation recognition, signature detection, watermark extraction, smart checkbox handling, and complex table extraction.
- 3
A curated 250,000-page training set spanning research, finance, legal, healthcare forms, receipts, and invoices is central to its performance on tables, equations, signatures, and watermarks.
- 4
Structured outputs (including HTML-like table layouts and labeled artifacts) make it more directly usable for RAG and information extraction pipelines than approaches that only return images.
- 5
The open-weight release enables private/on-prem deployments and customization without uploading sensitive documents to third-party APIs.
- 6
The transcript suggests some multilingual character handling (e.g., umlauts and Japanese), but it’s not presented as a guaranteed multilingual OCR system.
- 7
Handwriting OCR is not the model’s main target; signatures may work, but general handwritten text likely needs separate evaluation.