NanoNets OCR-s

TL;DR

Nanet’s OCR Small is a ~3B open-weight OCR model built on Quen 2.5VL, aimed at structured document extraction rather than plain text alone.

Briefing Cornell Notes

Briefing

A newly released “OCR Small” model from Nanets—built on the open-weight Quen 2.5VL base—turns a roughly 3B parameter vision-language model into a multi-task document reader that goes beyond plain text. The standout point isn’t just accuracy on OCR; it’s the ability to extract structured outputs for specialized needs like LaTeX equations, signatures, watermarks, checkboxes, and complex tables, all while staying small enough to run on modest hardware.

The model’s feature set is framed around six capabilities: LaTeX equation recognition, intelligent image description, signature detection, watermark extraction, smart checkbox handling, and table extraction. In practice, that means it can pull equations out of scanned or rendered documents rather than leaving them as images, detect signatures well enough to identify a name (the example shown reads “J Walker”), and flag watermarks so they can be filtered and extracted. It also handles tables in a way that outputs an HTML-like structure—useful because downstream systems (including retrieval-augmented generation pipelines) can query that structure instead of treating the table as an opaque image.

A key driver behind the performance is dataset curation. Nanets trained on a dataset of 250,000 pages selected to represent research papers, financial documents, legal documents, healthcare tax forms, receipts, and invoices—document types where tables, equations, signatures, and watermarks are common. The training mix is described as both synthetically generated and manually annotated data, with the emphasis on enhancing those specific visual/textual artifacts. That focus helps explain why the model looks strong on structured documents while not being positioned as a general-purpose handwriting OCR system.

The comparison against Mistral highlights the practical difference between “text extraction” and “document understanding.” In examples involving equations and watermarks, the Nanets OCR Small output is more directly usable: equations are extracted, watermarks are detected and labeled, and tables are returned in structured form. The tradeoff is that some alternative approaches may preserve images for later multimodal processing, which can be valuable in certain pipelines—but it tends to be less convenient for immediate RAG-style retrieval when the goal is to embed text descriptions rather than image blobs.

Beyond the headline capabilities, the model is available on Hugging Face as open weights, enabling organizations to fine-tune or deploy privately without sending sensitive documents to external APIs. Testing described in the transcript suggests it performs reasonably on non-English characters (including umlauts and Japanese text), and it handles financial statement tables (profit and loss) with consistent dollar-sign placement and section alignment. Still, there’s no clear evidence of a dedicated multilingual training claim, so real-world performance for languages like Arabic is presented as an open question.

Overall, the release reinforces a broader shift: OCR is moving toward smaller, specialized open models that can be run on consumer or cloud GPUs (including a T4-class setup) and then adapted to specific document workflows. With Quen 2.5VL aging and newer Quen generations likely on the horizon, the expectation is that future OCR models will get even smaller and faster—making latency and cost planning a near-term priority for document extraction systems.

Cornell Notes

Nanet’s OCR Small is a ~3B vision-language model for document extraction built on the open-weight Quen 2.5VL base. Its main value is not just reading text, but producing structured outputs for tasks like LaTeX equation recognition, signature detection, watermark extraction, smart checkbox handling, and complex table extraction (often in an HTML-like layout). That structure matters for downstream RAG and information extraction systems because it enables querying and embedding meaningful text rather than treating tables or equations as images. Performance is attributed to a curated training set of 250,000 pages spanning research, finance, legal, healthcare forms, receipts, and invoices, with emphasis on tables, equations, signatures, and watermarks. The model is available on Hugging Face as open weights, supporting private, on-prem deployments and customization.

What makes Nanets OCR Small different from “basic OCR” systems?

It targets specialized document artifacts and returns usable structured outputs. The six highlighted capabilities are LaTeX equation recognition, intelligent image description, signature detection, watermark extraction, smart checkbox handling, and table extraction. Instead of leaving equations or tables as images, it extracts them into formats that can be embedded into downstream systems (e.g., table content in an HTML-like layout), and it can label watermarks so they can be filtered and extracted.

Why does dataset curation matter so much for this model’s performance?

The training emphasis is on document types where complex visual/text elements appear. Nanets trained on 250,000 pages chosen to represent research papers, financial documents, legal documents, healthcare tax forms, receipts, and invoices. The dataset is described as synthetically generated plus manually annotated data, with enhancements aimed at tables, equations, signatures, and watermarks—explaining why the model looks strong on structured documents even if it isn’t positioned as a general handwriting OCR solution.

How does structured table extraction change what downstream systems can do?

Tables are often a failure point for OCR because alignment and relationships between cells are hard to recover. OCR Small outputs tables in an HTML-like structure, which can be fed into a retrieval-augmented generation pipeline. That structure lets the system answer questions about the table content more reliably than if the table were stored as an image or as an unstructured blob.

What’s the practical difference between extracting text/labels vs returning images for later multimodal processing?

The transcript contrasts approaches where one model returns images (e.g., a watermark or signature as an image) with Nanets OCR Small, which returns extracted, labeled, or described outputs. Returning images can still work for multimodal RAG where another model interprets the image later, but it’s less convenient for immediate embedding and retrieval. OCR Small’s outputs are designed to be directly usable in pipelines without requiring a second model to interpret the image content.

How feasible is deployment given the model size and hardware mentioned?

The model is described as small enough to run on modest GPUs, with a demonstration using a T4 GPU available on Google Colab’s free tier. The transcript notes potential issues with flash attention on T4, but the workflow still runs. The broader point is that open-weight OCR can be deployed on-prem or privately without large inference clusters.

What uncertainties remain about language coverage and handwriting?

The transcript suggests the model handles some multilingual characters (examples include umlauts and Japanese text), but it also notes there’s no clear claim of dedicated multilingual training. Handwriting is treated as a likely weak spot: signatures may work because they’re a specific visual pattern, but general handwritten text isn’t the stated training target. Arabic performance is explicitly left as something viewers should test and report.

Review Questions

Which of OCR Small’s six capabilities would be most directly useful for building a RAG system that needs to answer questions about a document’s table contents?
How does the training dataset composition (250,000 pages across specific document types, plus synthetic and manual annotation) relate to the model’s strengths and likely weaknesses?
Why might returning an image for later multimodal interpretation be less convenient than extracting structured text/labels for immediate downstream use?

Key Points

1
Nanet’s OCR Small is a ~3B open-weight OCR model built on Quen 2.5VL, aimed at structured document extraction rather than plain text alone.
2
The model’s core strengths include LaTeX equation recognition, signature detection, watermark extraction, smart checkbox handling, and complex table extraction.
3
A curated 250,000-page training set spanning research, finance, legal, healthcare forms, receipts, and invoices is central to its performance on tables, equations, signatures, and watermarks.
4
Structured outputs (including HTML-like table layouts and labeled artifacts) make it more directly usable for RAG and information extraction pipelines than approaches that only return images.
5
The open-weight release enables private/on-prem deployments and customization without uploading sensitive documents to third-party APIs.
6
The transcript suggests some multilingual character handling (e.g., umlauts and Japanese), but it’s not presented as a guaranteed multilingual OCR system.
7
Handwriting OCR is not the model’s main target; signatures may work, but general handwritten text likely needs separate evaluation.

Highlights

OCR Small focuses on extracting structured artifacts—equations, signatures, watermarks, checkboxes, and tables—so downstream systems can use the results immediately.

Tables are returned in an HTML-like structure, enabling question answering over table content instead of treating tables as images.

The model is small enough to run on a T4-class GPU setup, supporting private deployments with open weights.

A 250,000-page, document-type-specific training set (synthetic plus manual annotation) underpins the specialized performance.

Topics

OCR
Vision-Language Models
Document Extraction
Table Parsing
Open-Weights Deployment