Get AI summaries of any video or article — Sign up free
Mistral OCR - Multimodal & Multilingual OCR thumbnail

Mistral OCR - Multimodal & Multilingual OCR

Sam Witteveen·
5 min read

Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Mistral’s OCR API converts PDFs and images into page-by-page multimodal outputs, returning markdown for text/tables and preserved images for figures.

Briefing

Mistral has launched a non-open-source OCR system delivered through an API that turns scanned documents, PDFs, and images into structured, multimodal outputs—text interleaved with extracted tables and preserved images—supporting multiple languages and math. The core value is practical: the service returns content in a form that can be fed directly into downstream LLM workflows like RAG or visual question answering, without forcing users to manually convert layouts, tables, or figures.

The API accepts either full PDFs or single images (including base64-encoded images). For each page, it returns an object containing extracted markdown for text and tables, plus image elements where the layout calls for it. In the example of an academic paper, tables are converted into markdown while figures and other non-text elements remain as images, preserving the document’s structure and positioning. That “interleaving” matters because it keeps context intact when the output is later used for retrieval or for question answering over the document’s content.

Mistral’s pitch centers on two capabilities: multilingual OCR and multimodal extraction. The OCR output includes math converted from the page, and it handles scripts beyond English—demonstrations include Hindi and Arabic, with the system still extracting usable text even when the source is messy or misaligned. Benchmarks shown in the release materials place it near the top against other OCR competitors, including results that extend across languages such as Chinese, Hindi, and Russian.

Pricing is set at $1 per thousand pages, with batch inference described as roughly half price (at the cost of slower turnaround). For organizations with privacy or data-upload constraints, Mistral also markets an on-prem option, claiming throughput of up to 2,000 pages per minute on a single node. The monetization model is consistent with that: license the OCR for on-prem deployments rather than distributing the model weights.

A key usability feature is structured output. Users can supply a document plus a custom prompt and request JSON-like results that can drive automation—such as triggering function calls in a workflow once fields are extracted. In the provided code walkthrough, helper functions assemble the OCR results into readable text or return images for further processing. The workflow also demonstrates how to enforce schemas using Pydantic classes so that repeated document processing yields consistent keys (e.g., file name, topics, language, and OCR contents).

The practical experiments go further: a receipt image is processed into a structured JSON response, and another test targets Thai, showing that the system can identify the language and extract structured fields even when the Thai characters are represented in a specific Unicode form. Batch processing notebooks show how to queue large volumes for lower cost, trading speed for efficiency.

Overall, the OCR system is positioned as a production-oriented tool for extracting document content while preserving layout cues—especially tables and figures—at a price that’s framed as reasonable for an API service. The trade-offs are clear: it’s not open-source, and like any LLM-based OCR pipeline, it can still produce occasional structural errors or hallucinations, so accuracy needs validation against specific document types and quality requirements.

Cornell Notes

Mistral’s OCR offering is delivered via an API (not open-source) that converts PDFs and images into multimodal, page-by-page outputs: extracted text and tables in markdown, plus preserved images where layout matters. It supports multilingual OCR (including Hindi, Arabic, Chinese, Russian, and Thai) and can convert math into OCR text. The results are designed to plug into LLM pipelines for RAG and visual question answering, with options for structured JSON outputs. Developers can enforce consistent schemas using prompts and Pydantic models, making it easier to extract fields from many similar documents. Batch inference reduces cost (about half) while slowing turnaround.

What does Mistral’s OCR API return, and why is the “interleaving” of text and images important?

For each page, the API returns an object containing extracted markdown for text and tables, along with image elements for figures or regions where the layout should be preserved. In the paper example, tables become markdown while figures remain images, keeping the document’s structure intact. That structure is useful when the output is later fed into an LLM for retrieval (RAG) or visual question answering, because the context around tables and figures stays aligned rather than being flattened into plain text.

How does the system handle multilingual and messy inputs?

The release materials highlight OCR in multiple scripts, including Hindi and Arabic, and benchmark results across languages such as Chinese, Hindi, and Russian. It’s also shown extracting usable content even when text is not clean or alignment is off, indicating the model is trained for OCR robustness rather than only perfect scans.

What is the pricing model and how does batch inference change it?

The API pricing is $1 per thousand pages. Batch inference is described as roughly double the pages cost-effectiveness compared with standard processing—framed as about half price—while trading off speed because jobs are queued and returned after processing completes.

How do developers get structured outputs suitable for automation?

Users can provide the document (PDF or image) along with a custom prompt requesting structured output such as JSON. The walkthrough demonstrates using prompts that instruct the model to return strictly JSON with no extra commentary. It also shows schema enforcement using Pydantic classes so repeated extractions from many similar documents share the same keys (e.g., file name, topics, language, and OCR contents).

How can the OCR output be used for downstream tasks beyond plain text extraction?

Helper functions can combine OCR results into readable text for direct injection into an LLM for RAG. Another helper retrieves images so a second model can generate descriptions or layout-aware outputs. The workflow also demonstrates converting OCR results into structured data (like receipt fields) that can be stored in a database or used to trigger function calls in a chain.

Review Questions

  1. What specific elements does the OCR output preserve (e.g., tables, figures, page structure), and how does that affect downstream LLM tasks?
  2. How would you design a prompt and schema (e.g., with Pydantic) to extract consistent fields from 1,000 receipts?
  3. What trade-offs come with batch inference versus standard API calls, and how would you decide which to use?

Key Points

  1. 1

    Mistral’s OCR API converts PDFs and images into page-by-page multimodal outputs, returning markdown for text/tables and preserved images for figures.

  2. 2

    The service is designed for LLM workflows like RAG and visual question answering by keeping document structure rather than flattening everything into plain text.

  3. 3

    Multilingual OCR is a core feature, with demonstrations and benchmarks spanning languages such as Hindi, Arabic, Chinese, Russian, and Thai.

  4. 4

    Math OCR is included, converting mathematical content into extractable text rather than leaving it as an image-only region.

  5. 5

    Structured JSON outputs can be requested via prompts, and schema consistency can be enforced using Pydantic models for large-scale extraction.

  6. 6

    Pricing is $1 per thousand pages, while batch inference is positioned as cheaper but slower due to queued processing.

  7. 7

    On-prem deployment is marketed via licensing, with claims of up to 2,000 pages per minute on a single node for privacy-sensitive use cases.

Highlights

The API returns extracted content in a multimodal, interleaved format: tables become markdown while figures remain images, preserving layout for downstream reasoning.
Multilingual OCR extends beyond English, with examples in Hindi, Arabic, and Thai and benchmark coverage across multiple languages.
Structured extraction supports automation: prompts can force strict JSON, and Pydantic schemas help keep keys consistent across thousands of documents.

Topics

  • Multimodal OCR
  • Multilingual OCR
  • Structured JSON Extraction
  • On-Prem OCR
  • Batch Inference

Mentioned