Get AI summaries of any video or article — Sign up free
100% Local PDF OCR with Docling and Ollama | PDF to Markdown with VLM (Nanonets-OCR-s) thumbnail

100% Local PDF OCR with Docling and Ollama | PDF to Markdown with VLM (Nanonets-OCR-s)

Venelin Valkov·
4 min read

Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Docling can convert PDFs to Markdown using a visual language model by swapping out the OCR pipeline for a VM pipeline.

Briefing

A local, fully self-hosted pipeline can convert PDFs into Markdown by swapping out traditional OCR for a visual language model—specifically Docling paired with an Ollama-served Nanonets OCR model. The practical takeaway is that table- and layout-aware extraction can improve when a model is prompted to produce structured output, but accuracy still depends heavily on document type and what “good” means for the task.

The workflow starts with Docling’s document conversion pipeline, configured to run OCR via a VM (visual language model) rather than a classical OCR engine. The Nanonets OCR model used is a fine-tuned version of Qwen 2.5, loaded in a quantized form and served locally through Ollama. The model is accessed via a local Ollama URL, and the pipeline is configured with parameters such as a prompt, timeout, image scaling, and a response format set to Markdown. A key implementation detail is enabling remote services in the pipeline options and passing the VM options (model identifier plus the instruction prompt) so Docling can call the model during conversion.

Once the pipeline is wired into a Docling document converter, the system processes a sample PDF—Nvidia financial results—by converting selected pages into Markdown. On a “WO M3 Pro” machine, the conversion took about 53 seconds, including the time to load a 3B quantized model into GPU RAM. The runtime is expected to improve with stronger hardware and may change with longer PDFs.

The output quality is mixed in a way that matters for real-world use. On at least one page, the model successfully extracts tables, producing Markdown that includes the table structure. However, header alignment errors appear—for example, the “millions” header is misplaced relative to the intended column. The result is a clear signal that these models are not perfect at precise table geometry, even when they can capture the presence and general structure of tables.

The closing guidance draws a boundary around when to choose which approach. Classical OCR models (including free ones) tend to perform well for extracting complete page text. But when the goal is to extract specific elements—images, tables, and other structured components—prompting a visual language model can yield better results, especially if the pipeline is designed to ask for structured Markdown. The recommendation is to evaluate on the target document set: for general information extraction tasks (such as a RAG pipeline), a classical OCR model may still be the safer default, while VLM-based extraction can be advantageous for structured outputs where prompting and element-level understanding matter most.

Cornell Notes

Docling can convert PDFs to Markdown using a visual language model served locally through Ollama, replacing classical OCR. In this setup, a Nanonets OCR model fine-tuned from Qwen 2.5 (GGF variant) is loaded in a 3B quantized form and called via Docling’s VM options pipeline with a prompt and Markdown output format. On an Nvidia financial results PDF, the system extracted tables into Markdown but showed table header misalignment (e.g., the “millions” header landing in the wrong column). Runtime was about 53 seconds on an M3 Pro-class machine, including model load time. The practical lesson: VLMs can improve structured extraction with good prompting, while classical OCR often remains better for full-text accuracy.

How does the Docling pipeline change when replacing classical OCR with a visual language model?

Instead of using an OCR pipeline, the document converter is configured with a VM pipeline. The VM pipeline options must enable remote services and include the Ollama-served model plus a prompt instructing the model to perform page OCR and return Markdown. Docling then calls this VM during conversion, and the final Markdown is retrieved via the export-to-Markdown step.

What model and serving approach are used for local PDF-to-Markdown conversion?

The pipeline uses a Nanonets OCR model that is a fine-tuned version of Qwen 2.5 (GGF variant). The model is served locally through Ollama, with the pipeline configured using a local Ollama URL and model identifier. The setup includes parameters like timeout, image scaling, and response format set to Markdown.

What performance was observed, and what factors likely affect it?

Conversion took roughly 53 seconds on an M3 Pro machine, including the time to load the 3B quantized model into GPU RAM. Faster GPUs or more capable hardware should reduce runtime, and longer PDFs may change total processing time depending on page count and model inference cost.

How accurate was the Markdown output for tables, and what specific failure mode appeared?

Tables were extracted into Markdown, but header alignment errors occurred. For example, the “millions” header appeared in the wrong position relative to the intended column. That points to imperfect table geometry/structure reconstruction even when the model captures table presence.

When should classical OCR be preferred over a VLM-based approach in this workflow?

Classical OCR models are generally better for extracting complete page text. VLMs are more promising when the task emphasizes structured elements—tables, images, and other layout-dependent components—especially when prompting is used to request structured Markdown. Either way, evaluation on the target document set is necessary.

Review Questions

  1. For a Docling VM-based pipeline, which configuration choices (e.g., remote services, prompt, response format) are essential to get Markdown output from a locally served model?
  2. What evidence from the table extraction results suggests a limitation of VLM-based OCR compared with classical OCR?
  3. How would you decide between classical OCR and a VLM for a RAG pipeline that needs reliable text chunks versus structured table fields?

Key Points

  1. 1

    Docling can convert PDFs to Markdown using a visual language model by swapping out the OCR pipeline for a VM pipeline.

  2. 2

    A locally served Nanonets OCR model (fine-tuned from Qwen 2.5, GGF variant) can be called through Ollama using Docling’s VM options.

  3. 3

    Pipeline configuration must enable remote services and pass the model identifier plus a prompt that requests page OCR in Markdown format.

  4. 4

    On an M3 Pro-class machine, converting an Nvidia financial results PDF took about 53 seconds, including loading a 3B quantized model into GPU RAM.

  5. 5

    VLM-based extraction can capture tables into Markdown, but header alignment and column placement can be wrong.

  6. 6

    Classical OCR remains a strong default for full-page text extraction, while VLMs may be better for structured elements when prompting is tailored.

  7. 7

    Choosing between OCR and VLM should be driven by task requirements and evaluated on the specific document set.

Highlights

Docling’s VM pipeline can turn PDF pages into Markdown by calling a locally hosted visual language model through Ollama.
Table extraction worked, but header placement errors (like the “millions” header) showed that structure reconstruction is not always precise.
The reported end-to-end time (~53 seconds) included model loading, so hardware and model size strongly influence throughput.
A practical rule emerges: classical OCR for complete text; VLMs for structured elements such as tables and layout-sensitive content.

Topics

Mentioned