100% Local PDF OCR with Docling and Ollama | PDF to Markdown with VLM (Nanonets-OCR-s)
Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Docling can convert PDFs to Markdown using a visual language model by swapping out the OCR pipeline for a VM pipeline.
Briefing
A local, fully self-hosted pipeline can convert PDFs into Markdown by swapping out traditional OCR for a visual language model—specifically Docling paired with an Ollama-served Nanonets OCR model. The practical takeaway is that table- and layout-aware extraction can improve when a model is prompted to produce structured output, but accuracy still depends heavily on document type and what “good” means for the task.
The workflow starts with Docling’s document conversion pipeline, configured to run OCR via a VM (visual language model) rather than a classical OCR engine. The Nanonets OCR model used is a fine-tuned version of Qwen 2.5, loaded in a quantized form and served locally through Ollama. The model is accessed via a local Ollama URL, and the pipeline is configured with parameters such as a prompt, timeout, image scaling, and a response format set to Markdown. A key implementation detail is enabling remote services in the pipeline options and passing the VM options (model identifier plus the instruction prompt) so Docling can call the model during conversion.
Once the pipeline is wired into a Docling document converter, the system processes a sample PDF—Nvidia financial results—by converting selected pages into Markdown. On a “WO M3 Pro” machine, the conversion took about 53 seconds, including the time to load a 3B quantized model into GPU RAM. The runtime is expected to improve with stronger hardware and may change with longer PDFs.
The output quality is mixed in a way that matters for real-world use. On at least one page, the model successfully extracts tables, producing Markdown that includes the table structure. However, header alignment errors appear—for example, the “millions” header is misplaced relative to the intended column. The result is a clear signal that these models are not perfect at precise table geometry, even when they can capture the presence and general structure of tables.
The closing guidance draws a boundary around when to choose which approach. Classical OCR models (including free ones) tend to perform well for extracting complete page text. But when the goal is to extract specific elements—images, tables, and other structured components—prompting a visual language model can yield better results, especially if the pipeline is designed to ask for structured Markdown. The recommendation is to evaluate on the target document set: for general information extraction tasks (such as a RAG pipeline), a classical OCR model may still be the safer default, while VLM-based extraction can be advantageous for structured outputs where prompting and element-level understanding matter most.
Cornell Notes
Docling can convert PDFs to Markdown using a visual language model served locally through Ollama, replacing classical OCR. In this setup, a Nanonets OCR model fine-tuned from Qwen 2.5 (GGF variant) is loaded in a 3B quantized form and called via Docling’s VM options pipeline with a prompt and Markdown output format. On an Nvidia financial results PDF, the system extracted tables into Markdown but showed table header misalignment (e.g., the “millions” header landing in the wrong column). Runtime was about 53 seconds on an M3 Pro-class machine, including model load time. The practical lesson: VLMs can improve structured extraction with good prompting, while classical OCR often remains better for full-text accuracy.
How does the Docling pipeline change when replacing classical OCR with a visual language model?
What model and serving approach are used for local PDF-to-Markdown conversion?
What performance was observed, and what factors likely affect it?
How accurate was the Markdown output for tables, and what specific failure mode appeared?
When should classical OCR be preferred over a VLM-based approach in this workflow?
Review Questions
- For a Docling VM-based pipeline, which configuration choices (e.g., remote services, prompt, response format) are essential to get Markdown output from a locally served model?
- What evidence from the table extraction results suggests a limitation of VLM-based OCR compared with classical OCR?
- How would you decide between classical OCR and a VLM for a RAG pipeline that needs reliable text chunks versus structured table fields?
Key Points
- 1
Docling can convert PDFs to Markdown using a visual language model by swapping out the OCR pipeline for a VM pipeline.
- 2
A locally served Nanonets OCR model (fine-tuned from Qwen 2.5, GGF variant) can be called through Ollama using Docling’s VM options.
- 3
Pipeline configuration must enable remote services and pass the model identifier plus a prompt that requests page OCR in Markdown format.
- 4
On an M3 Pro-class machine, converting an Nvidia financial results PDF took about 53 seconds, including loading a 3B quantized model into GPU RAM.
- 5
VLM-based extraction can capture tables into Markdown, but header alignment and column placement can be wrong.
- 6
Classical OCR remains a strong default for full-page text extraction, while VLMs may be better for structured elements when prompting is tailored.
- 7
Choosing between OCR and VLM should be driven by task requirements and evaluated on the specific document set.