Convert Any Document To LLM Knowledge with Docling & Ollama (100% Local)

TL;DR

Use Docling with PI PDFM2 to extract text from digital PDFs and disable OCR to avoid transcription errors.

Briefing Cornell Notes

Briefing

Building a reliable, fully local knowledge base from PDFs hinges on turning messy layouts—especially tables and charts—into structured Markdown that an LLM can query. The pipeline described here uses Docling to convert digital PDFs into Markdown while preserving table structure and replacing images with model-generated annotations. That matters because ARC-style retrieval and question answering often fail when the source material is trapped in PDF formatting rather than clean text.

The workflow starts with a PDF reader backend (Docling’s PI PDFM2) that extracts text directly from digital PDFs. Since the documents are already digital, OCR is intentionally disabled to avoid the errors and hallucinated artifacts that can come from OCR engines. For financial documents—where tables dominate—Docling’s table extraction uses a TableFormer model configured for high-precision table structure recognition.

Images are handled as a separate step. Docling crops out images from the PDF and sends them to a visual language model running locally through Ollama. The example uses Qwen2.5-VL (the “2 billion parameter” version) to generate concise descriptions for each chart or figure. Those descriptions are inserted back into the Markdown in place of the original images, using a prompt and token limit configured in the pipeline. The setup also includes page-break placeholders so later chunking for retrieval can respect page boundaries.

Once the conversion runs, Docling exports a Markdown file that includes: extracted text, correctly parsed tables, and image annotations that describe what the charts show. In the demonstrated output, a chart description captures the trend being displayed without inventing extra metadata—such as identifying the chart as specifically S&P 500—because the model only describes what it can infer visually. The conversion completes in roughly 8–9 seconds for the sample document.

To validate usefulness for retrieval-style QA, the resulting Markdown is then fed into a local Ollama chat workflow using a Qwen model (the example mentions a “4 billion parameter model”). Questions like “What is the document about?” and a specific numeric query for 2022 return answers grounded in the converted content—such as an annual return of -9.44%—demonstrating that the Markdown preserves the key facts needed for downstream ARC applications.

The pipeline is presented as modular: Docling can swap out the PDF reader (for scanned PDFs, an OCR-based approach could be used), and the image understanding component can be replaced with a different visual language model if more compute is available. But for digital PDFs, the recommendation is to stick with direct PDF text extraction to keep the content faithful to the underlying document structure. Overall, the approach turns hard-to-parse financial PDFs into LLM-ready knowledge artifacts entirely on-device, setting up a foundation for later chunking and retrieval improvements in subsequent steps.

Cornell Notes

A local Docling + Ollama pipeline converts digital PDFs into Markdown that keeps both tables and image meaning. Text is extracted with Docling’s PI PDFM2 backend while OCR is disabled to avoid transcription errors. Tables are recovered using a TableFormer model in high-precision mode, and charts/figures are cropped and described by a local visual language model (Qwen2.5-VL) via Ollama. The output Markdown includes image annotations plus page-break placeholders, making it easier to chunk later for retrieval. In a sample QA flow, questions about the document and a 2022 numeric value (annual return -9.44%) are answered from the converted Markdown, showing the pipeline’s usefulness for ARC-style applications.

Why disable OCR when converting PDFs in this pipeline?

The pipeline targets digital PDFs where text already exists in the PDF structure. OCR is turned off in the Docling PDF conversion options so extraction relies on the PDF’s native text rather than OCR engines that can introduce errors or misread layout elements. That choice helps keep the downstream LLM answers grounded in the document’s actual content.

How does the pipeline achieve accurate table extraction for financial documents?

Tables are extracted using Docling’s table structure recognition with a TableFormer model configured for high precision. This matters because financial PDFs are table-heavy, and retrieval quality depends on preserving the table’s structure rather than flattening it into unreadable text.

What role does the visual language model play, and how are images handled?

Images (charts/figures) are cropped out by Docling and then sent to a visual language model running locally through Ollama. The example uses Qwen2.5-VL (2B) with a configured prompt and a maximum token limit. The model’s descriptions replace the images in the final Markdown, so retrieval can use the chart meaning directly.

What are page-break placeholders used for?

The pipeline inserts page-break placeholders during conversion. Those markers become important later when chunking the Markdown for retrieval, because they preserve document structure—knowing where one page ends and the next begins improves how chunks map back to source context.

How is the converted Markdown validated for question answering?

The Markdown output is loaded into a local Ollama chat setup using a Qwen model (example mentions a 4B model). Queries like “What is the document about?” and a specific numeric question for 2022 return answers that match values present in the converted content, such as an annual return of -9.44%.

What makes the pipeline flexible for different document types and hardware?

Docling’s components are modular. The PDF reader can be swapped—for scanned PDFs, an OCR-based reader could be used. The image understanding step can also be replaced with a different visual language model; the example notes that a more powerful model can be used on beefier machines to improve image descriptions.

Review Questions

If the input PDF is scanned rather than digital, what change would you make to the pipeline and why?
Which two extraction steps are most critical for financial PDFs, and what models/backends handle them?
How do page-break placeholders improve later retrieval chunking compared with plain Markdown output?

Key Points

1
Use Docling with PI PDFM2 to extract text from digital PDFs and disable OCR to avoid transcription errors.
2
Enable high-precision table structure recognition with TableFormer to preserve financial tables for retrieval.
3
Crop and describe charts/figures by sending images to a local visual language model via Ollama (e.g., Qwen2.5-VL).
4
Replace images in the Markdown with model-generated annotations so LLM retrieval can use chart meaning directly.
5
Insert page-break placeholders during conversion to support page-aware chunking later.
6
Run the entire pipeline locally (Docling + Ollama) to keep document processing on-device and reduce dependency on external services.
7
Validate the conversion by asking targeted questions and checking that numeric answers match values present in the Markdown.

Highlights

Digital PDFs can be converted without OCR by relying on PI PDFM2 text extraction, improving faithfulness to the source.

Financial table quality depends on high-precision TableFormer table structure recognition, not just plain text extraction.

Charts become searchable knowledge when Docling replaces images with Qwen2.5-VL-generated descriptions in the Markdown.

Page-break placeholders are a practical trick for later retrieval chunking that respects document structure.

A sample QA flow returned a 2022 annual return of -9.44% from the converted Markdown, illustrating end-to-end usefulness.

Topics

PDF to Markdown
Docling Pipeline
Table Extraction
Visual Language Models
Local Ollama QA

Mentioned

Venelin Valkov
ARC
LLM
OCR
VM
QA

Convert Any Document To LLM Knowledge with Docling & Ollama (100% Local) | PDF to Markdown Pipeline