Convert Any Document To LLM Knowledge with Docling & Ollama (100% Local) | PDF to Markdown Pipeline
Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Use Docling with PI PDFM2 to extract text from digital PDFs and disable OCR to avoid transcription errors.
Briefing
Building a reliable, fully local knowledge base from PDFs hinges on turning messy layouts—especially tables and charts—into structured Markdown that an LLM can query. The pipeline described here uses Docling to convert digital PDFs into Markdown while preserving table structure and replacing images with model-generated annotations. That matters because ARC-style retrieval and question answering often fail when the source material is trapped in PDF formatting rather than clean text.
The workflow starts with a PDF reader backend (Docling’s PI PDFM2) that extracts text directly from digital PDFs. Since the documents are already digital, OCR is intentionally disabled to avoid the errors and hallucinated artifacts that can come from OCR engines. For financial documents—where tables dominate—Docling’s table extraction uses a TableFormer model configured for high-precision table structure recognition.
Images are handled as a separate step. Docling crops out images from the PDF and sends them to a visual language model running locally through Ollama. The example uses Qwen2.5-VL (the “2 billion parameter” version) to generate concise descriptions for each chart or figure. Those descriptions are inserted back into the Markdown in place of the original images, using a prompt and token limit configured in the pipeline. The setup also includes page-break placeholders so later chunking for retrieval can respect page boundaries.
Once the conversion runs, Docling exports a Markdown file that includes: extracted text, correctly parsed tables, and image annotations that describe what the charts show. In the demonstrated output, a chart description captures the trend being displayed without inventing extra metadata—such as identifying the chart as specifically S&P 500—because the model only describes what it can infer visually. The conversion completes in roughly 8–9 seconds for the sample document.
To validate usefulness for retrieval-style QA, the resulting Markdown is then fed into a local Ollama chat workflow using a Qwen model (the example mentions a “4 billion parameter model”). Questions like “What is the document about?” and a specific numeric query for 2022 return answers grounded in the converted content—such as an annual return of -9.44%—demonstrating that the Markdown preserves the key facts needed for downstream ARC applications.
The pipeline is presented as modular: Docling can swap out the PDF reader (for scanned PDFs, an OCR-based approach could be used), and the image understanding component can be replaced with a different visual language model if more compute is available. But for digital PDFs, the recommendation is to stick with direct PDF text extraction to keep the content faithful to the underlying document structure. Overall, the approach turns hard-to-parse financial PDFs into LLM-ready knowledge artifacts entirely on-device, setting up a foundation for later chunking and retrieval improvements in subsequent steps.
Cornell Notes
A local Docling + Ollama pipeline converts digital PDFs into Markdown that keeps both tables and image meaning. Text is extracted with Docling’s PI PDFM2 backend while OCR is disabled to avoid transcription errors. Tables are recovered using a TableFormer model in high-precision mode, and charts/figures are cropped and described by a local visual language model (Qwen2.5-VL) via Ollama. The output Markdown includes image annotations plus page-break placeholders, making it easier to chunk later for retrieval. In a sample QA flow, questions about the document and a 2022 numeric value (annual return -9.44%) are answered from the converted Markdown, showing the pipeline’s usefulness for ARC-style applications.
Why disable OCR when converting PDFs in this pipeline?
How does the pipeline achieve accurate table extraction for financial documents?
What role does the visual language model play, and how are images handled?
What are page-break placeholders used for?
How is the converted Markdown validated for question answering?
What makes the pipeline flexible for different document types and hardware?
Review Questions
- If the input PDF is scanned rather than digital, what change would you make to the pipeline and why?
- Which two extraction steps are most critical for financial PDFs, and what models/backends handle them?
- How do page-break placeholders improve later retrieval chunking compared with plain Markdown output?
Key Points
- 1
Use Docling with PI PDFM2 to extract text from digital PDFs and disable OCR to avoid transcription errors.
- 2
Enable high-precision table structure recognition with TableFormer to preserve financial tables for retrieval.
- 3
Crop and describe charts/figures by sending images to a local visual language model via Ollama (e.g., Qwen2.5-VL).
- 4
Replace images in the Markdown with model-generated annotations so LLM retrieval can use chart meaning directly.
- 5
Insert page-break placeholders during conversion to support page-aware chunking later.
- 6
Run the entire pipeline locally (Docling + Ollama) to keep document processing on-device and reduce dependency on external services.
- 7
Validate the conversion by asking targeted questions and checking that numeric answers match values present in the Markdown.