Build an AI Document (PDF, DOC, XML) Processing Pipeline for RAG | Docling, OCR, Chunking, Images
Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Use Docling to convert PDFs into Markdown with OCR and image placeholders, then explicitly replace placeholders with locally generated image descriptions.
Briefing
Turning messy PDFs into reliable knowledge for RAG hinges on more than OCR. The core takeaway is a three-stage, fully local pipeline that converts a multi-page document (text, tables, and images) into structured Markdown, then splits it into semantically coherent chunks, enriches each chunk with “where it fits” context, and finally retrieves the right chunks to answer questions grounded in the source.
The pipeline starts with Docling (an IBM open-source library) to do the heavy lifting: each PDF page is rendered into images, OCR is run, and image placeholders are inserted into the resulting Markdown. For OCR, the setup uses RapidOCR, and for image understanding it relies on a small local visual language model (described as a 256M-parameter model) to generate short descriptions. A key implementation detail is that image descriptions don’t automatically replace placeholders; the workflow explicitly replaces placeholders by extracting annotations, then running a function that swaps “image placeholder” tokens with the model’s text. The author also recommends a practical feedback loop: visually inspect the produced Markdown—especially tables—then iterate on the pipeline.
That inspection step matters because tables are where open-source OCR often stumbles. In the example NVIDIA fiscal 2025 financial results document, table headers and column alignment degrade into confusing gaps, which can mislead downstream language models. The transcript notes a workaround: when OCR output is imperfect, feed the page image alongside the Markdown into a stronger multimodal model (the example given is Gemini 2) to repair formatting issues. Even without that upgrade, the pipeline can still work well on “nicely formatted” PDFs, and the author contrasts this with the still-better accuracy of cloud-based document AI systems.
Next comes chunking, designed to respect document structure rather than splitting purely by length. The approach uses a hybrid strategy: first split by Markdown headers, then ask a local LLM (Gemma 2, described as a 12B parameter model with a larger context window) to decide where additional semantic splits should occur. The model is prompted to return only chunk IDs where a split should happen, with a rule that at least one split must be proposed. This lets the system keep tables or lists intact when desired—domain rules can be added so certain structures never get broken.
Finally, contextual enrichment adds retrieval-ready framing to each chunk. Inspired by Anthropic’s “contextual retrieval” and similar Microsoft ideas, each chunk is paired with a short prompt that asks for three concise sentences situating the chunk within the larger document. The enriched chunk is then prepended with that context, producing a more self-contained unit for retrieval.
For retrieval, the example uses a simple TF-IDF-based setup (via scikit-learn/Sklearn-style components) and then queries the local LLM (Gemma 3) with retrieved context. In tests, the system answers “What is Blackwell?” with details grounded in the document and correctly extracts a specific table value for “gaming revenue for fourth quarter” as $2.5 billion. The overall message: there’s no perfect universal pipeline, but a local, inspect-and-iterate architecture can produce grounded RAG results for complex PDFs when formatting is reasonably clean and chunking/enrichment are handled carefully.
Cornell Notes
A practical, fully local document-to-RAG pipeline turns PDFs with text, tables, and images into Markdown, then into retrieval-ready chunks. Docling renders pages to images, runs OCR (RapidOCR), and inserts image placeholders that are later replaced using local image-description annotations. Chunking is hybrid: Markdown header splits plus an LLM (Gemma 2) that proposes additional semantic split points using only chunk IDs, with rules to avoid breaking sensitive structures like tables. Each chunk then gets contextual enrichment—three concise sentences describing where it fits in the document—so retrieval returns more self-contained evidence. A TF-IDF retriever selects enriched chunks, and Gemma 3 answers questions grounded in the retrieved context, including correct table extraction (e.g., $2.5 billion).
Why does the pipeline explicitly replace image placeholders after Docling conversion instead of relying on automatic substitution?
What goes wrong with tables, and how does the workflow mitigate it?
How does the chunking strategy balance structure with semantic coherence?
What is contextual enrichment, and why does it improve retrieval?
Why use TF-IDF retrieval in the example instead of a more complex vector database?
How do the local models fit into the pipeline’s stages?
Review Questions
- What specific artifacts does the pipeline produce after Docling conversion, and how are image descriptions integrated into the final Markdown?
- Describe the two-step chunking approach and explain how it prevents tables or lists from being split incorrectly.
- How does contextual enrichment change the information content of a chunk before retrieval, and what effect does that have on grounded answers?
Key Points
- 1
Use Docling to convert PDFs into Markdown with OCR and image placeholders, then explicitly replace placeholders with locally generated image descriptions.
- 2
Run a visual inspection loop on OCR output—tables are the most error-prone area and often require targeted fixes.
- 3
Chunk by Markdown structure first (headers), then use an LLM to propose additional semantic split points using only chunk IDs.
- 4
Add contextual enrichment by generating a short “where this chunk fits” summary for every chunk to make retrieval more self-contained.
- 5
Keep the pipeline local by pairing RapidOCR, a small visual language model for image descriptions, Gemma 2 for chunk splitting, and Gemma 3 for enrichment and answering.
- 6
Ground answers by retrieving enriched chunks and passing them into a simple QA prompt that instructs the model to use provided context.
- 7
Expect no universal pipeline: accuracy depends heavily on document formatting quality and iterative tuning of OCR, prompts, and chunking rules.