Build an AI Document (PDF, DOC, XML) Processing Pipeline for RAG

TL;DR

Use Docling to convert PDFs into Markdown with OCR and image placeholders, then explicitly replace placeholders with locally generated image descriptions.

Briefing Cornell Notes

Briefing

Turning messy PDFs into reliable knowledge for RAG hinges on more than OCR. The core takeaway is a three-stage, fully local pipeline that converts a multi-page document (text, tables, and images) into structured Markdown, then splits it into semantically coherent chunks, enriches each chunk with “where it fits” context, and finally retrieves the right chunks to answer questions grounded in the source.

The pipeline starts with Docling (an IBM open-source library) to do the heavy lifting: each PDF page is rendered into images, OCR is run, and image placeholders are inserted into the resulting Markdown. For OCR, the setup uses RapidOCR, and for image understanding it relies on a small local visual language model (described as a 256M-parameter model) to generate short descriptions. A key implementation detail is that image descriptions don’t automatically replace placeholders; the workflow explicitly replaces placeholders by extracting annotations, then running a function that swaps “image placeholder” tokens with the model’s text. The author also recommends a practical feedback loop: visually inspect the produced Markdown—especially tables—then iterate on the pipeline.

That inspection step matters because tables are where open-source OCR often stumbles. In the example NVIDIA fiscal 2025 financial results document, table headers and column alignment degrade into confusing gaps, which can mislead downstream language models. The transcript notes a workaround: when OCR output is imperfect, feed the page image alongside the Markdown into a stronger multimodal model (the example given is Gemini 2) to repair formatting issues. Even without that upgrade, the pipeline can still work well on “nicely formatted” PDFs, and the author contrasts this with the still-better accuracy of cloud-based document AI systems.

Next comes chunking, designed to respect document structure rather than splitting purely by length. The approach uses a hybrid strategy: first split by Markdown headers, then ask a local LLM (Gemma 2, described as a 12B parameter model with a larger context window) to decide where additional semantic splits should occur. The model is prompted to return only chunk IDs where a split should happen, with a rule that at least one split must be proposed. This lets the system keep tables or lists intact when desired—domain rules can be added so certain structures never get broken.

Finally, contextual enrichment adds retrieval-ready framing to each chunk. Inspired by Anthropic’s “contextual retrieval” and similar Microsoft ideas, each chunk is paired with a short prompt that asks for three concise sentences situating the chunk within the larger document. The enriched chunk is then prepended with that context, producing a more self-contained unit for retrieval.

For retrieval, the example uses a simple TF-IDF-based setup (via scikit-learn/Sklearn-style components) and then queries the local LLM (Gemma 3) with retrieved context. In tests, the system answers “What is Blackwell?” with details grounded in the document and correctly extracts a specific table value for “gaming revenue for fourth quarter” as $2.5 billion. The overall message: there’s no perfect universal pipeline, but a local, inspect-and-iterate architecture can produce grounded RAG results for complex PDFs when formatting is reasonably clean and chunking/enrichment are handled carefully.

Cornell Notes

A practical, fully local document-to-RAG pipeline turns PDFs with text, tables, and images into Markdown, then into retrieval-ready chunks. Docling renders pages to images, runs OCR (RapidOCR), and inserts image placeholders that are later replaced using local image-description annotations. Chunking is hybrid: Markdown header splits plus an LLM (Gemma 2) that proposes additional semantic split points using only chunk IDs, with rules to avoid breaking sensitive structures like tables. Each chunk then gets contextual enrichment—three concise sentences describing where it fits in the document—so retrieval returns more self-contained evidence. A TF-IDF retriever selects enriched chunks, and Gemma 3 answers questions grounded in the retrieved context, including correct table extraction (e.g., $2.5 billion).

Why does the pipeline explicitly replace image placeholders after Docling conversion instead of relying on automatic substitution?

Docling can generate Markdown with image placeholders even when image descriptions are requested. In the workflow, the image-description model produces annotations (e.g., a short description for the NVIDIA logo), and a custom replace function swaps the placeholder tokens with those annotations. This makes the output deterministic and ensures the downstream chunker and retriever see the same text that the image model generated.

What goes wrong with tables, and how does the workflow mitigate it?

OCR output can distort table structure—especially headers and column alignment—turning header values into confusing gaps that can mislead a language model during retrieval or answering. The transcript recommends visual inspection of the generated Markdown and, when needed, using a stronger multimodal model by providing both the page image and the OCR Markdown (example: Gemini 2) to repair formatting errors.

How does the chunking strategy balance structure with semantic coherence?

It starts with structural splitting by Markdown headers, then uses an LLM (Gemma 2) to decide where additional splits should occur. The LLM is instructed to return only the IDs of chunks where splits should happen, and it must propose at least one split. This hybrid method keeps important structures (like tables or lists) from being broken arbitrarily while still producing semantically consistent sections.

What is contextual enrichment, and why does it improve retrieval?

Contextual enrichment adds a short “situating” summary to each chunk. For each chunk, the system prompts a local model (Gemma 3 in the example) to output three concise sentences identifying the main topic and how the chunk relates to the overall document. The enriched chunk becomes more self-contained, so TF-IDF retrieval and the final answer generation rely on clearer, more relevant evidence.

Why use TF-IDF retrieval in the example instead of a more complex vector database?

The example uses a straightforward TF-IDF-based retriever (via scikit-learn-style components) to keep the pipeline minimal and testable. Retrieval quality still improves because the chunks are structurally split and context-enriched, making keyword overlap more reliable. The transcript also notes that the same approach could be implemented with vector-store frameworks like LangChain or LlamaIndex.

How do the local models fit into the pipeline’s stages?

Docling handles parsing and OCR-to-Markdown conversion. RapidOCR performs OCR. A small local visual language model generates image descriptions for placeholders. Gemma 2 proposes semantic split points during chunking. Gemma 3 generates contextual enrichment and answers questions using retrieved enriched chunks.

Review Questions

What specific artifacts does the pipeline produce after Docling conversion, and how are image descriptions integrated into the final Markdown?
Describe the two-step chunking approach and explain how it prevents tables or lists from being split incorrectly.
How does contextual enrichment change the information content of a chunk before retrieval, and what effect does that have on grounded answers?

Key Points

1
Use Docling to convert PDFs into Markdown with OCR and image placeholders, then explicitly replace placeholders with locally generated image descriptions.
2
Run a visual inspection loop on OCR output—tables are the most error-prone area and often require targeted fixes.
3
Chunk by Markdown structure first (headers), then use an LLM to propose additional semantic split points using only chunk IDs.
4
Add contextual enrichment by generating a short “where this chunk fits” summary for every chunk to make retrieval more self-contained.
5
Keep the pipeline local by pairing RapidOCR, a small visual language model for image descriptions, Gemma 2 for chunk splitting, and Gemma 3 for enrichment and answering.
6
Ground answers by retrieving enriched chunks and passing them into a simple QA prompt that instructs the model to use provided context.
7
Expect no universal pipeline: accuracy depends heavily on document formatting quality and iterative tuning of OCR, prompts, and chunking rules.

Highlights

Docling can produce Markdown with image placeholders; a custom replacement step is needed to insert the image model’s descriptions into the text.

Table OCR errors often manifest as broken header structure and column gaps—visual inspection plus multimodal correction (e.g., Gemini 2 with page image + Markdown) can help.

Chunking works best when it’s hybrid: Markdown-header splits plus an LLM that returns only split locations (chunk IDs), with rules to protect tables/lists.

Contextual retrieval improves answers when each chunk is prepended with a short, three-sentence summary of its role in the full document.

In the example, the system answers “What is Blackwell?” with document-grounded details and correctly extracts a table value for gaming revenue ($2.5 billion).

Topics

Docling Pipeline
OCR to Markdown
Table Handling
LLM Chunking
Contextual Retrieval

Mentioned

OCR
RAG
TF-IDF
GPU
CUDA
VL
LM
OM
TF
IDF

Build an AI Document (PDF, DOC, XML) Processing Pipeline for RAG | Docling, OCR, Chunking, Images