Get AI summaries of any video or article — Sign up free
Advanced RAG Chunking: Contextual & Structural Chunking with LangChain & Ollama (100% Local) thumbnail

Advanced RAG Chunking: Contextual & Structural Chunking with LangChain & Ollama (100% Local)

Venelin Valkov·
5 min read

Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Split markdown by heading hierarchy (title/section/subsection) while preserving headers so chunk boundaries match the document’s structure.

Briefing

Turning a large, converted PDF markdown into retrieval-ready chunks is often where RAG pipelines lose both speed and accuracy. The core fix here is a chunking pipeline that is simultaneously structure-aware (it respects markdown headings and keeps tables intact) and context-aware (it enriches each chunk with model-generated summaries so retrieval has more semantic footing). The result is a chunk set that carries not just text, but also metadata and a “breadcrumb” hierarchy that later retrieval steps can use to preserve document structure.

The pipeline starts after a PDF has been converted into a single markdown document. Chunking is split into two stages. First comes structural chunking using LangChain components: a markdown header text splitter breaks the document along the first three levels of markdown headers (title, section, subsection) while preserving those headers inside the chunks. Within each header-based segment, a recursive character text splitter then enforces token limits using a tokenizer approach rather than raw character counts. Tables are treated specially: if a segment contains a table, it is kept whole and not split further, preventing retrieval from scattering critical tabular information across multiple chunks.

Second comes contextual enrichment, inspired by Anthropic’s contextual enrichment approach and implemented locally using Ollama with a Geometry model (the transcript specifies “geometry the 4 billion parameter version”). For each chunk, enrichment generates a short 2–3 sentence context that explains the document’s premise and the chunk’s specific positioning within it. The enrichment prompt is built using the first 5,000 characters of the full markdown document plus the chunk content, and the output is stored alongside the original chunk text.

Each produced chunk is represented as a data object containing: (1) the chunk content for answering queries, (2) metadata including company and fiscal year plus breadcrumbs that encode the heading hierarchy, (3) the enriched context text generated by the local model, and (4) a vector text field intended for later embedding and retrieval. Breadcrumbs are derived from the markdown heading levels so downstream retrieval can understand where a chunk sits in the document’s structure.

A practical example uses a financial markdown document over 40,000 characters. Basic chunking yields eight chunks, with token counts varying widely; chunks containing tables remain intact, which is why some exceed the max token threshold. No chunk ends up below the minimum token threshold because undersized chunks are merged back into neighboring chunks. When enrichment is enabled, the chunk set retains the same breadcrumbs but gains additional model-generated context per chunk. The transcript notes a measurable cost: enrichment increases the token footprint used for embeddings and retrieval—about an 8% inflation in the example (average from 8,857 row tokens to roughly 9,500). That tradeoff is framed as the engineering decision: more context can improve retrieval quality, but it raises inference and embedding costs.

Overall, the strategy is designed to produce retrieval units that are structurally faithful (headings and tables) and semantically boosted (model summaries), all while running locally with Ollama and a tokenizer-based token budgeting approach.

Cornell Notes

The pipeline turns a large markdown-converted PDF into RAG-ready chunks using two stages: structure-aware splitting and contextual enrichment. LangChain’s markdown header splitter divides text by the first three heading levels while preserving headers, and a recursive token-aware splitter enforces max/min token limits. Tables are detected and kept intact so tabular facts aren’t broken across chunks. After chunking, a local Ollama Geometry model generates 2–3 sentence context for each chunk using the document’s first 5,000 characters plus the chunk content. Chunks store breadcrumbs (heading hierarchy), company and fiscal year metadata, enriched context, and separate fields for embedding—improving retrieval at the cost of higher token counts (about ~8% in the example).

How does the pipeline keep document structure intact while still enforcing token limits?

It first splits by markdown headings using LangChain’s markdown header text splitter, targeting the title, section, and subsection levels and preserving those headers inside each chunk. Then it applies a recursive character text splitter configured to work with token counts (via a tokenizer) rather than raw character length. This two-step approach means chunks align with the document’s hierarchy, but still get trimmed/merged to respect token budgets.

Why are tables treated differently during chunking?

If a candidate segment contains a table, the pipeline avoids splitting it further, keeping the table as a single chunk. The transcript’s example shows that this behavior leads to some chunks exceeding the max token threshold—because preserving the table is prioritized over strict token caps. The goal is to prevent retrieval from fragmenting critical tabular information.

What exactly does contextual enrichment add to each chunk?

For each chunk, contextual enrichment generates a short 2–3 sentence summary that states the document’s premise and explains the chunk’s specific role within it. The prompt is built by injecting the first 5,000 characters of the full markdown document plus the chunk’s content, then storing the model output as the chunk’s enriched context. This enriched context is kept alongside the original chunk text for later embedding and retrieval.

What metadata and fields are stored per chunk, and how are they intended to be used?

Each chunk includes: (1) original content for answering queries, (2) metadata such as company and fiscal year, (3) breadcrumbs that encode the heading hierarchy so retrieval can reconstruct where the chunk belongs, (4) enriched context generated by the local model, and (5) a vector text field used for embeddings in later steps. Token counts are also estimated using a tokenizer approach since OpenAI models aren’t used.

What cost tradeoff does enrichment introduce?

Enrichment requires additional inference and increases the token volume used for embedding and retrieval. In the financial-document example, average row token usage rises from 8,857 to about 9,500—an inflation of nearly 8%. The transcript frames this as a parameter-dependent engineering tradeoff between retrieval quality and compute/embedding cost.

Review Questions

  1. How would you modify the chunking parameters (max tokens, min tokens, overlap) to change the number and size of chunks without breaking table integrity?
  2. Why might breadcrumbs improve retrieval outcomes in a structured financial document compared with plain text chunking?
  3. In what ways does enriching chunks with a 2–3 sentence context change embedding behavior, and what downstream costs does that imply?

Key Points

  1. 1

    Split markdown by heading hierarchy (title/section/subsection) while preserving headers so chunk boundaries match the document’s structure.

  2. 2

    Use token-aware recursive splitting to enforce max/min token limits rather than relying on character counts.

  3. 3

    Detect tables and keep them intact as single chunks to avoid scattering tabular facts across multiple retrieval units.

  4. 4

    Generate per-chunk contextual summaries locally with Ollama using a Geometry model, using the document’s first 5,000 characters plus the chunk content.

  5. 5

    Store breadcrumbs (heading hierarchy), company, and fiscal year as metadata so retrieval can leverage document structure later.

  6. 6

    Expect enrichment to increase embedding/retrieval token volume (about ~8% in the example), raising inference and compute costs.

  7. 7

    Represent each chunk with separate fields for original content, enriched context, metadata, and vector text to support embedding and downstream answering.

Highlights

Chunking is built as a two-stage pipeline: structure-aware splitting first, then contextual enrichment per chunk.
Tables are explicitly preserved as whole chunks, even if that means some chunks exceed the max token threshold.
Breadcrumb metadata encodes heading hierarchy, enabling retrieval to respect where a chunk sits in the document.
Local contextual enrichment with a Geometry model adds 2–3 sentence summaries per chunk, improving retrieval context at the cost of roughly ~8% more tokens in the example.

Topics

  • RAG Chunking
  • Contextual Enrichment
  • LangChain Splitters
  • Token-Aware Splitting
  • Ollama Local Models