Advanced RAG Chunking: Contextual & Structural Chunking with LangChain & Ollama (100% Local)
Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Split markdown by heading hierarchy (title/section/subsection) while preserving headers so chunk boundaries match the document’s structure.
Briefing
Turning a large, converted PDF markdown into retrieval-ready chunks is often where RAG pipelines lose both speed and accuracy. The core fix here is a chunking pipeline that is simultaneously structure-aware (it respects markdown headings and keeps tables intact) and context-aware (it enriches each chunk with model-generated summaries so retrieval has more semantic footing). The result is a chunk set that carries not just text, but also metadata and a “breadcrumb” hierarchy that later retrieval steps can use to preserve document structure.
The pipeline starts after a PDF has been converted into a single markdown document. Chunking is split into two stages. First comes structural chunking using LangChain components: a markdown header text splitter breaks the document along the first three levels of markdown headers (title, section, subsection) while preserving those headers inside the chunks. Within each header-based segment, a recursive character text splitter then enforces token limits using a tokenizer approach rather than raw character counts. Tables are treated specially: if a segment contains a table, it is kept whole and not split further, preventing retrieval from scattering critical tabular information across multiple chunks.
Second comes contextual enrichment, inspired by Anthropic’s contextual enrichment approach and implemented locally using Ollama with a Geometry model (the transcript specifies “geometry the 4 billion parameter version”). For each chunk, enrichment generates a short 2–3 sentence context that explains the document’s premise and the chunk’s specific positioning within it. The enrichment prompt is built using the first 5,000 characters of the full markdown document plus the chunk content, and the output is stored alongside the original chunk text.
Each produced chunk is represented as a data object containing: (1) the chunk content for answering queries, (2) metadata including company and fiscal year plus breadcrumbs that encode the heading hierarchy, (3) the enriched context text generated by the local model, and (4) a vector text field intended for later embedding and retrieval. Breadcrumbs are derived from the markdown heading levels so downstream retrieval can understand where a chunk sits in the document’s structure.
A practical example uses a financial markdown document over 40,000 characters. Basic chunking yields eight chunks, with token counts varying widely; chunks containing tables remain intact, which is why some exceed the max token threshold. No chunk ends up below the minimum token threshold because undersized chunks are merged back into neighboring chunks. When enrichment is enabled, the chunk set retains the same breadcrumbs but gains additional model-generated context per chunk. The transcript notes a measurable cost: enrichment increases the token footprint used for embeddings and retrieval—about an 8% inflation in the example (average from 8,857 row tokens to roughly 9,500). That tradeoff is framed as the engineering decision: more context can improve retrieval quality, but it raises inference and embedding costs.
Overall, the strategy is designed to produce retrieval units that are structurally faithful (headings and tables) and semantically boosted (model summaries), all while running locally with Ollama and a tokenizer-based token budgeting approach.
Cornell Notes
The pipeline turns a large markdown-converted PDF into RAG-ready chunks using two stages: structure-aware splitting and contextual enrichment. LangChain’s markdown header splitter divides text by the first three heading levels while preserving headers, and a recursive token-aware splitter enforces max/min token limits. Tables are detected and kept intact so tabular facts aren’t broken across chunks. After chunking, a local Ollama Geometry model generates 2–3 sentence context for each chunk using the document’s first 5,000 characters plus the chunk content. Chunks store breadcrumbs (heading hierarchy), company and fiscal year metadata, enriched context, and separate fields for embedding—improving retrieval at the cost of higher token counts (about ~8% in the example).
How does the pipeline keep document structure intact while still enforcing token limits?
Why are tables treated differently during chunking?
What exactly does contextual enrichment add to each chunk?
What metadata and fields are stored per chunk, and how are they intended to be used?
What cost tradeoff does enrichment introduce?
Review Questions
- How would you modify the chunking parameters (max tokens, min tokens, overlap) to change the number and size of chunks without breaking table integrity?
- Why might breadcrumbs improve retrieval outcomes in a structured financial document compared with plain text chunking?
- In what ways does enriching chunks with a 2–3 sentence context change embedding behavior, and what downstream costs does that imply?
Key Points
- 1
Split markdown by heading hierarchy (title/section/subsection) while preserving headers so chunk boundaries match the document’s structure.
- 2
Use token-aware recursive splitting to enforce max/min token limits rather than relying on character counts.
- 3
Detect tables and keep them intact as single chunks to avoid scattering tabular facts across multiple retrieval units.
- 4
Generate per-chunk contextual summaries locally with Ollama using a Geometry model, using the document’s first 5,000 characters plus the chunk content.
- 5
Store breadcrumbs (heading hierarchy), company, and fiscal year as metadata so retrieval can leverage document structure later.
- 6
Expect enrichment to increase embedding/retrieval token volume (about ~8% in the example), raising inference and compute costs.
- 7
Represent each chunk with separate fields for original content, enriched context, metadata, and vector text to support embedding and downstream answering.