Chunking 101: The Invisible Bottleneck Killing Enterprise AI Projects

TL;DR

Chunking errors can produce confident but wrong answers because RAG retrieves only a few chunks; missing meaning across chunk boundaries can’t be recovered reliably.

Briefing Cornell Notes

Briefing

Chunking—how text is cut into retrieval-ready pieces—is a major, often invisible failure point for enterprise AI systems, and it can directly cause wrong, confident answers and wasted spend. A fintech deal nearly collapsed after an AI chatbot answered an indemnification question incorrectly because the relevant contract language was split mid-sentence across token-sized chunks. Retrieval pulled only the first chunk, leading the system to claim “party A fully indemnifies party B,” even though the contract’s meaning depended on language that was split into the next chunk. The fix wasn’t a smarter model; it was context engineering—chunking the data so the right meaning lands together when the system retrieves a small set of passages.

That same context problem also drives cost and reliability. In retrieval-augmented generation (RAG), the system typically retrieves only three to five chunks per question, chosen by semantic fit. If the true answer is fragmented across those chunks, the model can’t reconstruct missing terms without guessing—fueling hallucinations. Bad chunking also inflates bills: retrieving extra chunks means more tokens loaded into the context window, which can overwhelm the model with irrelevant material and ironically degrade accuracy. The practical takeaway is blunt: chunking is a first line of defense against hallucinations and a lever for reducing model-maker costs by double-digit percentages when done well.

Agentic search doesn’t eliminate the need for chunking; it changes the trade-off. Agentic search uses iterative reasoning—searching, reading, reasoning, and searching again—so it can help with exploratory questions or multi-step tasks like aggregating the total impact of a marketing campaign across channels. But it can be 10x slower and 10x more expensive than well-targeted RAG retrieval. Even when agentic systems are used, they still rely on semantic selection from retrieved units; messy chunking makes that selection worse. The “no free lunch” message is that businesses must wrestle with their own data structure rather than expecting agentic search to bypass embeddings and chunking decisions.

Five chunking principles emerge as the scalable path for production systems. First is context coherence: never split meaning across chunks (e.g., separating “defendant shall pay damages” from the conditions that follow). Respect natural boundaries such as contract sections, code functions/classes, or conversation speaker turns. Second is controlling the three levers—boundaries, size, and overlap—rather than relying on arbitrary token counts. Overlap (often 10–20%) acts as insurance when meaning spans chunk edges. Third, data type dictates strategy: legal text, source code, financial tables, and spreadsheets each require different chunking logic. For code, dependency graphs and “neighborhood chunking” (including called functions) can help; for messy, coupled code, agentic search may be a pragmatic bridge. Excel and financial data are especially tricky because they preserve relationship webs—time windows, categories, formulas, and pivot hierarchies can’t be chunked row-by-row.

Fourth is “Goldilocks” sizing: chunks too small lose context and lead to “I don’t know,” while chunks too large waste tokens and produce unfocused answers. The right approach is to build an evaluation set and test chunking strategies against it. Fifth is overlap again, with overlap tailored to the data’s structure (temporal for time series, categorical for categorical data). The overall argument is that chunking isn’t a minor implementation detail—it’s foundational to retrieval accuracy, hallucination control, and cost efficiency across both RAG and agentic search, especially when corporate data is messy and hard to re-architect later.

Cornell Notes

Chunking determines what information an AI can retrieve and therefore what it can answer correctly. In RAG systems, only a small set of chunks (often 3–5) is retrieved; if key contract or technical meaning is split across chunk boundaries, the system returns confident but wrong answers and may “hallucinate” to fill gaps. Chunking also affects cost: retrieving more chunks loads more tokens into the context window, raising spend and sometimes reducing accuracy by adding irrelevant context. Agentic search can help for exploratory or multi-step tasks, but it still depends on semantic retrieval units and is often far slower and more expensive than well-chunked RAG. Effective chunking follows five principles: preserve context coherence, tune boundaries/size/overlap, adapt to data type (contracts, code, spreadsheets), size for Goldilocks outcomes using evals, and use overlap as insurance.

Why did the fintech chatbot give a wrong indemnification answer even though the model sounded confident?

The contract meaning was broken across chunks. Token-based chunking split a sentence so retrieval returned only the first chunk, which contained an incomplete legal condition. With the missing portion absent from the retrieved set, the system produced an incorrect but confident indemnification conclusion—an error rooted in context engineering, not model “intelligence.”

How does chunking influence both hallucinations and cost in RAG?

RAG typically retrieves only a few chunks per query (often 3–5). If the correct answer spans multiple chunks and part of it isn’t retrieved, the model must guess, which manifests as hallucinations. Bad chunking also increases cost because retrieving extra chunks loads more tokens into the context window; too much irrelevant context can further reduce accuracy.

What’s the practical difference between RAG and agentic search, and why doesn’t agentic search remove chunking?

RAG relies on fast, economical retrieval of semantically relevant chunks; chunking makes that retrieval accurate. Agentic search iteratively searches and reasons across steps, which can help when the answer path is unclear or requires multi-step reasoning across scattered data. But agentic systems still need semantic selection of information units, so poor chunking still harms performance. Agentic search also tends to be much slower and more expensive (described as 10x in the transcript).

What does “context coherence” require when chunking contracts, code, or conversations?

Context coherence means chunks must preserve meaning so the AI never has to infer missing conditions. For contracts, that means respecting natural boundaries like sections and subsections. For code, it means cutting along semantic units like functions and classes. For conversations, it often means speaker turns or time windows—avoiding splits that separate claims from their conditions.

How should chunking strategy change across data types like legal text, source code, and spreadsheets?

Legal text can often be chunked by clearly labeled structure (sections/subsections). Source code may require dependency-aware chunking: build dependency graphs and use neighborhood chunking that includes a function plus what it calls; for highly coupled code, chunking larger units like entire classes/modules or using agentic search may be necessary. Spreadsheets and financial dashboards preserve relationship webs (time series orientation, categories, formulas, pivot hierarchies), so row-by-row chunking fails; chunk by semantic time windows, dependency-linked calculable units, and hierarchical summaries.

How do “size for Goldilocks outcomes” and overlap work together in practice?

Chunk size should match the semantic unit needed for a good answer: too small yields vague “I don’t know,” too large wastes tokens and produces unfocused responses. The transcript recommends using an evaluation set to test chunking strategies rather than relying on arbitrary token counts. Overlap (often 10–20%) ensures that if meaning crosses a boundary, at least one chunk contains the complete idea; overlap direction may differ for time series versus categorical data.

Review Questions

What failure mode occurs when a contract sentence is split across chunks, and how does retrieval behavior (e.g., 3–5 chunks) contribute to it?
Which of the three chunking levers (boundaries, size, overlap) most directly affects retrieval accuracy, and why?
How would you design an evaluation set to compare multiple chunking strategies for the same RAG pipeline?

Key Points

1
Chunking errors can produce confident but wrong answers because RAG retrieves only a few chunks; missing meaning across chunk boundaries can’t be recovered reliably.
2
Bad chunking increases hallucinations and costs by forcing retrieval of extra chunks and injecting irrelevant context into the model’s context window.
3
Agentic search can help for exploratory and multi-step tasks, but it still depends on semantic retrieval units and is often far slower and more expensive than well-chunked RAG.
4
Preserve context coherence by cutting along natural semantic boundaries (contract sections, code functions/classes, conversation turns) and avoiding splits that separate claims from conditions.
5
Tune chunking using boundaries, size, and overlap; overlap (often 10–20%) acts as insurance when meaning spans chunk edges.
6
Use data-type-specific chunking strategies: legal text, dependency-aware code chunking, and relationship-preserving spreadsheet/financial chunking require different approaches.
7
Find the right chunk size with evals (“Goldilocks outcomes”) rather than arbitrary token thresholds, and validate strategies against a shared question set.

Highlights

A fintech indemnification mistake traced back to chunking: token splits broke a sentence so retrieval returned only the first chunk, yielding the wrong legal conclusion.

Chunking affects more than accuracy—retrieving extra chunks loads more tokens, raising costs and sometimes reducing quality by adding irrelevant context.

Agentic search isn’t a bypass: it still relies on semantic units and can be 10x slower and more expensive than chunked RAG.

Five principles drive effective chunking: context coherence; control boundaries/size/overlap; adapt to data type; size via evals; and use overlap as insurance.

Spreadsheets and financial tables can’t be chunked row-by-row because they encode relationship webs—time windows, formulas, and pivot hierarchies must be respected.

Topics

Mentioned

Nate B Jones
RAG