Why Your RAG Gives Wrong Answers (And 4 Chunking Strategies to Fix It)

TL;DR

Chunking can break RAG coherence even when embeddings, vector search, and LLM context windows are strong.

Briefing Cornell Notes

Briefing

RAG systems often fail for a surprisingly mundane reason: chunking breaks the information the model needs, even when embeddings, vector search, and the LLM’s context window are strong. The core claim is that retrieval quality depends less on “bigger context” and more on how documents are split during ingestion—bad splits can destroy coherence, cut bullet lists in half, and leave the model without the surrounding definitions or headings that make an answer possible.

The discussion starts by challenging a common assumption: modern LLMs with large context windows and powerful embeddings don’t guarantee perfect recall. Even with improved models, relevant text can still be missed or diluted, and it’s impractical to stuff an entire company knowledge base into a single prompt. Chunking is therefore framed as a critical step in the RAG pipeline: raw documents are split into smaller chunks so retrieval can surface the right passages. When chunking goes wrong, the LLM receives insufficient or confusing context—an example shows a naive fixed-size character split (1,024 characters) that slices through a “business days” bullet list, leaving one chunk starting mid-idea and the next chunk continuing the thought without the missing structure. The result is context destruction rather than context creation.

Four chunking strategies are then presented in increasing complexity, each with trade-offs.

First, the recursive character text splitter uses separators in a hierarchy (new lines, paragraphs, sentences, etc.) to build chunks around a target size and overlap. It’s described as a common starting point and “not that bad” if trade-offs are understood, but the example shows it can still split inside structured lists—bullet lists become fragmented across chunks.

Second, the markdown header text splitter leverages document structure. If content is converted to markdown with meaningful headers (H1/H2), chunk boundaries align with human-intended sections. In the example, splitting on H1 and H2 yields more chunks and better preservation of responsibilities and their associated bullet points. The downside is loss of control over chunk size when sections are uneven.

Third, the semantic chunker uses embedding-based similarity to decide where to split. By measuring distances between sentences and applying a threshold, it aims to create chunks that are coherent by meaning rather than by formatting. The trade-off is speed and compute cost: it can be slow and depends heavily on embedding quality and hardware.

Fourth, an LLM-driven chunking approach asks a strong model to choose split points. Candidate split locations are proposed (e.g., after each markdown line that begins with a heading), and the LLM returns which indices to split. The prompt includes document-specific instructions—such as keeping forms, images, and tables in separate chunks with their descriptive text. Using a reasoning model (Quint 3 the 4 billion parameter model) and local execution via OAMA, the method produces cleaner, task-aligned chunks, described as “pretty much the best possible approach” when compute and model strength are available.

A bonus technique adds Anthropic-style contextual retrieval: each chunk is enriched with extra context so the model understands what the chunk represents within the larger document. Using a visual language model (Qwen 2.5 VL) and the first page image, the system generates 2–3 sentences of contextual background and prepends it to the relevant chunk text—especially helpful for standalone sections like a complaint form.

The takeaway is pragmatic: choose chunking based on document type. Start with markdown header splitting when markdown structure exists; use recursive character splitting for quick prototypes; try semantic chunking if you can afford compute; and use LLM-based split-point selection for the highest-quality chunks. Contextual retrieval can further improve answers when chunks need “document-level” grounding.

Cornell Notes

Chunking quality is a major driver of RAG failures, often more than embeddings or the vector database. Poor splits can cut through bullet lists and headings, leaving retrieved chunks without the context needed to answer questions. The transcript walks through four LangChain chunking strategies: recursive character splitting (fast but can fragment structured content), markdown header splitting (better alignment with document sections but less control over chunk size), semantic chunking (meaning-based splits using embedding distances but slower and compute-heavy), and LLM-driven split-point selection (highest-quality chunks by asking a model to choose where to split, including special handling for forms/images/tables). A bonus technique adds contextual retrieval by enriching each chunk with background generated from the document’s first page image using a visual language model, improving performance for standalone sections like forms.

Why does chunking cause “wrong answers” in RAG even when embeddings and the LLM are strong?

Chunking determines what information gets retrieved together. If a split breaks coherence—like slicing a bullet list so one chunk contains only part of a definition and the next chunk contains the continuation—the LLM can’t reconstruct the missing structure. The transcript’s example shows a fixed 1,024-character split that separates “business days” content across chunks, forcing the model to answer with incomplete context.

What are the practical strengths and weaknesses of recursive character text splitting?

Recursive character text splitting targets a chunk size (1,024 characters in the example) and uses overlap to preserve continuity. It tries separators in a hierarchy (new lines, paragraphs, sentences) to avoid splitting mid-sentence. Still, structured content like bullet lists can be fragmented: one chunk may end after a heading while the bullet list continues in the next chunk, weakening retrieval usefulness.

When does markdown header text splitting outperform character-based methods?

When documents are available in markdown with meaningful headers (H1/H2), header-based splitting aligns chunk boundaries with human-intended sections. In the example, splitting on H1 and H2 produced 12 chunks and kept responsibilities and associated bullet points together more reliably than character splitting. The trade-off is that chunk size becomes dependent on how long each markdown section is.

How does semantic chunking decide split points, and what’s the cost?

Semantic chunking computes embedding-based distances between sentences and applies a threshold: when similarity drops enough, it creates a new chunk. In the example, the first chunk combined the first four sentences, while the last chunk contained the final sentence. The downside is latency and compute: it can be slow and depends on embedding model quality and available hardware.

What makes LLM-driven split-point selection more accurate than the earlier strategies?

It uses an LLM to choose which candidate boundaries to split, guided by detailed instructions tailored to the document type. The transcript proposes split candidates after markdown lines that start with headings, then asks the model to return chunk indices to split. The prompt also requests special handling—keeping forms, images, and tables in separate chunks along with their descriptive text—resulting in cleaner, more task-aligned chunks.

How does contextual retrieval (Anthropic-style) improve answers for form-like sections?

It enriches each chunk with additional background so the LLM understands what the chunk represents within the full document. The transcript describes generating 2–3 sentences of context using a visual language model (Qwen 2.5 VL) from the document’s first page image, then attaching that context to the relevant chunk text. This is especially helpful when a chunk (like a complaint form) lacks surrounding definitions that exist elsewhere in the document.

Review Questions

If a RAG system retrieves the “right” passage but still answers incorrectly, what chunking failure modes should be checked first (e.g., broken bullet lists, missing headings, or incoherent boundaries)?
Compare markdown header splitting and semantic chunking: which one depends on document formatting, which one depends on embedding similarity, and how do those dependencies affect chunk size control and runtime?
Why might LLM-driven split-point selection be worth the extra compute in documents with forms, tables, and images? What prompt instructions would you include to preserve those relationships?

Key Points

1
Chunking can break RAG coherence even when embeddings, vector search, and LLM context windows are strong.
2
Naive fixed-size character splitting can slice through structured elements like bullet lists, removing the context needed to answer.
3
Recursive character text splitting is a common baseline but can still fragment lists and headings despite overlap.
4
Markdown header text splitting often works best when documents are converted to markdown with reliable H1/H2 structure.
5
Semantic chunking creates meaning-based boundaries using embedding distance thresholds, but it can be slow and compute-intensive.
6
LLM-driven split-point selection can produce higher-quality chunks by choosing boundaries with document-specific rules (e.g., keep forms/images/tables with their descriptions).
7
Contextual retrieval can further improve results by enriching chunks with document-level background generated from the first page image.

Highlights

Large context windows don’t guarantee perfect recall; chunking still determines whether retrieved text contains the right surrounding structure.

A 1,024-character naive split can destroy meaning by splitting a bullet list so definitions are split across chunks.

Markdown header splitting aligns chunk boundaries with document sections, often preserving responsibilities and related bullet points better than character splitting.

Semantic chunking uses embedding distance thresholds to split by meaning, but runtime cost rises with text length.

LLM-driven chunking can enforce special handling—separating forms/images/tables while keeping their descriptions together—yielding the cleanest chunks in the example.

Topics

RAG Chunking
LangChain Text Splitters
Semantic Chunking
LLM Split Points
Contextual Retrieval

Mentioned

Venelin Valkov
RAG
OM
VL
OAMA

Why Your RAG Gives Wrong Answers (And 4 Chunking Strategies to Fix It) | LangChain Text Splitters