Get AI summaries of any video or article — Sign up free
Build Better RAGs with Contextual Retrieval thumbnail

Build Better RAGs with Contextual Retrieval

Venelin Valkov·
6 min read

Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Contextual retrieval rewrites each RAG chunk as “generated context + original chunk,” using the full document as input to an LLM.

Briefing

Contextual retrieval boosts retrieval-augmented generation (RAG) accuracy by enriching every text chunk with extra, chunk-specific context derived from the full source document—so the retriever pulls passages that are already “self-contained” instead of relying on the model to infer missing surrounding details. The approach, introduced by Anthropic, is especially effective when plain embedding search or even embedding plus BM25 reranking still misses key information. In Anthropic’s reported tests on the top 20 retrieved chunks, failure rates drop from about 5.7% with plain embeddings to about 5% with BM25 reranking, then further to roughly 1.3% when using contextual embeddings; combining BM25 reranking with contextual embeddings yields an even larger improvement. The tradeoff is practical: contextual retrieval adds extra inference time and cost because an LLM must generate the added context for each chunk during preprocessing.

The implementation starts like a standard RAG pipeline: split a large document into chunks, then compute embeddings for those chunks and store them in a vector database. The key difference comes next. For each chunk, an LLM is prompted with (1) the chunk text and (2) the entire document, and asked to produce a short context snippet that belongs with that chunk. The resulting “context + chunk” becomes the new chunk text that gets embedded and indexed. In the demo, the workflow uses Llama 3.1 (via the Groq API) and a recursive character text splitter (chunk size set to 248, with guidance to increase if the model can handle more tokens). The contextualization prompt uses XML tags to bound the chunk within the full document, mirroring Anthropic’s recommended structure.

After preprocessing, retrieval proceeds normally: the user query is embedded, the vector database returns the nearest chunks using the distance metric provided by the pgvector extension, and the top contextualized chunks are inserted into the final LLM prompt. The demo’s question—“how many parameters does Qwen have”—is answered using context retrieved from Qwen-related README content, with the system prompt positioning the model as an ML engineer and instructing it to use the provided context. The notebook also demonstrates practical concerns: contextual retrieval requires generating context for many chunks (the author notes around 15 chunks in one run), which increases preprocessing cost.

Several follow-up Q&A segments address deployment and integration. For running Qwen on a private server, the demo points to using vLLM to expose an OpenAI-compatible API. For hardware planning, an RTX 1490 with 24 GB VRAM is used as an example; the model suggests 7B or 14B variants might be feasible, with 14B potentially pushing limits depending on quantization and system constraints. Finally, the demo confirms that Qwen can be used in LangGraph-based applications, including human-in-the-loop interaction patterns, and contrasts that with a mistaken claim about LangGraph UI inputs—reinforcing that the core integration is through the model node, not a built-in UI system.

Overall, contextual retrieval is presented as a straightforward preprocessing upgrade to RAG: it costs more upfront, but it can materially reduce retrieval failure by making chunks more informative at search time, not just at generation time.

Cornell Notes

Contextual retrieval improves RAG by rewriting each chunk as “context + chunk,” where the added context is generated from the full document using an LLM. That enriched text is then embedded and stored in a vector database, so similarity search retrieves passages that already contain the surrounding meaning the model would otherwise have to reconstruct. Anthropic’s reported results show large reductions in retrieval failure rates when using contextual embeddings, especially when combined with BM25 reranking. The demo implements the method with Llama 3.1 via the Groq API, using a recursive character splitter and a prompt that bounds the chunk within the full document using XML tags. The main downside is extra preprocessing inference time and cost, so benchmarking with and without contextual retrieval is recommended.

What problem does contextual retrieval solve in RAG, and how does it change the retrieval unit?

It targets the common failure mode where a retrieved chunk is technically relevant but lacks the surrounding details needed to answer accurately. Instead of embedding and searching the raw chunk alone, contextual retrieval generates additional, chunk-specific context from the entire source document. The indexed text becomes “context + chunk,” making each retrieval result more self-contained and reducing the chance that the model must guess missing information.

Why do contextual embeddings outperform plain embeddings and BM25 reranking in reported tests?

In Anthropic’s test results on the top 20 retrieved chunks, plain embedding retrieval had about a 5.7% failure rate. Adding BM25 reranking reduced it to about 5%. Contextual embeddings produced a much larger drop to roughly 1.3% failure rate, and combining BM25 reranking with contextual embeddings improved further. The improvement comes from embedding a richer representation that includes document-grounded context, not just the chunk’s surface text.

How is the “context + chunk” text generated during preprocessing?

The pipeline first splits a large document into chunks (using a recursive character text splitter in the demo). For each chunk, an LLM prompt receives both the chunk and the full document, then outputs a short context snippet that belongs with that chunk. The prompt uses XML tags to bound the chunk within the document, and the generated context is appended to the chunk text. That enriched chunk text is what gets embedded and stored.

What does retrieval look like after contextual chunks are indexed?

Retrieval stays standard: embed the user query, search the vector database for nearest contextualized chunks using the distance metric (pgvector’s distance property in the demo), and pass the top results into the final LLM prompt. The demo’s system prompt instructs the model to use the provided context, with the retrieved chunks separated by three dashes, and then asks the question using the retrieved context as input.

What practical costs and engineering constraints come with contextual retrieval?

The method adds preprocessing inference time and cost because an LLM must generate context for each chunk before indexing. The demo notes that creating contextual chunks may require calling the model for many chunks (around 15 in one run). It also increases compute needs for embedding and storage, and it’s sensitive to chunking choices (chunk size and how many chunks a document produces).

How can Qwen be deployed and integrated into an application stack?

For private deployment, the demo points to vLLM to run a service with an OpenAI-compatible API. For hardware planning, an example with an RTX 1490 and 24 GB VRAM suggests 7B or 14B variants may run with good inference speed, with 14B potentially near the limit depending on quantization and system factors. For orchestration, Qwen can be used within LangGraph by defining a LangGraph workflow that includes a Qwen model node as the LLM component, and LangGraph supports human-in-the-loop interaction patterns.

Review Questions

  1. How does contextual retrieval modify the text that gets embedded and indexed, and why does that improve retrieval accuracy?
  2. In the demo’s pipeline, where do the extra LLM calls happen, and what are the main cost implications?
  3. What role do XML tags play in the contextualization prompt, and how does that affect the chunk-to-document relationship?

Key Points

  1. 1

    Contextual retrieval rewrites each RAG chunk as “generated context + original chunk,” using the full document as input to an LLM.

  2. 2

    Anthropic’s reported tests show retrieval failure rates dropping from ~5.7% (plain embeddings) to ~5% (BM25 reranking) and to ~1.3% (contextual embeddings), with further gains when combining contextual embeddings and BM25.

  3. 3

    The preprocessing step is the main cost driver: generating context for every chunk adds inference time and expense before indexing.

  4. 4

    After indexing contextualized chunks, query-time retrieval remains standard: embed the query, use vector similarity (distance via pgvector), and feed top chunks into the final prompt.

  5. 5

    Chunking strategy matters: the demo uses a recursive character splitter with chunk size 248 and suggests increasing chunk size if the target model supports more tokens.

  6. 6

    For deployment, vLLM can host Qwen behind an OpenAI-compatible API, and hardware feasibility depends on model size, VRAM, and quantization choices.

  7. 7

    Qwen can be integrated into LangGraph workflows as the LLM component, including human-in-the-loop interaction patterns.

Highlights

Contextual retrieval reduces retrieval failure by embedding “context + chunk” rather than raw chunks, making retrieved passages more self-contained.
Anthropic’s top-20 retrieval results report failure rates falling to about 1.3% with contextual embeddings, versus ~5.7% with plain embeddings.
The method’s downside is straightforward: extra preprocessing LLM calls increase time and cost, so benchmarking is essential.
The demo uses Groq’s API to run Llama 3.1 for contextual chunk generation and pgvector for distance-based retrieval.
vLLM is recommended for private Qwen deployment, and a 24 GB VRAM example suggests 7B is safer than 14B for good inference speed.

Mentioned

  • RAG
  • BM25
  • LLM
  • API
  • VRAM
  • CPU
  • RAM
  • XML