Build Better RAGs with Contextual Retrieval
Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Contextual retrieval rewrites each RAG chunk as “generated context + original chunk,” using the full document as input to an LLM.
Briefing
Contextual retrieval boosts retrieval-augmented generation (RAG) accuracy by enriching every text chunk with extra, chunk-specific context derived from the full source document—so the retriever pulls passages that are already “self-contained” instead of relying on the model to infer missing surrounding details. The approach, introduced by Anthropic, is especially effective when plain embedding search or even embedding plus BM25 reranking still misses key information. In Anthropic’s reported tests on the top 20 retrieved chunks, failure rates drop from about 5.7% with plain embeddings to about 5% with BM25 reranking, then further to roughly 1.3% when using contextual embeddings; combining BM25 reranking with contextual embeddings yields an even larger improvement. The tradeoff is practical: contextual retrieval adds extra inference time and cost because an LLM must generate the added context for each chunk during preprocessing.
The implementation starts like a standard RAG pipeline: split a large document into chunks, then compute embeddings for those chunks and store them in a vector database. The key difference comes next. For each chunk, an LLM is prompted with (1) the chunk text and (2) the entire document, and asked to produce a short context snippet that belongs with that chunk. The resulting “context + chunk” becomes the new chunk text that gets embedded and indexed. In the demo, the workflow uses Llama 3.1 (via the Groq API) and a recursive character text splitter (chunk size set to 248, with guidance to increase if the model can handle more tokens). The contextualization prompt uses XML tags to bound the chunk within the full document, mirroring Anthropic’s recommended structure.
After preprocessing, retrieval proceeds normally: the user query is embedded, the vector database returns the nearest chunks using the distance metric provided by the pgvector extension, and the top contextualized chunks are inserted into the final LLM prompt. The demo’s question—“how many parameters does Qwen have”—is answered using context retrieved from Qwen-related README content, with the system prompt positioning the model as an ML engineer and instructing it to use the provided context. The notebook also demonstrates practical concerns: contextual retrieval requires generating context for many chunks (the author notes around 15 chunks in one run), which increases preprocessing cost.
Several follow-up Q&A segments address deployment and integration. For running Qwen on a private server, the demo points to using vLLM to expose an OpenAI-compatible API. For hardware planning, an RTX 1490 with 24 GB VRAM is used as an example; the model suggests 7B or 14B variants might be feasible, with 14B potentially pushing limits depending on quantization and system constraints. Finally, the demo confirms that Qwen can be used in LangGraph-based applications, including human-in-the-loop interaction patterns, and contrasts that with a mistaken claim about LangGraph UI inputs—reinforcing that the core integration is through the model node, not a built-in UI system.
Overall, contextual retrieval is presented as a straightforward preprocessing upgrade to RAG: it costs more upfront, but it can materially reduce retrieval failure by making chunks more informative at search time, not just at generation time.
Cornell Notes
Contextual retrieval improves RAG by rewriting each chunk as “context + chunk,” where the added context is generated from the full document using an LLM. That enriched text is then embedded and stored in a vector database, so similarity search retrieves passages that already contain the surrounding meaning the model would otherwise have to reconstruct. Anthropic’s reported results show large reductions in retrieval failure rates when using contextual embeddings, especially when combined with BM25 reranking. The demo implements the method with Llama 3.1 via the Groq API, using a recursive character splitter and a prompt that bounds the chunk within the full document using XML tags. The main downside is extra preprocessing inference time and cost, so benchmarking with and without contextual retrieval is recommended.
What problem does contextual retrieval solve in RAG, and how does it change the retrieval unit?
Why do contextual embeddings outperform plain embeddings and BM25 reranking in reported tests?
How is the “context + chunk” text generated during preprocessing?
What does retrieval look like after contextual chunks are indexed?
What practical costs and engineering constraints come with contextual retrieval?
How can Qwen be deployed and integrated into an application stack?
Review Questions
- How does contextual retrieval modify the text that gets embedded and indexed, and why does that improve retrieval accuracy?
- In the demo’s pipeline, where do the extra LLM calls happen, and what are the main cost implications?
- What role do XML tags play in the contextualization prompt, and how does that affect the chunk-to-document relationship?
Key Points
- 1
Contextual retrieval rewrites each RAG chunk as “generated context + original chunk,” using the full document as input to an LLM.
- 2
Anthropic’s reported tests show retrieval failure rates dropping from ~5.7% (plain embeddings) to ~5% (BM25 reranking) and to ~1.3% (contextual embeddings), with further gains when combining contextual embeddings and BM25.
- 3
The preprocessing step is the main cost driver: generating context for every chunk adds inference time and expense before indexing.
- 4
After indexing contextualized chunks, query-time retrieval remains standard: embed the query, use vector similarity (distance via pgvector), and feed top chunks into the final prompt.
- 5
Chunking strategy matters: the demo uses a recursive character splitter with chunk size 248 and suggests increasing chunk size if the target model supports more tokens.
- 6
For deployment, vLLM can host Qwen behind an OpenAI-compatible API, and hardware feasibility depends on model size, VRAM, and quantization choices.
- 7
Qwen can be integrated into LangGraph workflows as the LLM component, including human-in-the-loop interaction patterns.