Advanced RAG 04 - Contextual Compressors & Filters
Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Contextual compression and filters sit between retrieval and generation to remove irrelevant text and extract only query-relevant passages.
Briefing
RAG systems often fail not because retrieval misses everything, but because they bring back too much irrelevant text—or the right facts buried inside long chunks—making it harder for the language model to synthesize an accurate answer. Contextual compression and filtering address that bottleneck by inserting a “cleanup” stage between retrieval and generation: a base retriever first pulls a broad set of documents, then compressor components strip, select, and re-rank only the parts likely to matter for the specific query.
Contextual compression works like a two-step pipeline. A retriever returns multiple contexts (sometimes large chunks where only a small fraction is useful). Then document compressors and filters process those contexts to extract just the relevant spans. One example is an LLM-based chain extractor that trims each retrieved context down to only the passages that directly support the question—removing irrelevant beginnings and endings and even reducing the number of returned documents. The tradeoff is cost and latency: compression adds extra LLM calls, but the payoff is higher-quality context that better matches the answer target.
Filtering offers a cheaper, more binary alternative. An LLM chain filter evaluates each context and returns “yes” or “no” depending on whether it’s relevant to the question. This can drop entire snippets that don’t contribute, while keeping the most answer-critical sections. In the transcript’s example about “LangSmith,” the filter removes less useful material like URLs/titles while retaining the core “announcing LangSmith a unified platform for debugging testing, evaluating” content.
Embedding filters add another layer of selectivity by re-checking relevance after compression or other transformations. Even though embeddings are already used during retrieval, the pipeline can compute new embeddings on the compressed output and then rank or threshold the results by similarity to the original query. This matters because compression changes the text: the most relevant fragments after trimming may differ from what the initial retrieval scoring implied.
The most flexible pattern is a document compressor pipeline: retrieve many large chunks, split them again into smaller pieces, run embedding-based redundancy checks, and keep only those within a similarity threshold. The transcript describes a scenario where five large 1000-character chunks become many smaller 300-character chunks (with overlap), then embeddings select the best subset to send to the final model. Pipelines can also be arranged in different orders—compress first, then embed-filter; or split first, then compress and filter—depending on whether the goal is maximum relevance, reduced redundancy, or better coverage across multiple sources.
Finally, performance and prompt design determine whether these gains are practical. More compressor steps increase latency, so real-time systems may need fewer stages, while summarization tasks can tolerate heavier pipelines. The transcript also emphasizes prompt tailoring: rewriting compressor prompts for a domain-specific use case (e.g., medical-only relevance) can improve extraction and filtering quality. The overall message is that strong RAG isn’t just about retrieval—it’s about controlling what survives retrieval and how that surviving text is shaped for the final answer.
Cornell Notes
Contextual compression and filters improve RAG by cleaning retrieved text before it reaches the final language model. A base retriever first pulls multiple contexts, then compressor components extract only query-relevant passages (LLM chain extractor), keep or discard contexts via yes/no relevance checks (LLM chain filter), and optionally re-rank results using embeddings after transformations (embedding filter). Pipelines can chain these steps—retrieving many chunks, splitting them again, removing redundancy with similarity thresholds, and sending a smaller, higher-signal set to the generator. This matters because long or mixed relevance chunks can force the model to sift through noise, lowering answer quality and clarity.
Why do contextual compressors matter in RAG, even when retrieval returns “relevant” documents?
How does an LLM chain extractor differ from an LLM chain filter?
What does an embedding filter do after compression, and why recompute embeddings?
What is a document compressor pipeline, and how can it reduce context size?
How should pipeline complexity be chosen for different RAG tasks?
Review Questions
- In what situations would you prefer an LLM chain extractor over an LLM chain filter?
- Explain why recomputing embeddings after compression can improve relevance selection.
- Describe a compressor pipeline that starts with large chunks and ends with a similarity-thresholded set of smaller chunks. What steps occur in between?
Key Points
- 1
Contextual compression and filters sit between retrieval and generation to remove irrelevant text and extract only query-relevant passages.
- 2
An LLM chain extractor trims contexts to relevant spans, while an LLM chain filter uses yes/no relevance decisions to keep or discard entire contexts.
- 3
Embedding filters can re-rank or threshold compressed outputs by similarity to the original query, often using new embeddings computed after transformations.
- 4
Document compressor pipelines can chain operations like splitting, embedding-based redundancy removal, and filtering to shrink large retrieved contexts into a smaller, higher-signal set.
- 5
Pipeline ordering is flexible: compression can happen before embedding filtering, or splitting can happen before compression and filtering.
- 6
More pipeline steps improve context quality but increase latency, so real-time systems may need fewer stages than summarization workflows.
- 7
Prompt rewriting for the specific domain (e.g., medical relevance) can improve extraction and filtering accuracy compared with generic prompts.