Advanced RAG 04 - Contextual Compressors & Filters

TL;DR

Contextual compression and filters sit between retrieval and generation to remove irrelevant text and extract only query-relevant passages.

Briefing Cornell Notes

Briefing

RAG systems often fail not because retrieval misses everything, but because they bring back too much irrelevant text—or the right facts buried inside long chunks—making it harder for the language model to synthesize an accurate answer. Contextual compression and filtering address that bottleneck by inserting a “cleanup” stage between retrieval and generation: a base retriever first pulls a broad set of documents, then compressor components strip, select, and re-rank only the parts likely to matter for the specific query.

Contextual compression works like a two-step pipeline. A retriever returns multiple contexts (sometimes large chunks where only a small fraction is useful). Then document compressors and filters process those contexts to extract just the relevant spans. One example is an LLM-based chain extractor that trims each retrieved context down to only the passages that directly support the question—removing irrelevant beginnings and endings and even reducing the number of returned documents. The tradeoff is cost and latency: compression adds extra LLM calls, but the payoff is higher-quality context that better matches the answer target.

Filtering offers a cheaper, more binary alternative. An LLM chain filter evaluates each context and returns “yes” or “no” depending on whether it’s relevant to the question. This can drop entire snippets that don’t contribute, while keeping the most answer-critical sections. In the transcript’s example about “LangSmith,” the filter removes less useful material like URLs/titles while retaining the core “announcing LangSmith a unified platform for debugging testing, evaluating” content.

Embedding filters add another layer of selectivity by re-checking relevance after compression or other transformations. Even though embeddings are already used during retrieval, the pipeline can compute new embeddings on the compressed output and then rank or threshold the results by similarity to the original query. This matters because compression changes the text: the most relevant fragments after trimming may differ from what the initial retrieval scoring implied.

The most flexible pattern is a document compressor pipeline: retrieve many large chunks, split them again into smaller pieces, run embedding-based redundancy checks, and keep only those within a similarity threshold. The transcript describes a scenario where five large 1000-character chunks become many smaller 300-character chunks (with overlap), then embeddings select the best subset to send to the final model. Pipelines can also be arranged in different orders—compress first, then embed-filter; or split first, then compress and filter—depending on whether the goal is maximum relevance, reduced redundancy, or better coverage across multiple sources.

Finally, performance and prompt design determine whether these gains are practical. More compressor steps increase latency, so real-time systems may need fewer stages, while summarization tasks can tolerate heavier pipelines. The transcript also emphasizes prompt tailoring: rewriting compressor prompts for a domain-specific use case (e.g., medical-only relevance) can improve extraction and filtering quality. The overall message is that strong RAG isn’t just about retrieval—it’s about controlling what survives retrieval and how that surviving text is shaped for the final answer.

Cornell Notes

Contextual compression and filters improve RAG by cleaning retrieved text before it reaches the final language model. A base retriever first pulls multiple contexts, then compressor components extract only query-relevant passages (LLM chain extractor), keep or discard contexts via yes/no relevance checks (LLM chain filter), and optionally re-rank results using embeddings after transformations (embedding filter). Pipelines can chain these steps—retrieving many chunks, splitting them again, removing redundancy with similarity thresholds, and sending a smaller, higher-signal set to the generator. This matters because long or mixed relevance chunks can force the model to sift through noise, lowering answer quality and clarity.

Why do contextual compressors matter in RAG, even when retrieval returns “relevant” documents?

Retrievers often return large chunks where only a small portion is actually useful, or they return multiple snippets where only some support the final answer. That forces the language model to synthesize across chunks and sift out irrelevant details that can pollute the context window. Contextual compression inserts a processing stage after retrieval so only the most answer-relevant text survives into the generation step.

How does an LLM chain extractor differ from an LLM chain filter?

An LLM chain extractor trims each retrieved context to the relevant spans, keeping only the useful parts “as is” and discarding irrelevant sections (including removing unhelpful beginnings/ends). An LLM chain filter instead makes a yes/no decision per context: it keeps contexts deemed relevant and drops those deemed irrelevant. In the transcript’s example for “LangSmith,” extraction shortened the contexts and reduced the number of returned documents, while filtering removed less useful elements like URLs/titles but retained the core announcement text.

What does an embedding filter do after compression, and why recompute embeddings?

An embedding filter re-checks relevance by computing similarity between the original query and the transformed outputs (e.g., compressed text). Recomputing embeddings matters because compression changes the content: the best matching fragments after trimming may not align with the initial retrieval scoring. The pipeline can then rank results or apply a similarity threshold to keep only the most relevant items.

What is a document compressor pipeline, and how can it reduce context size?

A pipeline chains multiple operations: retrieve many large chunks, optionally split them into smaller pieces, run embedding-based redundancy checks, and keep only chunks within a similarity threshold. The transcript describes splitting five ~1000-character chunks into smaller ~300-character chunks (with overlap), increasing the number of chunks initially (e.g., to ~20), then using embeddings to select the most relevant subset for the final answer.

How should pipeline complexity be chosen for different RAG tasks?

More compressor steps increase latency because they add extra LLM calls and embedding computations. For real-time question answering, fewer stages may be needed. For summarization or offline tasks where speed is less critical, heavier pipelines can be used to pull and compress more chunks to improve coverage and summarization quality.

Review Questions

In what situations would you prefer an LLM chain extractor over an LLM chain filter?
Explain why recomputing embeddings after compression can improve relevance selection.
Describe a compressor pipeline that starts with large chunks and ends with a similarity-thresholded set of smaller chunks. What steps occur in between?

Key Points

1
Contextual compression and filters sit between retrieval and generation to remove irrelevant text and extract only query-relevant passages.
2
An LLM chain extractor trims contexts to relevant spans, while an LLM chain filter uses yes/no relevance decisions to keep or discard entire contexts.
3
Embedding filters can re-rank or threshold compressed outputs by similarity to the original query, often using new embeddings computed after transformations.
4
Document compressor pipelines can chain operations like splitting, embedding-based redundancy removal, and filtering to shrink large retrieved contexts into a smaller, higher-signal set.
5
Pipeline ordering is flexible: compression can happen before embedding filtering, or splitting can happen before compression and filtering.
6
More pipeline steps improve context quality but increase latency, so real-time systems may need fewer stages than summarization workflows.
7
Prompt rewriting for the specific domain (e.g., medical relevance) can improve extraction and filtering accuracy compared with generic prompts.

Highlights

RAG quality often breaks because retrieval returns too much noise or buries useful facts inside long chunks; compression fixes that by shaping what reaches the generator.

LLM chain extractors shorten contexts by extracting relevant spans, while LLM chain filters decide relevance with yes/no outputs.

Re-embedding after compression can correct for changes in text and improve similarity-based selection.

A compressor pipeline can turn a handful of large chunks into many smaller ones, then use embeddings to keep only the most relevant subset.

Latency is the main constraint: heavier pipelines fit better for non-real-time tasks like summarization.

Topics

Contextual Compression
RAG Filtering
LLM Extractors
Embedding Re-Ranking
Document Compressor Pipelines

Mentioned

Sam Witteveen
RAG
LLM
FAISS
BGE