Advanced RAG 06 - RAG Fusion
Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
RAG Fusion rewrites one user query into multiple related queries to trigger multiple vector searches and broaden coverage of intent.
Briefing
RAG Fusion aims to narrow the gap between what users type and what they actually mean by turning one user query into several targeted search queries, then merging the best retrieval results before the language model writes an answer. Instead of relying on a single vector search lookup, it generates multiple query “views,” runs each through retrieval, and then uses Reciprocal Rank Fusion to rerank and combine the outputs. The payoff is broader, more reliable context—especially when questions are vague or cover multiple angles.
The core workflow starts with query duplication “with a twist.” A single input question is rewritten into multiple related queries (the example uses five), and each rewritten query triggers its own vector search against the knowledge base. Because each query variant tends to surface different documents, the system collects a richer candidate set than any one lookup would provide. Those candidates are then reranked using Reciprocal Rank Fusion, an algorithm designed to combine ranked lists from multiple retrieval runs into a single ordering that favors items that appear consistently well across the different queries.
After reranking, the selected passages are treated as the context for generation. The large language model receives the original user question plus the fused, filtered retrieval results, and then produces a final response that reflects multiple perspectives rather than whatever happened to match the first query formulation.
A key practical insight is that the query rewriting step can be steered to pull different angles of the same topic. The transcript’s example frames this as rewriting one question into variants that emphasize, for instance, an economic perspective versus a public-health perspective—useful when users ask broad questions but want a diverse set of evidence in the answer.
To demonstrate the approach in code, the walkthrough reproduces RAG Fusion using LangChain and Google’s Palm 2 model (with a note that an API key can be obtained via Google’s Make a suite site and that Palm 2 is available for free). The setup uses a Chroma vector database containing precomputed embeddings over a small dataset of scraped Singapore tourist-attraction articles. For embeddings, it uses BGE embeddings, and the notebook either loads an existing Chroma DB or outlines how ingestion would split documents into chunks and compute embeddings.
Retrieval begins with a basic retriever and a simple RAG chain that can already answer questions like “tell me about Universal Studios Singapore,” even when the spelling is off. The RAG Fusion version then adds a query-generation chain: a prompt instructs the model to output multiple search queries from the single user input (the example generates four). Those queries are split out, mapped to the retriever, and their results are merged via Reciprocal Rank Fusion. Debugging traces (via LangSmith) can show the intermediate steps, including the specific rewritten queries.
In the Universal Studios example, the generated queries cover rides, pricing, best things to do, and the best time of year to visit—some highly relevant, others less so. The final generation step uses the fused, filtered context to produce the answer, effectively combining the strongest material retrieved across multiple query formulations. The overall message: RAG Fusion is a retrieval-quality upgrade that mainly works by rewriting and fusing ranked results, not by changing the language model itself.
Cornell Notes
RAG Fusion improves retrieval-augmented generation by rewriting one user question into several related search queries, running vector search for each, and then merging the results with Reciprocal Rank Fusion. This produces a fused set of context passages that better matches the user’s intent, particularly for broad or ambiguous questions. In the LangChain walkthrough, Palm 2 generates multiple query variants (e.g., rides, pricing, best activities, and best time to visit for Universal Studios Singapore). Each query retrieves from a Chroma vector database built with BGE embeddings, and the reranked outputs are fed into a final prompt for answer generation. The method is mainly about query rewriting plus rank fusion, with the language model used only at the end to synthesize the fused context.
Why does RAG Fusion rewrite a single query into multiple queries instead of relying on one vector search lookup?
What role does Reciprocal Rank Fusion play after retrieval?
How does the final generation step use the fused retrieval results?
How does the LangChain implementation generate multiple search queries?
What components are used in the demo’s RAG Fusion pipeline?
When would RAG Fusion help most, based on the examples shown?
Review Questions
- How does rewriting the user query into multiple variants change the set of documents retrieved from a vector database?
- What is the purpose of Reciprocal Rank Fusion when combining retrieval results from multiple rewritten queries?
- In the LangChain demo, where do the generated queries feed into the pipeline, and how do they affect the final context passed to the language model?
Key Points
- 1
RAG Fusion rewrites one user query into multiple related queries to trigger multiple vector searches and broaden coverage of intent.
- 2
Each rewritten query retrieves its own candidate set from the vector database, increasing the chance of capturing diverse relevant evidence.
- 3
Reciprocal Rank Fusion reranks and merges ranked retrieval lists so consistently high-ranking items rise in the final ordering.
- 4
The language model generates answers using the original question plus the fused, filtered retrieval context rather than a single retrieval result.
- 5
In the LangChain walkthrough, Palm 2 generates multiple search queries, which are then split and mapped to the retriever.
- 6
The demo uses Chroma for vector storage and BGE embeddings for embedding text chunks, with an option to load a prebuilt database.
- 7
RAG Fusion is especially useful for broad or ambiguous questions where users want multiple angles reflected in the final response.