Advanced RAG 06 - RAG Fusion

TL;DR

RAG Fusion rewrites one user query into multiple related queries to trigger multiple vector searches and broaden coverage of intent.

Briefing Cornell Notes

Briefing

RAG Fusion aims to narrow the gap between what users type and what they actually mean by turning one user query into several targeted search queries, then merging the best retrieval results before the language model writes an answer. Instead of relying on a single vector search lookup, it generates multiple query “views,” runs each through retrieval, and then uses Reciprocal Rank Fusion to rerank and combine the outputs. The payoff is broader, more reliable context—especially when questions are vague or cover multiple angles.

The core workflow starts with query duplication “with a twist.” A single input question is rewritten into multiple related queries (the example uses five), and each rewritten query triggers its own vector search against the knowledge base. Because each query variant tends to surface different documents, the system collects a richer candidate set than any one lookup would provide. Those candidates are then reranked using Reciprocal Rank Fusion, an algorithm designed to combine ranked lists from multiple retrieval runs into a single ordering that favors items that appear consistently well across the different queries.

After reranking, the selected passages are treated as the context for generation. The large language model receives the original user question plus the fused, filtered retrieval results, and then produces a final response that reflects multiple perspectives rather than whatever happened to match the first query formulation.

A key practical insight is that the query rewriting step can be steered to pull different angles of the same topic. The transcript’s example frames this as rewriting one question into variants that emphasize, for instance, an economic perspective versus a public-health perspective—useful when users ask broad questions but want a diverse set of evidence in the answer.

To demonstrate the approach in code, the walkthrough reproduces RAG Fusion using LangChain and Google’s Palm 2 model (with a note that an API key can be obtained via Google’s Make a suite site and that Palm 2 is available for free). The setup uses a Chroma vector database containing precomputed embeddings over a small dataset of scraped Singapore tourist-attraction articles. For embeddings, it uses BGE embeddings, and the notebook either loads an existing Chroma DB or outlines how ingestion would split documents into chunks and compute embeddings.

Retrieval begins with a basic retriever and a simple RAG chain that can already answer questions like “tell me about Universal Studios Singapore,” even when the spelling is off. The RAG Fusion version then adds a query-generation chain: a prompt instructs the model to output multiple search queries from the single user input (the example generates four). Those queries are split out, mapped to the retriever, and their results are merged via Reciprocal Rank Fusion. Debugging traces (via LangSmith) can show the intermediate steps, including the specific rewritten queries.

In the Universal Studios example, the generated queries cover rides, pricing, best things to do, and the best time of year to visit—some highly relevant, others less so. The final generation step uses the fused, filtered context to produce the answer, effectively combining the strongest material retrieved across multiple query formulations. The overall message: RAG Fusion is a retrieval-quality upgrade that mainly works by rewriting and fusing ranked results, not by changing the language model itself.

Cornell Notes

RAG Fusion improves retrieval-augmented generation by rewriting one user question into several related search queries, running vector search for each, and then merging the results with Reciprocal Rank Fusion. This produces a fused set of context passages that better matches the user’s intent, particularly for broad or ambiguous questions. In the LangChain walkthrough, Palm 2 generates multiple query variants (e.g., rides, pricing, best activities, and best time to visit for Universal Studios Singapore). Each query retrieves from a Chroma vector database built with BGE embeddings, and the reranked outputs are fed into a final prompt for answer generation. The method is mainly about query rewriting plus rank fusion, with the language model used only at the end to synthesize the fused context.

Why does RAG Fusion rewrite a single query into multiple queries instead of relying on one vector search lookup?

A single vector search often returns documents that match only one interpretation of a broad question. RAG Fusion duplicates the user input into several rewritten queries, each nudging retrieval toward a different “angle.” In the example, “Universal Studio Singapore” is expanded into four queries about rides, cost/pricing, best things to do, and the best time of year—so the system gathers evidence that a single lookup might miss.

What role does Reciprocal Rank Fusion play after retrieval?

Reciprocal Rank Fusion combines ranked lists from multiple retrieval runs into one reranked ordering. Instead of trusting any one vector search result list, it favors items that rank well across several query variants. That reranking step is what turns multiple candidate sets into a single, higher-quality context set for generation.

How does the final generation step use the fused retrieval results?

After reranking, the selected passages are treated as the “general context.” The language model then receives the original user question plus this fused context and produces the final answer. The transcript emphasizes that the model synthesizes across the fused evidence rather than answering from a single retrieval slice.

How does the LangChain implementation generate multiple search queries?

A dedicated query-generation chain uses a prompt template that instructs the model to generate multiple search queries from the single input question. In the demo, the prompt is set up to output four queries, and the code splits the model output into separate queries before sending each one to the retriever.

What components are used in the demo’s RAG Fusion pipeline?

The pipeline uses LangChain with the Google Palm chat model (Palm 2), a Chroma vector database, and BGE embeddings. It either loads a pre-made Chroma DB (with embeddings already computed) or outlines ingestion steps: load documents, split into chunks, embed with BGE, and store in Chroma. Retrieval runs per generated query, then results are fused with Reciprocal Rank Fusion before the final answer chain.

When would RAG Fusion help most, based on the examples shown?

It’s most helpful when the user question is broad or vague and the system needs a wide variety of data in the response. The transcript notes that for a broad query like “tell me about Universal Studios Singapore,” RAG Fusion may not dramatically change results, but it becomes more valuable when the user types a more specific or differently phrased query where multiple interpretations matter.

Review Questions

How does rewriting the user query into multiple variants change the set of documents retrieved from a vector database?
What is the purpose of Reciprocal Rank Fusion when combining retrieval results from multiple rewritten queries?
In the LangChain demo, where do the generated queries feed into the pipeline, and how do they affect the final context passed to the language model?

Key Points

1
RAG Fusion rewrites one user query into multiple related queries to trigger multiple vector searches and broaden coverage of intent.
2
Each rewritten query retrieves its own candidate set from the vector database, increasing the chance of capturing diverse relevant evidence.
3
Reciprocal Rank Fusion reranks and merges ranked retrieval lists so consistently high-ranking items rise in the final ordering.
4
The language model generates answers using the original question plus the fused, filtered retrieval context rather than a single retrieval result.
5
In the LangChain walkthrough, Palm 2 generates multiple search queries, which are then split and mapped to the retriever.
6
The demo uses Chroma for vector storage and BGE embeddings for embedding text chunks, with an option to load a prebuilt database.
7
RAG Fusion is especially useful for broad or ambiguous questions where users want multiple angles reflected in the final response.

Highlights

RAG Fusion’s “fusion” happens after retrieval: multiple ranked lists are merged using Reciprocal Rank Fusion before generation.

A single broad question can be expanded into targeted sub-queries (rides, pricing, best activities, best time to visit) to pull richer context.

The final answer is synthesized from fused retrieval context, not from whichever documents a single vector search happens to return.

The LangChain implementation demonstrates the full pipeline: query generation → per-query retrieval → rank fusion → final answer chain.

Topics

RAG Fusion
Query Rewriting
Reciprocal Rank Fusion
LangChain
Vector Retrieval

Mentioned

RAG
MMR
LLM
API
DB
Chroma
BGE