Build Production-Ready Retrieval RAG Pipeline in LangChain | Hybrid Search (BM25), Re-ranking & HyDE

TL;DR

Embedding-only retrieval can be “keyword blind,” causing wrong-context failures when users request exact figures or specific phrases.

Briefing Cornell Notes

Briefing

A production-ready RAG pipeline needs more than embeddings: it must reliably fetch the right chunks, even when users ask for exact numbers. A simple retrieval setup can return a convincing but wrong answer—like pulling a “4.5 billion charge” from an unrelated customer complaint policy instead of Nvidia’s financial results—because semantic search prioritizes meaning over exact matches.

The breakdown comes from “keyword blindness.” Embedding-based retrieval captures intent, but it can miss exact figures, product codes, or specific phrases that determine correctness. In the demo knowledge base, the query “Locate the section discussing the 4.5 billion charge” should land on Nvidia’s financial PDF, yet pure semantic search ranks the third chunk containing the correct figure poorly and surfaces a second chunk from customer complaint text. The result is a retrieval error that downstream generation can amplify into a fluent hallucination.

To fix recall, the pipeline switches to hybrid search using LangChain’s ensemble retriever. It combines a semantic retriever with a BM25 retriever (best match 25), which scores documents using term frequency and inverse term frequency. BM25 is designed for exact keyword matching and is widely used in search engines, making it a practical way to ensure that queries containing specific terms or numbers don’t get lost in embedding similarity. In the example, BM25 lifts the Nvidia financial chunk into the top results, but noise can still remain.

Precision then comes from reranking. Hybrid search casts a wide net to collect candidate chunks; reranking acts like an expert filter that selects the best passages among those candidates. The implementation uses Answer.AI’s Cohere-based reranker model (a “cobert type” model) inside LangChain’s contextual compression retriever. The compressor takes the ensemble retriever’s top candidates (set to three) and removes irrelevant chunks, pushing the truly relevant financial excerpts to the top.

Even with strong retrieval, real users often ask imperfect questions. The pipeline adds HyDE (hypothetical document embeddings), also called “height” in the transcript. A smaller language model generates a short hypothetical answer (2–3 sentences) to the user’s question, then appends that text to the original query. This augmentation injects concrete terms—like “financial statements” and “operating expenses”—so the retriever searches with richer signals. HyDE is especially helpful for vague queries and large document collections.

Finally, the system is wired into an end-to-end chain that produces answers with citations. A prompt instructs the model to use retrieved sources, while the chain formats the compressed, reranked chunks and feeds them into a larger chat model (the transcript mentions “Qwen3 8 billion”). In the Nvidia example query about data center revenue versus the previous quarter, the pipeline returns the correct figure (39.1 billion, up 10%) and includes the chunk source from Nvidia’s financial results.

Together, hybrid search (recall), reranking (precision), and HyDE (query strengthening)—plus source citation—turn a naive RAG setup into a system built to reduce wrong-context failures in production.

Cornell Notes

The transcript shows why embedding-only retrieval can produce wrong RAG answers: semantic search is “keyword blind” and may retrieve a plausible but incorrect chunk (e.g., customer complaint text instead of Nvidia financial results for a “4.5 billion charge”). The fix starts with hybrid search by combining a semantic retriever with a BM25 retriever, improving recall for exact numbers and phrases. Next, reranking using Answer.AI’s Cohere-based model filters the candidate chunks so only the most relevant passages survive. Finally, HyDE (“hypothetical document embeddings”) strengthens weak user queries by generating a short hypothetical answer and appending it to the query before retrieval. The pipeline ends with a prompt-driven answer that cites the retrieved chunks used to form the response.

Why does a basic semantic-search RAG pipeline fail on exact-number questions?

Semantic retrieval optimizes for meaning, not exact token matches. In the demo, the query about “the 4.5 billion charge” should match Nvidia’s financial results, but embedding similarity ranks a chunk from a customer complaint policy higher. That mismatch happens because semantic search can capture intent while missing the exact figure/phrase, letting incorrect chunks enter the context window and enabling confident-but-wrong generation.

How does BM25 address “keyword blindness” in retrieval?

BM25 scores documents using term frequency and inverse term frequency, rewarding documents that contain the query’s exact terms. In LangChain, this is done via a BM25 retriever created from documents (using the from_documents method) and then invoked with the user query. In the example, BM25 brings the Nvidia financial chunk into the top results because it contains the specific “4.5 billion charge” wording.

What does hybrid search improve, and what problem does it still leave?

Hybrid search (ensemble retriever) combines semantic retrieval with BM25. The transcript frames it as increasing recall: it retrieves more relevant candidates by covering both intent and exact-match signals. However, hybrid search can still return noise—irrelevant chunks can remain in the top results—so precision needs an additional step.

How does reranking increase precision after hybrid retrieval?

Reranking acts as a second-stage filter. The transcript uses Answer.AI’s Cohere-based reranker model inside LangChain’s contextual compression retriever. Hybrid search first produces a small set of candidate chunks (e.g., top 3), and the reranker selects the best ones while filtering out the rest. This pushes the correct Nvidia financial passages above irrelevant candidates.

How does HyDE (“hypothetical document embeddings”) make retrieval more reliable for imperfect queries?

HyDE generates a short hypothetical answer to the user’s question using a smaller language model, then appends that generated text to the original query. The augmented query contains richer, concrete terms (the transcript mentions “financial statements” and “operating expenses”), which improves retrieval signals. The transcript notes HyDE is especially effective for vague questions and large document collections.

Why add citations in a production RAG pipeline?

Citations make it possible to verify that the answer came from the retrieved sources rather than from model guesswork. The pipeline includes a prompt that instructs the model to cite the chunks used, and the example shows the response paired with the Nvidia financial-results chunk number. This supports debugging when retrieval fails and helps users trust the output.

Review Questions

If semantic search returns the wrong chunk for an exact-number query, which two retrieval upgrades in the transcript would you try first and why?
Explain the difference between hybrid search and reranking in terms of recall vs precision.
Describe how HyDE changes the query before retrieval and what kinds of user questions it helps most.

Key Points

1
Embedding-only retrieval can be “keyword blind,” causing wrong-context failures when users request exact figures or specific phrases.
2
BM25 improves exact-match recall by scoring documents with term frequency and inverse term frequency, making it effective for numbers and precise wording.
3
Hybrid search via an ensemble retriever combines semantic intent with BM25 keyword matching to retrieve more relevant candidates.
4
Reranking increases precision by filtering the candidate chunks using a dedicated reranker model inside a contextual compression retriever.
5
HyDE strengthens weak or vague user queries by generating a short hypothetical answer and appending it to the query before retrieval.
6
A production RAG pipeline should include source citation so answers can be traced back to the retrieved chunks used to construct them.

Highlights

A naive RAG setup can confidently answer with the wrong context because semantic search may retrieve a plausible but incorrect chunk when exact numbers matter.

Hybrid search (semantic + BM25) fixes recall by combining intent matching with exact keyword/term matching.

Reranking turns a wide candidate set into a precise context window by using a dedicated reranker model to drop irrelevant chunks.

HyDE improves retrieval for imperfect questions by injecting concrete terms through a hypothetical answer before searching.

The pipeline’s citation step helps verify that the final answer is grounded in the retrieved Nvidia financial results rather than guesswork.

Topics

RAG Pipelines
Hybrid Search
BM25
Reranking
HyDE

Mentioned

Venelin Valkov
RAG
BM25
HyDE