Build Production-Ready Retrieval RAG Pipeline in LangChain | Hybrid Search (BM25), Re-ranking & HyDE
Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Embedding-only retrieval can be “keyword blind,” causing wrong-context failures when users request exact figures or specific phrases.
Briefing
A production-ready RAG pipeline needs more than embeddings: it must reliably fetch the right chunks, even when users ask for exact numbers. A simple retrieval setup can return a convincing but wrong answer—like pulling a “4.5 billion charge” from an unrelated customer complaint policy instead of Nvidia’s financial results—because semantic search prioritizes meaning over exact matches.
The breakdown comes from “keyword blindness.” Embedding-based retrieval captures intent, but it can miss exact figures, product codes, or specific phrases that determine correctness. In the demo knowledge base, the query “Locate the section discussing the 4.5 billion charge” should land on Nvidia’s financial PDF, yet pure semantic search ranks the third chunk containing the correct figure poorly and surfaces a second chunk from customer complaint text. The result is a retrieval error that downstream generation can amplify into a fluent hallucination.
To fix recall, the pipeline switches to hybrid search using LangChain’s ensemble retriever. It combines a semantic retriever with a BM25 retriever (best match 25), which scores documents using term frequency and inverse term frequency. BM25 is designed for exact keyword matching and is widely used in search engines, making it a practical way to ensure that queries containing specific terms or numbers don’t get lost in embedding similarity. In the example, BM25 lifts the Nvidia financial chunk into the top results, but noise can still remain.
Precision then comes from reranking. Hybrid search casts a wide net to collect candidate chunks; reranking acts like an expert filter that selects the best passages among those candidates. The implementation uses Answer.AI’s Cohere-based reranker model (a “cobert type” model) inside LangChain’s contextual compression retriever. The compressor takes the ensemble retriever’s top candidates (set to three) and removes irrelevant chunks, pushing the truly relevant financial excerpts to the top.
Even with strong retrieval, real users often ask imperfect questions. The pipeline adds HyDE (hypothetical document embeddings), also called “height” in the transcript. A smaller language model generates a short hypothetical answer (2–3 sentences) to the user’s question, then appends that text to the original query. This augmentation injects concrete terms—like “financial statements” and “operating expenses”—so the retriever searches with richer signals. HyDE is especially helpful for vague queries and large document collections.
Finally, the system is wired into an end-to-end chain that produces answers with citations. A prompt instructs the model to use retrieved sources, while the chain formats the compressed, reranked chunks and feeds them into a larger chat model (the transcript mentions “Qwen3 8 billion”). In the Nvidia example query about data center revenue versus the previous quarter, the pipeline returns the correct figure (39.1 billion, up 10%) and includes the chunk source from Nvidia’s financial results.
Together, hybrid search (recall), reranking (precision), and HyDE (query strengthening)—plus source citation—turn a naive RAG setup into a system built to reduce wrong-context failures in production.
Cornell Notes
The transcript shows why embedding-only retrieval can produce wrong RAG answers: semantic search is “keyword blind” and may retrieve a plausible but incorrect chunk (e.g., customer complaint text instead of Nvidia financial results for a “4.5 billion charge”). The fix starts with hybrid search by combining a semantic retriever with a BM25 retriever, improving recall for exact numbers and phrases. Next, reranking using Answer.AI’s Cohere-based model filters the candidate chunks so only the most relevant passages survive. Finally, HyDE (“hypothetical document embeddings”) strengthens weak user queries by generating a short hypothetical answer and appending it to the query before retrieval. The pipeline ends with a prompt-driven answer that cites the retrieved chunks used to form the response.
Why does a basic semantic-search RAG pipeline fail on exact-number questions?
How does BM25 address “keyword blindness” in retrieval?
What does hybrid search improve, and what problem does it still leave?
How does reranking increase precision after hybrid retrieval?
How does HyDE (“hypothetical document embeddings”) make retrieval more reliable for imperfect queries?
Why add citations in a production RAG pipeline?
Review Questions
- If semantic search returns the wrong chunk for an exact-number query, which two retrieval upgrades in the transcript would you try first and why?
- Explain the difference between hybrid search and reranking in terms of recall vs precision.
- Describe how HyDE changes the query before retrieval and what kinds of user questions it helps most.
Key Points
- 1
Embedding-only retrieval can be “keyword blind,” causing wrong-context failures when users request exact figures or specific phrases.
- 2
BM25 improves exact-match recall by scoring documents with term frequency and inverse term frequency, making it effective for numbers and precise wording.
- 3
Hybrid search via an ensemble retriever combines semantic intent with BM25 keyword matching to retrieve more relevant candidates.
- 4
Reranking increases precision by filtering the candidate chunks using a dedicated reranker model inside a contextual compression retriever.
- 5
HyDE strengthens weak or vague user queries by generating a short hypothetical answer and appending it to the query before retrieval.
- 6
A production RAG pipeline should include source citation so answers can be traced back to the retrieved chunks used to construct them.