Hybrid Search RAG With Langchain And Pinecone Vector DB

TL;DR

Hybrid search improves RAG retrieval by combining dense semantic similarity with sparse keyword matching instead of relying on one method alone.

Briefing Cornell Notes

Briefing

Hybrid search for RAG is built on a simple but powerful idea: retrieve relevant chunks using both semantic similarity (dense vector search) and exact/keyword matching (sparse vector search), then merge the two ranked lists into a single, better set of results. That matters because dense retrieval can miss exact terms, while keyword retrieval can miss meaning; combining both improves recall and relevance before the retrieved text is fed into an LLM for final summarization.

In the usual RAG pipeline, documents get split into chunks, each chunk is converted into an embedding vector, and those vectors are stored in a vector database. When a user asks a question, the query is embedded and a dense similarity search (often cosine similarity) returns the top-K most similar chunks. Those chunks are then combined with a prompt template and passed to an LLM to generate the answer.

Hybrid search keeps the dense-vector path but adds a parallel sparse-vector path. Semantic search corresponds to dense vectors: the text becomes dense embeddings where similarity is computed via vector metrics (the transcript mentions cosine similarity as the conceptual basis). Syntactic search corresponds to sparse representations: text is converted into sparse matrices using methods like one-hot encoding, bag-of-words, or TF-IDF. In that sparse space, only a small subset of terms is non-zero, making it well-suited for exact keyword matching.

Operationally, hybrid search stores both representations for the same documents in a vector store that supports both modes. On query time, the user question is transformed into two forms as well: sparse vectors for keyword search and dense vectors for semantic search. Each mode returns its own top-K results—often with different ordering and overlap. The remaining challenge is how to combine them.

The transcript uses Reciprocal Rank Fusion (RRF) to merge the two ranked lists. Each candidate document receives a score computed as a sum of terms of the form 1 / (C + rank), where C is a constant chosen by the system (the transcript notes values can range roughly from 1 to 60, with examples like 10, 20, or 60). Documents that rank highly in either semantic or keyword retrieval get boosted, and changing weightage between the two retrieval signals can reorder the final top-K output.

After laying out the theory, the walkthrough implements hybrid search with LangChain and Pinecone. Pinecone is used as the vector database, with a hybrid-capable retriever from LangChain that performs both dense and sparse retrieval. The setup includes creating a Pinecone index with vector dimension 384 (matching the Hugging Face sentence-transformer model used), choosing a metric compatible with the sparse/dense setup (the transcript specifies dot product), and deploying on AWS us-east-1.

For dense embeddings, the implementation uses Hugging Face embeddings via the sentence-transformers model all-MiniLM-L6-v2. For sparse retrieval, it uses a BM25 encoder (TF-IDF-based by default in the transcript) to generate sparse representations. A Pinecone hybrid search retriever is then created using both the dense embedding function and the BM25 sparse encoder, tied to the Pinecone index.

Finally, sample sentences about visiting cities across years are inserted into the index. Queries like “with what city did I visit last?” and “city did I visit first” return the correct year-to-city mapping, demonstrating that the retrieval step benefits from both semantic similarity and keyword/sparse matching before the LLM produces the response.

Cornell Notes

Hybrid search for RAG retrieves document chunks using two parallel signals: dense semantic similarity and sparse keyword matching. Text chunks are stored in a vector database in both dense-vector form (embeddings) and sparse-matrix form (e.g., TF-IDF/BM25). At query time, the question is converted into both representations, producing two top-K ranked lists. Reciprocal Rank Fusion (RRF) merges those lists by scoring candidates with terms like 1/(C+rank), boosting documents that rank well in either list. This approach improves relevance compared with using only dense or only keyword retrieval.

The implementation uses LangChain with Pinecone: a Pinecone index is created with dimension 384, dense embeddings come from Hugging Face all-MiniLM-L6-v2, and sparse retrieval uses a BM25 encoder (TF-IDF-based default). A hybrid retriever combines both retrieval modes and successfully answers “first/last visited” queries from inserted city-year sentences.

How does dense (semantic) retrieval differ from sparse (keyword) retrieval in hybrid search?

Dense retrieval converts text into dense embedding vectors and ranks candidates by vector similarity (the transcript references cosine similarity conceptually). Sparse retrieval converts text into sparse matrices using keyword-based methods like TF-IDF or bag-of-words/BM25, where most entries are zero and only terms present in the text contribute non-zero values. Dense retrieval tends to capture meaning; sparse retrieval tends to capture exact term overlap.

What does it mean to store both dense vectors and sparse matrices for the same documents?

For each document chunk, the system generates two representations: (1) a dense embedding vector (e.g., from Hugging Face sentence-transformers) and (2) a sparse representation (e.g., BM25/TF-IDF). Both are stored in the same vector database/index so that the database can run either dense vector search or sparse keyword search (or both) against the same underlying content.

How does Reciprocal Rank Fusion (RRF) combine two ranked lists?

Hybrid search produces two ranked lists: one from dense retrieval and one from sparse retrieval. RRF assigns each candidate a score by summing contributions from each list using a formula like 1/(C + rank). The constant C is database-dependent (examples mentioned include values around 10, 20, or up to 60). Documents that appear near the top of either list get higher scores, and the final ranking is based on these combined scores.

Why does weightage between semantic and keyword results matter?

Because the two retrieval modes can disagree on ordering, changing the emphasis shifts which documents rise to the top. The transcript describes scenarios where a document’s combined score changes depending on whether more weight is given to keyword search or semantic search, which can reorder the final top-K results.

What concrete components are used in the LangChain + Pinecone implementation?

The setup uses a Pinecone hybrid search retriever from LangChain Community. Pinecone stores the index. Dense embeddings come from Hugging Face all-MiniLM-L6-v2 (dimension 384). Sparse retrieval uses a BM25 encoder (TF-IDF-based default in the transcript). The retriever combines both so queries are answered using both dense and sparse retrieval signals.

How does the example query demonstrate hybrid retrieval?

After inserting city-year sentences into the Pinecone index, queries like “with what city did I visit last?” and “city did I visit first” return the correct city associated with the latest or earliest year. The transcript attributes this to the retriever using both BM25/sparse matching and dense embedding similarity internally, improving the chance of selecting the right chunk.

Review Questions

When would sparse keyword retrieval outperform dense semantic retrieval in a RAG system, and why?
Explain how RRF’s scoring formula uses rank positions from two retrieval lists to produce a single final ranking.
In the Pinecone + LangChain setup, what must match between the embedding model and the Pinecone index configuration (e.g., dimension), and what happens if they don’t?

Key Points

1
Hybrid search improves RAG retrieval by combining dense semantic similarity with sparse keyword matching instead of relying on one method alone.
2
Sparse representations are typically built with TF-IDF/BM25-style methods, producing mostly zeros with non-zero weights for terms present in text.
3
Dense representations come from embedding models and support semantic similarity search (e.g., via cosine similarity as the conceptual basis).
4
Hybrid search requires merging two top-K ranked lists; Reciprocal Rank Fusion (RRF) scores candidates using 1/(C+rank) terms from each list.
5
Changing the weightage between semantic and keyword signals can reorder the final top-K results and affect which chunks reach the LLM.
6
In the LangChain + Pinecone implementation, a Pinecone index is created with dimension 384 to match the Hugging Face all-MiniLM-L6-v2 embedding output.
7
A LangChain Pinecone hybrid search retriever combines Hugging Face dense embeddings with a BM25 (TF-IDF-based default) sparse encoder to answer queries from inserted text.

Highlights

Hybrid search runs two retrieval modes in parallel—dense semantic search and sparse keyword search—then merges results for a stronger top-K set.

Reciprocal Rank Fusion (RRF) boosts documents that rank highly in either list using a score shaped like 1/(C+rank).

The practical build uses Pinecone as the index, Hugging Face all-MiniLM-L6-v2 for dense embeddings (384-dim), and a BM25 encoder for sparse retrieval.

Topics

Hybrid Search
RAG Retrieval
Reciprocal Rank Fusion
Pinecone Indexing
LangChain Retrievers

Mentioned

Pinecone
LangChain
Hugging Face
sentence-transformers
Pinecone hybrid search retriever
Krish Naik
RAG
TF-IDF
BM25
LLM
RRF