How RAG Finds Answers in Millions of Documents | Embeddings, Vector Databases, LangChain & Supabase
Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Embeddings convert text chunks and user queries into vectors so retrieval can match meaning rather than shared keywords.
Briefing
Retrieval in RAG hinges on one practical step: turning a user question into a vector and then finding the most semantically similar document chunks among millions. Keyword search often returns chunks that share surface terms but miss the meaning. Embeddings fix that by representing text as high-dimensional number vectors where “closeness” corresponds to semantic similarity—so a query about “responsibilities of the management team” can retrieve chunks about management handling customer complaints even when the wording differs.
The transcript walks through the mechanics with a toy example. Words like “king” and “queen” are placed into a simple embedding space where royalty-related terms cluster together, while food-related terms (“apple,” “pizza,” “eat”) form a separate region. A sentence embedding is built by splitting the sentence into words, looking up each word’s vector from a dictionary, averaging those vectors, and ignoring out-of-vocabulary words. Similarity between the query vector and each chunk vector is then computed using cosine similarity: the dot product divided by the product of vector norms. Scores fall between 0 and 1, with values near 1 indicating strong semantic alignment and values near 0 indicating dissimilarity. In the example, the “related” query produces much higher cosine similarity scores than the unrelated one.
From there, the workflow shifts to a more realistic RAG setup using a pre-trained embedding model and LangChain. Chunks drawn from a customer complaint policy plus additional chunks from Nvidia’s financial results are embedded with FastEmbed using the model “BAAI/bge-small-en-v1.5.” The query is embedded the same way, and cosine similarity ranks chunks by relevance. The closest chunk is identified by sorting similarity scores, and the transcript also notes that visualization can treat distance from the query as a proxy for similarity (smaller distance means higher similarity).
Choosing a better embedding model is framed as a performance lever. The transcript points to the MTEB leaderboard and specifically retrieval-focused leaderboards, noting that stronger open models (including “nomic-embed-text” variants) can improve retrieval quality, with tradeoffs in size and speed.
Finally, embeddings need a place to live at scale. Instead of always adopting a standalone vector database, the transcript recommends using vector extensions inside existing SQL infrastructure—especially PG Vector with Postgres. The implementation uses Supabase locally (via Docker) to provide a UI and API layer. A SQL table named “documents” stores chunk content, metadata (as JSONB), and the embedding vector. LangChain then connects to Supabase through a Supabase vector store, adds embedded chunks, and performs similarity search with relevance scores.
Two key retrieval controls are demonstrated: limiting results with k (e.g., k=2 returns the top two chunks) and filtering by metadata (e.g., restricting results to chunks whose source equals “Nvidia financial results”). The filtered search still returns the most relevant Nvidia chunks, and the transcript suggests that low scores can motivate adding a similarity threshold later. The overall takeaway is a complete, end-to-end path from text chunks to vector storage and meaning-based retrieval—ready for the next step of feeding retrieved context into an LLM for answer generation.
Cornell Notes
RAG retrieval works by embedding both document chunks and user queries into the same vector space, then ranking chunks by semantic similarity rather than keyword overlap. Cosine similarity (dot product normalized by vector norms) produces scores where values near 1 indicate strong meaning alignment and values near 0 indicate weak or unrelated matches. The transcript demonstrates embedding chunks with FastEmbed using “BAAI/bge-small-en-v1.5,” then using LangChain to compute similarity scores and select the top-k chunks. For storage and fast lookup, it recommends PG Vector inside Postgres, implemented locally via Supabase. Metadata filters (e.g., source = “Nvidia financial results”) let retrieval stay constrained to specific document sets while still using embedding-based similarity.
Why does keyword search fail in RAG, and how do embeddings change the retrieval problem?
How is a sentence embedding constructed in the toy example, and what happens to unknown words?
What does cosine similarity measure here, and how should its score range be interpreted?
Which embedding model and tooling are used for the LangChain + Supabase retrieval demo?
How do k and metadata filters affect retrieval results?
Review Questions
- In the described pipeline, at what exact step does semantic matching replace keyword matching, and what mathematical operation produces the ranking score?
- How would you modify the retrieval behavior if you wanted to return only chunks above a certain relevance threshold rather than always returning top-k?
- What are the practical reasons the transcript gives for using PG Vector inside Postgres (via Supabase) instead of a standalone vector database?
Key Points
- 1
Embeddings convert text chunks and user queries into vectors so retrieval can match meaning rather than shared keywords.
- 2
Cosine similarity ranks chunks by semantic closeness using the dot product normalized by vector norms, producing scores between 0 and 1.
- 3
A sentence embedding can be built by averaging word embeddings; out-of-vocabulary words can be ignored.
- 4
Embedding quality matters: retrieval-focused leaderboards like MTEB can guide choosing stronger models than small, fast defaults.
- 5
Storing embeddings in Postgres with PG Vector (via Supabase) can simplify production by reusing an existing database stack.
- 6
LangChain can connect to Supabase’s PG Vector-backed table and run similarity search with relevance scores.
- 7
Metadata filters (e.g., source = “Nvidia financial results”) let retrieval target specific document subsets while still using embedding similarity.