Get AI summaries of any video or article — Sign up free
How RAG Finds Answers in Millions of Documents | Embeddings, Vector Databases, LangChain & Supabase thumbnail

How RAG Finds Answers in Millions of Documents | Embeddings, Vector Databases, LangChain & Supabase

Venelin Valkov·
5 min read

Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Embeddings convert text chunks and user queries into vectors so retrieval can match meaning rather than shared keywords.

Briefing

Retrieval in RAG hinges on one practical step: turning a user question into a vector and then finding the most semantically similar document chunks among millions. Keyword search often returns chunks that share surface terms but miss the meaning. Embeddings fix that by representing text as high-dimensional number vectors where “closeness” corresponds to semantic similarity—so a query about “responsibilities of the management team” can retrieve chunks about management handling customer complaints even when the wording differs.

The transcript walks through the mechanics with a toy example. Words like “king” and “queen” are placed into a simple embedding space where royalty-related terms cluster together, while food-related terms (“apple,” “pizza,” “eat”) form a separate region. A sentence embedding is built by splitting the sentence into words, looking up each word’s vector from a dictionary, averaging those vectors, and ignoring out-of-vocabulary words. Similarity between the query vector and each chunk vector is then computed using cosine similarity: the dot product divided by the product of vector norms. Scores fall between 0 and 1, with values near 1 indicating strong semantic alignment and values near 0 indicating dissimilarity. In the example, the “related” query produces much higher cosine similarity scores than the unrelated one.

From there, the workflow shifts to a more realistic RAG setup using a pre-trained embedding model and LangChain. Chunks drawn from a customer complaint policy plus additional chunks from Nvidia’s financial results are embedded with FastEmbed using the model “BAAI/bge-small-en-v1.5.” The query is embedded the same way, and cosine similarity ranks chunks by relevance. The closest chunk is identified by sorting similarity scores, and the transcript also notes that visualization can treat distance from the query as a proxy for similarity (smaller distance means higher similarity).

Choosing a better embedding model is framed as a performance lever. The transcript points to the MTEB leaderboard and specifically retrieval-focused leaderboards, noting that stronger open models (including “nomic-embed-text” variants) can improve retrieval quality, with tradeoffs in size and speed.

Finally, embeddings need a place to live at scale. Instead of always adopting a standalone vector database, the transcript recommends using vector extensions inside existing SQL infrastructure—especially PG Vector with Postgres. The implementation uses Supabase locally (via Docker) to provide a UI and API layer. A SQL table named “documents” stores chunk content, metadata (as JSONB), and the embedding vector. LangChain then connects to Supabase through a Supabase vector store, adds embedded chunks, and performs similarity search with relevance scores.

Two key retrieval controls are demonstrated: limiting results with k (e.g., k=2 returns the top two chunks) and filtering by metadata (e.g., restricting results to chunks whose source equals “Nvidia financial results”). The filtered search still returns the most relevant Nvidia chunks, and the transcript suggests that low scores can motivate adding a similarity threshold later. The overall takeaway is a complete, end-to-end path from text chunks to vector storage and meaning-based retrieval—ready for the next step of feeding retrieved context into an LLM for answer generation.

Cornell Notes

RAG retrieval works by embedding both document chunks and user queries into the same vector space, then ranking chunks by semantic similarity rather than keyword overlap. Cosine similarity (dot product normalized by vector norms) produces scores where values near 1 indicate strong meaning alignment and values near 0 indicate weak or unrelated matches. The transcript demonstrates embedding chunks with FastEmbed using “BAAI/bge-small-en-v1.5,” then using LangChain to compute similarity scores and select the top-k chunks. For storage and fast lookup, it recommends PG Vector inside Postgres, implemented locally via Supabase. Metadata filters (e.g., source = “Nvidia financial results”) let retrieval stay constrained to specific document sets while still using embedding-based similarity.

Why does keyword search fail in RAG, and how do embeddings change the retrieval problem?

Keyword search returns chunks that share surface terms with the query, which often misses semantic intent. Embeddings represent text as vectors in a high-dimensional space where semantic similarity corresponds to geometric closeness. That lets retrieval match meaning: a query about management responsibilities can retrieve chunks about management handling customer complaints even if the exact words differ.

How is a sentence embedding constructed in the toy example, and what happens to unknown words?

The toy setup uses a word-to-embedding dictionary. A sentence is split into words; each word’s vector is looked up and then averaged to form the sentence embedding. If a word isn’t in the dictionary, it’s ignored rather than forcing a guess.

What does cosine similarity measure here, and how should its score range be interpreted?

Cosine similarity measures the angle-based closeness between the query vector and a chunk vector. It’s computed as the dot product divided by the product of the vectors’ norms. Scores are between 0 and 1 in this setup: 1 means very similar, while values near 0 indicate dissimilarity. The related query yields much higher cosine similarity than the unrelated one.

Which embedding model and tooling are used for the LangChain + Supabase retrieval demo?

Embeddings are generated with FastEmbed and the model “BAAI/bge-small-en-v1.5” (the transcript notes the resulting vector dimension is 384). LangChain then embeds the query and ranks chunk embeddings by cosine similarity, while Supabase (with PG Vector) stores the vectors and supports similarity search via a match_documents function.

How do k and metadata filters affect retrieval results?

k limits how many top-matching chunks are returned (e.g., k=2 returns the two highest-scoring chunks). Metadata filters constrain which chunks are eligible—for example, filtering by metadata where source equals “Nvidia financial results” returns only Nvidia chunks, even though similarity scores may still be relatively low if the query doesn’t strongly match those passages.

Review Questions

  1. In the described pipeline, at what exact step does semantic matching replace keyword matching, and what mathematical operation produces the ranking score?
  2. How would you modify the retrieval behavior if you wanted to return only chunks above a certain relevance threshold rather than always returning top-k?
  3. What are the practical reasons the transcript gives for using PG Vector inside Postgres (via Supabase) instead of a standalone vector database?

Key Points

  1. 1

    Embeddings convert text chunks and user queries into vectors so retrieval can match meaning rather than shared keywords.

  2. 2

    Cosine similarity ranks chunks by semantic closeness using the dot product normalized by vector norms, producing scores between 0 and 1.

  3. 3

    A sentence embedding can be built by averaging word embeddings; out-of-vocabulary words can be ignored.

  4. 4

    Embedding quality matters: retrieval-focused leaderboards like MTEB can guide choosing stronger models than small, fast defaults.

  5. 5

    Storing embeddings in Postgres with PG Vector (via Supabase) can simplify production by reusing an existing database stack.

  6. 6

    LangChain can connect to Supabase’s PG Vector-backed table and run similarity search with relevance scores.

  7. 7

    Metadata filters (e.g., source = “Nvidia financial results”) let retrieval target specific document subsets while still using embedding similarity.

Highlights

Cosine similarity turns vector closeness into a numeric relevance score, making “related” queries consistently rank higher than unrelated ones.
FastEmbed with “BAAI/bge-small-en-v1.5” produces 384-dimensional embeddings that LangChain can use to rank mixed document chunks.
Using PG Vector inside Supabase enables similarity search through a SQL-backed match_documents function, avoiding a separate vector service.
Metadata filtering keeps retrieval constrained to a chosen corpus (such as Nvidia financial results) without changing the embedding-based ranking method.

Mentioned