3-Build RAG Pipeline From Scratch-Building Advanced Retreival Query Pipline-Part 2

TL;DR

RAG answers are grounded by retrieving relevant chunks from a vector DB using an embedded user query, then injecting that context into an LLM prompt.

Briefing Cornell Notes

Briefing

Retrieval-Augmented Generation (RAG) becomes practical once the system can (1) pull the right chunks from a vector database and (2) feed that retrieved context into an LLM with a clear prompt—then optionally return evidence, confidence, and richer metadata. After completing the earlier data injection pipeline (chunking documents, embedding them, and persisting vectors in a vector DB), the focus shifts to the query retrieval pipeline, where a new user question is embedded, matched against stored vectors, and combined with prompt instructions to produce an answer.

The retrieval flow is straightforward: a user query is converted into vectors using the same embedding approach used during ingestion. Those vectors are used to query the vector DB, returning the most relevant context. That context is then inserted into a prompt template—effectively “augmentation”—so the LLM generates an answer grounded in retrieved text rather than relying purely on its internal knowledge. In the implementation, the LLM is set up using Grok via an API key (loaded from environment variables), with a chosen model name (gamma 2), temperature set to 0.1, and a maximum token limit of 1024. A simple RAG function then ties everything together: it retrieves top-k documents (k=3), concatenates their contents into a single context string, checks for empty retrieval results, formats a prompt using the context and question, and invokes the LLM to return the response content.

The next step upgrades the “simple” pipeline into an “enhanced” RAG pipeline that prioritizes transparency. Instead of returning only an answer, the system also returns sources, a confidence score, and optional context. The retrieval results include similarity scores and metadata such as source file and page number. The code aggregates these into a structured output: if retrieval returns nothing, it returns a “no relevant context found” style response with blank sources and confidence 0.0. When results exist, it builds a context string (including up to a preview length, such as the first 300 characters per chunk) and computes confidence from the similarity score. The prompt still instructs the LLM to answer concisely using the retrieved context, but the user now sees where the answer came from.

Finally, a more advanced variant adds streaming citations, history, and summarization behavior. The transcript notes that some queries may fail depending on context size and retrieval thresholds (for example, adjusting minimum score values can change whether relevant chunks are found). The overall takeaway is that RAG pipelines can be built incrementally: start with retrieval + generation, then add evidence (sources and confidence), and then layer in usability features like streaming citations and summarization. The work also sets up a next phase: refactoring the code from a single file into a modular folder structure for maintainability.

Cornell Notes

RAG works by embedding a user query, retrieving the most similar document chunks from a vector DB, and then using an LLM to generate an answer grounded in that retrieved context. The transcript first builds a “simple” RAG function that retrieves top-k results (k=3), concatenates their text into a context block, formats a prompt with the context and question, and invokes Grok (gamma 2) to produce the answer. It then upgrades to an “enhanced” pipeline that returns not just the answer, but also sources (file + page), similarity-based confidence, and optional context previews. A final advanced version adds streaming citations, history, and summarization, with retrieval thresholds (like minimum score) affecting whether relevant context is found.

How does the query retrieval pipeline in RAG turn a new question into an LLM-grounded answer?

A new user query is embedded using the same embedding approach used during ingestion. That vector is used to query the vector DB, returning top-k relevant chunks as “context.” The system then performs augmentation by inserting the retrieved context into a prompt template along with the user question. Finally, the LLM is invoked to generate the answer based on that context rather than only internal knowledge.

What does the “simple RAG” function do step-by-step in the implementation?

It defines rag_simple(query, retriever, llm, k). First it calls retriever.retrieve(query, top_k=k) to fetch results. It concatenates the retrieved documents’ content into a single context string (joining with double newlines). If no results are returned, it returns a message indicating no relevant context. Otherwise it formats a prompt like “Use the following context to answer the question concisely,” then calls llm.invoke(prompt.format(context=context, query=query)) and returns response.content.

How does the “enhanced RAG” pipeline improve output quality and trust?

It returns structured evidence alongside the answer: sources, confidence score, and optionally the full context. Retrieved documents include metadata such as source file and page number, plus a similarity score. Confidence is derived from that similarity score. The pipeline also supports a minimum score threshold and returns blank sources and confidence 0.0 when retrieval yields no relevant chunks.

What information is extracted from retrieved documents to build sources and confidence?

For each retrieved doc, the pipeline uses metadata fields like source file and page number, and it uses the doc’s similarity score as a confidence signal. It also includes a short preview of content (for example, up to 300 characters) to help users see what text contributed to the answer.

Why might some queries return “no relevant context found” in the advanced pipeline?

Retrieval depends on thresholds and context availability. If the minimum score is set too high or the retrieved context doesn’t fit within the prompt/context limits, the pipeline may fail to find usable chunks. The transcript suggests adjusting parameters like minimum score and trying different questions to observe changes in retrieval success.

Review Questions

In a RAG system, what are the distinct roles of retrieval (vector DB lookup) and generation (LLM invocation), and how are they connected via prompt augmentation?
How would you modify the simple RAG pipeline to return the top-k retrieved chunks verbatim along with the final LLM answer?
What metadata fields (e.g., file name, page number) and similarity scores are used in the enhanced pipeline to compute confidence and sources?

Key Points

1
RAG answers are grounded by retrieving relevant chunks from a vector DB using an embedded user query, then injecting that context into an LLM prompt.
2
A simple RAG implementation retrieves top-k documents, concatenates their content into a context block, and invokes Grok (gamma 2) with a context-and-question prompt.
3
Empty retrieval results should be handled explicitly to avoid hallucinated answers; the pipeline returns a “no relevant context” style response when results are empty.
4
An enhanced RAG pipeline improves trust by returning sources (source file and page number) and a confidence score derived from similarity scores.
5
Retrieval thresholds such as minimum score can determine whether context is found; tuning them affects answer availability.
6
More advanced RAG variants can add streaming citations, history, and summarization, but they may still be sensitive to context size and retrieval settings.
7
For maintainability, the next step is refactoring the RAG code from a single file into a modular folder structure.

Highlights

RAG’s core mechanism is augmentation: retrieved vector-search context is merged into a prompt so the LLM generates answers grounded in specific document text.

The enhanced pipeline turns similarity scores into a confidence metric and attaches source evidence like file name and page number.

The final “advanced” variant layers streaming citations, history, and summarization, while retrieval thresholds can cause some queries to return no context.

Topics

RAG Pipeline
Vector Database
Prompt Augmentation
Similarity Scores
LLM Invocation