3-Build RAG Pipeline From Scratch-Building Advanced Retreival Query Pipline-Part 2
Based on Krish Naik's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
RAG answers are grounded by retrieving relevant chunks from a vector DB using an embedded user query, then injecting that context into an LLM prompt.
Briefing
Retrieval-Augmented Generation (RAG) becomes practical once the system can (1) pull the right chunks from a vector database and (2) feed that retrieved context into an LLM with a clear prompt—then optionally return evidence, confidence, and richer metadata. After completing the earlier data injection pipeline (chunking documents, embedding them, and persisting vectors in a vector DB), the focus shifts to the query retrieval pipeline, where a new user question is embedded, matched against stored vectors, and combined with prompt instructions to produce an answer.
The retrieval flow is straightforward: a user query is converted into vectors using the same embedding approach used during ingestion. Those vectors are used to query the vector DB, returning the most relevant context. That context is then inserted into a prompt template—effectively “augmentation”—so the LLM generates an answer grounded in retrieved text rather than relying purely on its internal knowledge. In the implementation, the LLM is set up using Grok via an API key (loaded from environment variables), with a chosen model name (gamma 2), temperature set to 0.1, and a maximum token limit of 1024. A simple RAG function then ties everything together: it retrieves top-k documents (k=3), concatenates their contents into a single context string, checks for empty retrieval results, formats a prompt using the context and question, and invokes the LLM to return the response content.
The next step upgrades the “simple” pipeline into an “enhanced” RAG pipeline that prioritizes transparency. Instead of returning only an answer, the system also returns sources, a confidence score, and optional context. The retrieval results include similarity scores and metadata such as source file and page number. The code aggregates these into a structured output: if retrieval returns nothing, it returns a “no relevant context found” style response with blank sources and confidence 0.0. When results exist, it builds a context string (including up to a preview length, such as the first 300 characters per chunk) and computes confidence from the similarity score. The prompt still instructs the LLM to answer concisely using the retrieved context, but the user now sees where the answer came from.
Finally, a more advanced variant adds streaming citations, history, and summarization behavior. The transcript notes that some queries may fail depending on context size and retrieval thresholds (for example, adjusting minimum score values can change whether relevant chunks are found). The overall takeaway is that RAG pipelines can be built incrementally: start with retrieval + generation, then add evidence (sources and confidence), and then layer in usability features like streaming citations and summarization. The work also sets up a next phase: refactoring the code from a single file into a modular folder structure for maintainability.
Cornell Notes
RAG works by embedding a user query, retrieving the most similar document chunks from a vector DB, and then using an LLM to generate an answer grounded in that retrieved context. The transcript first builds a “simple” RAG function that retrieves top-k results (k=3), concatenates their text into a context block, formats a prompt with the context and question, and invokes Grok (gamma 2) to produce the answer. It then upgrades to an “enhanced” pipeline that returns not just the answer, but also sources (file + page), similarity-based confidence, and optional context previews. A final advanced version adds streaming citations, history, and summarization, with retrieval thresholds (like minimum score) affecting whether relevant context is found.
How does the query retrieval pipeline in RAG turn a new question into an LLM-grounded answer?
What does the “simple RAG” function do step-by-step in the implementation?
How does the “enhanced RAG” pipeline improve output quality and trust?
What information is extracted from retrieved documents to build sources and confidence?
Why might some queries return “no relevant context found” in the advanced pipeline?
Review Questions
- In a RAG system, what are the distinct roles of retrieval (vector DB lookup) and generation (LLM invocation), and how are they connected via prompt augmentation?
- How would you modify the simple RAG pipeline to return the top-k retrieved chunks verbatim along with the final LLM answer?
- What metadata fields (e.g., file name, page number) and similarity scores are used in the enhanced pipeline to compute confidence and sources?
Key Points
- 1
RAG answers are grounded by retrieving relevant chunks from a vector DB using an embedded user query, then injecting that context into an LLM prompt.
- 2
A simple RAG implementation retrieves top-k documents, concatenates their content into a context block, and invokes Grok (gamma 2) with a context-and-question prompt.
- 3
Empty retrieval results should be handled explicitly to avoid hallucinated answers; the pipeline returns a “no relevant context” style response when results are empty.
- 4
An enhanced RAG pipeline improves trust by returning sources (source file and page number) and a confidence score derived from similarity scores.
- 5
Retrieval thresholds such as minimum score can determine whether context is found; tuning them affects answer availability.
- 6
More advanced RAG variants can add streaming citations, history, and summarization, but they may still be sensitive to context size and retrieval settings.
- 7
For maintainability, the next step is refactoring the RAG code from a single file into a modular folder structure.