What is RAG? The Complete Tutorial - From Scratch to Deployed API on Production | LangChain & Ollama
Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
RAG avoids prompt bloat by retrieving top-k relevant chunks from a private knowledge base at question time, then grounding generation in that retrieved context.
Briefing
Retrieval-Augmented Generation (RAG) is positioned as the practical fix for a core limitation of “just stuff everything into the prompt” approaches: most companies can’t fit large, private datasets into a single context window without exploding cost and latency—and even then, models may not reliably use all provided context. RAG instead grounds answers in a company’s own knowledge by retrieving the most relevant document chunks at question time, then feeding those chunks alongside the user query into a language model. The result is a pipeline that scales internal knowledge use without forcing every document into every request.
At a high level, RAG runs in two stages. The ingestion pipeline takes source material—PDFs, text files, web pages, GitHub repositories, and other textual or multimodal inputs—splits it into chunks, converts chunks into vectors, and stores those vectors in a vector database. The question-answering pipeline then embeds the user query into a query vector, searches the vector store for the top-k most similar chunks, builds an augmented prompt containing the retrieved context plus the original question, and finally generates an answer grounded in that retrieved background.
The tutorial narrows the implementation to make the moving parts visible. Instead of embeddings-based retrieval, the retriever uses a TF-IDF approach: it tokenizes and vectorizes the knowledge base, computes similarity between a query and each chunk, and returns the top-k chunks along with similarity scores. Using a real-world sample PDF (“customer complaint policy” from Clarence Valley Conservatorium), the document is split into chunks (initially by double newlines), transformed into a TF-IDF vocabulary, and tested with example queries like how to handle an angry customer. Similarity scores sometimes land near zero, signaling weak relevance, which motivates filtering or careful prompt/context construction.
Next comes a minimal RAG system: retrieved chunks are wrapped into an XML-style context block and inserted into a RAG prompt template, then sent to a reasoning model (Quenry with “3 4 billion parameter” referenced as the model used). With streaming enabled, the system outputs both the model’s internal “thinking” and the final answer, producing guidance such as staying calm, asking permission before questioning, and explaining actions when handling complaints.
The knowledge base expands to external documents by parsing PDFs with PyMuPDF (PDF 2 in the transcript). Chunking shifts to page-level text extraction, and retrieval pulls relevant pages; the model may reorder context when needed and then generate an answer using the complaint policy pages.
Because RAG pipelines can fail in subtle ways—especially through prompt formatting—tracing is added using ML4 (open-source tracing). Spans capture retrieval inputs/outputs and the final prompt sent to the model, letting developers inspect which PDF pages and chunks were retrieved and how the augmented prompt was formed.
Finally, the pipeline is deployed as a production-ready API. FastAPI provides two endpoints: an upload endpoint that validates PDFs, extracts pages, chunks them, and rebuilds the in-memory vector store; and an ask endpoint that streams responses via Server-Sent Events. The service is containerized with Docker (multi-stage build, health check, Uvicorn startup) and deployed to Render as a web service. The deployed API is tested end-to-end by uploading the PDF and asking questions like what to do when a customer is angry, with responses grounded in the uploaded policy.
Cornell Notes
RAG is presented as a way to answer questions using private company data without stuffing entire datasets into a model’s context window. The pipeline has two parts: ingestion (chunk documents, convert to vectors, store in a vector database) and question answering (embed the query, retrieve top-k similar chunks, build an augmented prompt, then generate an answer grounded in retrieved context). The implementation starts with a TF-IDF retriever to make retrieval mechanics clear, then upgrades to PDF ingestion using PyMuPDF with page-level chunking. ML4 tracing is used to debug retrieval and prompt formatting by inspecting retrieved pages, the final prompt, and the generated output. The system is then wrapped in a FastAPI REST service, containerized with Docker, and deployed to Render with streaming responses.
Why does RAG beat “put all data into the prompt” for internal company knowledge?
What are the two stages of a typical RAG pipeline?
How does the tutorial’s TF-IDF retriever work, and what do similarity scores mean?
How is the RAG prompt constructed and why does formatting matter?
What changes when moving from a string knowledge base to an external PDF knowledge base?
How does the production API handle documents and responses?
Review Questions
- How do ingestion-time chunking choices (double-newline vs page-level) affect retrieval results in RAG?
- What debugging signals from tracing (retrieved pages/chunks and the final augmented prompt) help pinpoint why an answer is ungrounded?
- In the TF-IDF retriever approach, how would you use similarity scores to decide whether to include retrieved context in the prompt?
Key Points
- 1
RAG avoids prompt bloat by retrieving top-k relevant chunks from a private knowledge base at question time, then grounding generation in that retrieved context.
- 2
A complete RAG system has an ingestion pipeline (chunk → vectorize → store) and a question-answering pipeline (embed query → retrieve → augment prompt → generate).
- 3
TF-IDF retrieval can be used as a simple baseline to understand retrieval mechanics, including how similarity scores reflect lexical relevance.
- 4
PDF support can be added by extracting page text with PyMuPDF and chunking at the page level, then reusing the same retrieval-and-prompt structure.
- 5
Prompt formatting errors are a common RAG failure mode; tracing with ML4 helps inspect retrieved context and the exact final prompt sent to the model.
- 6
A production-ready RAG service can be built with FastAPI using an upload endpoint to ingest PDFs and an ask endpoint that streams responses via Server-Sent Events.
- 7
Docker multi-stage builds and deployment to Render enable running the RAG API as a containerized web service with a health check and streaming chat endpoint.