What is RAG? The Complete Tutorial - From Scratch to Deployed API on Production

TL;DR

RAG avoids prompt bloat by retrieving top-k relevant chunks from a private knowledge base at question time, then grounding generation in that retrieved context.

Briefing Cornell Notes

Briefing

Retrieval-Augmented Generation (RAG) is positioned as the practical fix for a core limitation of “just stuff everything into the prompt” approaches: most companies can’t fit large, private datasets into a single context window without exploding cost and latency—and even then, models may not reliably use all provided context. RAG instead grounds answers in a company’s own knowledge by retrieving the most relevant document chunks at question time, then feeding those chunks alongside the user query into a language model. The result is a pipeline that scales internal knowledge use without forcing every document into every request.

At a high level, RAG runs in two stages. The ingestion pipeline takes source material—PDFs, text files, web pages, GitHub repositories, and other textual or multimodal inputs—splits it into chunks, converts chunks into vectors, and stores those vectors in a vector database. The question-answering pipeline then embeds the user query into a query vector, searches the vector store for the top-k most similar chunks, builds an augmented prompt containing the retrieved context plus the original question, and finally generates an answer grounded in that retrieved background.

The tutorial narrows the implementation to make the moving parts visible. Instead of embeddings-based retrieval, the retriever uses a TF-IDF approach: it tokenizes and vectorizes the knowledge base, computes similarity between a query and each chunk, and returns the top-k chunks along with similarity scores. Using a real-world sample PDF (“customer complaint policy” from Clarence Valley Conservatorium), the document is split into chunks (initially by double newlines), transformed into a TF-IDF vocabulary, and tested with example queries like how to handle an angry customer. Similarity scores sometimes land near zero, signaling weak relevance, which motivates filtering or careful prompt/context construction.

Next comes a minimal RAG system: retrieved chunks are wrapped into an XML-style context block and inserted into a RAG prompt template, then sent to a reasoning model (Quenry with “3 4 billion parameter” referenced as the model used). With streaming enabled, the system outputs both the model’s internal “thinking” and the final answer, producing guidance such as staying calm, asking permission before questioning, and explaining actions when handling complaints.

The knowledge base expands to external documents by parsing PDFs with PyMuPDF (PDF 2 in the transcript). Chunking shifts to page-level text extraction, and retrieval pulls relevant pages; the model may reorder context when needed and then generate an answer using the complaint policy pages.

Because RAG pipelines can fail in subtle ways—especially through prompt formatting—tracing is added using ML4 (open-source tracing). Spans capture retrieval inputs/outputs and the final prompt sent to the model, letting developers inspect which PDF pages and chunks were retrieved and how the augmented prompt was formed.

Finally, the pipeline is deployed as a production-ready API. FastAPI provides two endpoints: an upload endpoint that validates PDFs, extracts pages, chunks them, and rebuilds the in-memory vector store; and an ask endpoint that streams responses via Server-Sent Events. The service is containerized with Docker (multi-stage build, health check, Uvicorn startup) and deployed to Render as a web service. The deployed API is tested end-to-end by uploading the PDF and asking questions like what to do when a customer is angry, with responses grounded in the uploaded policy.

Cornell Notes

RAG is presented as a way to answer questions using private company data without stuffing entire datasets into a model’s context window. The pipeline has two parts: ingestion (chunk documents, convert to vectors, store in a vector database) and question answering (embed the query, retrieve top-k similar chunks, build an augmented prompt, then generate an answer grounded in retrieved context). The implementation starts with a TF-IDF retriever to make retrieval mechanics clear, then upgrades to PDF ingestion using PyMuPDF with page-level chunking. ML4 tracing is used to debug retrieval and prompt formatting by inspecting retrieved pages, the final prompt, and the generated output. The system is then wrapped in a FastAPI REST service, containerized with Docker, and deployed to Render with streaming responses.

Why does RAG beat “put all data into the prompt” for internal company knowledge?

Large private datasets usually can’t fit into a single prompt without increasing cost and latency, and even with large context windows, models may not reliably use everything provided. RAG limits what each request includes by retrieving only the most relevant chunks at question time, then grounding generation in that retrieved context.

What are the two stages of a typical RAG pipeline?

Ingestion: source documents (PDFs, text, web pages, GitHub, etc.) are split into chunks, converted into vectors, and stored in a vector database. Question answering: a user query is embedded into a query vector, the vector store returns the top-k most similar chunks, those chunks are inserted into an augmented prompt with the user question, and the language model generates the final answer.

How does the tutorial’s TF-IDF retriever work, and what do similarity scores mean?

The retriever uses TF-IDF instead of embeddings. It fits a vectorizer on the knowledge base chunks, transforms the user query into the same vector space, computes similarity between the query and each chunk, then returns the top-k chunks plus their similarity scores. Scores near 0 indicate weak relevance; higher scores indicate stronger lexical overlap and more likely relevance to the query.

How is the RAG prompt constructed and why does formatting matter?

Retrieved chunks are wrapped into a context block (XML-style in the implementation) and combined with the user query using a RAG prompt template. Formatting mistakes can break grounding—so the tutorial adds tracing to inspect the exact final prompt sent to the model and the retrieved context used.

What changes when moving from a string knowledge base to an external PDF knowledge base?

PDF ingestion uses PyMuPDF to extract page text into a list of strings. Chunking shifts to page-level extraction, and retrieval returns relevant pages (with the model sometimes reordering context). The RAG pipeline remains the same structurally: retrieve top-k chunks/pages, build the augmented prompt, then generate.

How does the production API handle documents and responses?

FastAPI exposes an upload endpoint that validates the file is a PDF, reads bytes, extracts pages, rebuilds the in-memory chunks and TF-IDF vector store, and returns metadata like filename and page count. An ask endpoint streams model output using Server-Sent Events, sending incremental JSON chunks and a final “done” chunk to close the stream.

Review Questions

How do ingestion-time chunking choices (double-newline vs page-level) affect retrieval results in RAG?
What debugging signals from tracing (retrieved pages/chunks and the final augmented prompt) help pinpoint why an answer is ungrounded?
In the TF-IDF retriever approach, how would you use similarity scores to decide whether to include retrieved context in the prompt?

Key Points

1
RAG avoids prompt bloat by retrieving top-k relevant chunks from a private knowledge base at question time, then grounding generation in that retrieved context.
2
A complete RAG system has an ingestion pipeline (chunk → vectorize → store) and a question-answering pipeline (embed query → retrieve → augment prompt → generate).
3
TF-IDF retrieval can be used as a simple baseline to understand retrieval mechanics, including how similarity scores reflect lexical relevance.
4
PDF support can be added by extracting page text with PyMuPDF and chunking at the page level, then reusing the same retrieval-and-prompt structure.
5
Prompt formatting errors are a common RAG failure mode; tracing with ML4 helps inspect retrieved context and the exact final prompt sent to the model.
6
A production-ready RAG service can be built with FastAPI using an upload endpoint to ingest PDFs and an ask endpoint that streams responses via Server-Sent Events.
7
Docker multi-stage builds and deployment to Render enable running the RAG API as a containerized web service with a health check and streaming chat endpoint.

Highlights

RAG is framed as a scaling strategy for internal data: it prevents cost/latency blowups from stuffing entire private datasets into prompts.

The tutorial uses TF-IDF retrieval first to make retrieval behavior measurable, including similarity scores that can be near zero when relevance is weak.

ML4 tracing is used to debug the exact augmented prompt and the retrieved PDF pages/chunks that feed the model.

The production API streams answers over Server-Sent Events, sending incremental JSON chunks and a final completion message.

Deployment is demonstrated end-to-end: Dockerize the FastAPI service, then deploy to Render and test upload + grounded chat.

Topics

Mentioned

RAG
TF-IDF
REST
API
PDF
XML
ML4
FastAPI
Docker
Uvicorn
SSE

What is RAG? The Complete Tutorial - From Scratch to Deployed API on Production | LangChain & Ollama