Introduction To Undertsanding RAG(Retrieval-Augmented Generation)

TL;DR

RAG grounds LLM outputs in an external knowledge base retrieved at answer time, avoiding reliance solely on fixed training data.

Briefing Cornell Notes

Briefing

Retrieval-Augmented Generation (RAG) is positioned as a practical way to make large language models more reliable and more useful for an organization’s specific knowledge—without retraining the model. Instead of relying only on what an LLM learned during training, RAG pulls in information from an external, authoritative knowledge base at answer time. That matters because it directly targets two common failure modes of “LLM-only” systems: outdated knowledge that leads to hallucinations, and the difficulty of keeping private, frequently changing company data (like HR or finance policies) aligned with model outputs.

In an LLM-only setup, a user query goes through a prompt and then straight into the model to generate an answer. The weakness is temporal and factual: an LLM is trained on a fixed dataset window. If a model was trained up to, say, 1 August, it may not know what happened between 1 August and 31 August, yet it will still produce an answer when asked about that period. When the model lacks real knowledge, it tends to “hallucinate”—generating plausible-sounding content to avoid looking wrong, even if the information is not actually known.

A second issue appears when a startup or company wants a chatbot grounded in internal documents that aren’t public. One option is fine-tuning, but that is described as expensive and tedious because modern LLMs have billions of parameters. It also doesn’t fit fast-changing policy data: fine-tuning every day (or even every week) is not realistic. RAG offers an alternative by injecting internal knowledge into a retrieval system rather than modifying the model weights.

RAG is built around two pipelines. The first is the data injection pipeline, which ingests company data from many possible formats—PDF, HTML, Excel, or even SQL databases. The process starts with data parsing and chunking, breaking documents into smaller pieces so they can be indexed effectively. Each chunk is then converted into embeddings, meaning text is transformed into numerical vectors. Those vectors are stored in a vector database (vector store), enabling similarity search using techniques like cosine similarity.

The second pipeline is the retrieval pipeline. When a user asks a question, the query is embedded into a vector as well, then used to search the vector database for the most relevant chunks. The retrieved passages become “context,” which is fed into the LLM alongside a prompt instructing the model to use that context to answer. The result is not a complete elimination of hallucination, but a reduction: if the needed information exists in the vector store, the model has grounded material to draw from; if it doesn’t, hallucination can still occur.

The transcript also frames this as “traditional RAG” and notes that other RAG variants exist, with agentic RAG expected later in the series. As a real-world example, Perplexity is mentioned as a RAG-based system that connects retrievers and tools such as web search, then summarizes results using LLMs. The takeaway is that RAG turns private and up-to-date knowledge into retrievable context, improving relevance and accuracy while avoiding the cost and operational burden of continual fine-tuning.

Cornell Notes

RAG (Retrieval-Augmented Generation) improves LLM answers by grounding them in an external knowledge base rather than relying only on the model’s fixed training data. It addresses two major problems: hallucinations caused by outdated knowledge and the challenge of using private, frequently updated company documents without expensive fine-tuning. RAG uses two pipelines: a data injection pipeline that parses and chunks documents, creates embeddings, and stores them in a vector database; and a retrieval pipeline that embeds user queries, performs similarity search to fetch relevant context, then prompts the LLM to answer using that context. Hallucinations may still happen when the needed information isn’t present, but retrieval reduces them when documents are available.

Why do LLM-only systems hallucinate when asked about recent events or facts outside their training window?

The model is trained on data up to a cutoff date, so it may not know what happened after that point. When a user asks about events between the cutoff and the present, the LLM still generates an answer. Because it lacks real knowledge, it produces plausible-sounding content—described as “hallucinating”—to avoid producing an empty or obviously incorrect response.

Why is fine-tuning often a poor fit for private company policies that change over time?

Fine-tuning is expensive and tedious because LLMs contain billions of parameters. It also doesn’t scale operationally when HR, finance, or other policy documents update frequently; continuously retraining the model would be impractical.

What is the data injection pipeline in RAG, and what happens to documents during it?

Documents from sources like PDF, HTML, Excel, or SQL are parsed and chunked into smaller pieces. Each chunk is converted into an embedding (a numerical vector representation of text) and stored in a vector database. This creates an internal knowledge base that can be searched later.

How does the retrieval pipeline turn a user question into grounded context for the LLM?

The user query is embedded into a vector, then used for similarity search (e.g., cosine similarity) against the vector database. The most relevant stored chunks are returned as context. A prompt then instructs the LLM to answer using that retrieved context.

Does RAG fully eliminate hallucinations?

No. The transcript frames RAG as reducing hallucination when the answerable information exists in the vector store. If the vector database lacks the needed information, the LLM can still hallucinate because it has no grounded context to rely on.

What example system is cited as being built on RAG principles?

Perplexity is mentioned as a RAG-based application that connects multiple retrievers and tools such as web search, then summarizes results using LLMs.

Review Questions

In an LLM-only chatbot, what specific mechanism leads to hallucinations when questions involve information outside the training cutoff?
Walk through the two RAG pipelines: what steps occur in data injection versus retrieval, and where do embeddings fit?
Under what condition does RAG still allow hallucinations, according to the transcript?

Key Points

1
RAG grounds LLM outputs in an external knowledge base retrieved at answer time, avoiding reliance solely on fixed training data.
2
Hallucinations are linked to missing knowledge beyond the model’s training cutoff; the model still generates plausible answers.
3
Fine-tuning is costly and operationally difficult for private, frequently updated company data like HR and finance policies.
4
RAG’s data injection pipeline parses and chunks documents, converts chunks into embeddings, and stores them in a vector database.
5
RAG’s retrieval pipeline embeds the user query, runs similarity search to fetch relevant chunks as context, and prompts the LLM to answer using that context.
6
RAG reduces hallucinations when the needed information exists in the vector store, but it cannot guarantee correctness when retrieval returns no relevant facts.
7
Traditional RAG relies on retrieval plus prompt-based generation; other RAG variants (including agentic RAG) are expected later.

Highlights

RAG reduces hallucinations by feeding the LLM retrieved, domain-specific context from an external vector database instead of relying only on training data.

The approach avoids continual fine-tuning by injecting internal documents into a searchable embedding index.

RAG is organized into two pipelines: data injection (parse → chunk → embed → store) and retrieval (embed query → similarity search → context → answer).

Hallucination can still occur when the vector store lacks the information needed to answer the question.

Topics

RAG Basics
Hallucination
Vector Database
Embeddings
Data Chunking

Mentioned

Krishna Naik
RAG
LLM
GPT5