2-Build RAG Pipeline From Scratch-Data Ingestion to Vector DB Pipeline-Part 1
Based on Krish Naik's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Treat ingestion as a contract: represent every file as a LangChain `Document` with `page_content` plus `metadata` before chunking.
Briefing
A practical RAG pipeline is built end-to-end: raw files get parsed into a structured “document” format, split into chunks that fit model context windows, embedded into vectors, stored in a persistent vector database, and then queried via similarity search to return the most relevant context. The key takeaway is that retrieval quality and reliability start long before any LLM is involved—data ingestion and document structure determine what information the system can later retrieve.
The implementation plan is deliberately modular. Code starts in a Jupyter notebook to establish fundamentals, then evolves toward reusable classes. The first major focus is document structure, because every downstream step—chunking, embeddings, and vector search—assumes a consistent representation. In LangChain terms, a document object holds two core fields: `page_content` (the actual text extracted from a file) and `metadata` (auxiliary details like source filename, page counts, timestamps, and author information). Loaders are the mechanism that convert different file types into this document structure. The walkthrough highlights loaders for text files (`TextLoader`), directories of mixed files (`DirectoryLoader`), and PDFs using `PyMuPDFLoader` (with a comparison to `PyPDFLoader` and a note that `PyMuPDF` is generally preferable).
To demonstrate ingestion, the workflow creates sample text files programmatically, then loads them into LangChain documents. For directory loading, a pattern-based approach is used to read multiple files at once and return a list of documents. For PDFs, the directory loader reads every PDF in a folder and produces documents whose metadata differs by file—such as creation date, total pages, file path, and sometimes author—while `page_content` contains the extracted text. This structured output is treated as the “contract” that later stages rely on.
Next comes the data injection pipeline’s transformation steps: chunking, embedding, and storage. Chunking is motivated by fixed context size limits in both embedding models and LLMs; large documents must be divided so each chunk can be embedded without exceeding those limits. After chunking, each chunk is embedded into vectors and stored in a vector database. The walkthrough implements embeddings using Hugging Face sentence-transformers with the model `all-MiniLM-L6-v2`, producing 384-dimensional vectors. For storage, it uses ChromaDB with persistence enabled, so the vector collection is saved to disk and can be reloaded later.
A custom `EmbeddingManager` class handles model loading and embedding generation, while a `VectorStore` class manages ChromaDB initialization and insertion. When adding documents, the system generates IDs, passes chunk text as the “document” field, stores embeddings as vectors, and preserves metadata for later filtering. After insertion, the collection contains hundreds of chunk records (359 in the example).
Finally, retrieval is implemented as a `RAGRetriever` class. Given a user query, it embeds the query using the same embedding model, runs a similarity search against the ChromaDB collection, and computes a similarity score from the returned distance (`1 - distance`). Results above a configurable threshold are assembled into a context list containing both retrieved text and metadata. Example queries like “what is attention is all you need” and “unified multitask learning framework” return relevant snippets from the ingested PDFs, demonstrating that the pipeline can retrieve context quickly and deterministically—setting up the next step of integrating an LLM to generate answers from that retrieved context in a later part of the series.
Cornell Notes
The pipeline starts by converting raw files (TXT and PDFs) into LangChain `Document` objects with two fields: `page_content` and `metadata`. Chunking then splits long content into smaller pieces so embeddings and later LLM context windows aren’t exceeded. Each chunk is embedded using Hugging Face `all-MiniLM-L6-v2` (384-dimensional vectors) and stored in a persistent ChromaDB collection. Retrieval works by embedding the user query, running similarity search in ChromaDB, converting returned distances into similarity scores (`1 - distance`), and filtering results by a threshold. The output is a context bundle of retrieved chunks plus metadata, ready for LLM integration later.
Why does document structure matter before any chunking or embeddings happen?
How do loaders turn different file types into the same document format?
What problem does chunking solve, and how is it tied to context size?
What does the embedding stage produce, and why is the embedding model choice important?
How does similarity search output become a usable relevance score?
What exactly is returned by the retriever, and how is it used next?
Review Questions
- What two fields make up LangChain’s `Document` structure, and how does each field get used later in RAG?
- Why must documents be chunked before embedding, and what role does context size play?
- During retrieval, how are ChromaDB distance values converted into similarity scores, and where does the threshold filter apply?
Key Points
- 1
Treat ingestion as a contract: represent every file as a LangChain `Document` with `page_content` plus `metadata` before chunking.
- 2
Use directory-based loaders to scale ingestion from single files to whole folders while keeping a consistent document format.
- 3
Chunking is required to fit fixed context-size limits of embedding models and LLMs; it also improves retrieval granularity.
- 4
Embed chunks with the same model used at query time; the walkthrough uses `all-MiniLM-L6-v2` to produce 384-dimensional vectors.
- 5
Store chunk embeddings in a persistent ChromaDB collection so the vector index survives across sessions.
- 6
Implement retrieval by embedding the query, running similarity search in the vector store, converting distance to similarity (`1 - distance`), and filtering by a score threshold.
- 7
Return retrieved chunks plus metadata as context, then use that context in the next stage to generate answers with an LLM.