2-Build RAG Pipeline From Scratch-Data Ingestion to Vector DB Pipeline-Part 1

TL;DR

Treat ingestion as a contract: represent every file as a LangChain `Document` with `page_content` plus `metadata` before chunking.

Briefing Cornell Notes

Briefing

A practical RAG pipeline is built end-to-end: raw files get parsed into a structured “document” format, split into chunks that fit model context windows, embedded into vectors, stored in a persistent vector database, and then queried via similarity search to return the most relevant context. The key takeaway is that retrieval quality and reliability start long before any LLM is involved—data ingestion and document structure determine what information the system can later retrieve.

The implementation plan is deliberately modular. Code starts in a Jupyter notebook to establish fundamentals, then evolves toward reusable classes. The first major focus is document structure, because every downstream step—chunking, embeddings, and vector search—assumes a consistent representation. In LangChain terms, a document object holds two core fields: `page_content` (the actual text extracted from a file) and `metadata` (auxiliary details like source filename, page counts, timestamps, and author information). Loaders are the mechanism that convert different file types into this document structure. The walkthrough highlights loaders for text files (`TextLoader`), directories of mixed files (`DirectoryLoader`), and PDFs using `PyMuPDFLoader` (with a comparison to `PyPDFLoader` and a note that `PyMuPDF` is generally preferable).

To demonstrate ingestion, the workflow creates sample text files programmatically, then loads them into LangChain documents. For directory loading, a pattern-based approach is used to read multiple files at once and return a list of documents. For PDFs, the directory loader reads every PDF in a folder and produces documents whose metadata differs by file—such as creation date, total pages, file path, and sometimes author—while `page_content` contains the extracted text. This structured output is treated as the “contract” that later stages rely on.

Next comes the data injection pipeline’s transformation steps: chunking, embedding, and storage. Chunking is motivated by fixed context size limits in both embedding models and LLMs; large documents must be divided so each chunk can be embedded without exceeding those limits. After chunking, each chunk is embedded into vectors and stored in a vector database. The walkthrough implements embeddings using Hugging Face sentence-transformers with the model `all-MiniLM-L6-v2`, producing 384-dimensional vectors. For storage, it uses ChromaDB with persistence enabled, so the vector collection is saved to disk and can be reloaded later.

A custom `EmbeddingManager` class handles model loading and embedding generation, while a `VectorStore` class manages ChromaDB initialization and insertion. When adding documents, the system generates IDs, passes chunk text as the “document” field, stores embeddings as vectors, and preserves metadata for later filtering. After insertion, the collection contains hundreds of chunk records (359 in the example).

Finally, retrieval is implemented as a `RAGRetriever` class. Given a user query, it embeds the query using the same embedding model, runs a similarity search against the ChromaDB collection, and computes a similarity score from the returned distance (`1 - distance`). Results above a configurable threshold are assembled into a context list containing both retrieved text and metadata. Example queries like “what is attention is all you need” and “unified multitask learning framework” return relevant snippets from the ingested PDFs, demonstrating that the pipeline can retrieve context quickly and deterministically—setting up the next step of integrating an LLM to generate answers from that retrieved context in a later part of the series.

Cornell Notes

The pipeline starts by converting raw files (TXT and PDFs) into LangChain `Document` objects with two fields: `page_content` and `metadata`. Chunking then splits long content into smaller pieces so embeddings and later LLM context windows aren’t exceeded. Each chunk is embedded using Hugging Face `all-MiniLM-L6-v2` (384-dimensional vectors) and stored in a persistent ChromaDB collection. Retrieval works by embedding the user query, running similarity search in ChromaDB, converting returned distances into similarity scores (`1 - distance`), and filtering results by a threshold. The output is a context bundle of retrieved chunks plus metadata, ready for LLM integration later.

Why does document structure matter before any chunking or embeddings happen?

Document structure defines the “shape” of information that later steps depend on. In LangChain, each `Document` contains `page_content` (the extracted text) and `metadata` (source details like filename, page counts, timestamps, and sometimes author). When chunks are embedded and stored, metadata travels with them, enabling retrieval-time filtering and better traceability of results. Without consistent `page_content`/`metadata`, similarity search outputs become harder to interpret and less controllable.

How do loaders turn different file types into the same document format?

Loaders provide a uniform conversion layer. A `TextLoader` reads a single text file and returns a list of `Document` objects with `page_content` filled from the file and `metadata` populated (e.g., source). A `DirectoryLoader` can load many files from a folder using a pattern and a loader class. For PDFs, `PyMuPDFLoader` (preferred over `PyPDFLoader` in the walkthrough) parses PDFs and returns documents whose metadata includes fields like creation date, total pages, file path, and extracted text in `page_content`.

What problem does chunking solve, and how is it tied to context size?

Embedding models and LLMs have fixed context size limits. If a 100-page PDF is embedded as one block, it can exceed the maximum input length and fail. Chunking divides documents into smaller segments (chunk 1, chunk 2, etc.) so each chunk fits within the model’s context window. This makes embedding feasible and improves retrieval granularity by allowing similarity search over smaller, more relevant text spans.

What does the embedding stage produce, and why is the embedding model choice important?

The embedding stage converts each chunk’s text into a numeric vector. The walkthrough uses sentence-transformers with `all-MiniLM-L6-v2`, which yields 384-dimensional embeddings. These vectors are what ChromaDB indexes for similarity search. Using the same embedding model for both ingestion (chunk embeddings) and retrieval (query embeddings) is essential; otherwise, query vectors won’t align with stored vectors.

How does similarity search output become a usable relevance score?

ChromaDB returns results including documents, metadata, IDs, and a distance metric. The walkthrough computes similarity as `1 - distance`. It then applies a `score_threshold` filter (default 0.0) so only sufficiently similar chunks are included in the final `context_documents` list returned by the retriever.

What exactly is returned by the retriever, and how is it used next?

The retriever returns a list of dictionaries containing retrieved chunk text (from `page_content`) and associated metadata. This list is treated as the “context” for the next step: feeding the retrieved snippets into an LLM to generate an answer grounded in the most relevant ingested material. In this part, the pipeline stops at context retrieval, not generation.

Review Questions

What two fields make up LangChain’s `Document` structure, and how does each field get used later in RAG?
Why must documents be chunked before embedding, and what role does context size play?
During retrieval, how are ChromaDB distance values converted into similarity scores, and where does the threshold filter apply?

Key Points

1
Treat ingestion as a contract: represent every file as a LangChain `Document` with `page_content` plus `metadata` before chunking.
2
Use directory-based loaders to scale ingestion from single files to whole folders while keeping a consistent document format.
3
Chunking is required to fit fixed context-size limits of embedding models and LLMs; it also improves retrieval granularity.
4
Embed chunks with the same model used at query time; the walkthrough uses `all-MiniLM-L6-v2` to produce 384-dimensional vectors.
5
Store chunk embeddings in a persistent ChromaDB collection so the vector index survives across sessions.
6
Implement retrieval by embedding the query, running similarity search in the vector store, converting distance to similarity (`1 - distance`), and filtering by a score threshold.
7
Return retrieved chunks plus metadata as context, then use that context in the next stage to generate answers with an LLM.

Highlights

LangChain `Document` objects unify ingestion across file types using `page_content` (text) and `metadata` (source details).

Chunking exists to respect fixed context-size limits; it prevents embedding failures and improves relevance.

The pipeline uses `all-MiniLM-L6-v2` embeddings (384 dimensions) and stores them in persistent ChromaDB collections.

Retrieval computes relevance as `1 - distance` from ChromaDB’s distance output and filters results by a threshold.

The retriever returns context chunks with metadata—ready to be fed into an LLM in the next part of the series.

Topics

RAG Pipeline
Document Structure
Chunking
Embeddings
ChromaDB Retrieval

Mentioned

Krish Naik
RAG
LLM
DB
PDF
CSV
UTF8