Loaders, Indexes & Vectorstores in LangChain: Question Answering on PDF files with ChatGPT

TL;DR

Use LangChain loaders to ingest different document types (text, YouTube transcripts, PDFs) while keeping the downstream retrieval pattern consistent.

Briefing Cornell Notes

Briefing

A practical LangChain pipeline for turning PDFs, YouTube transcripts, and plain text into question-answering over embeddings is the core takeaway—and the workflow is built to show every moving part, from loaders to vector stores to retrieval-based QA. The setup starts with simple document ingestion (a text file), then scales to richer sources (YouTube transcripts and an unstructured PDF), and finally wires everything into a single retrieval + ChatGPT-style answering chain. The result is a repeatable method for asking targeted questions about a document’s contents without manually summarizing or searching through it.

The walkthrough begins with loading a custom essay from a local text file using LangChain’s text loader, then indexing it with a vector store index creator backed by Chroma DB. A short query—“Why someone in today’s world would read”—is answered using text DaVinci 003, with the response returned alongside source attribution pointing back to the essay. The emphasis is on how quickly LangChain can turn raw text into a searchable embedding space, enabling concise answers grounded in the indexed content.

Next, the pipeline demonstrates how swapping loaders changes the input type while keeping the rest of the retrieval approach intact. A YouTube loader pulls a transcript (via the YouTube transcript API) for a specific video, producing document objects with page content and metadata. Those transcript documents can then be indexed the same way and queried for information—showing how video knowledge can be converted into the same embedding-based question-answering flow.

The most important scaling step is handling PDFs. Using an unstructured PDF loader, the pipeline extracts page content from an older resume PDF (Andre Karpathy’s CV). Because long documents exceed model context limits, the transcript is split into overlapping chunks using a recursive character text splitter. The chunk size and overlap are chosen to preserve continuity across boundaries—overlap helps prevent the loss of context when a relevant idea spans two chunks. This chunking produces multiple smaller document segments suitable for embedding.

Embeddings then convert each chunk into vectors. The walkthrough compares sentence-transformers embeddings from Hugging Face (768-dimensional vectors) with OpenAI embeddings (larger dimensionality, described as roughly about twice the sentence-transformers size). After embeddings are created, they’re stored in a vector database. Chroma DB is used both in a non-persistent mode and with persistence to disk (via a persist directory and DuckDB-backed storage), so the indexed representation can be reused later.

Finally, a RetrievalQA chain combines the Chroma retriever with a ChatGPT model (temperature set to 0) to answer questions directly from the indexed PDF chunks. Example prompts include extracting the candidate’s work experience in no more than two sentences, generating a background summary in up to three sentences, and assigning a 0–10 likelihood rating for becoming a top-tier deep learning researcher two years later with an explanation. Across these examples, the pipeline demonstrates a consistent pattern: load → split (for PDFs) → embed → store (vector DB) → retrieve relevant chunks → generate grounded answers.

Overall, the workflow matters because it turns unstructured personal data (PDFs, transcripts, text) into a queryable knowledge base with minimal glue code, while explicitly addressing the two practical bottlenecks that usually break naive approaches: document heterogeneity and context-length limits.

Cornell Notes

The pipeline builds a LangChain retrieval-and-QA system that can answer questions over custom documents by converting them into embeddings and storing them in a vector database. It starts with a simple text loader and Chroma DB indexing, then swaps in a YouTube transcript loader to treat video content as searchable text. For PDFs, it extracts page content with an unstructured PDF loader and uses a recursive character text splitter with chunk overlap to avoid losing context and to fit model context limits. It compares Hugging Face sentence-transformers embeddings with OpenAI embeddings, then persists the Chroma index to disk for reuse. A RetrievalQA chain uses the Chroma retriever plus a ChatGPT model (temperature 0) to generate answers grounded in the most similar chunks.

Why does the pipeline split PDF text into chunks, and what role does overlap play?

PDFs can exceed the context window of models like text DaVinci 003 or ChatGPT-style models. The workflow uses a recursive character text splitter to break extracted PDF pages into smaller chunks. Overlap (set to a small value like 64 in the example) ensures that ideas spanning a boundary aren’t cut in half—so context from the end of one chunk can carry into the start of the next. The result is multiple chunks (e.g., nearly ~1000 characters for the first and ~500 for the second in the demo) where the second chunk begins within the overlap region rather than abruptly restarting.

How do loaders and vector indexing stay consistent even when the input source changes?

The approach keeps the “index and retrieve” pattern stable while swapping the loader. A text loader ingests a plain essay; a YouTube loader fetches a transcript and wraps it as document objects; an unstructured PDF loader extracts page content from a PDF. After each loader produces documents, the pipeline feeds those documents into the same embedding + vector store indexing flow (Chroma DB in the examples), enabling the same style of similarity search and question answering across different data types.

What’s the practical difference between Hugging Face sentence-transformers embeddings and OpenAI embeddings in the workflow?

Both embedding types produce vectors for similarity search, but they differ in model choice and vector dimensionality. The demo uses Hugging Face sentence-transformers embeddings with a specified model, yielding 768-dimensional vectors (because the underlying sentence-transformers model uses a 768-size base like the MiniLM/BERT-style head). OpenAI embeddings are used via LangChain’s OpenAI embeddings interface (described as defaulting to text DaVinci 003 in the walkthrough), producing vectors with a different dimensionality (described as roughly about twice the sentence-transformers size). The interface remains similar: create embeddings, then store them in the vector DB.

How does persistence change the vector store workflow in Chroma DB?

Without persistence, the index lives only for the current run. With persistence enabled, the pipeline specifies a persist directory (e.g., “DB”), so Chroma stores index files and embedding data on disk (including files like bin and parquet outputs). The demo also mentions using DuckDB as the embedded database backend for persistence. After persisting, similarity search can be run again later and return essentially the same chunk matches, even when documents aren’t re-supplied in the same way.

What does RetrievalQA do differently from simply asking a model a question?

RetrievalQA first retrieves the most relevant chunks from the vector store using similarity search, then feeds those retrieved chunks into the language model to generate an answer. In the examples, questions like “What is the work experience of the candidate?” and “Give a background summary…” are answered based on the resume text chunks stored in Chroma. The chain is configured with a ChatGPT model and temperature set to 0, and it avoids extra summarization steps beyond answering from the retrieved context.

Why is temperature set to 0 for the QA chain?

Setting temperature to 0 makes outputs more deterministic. In the demo, that choice supports consistent, document-grounded answers when the retriever selects the same relevant chunks from the vector store. That matters when prompts ask for constrained formats like “no more than two sentences” or “no more than three sentences,” where variability could otherwise produce inconsistent lengths or phrasing.

Review Questions

When working with a long PDF, what two mechanisms in the pipeline prevent context-limit failures, and how do they interact?
Describe the end-to-end flow from raw document to answer generation, naming the roles of the loader, text splitter (if used), embeddings, vector store, and RetrievalQA chain.
What changes in the pipeline when switching from a text file to a YouTube transcript, and what stays the same?

Key Points

1
Use LangChain loaders to ingest different document types (text, YouTube transcripts, PDFs) while keeping the downstream retrieval pattern consistent.
2
For PDFs, extract page content with an unstructured PDF loader and apply a recursive character text splitter to fit model context limits.
3
Add chunk overlap so cross-boundary context isn’t lost when splitting long documents into embedding-ready segments.
4
Create embeddings for each chunk and store them in a vector database (Chroma DB), then use similarity search to retrieve the most relevant chunks for a query.
5
Persist Chroma indexes to disk (e.g., using a persist directory like “DB”) so the embedding store can be reused without re-indexing.
6
Use a RetrievalQA chain to ground answers in retrieved chunks, and set temperature to 0 for more deterministic, constraint-friendly responses.

Highlights

Chunk overlap is the practical fix for context fragmentation when splitting PDFs: the next chunk starts inside the overlap region so ideas aren’t severed at boundaries.

Chroma DB can run in both ephemeral and persistent modes; persistence writes index and embedding artifacts to a directory (e.g., “DB”) for later reuse.

The same retrieval-and-QA structure works across text files, YouTube transcripts, and PDFs—only the loader and (for PDFs) the splitting step change.

RetrievalQA answers are produced from the most similar embedded chunks, not from the model’s general knowledge alone.

Embedding dimensionality differs by model family (Hugging Face sentence-transformers at 768 vs OpenAI embeddings with a different size), but the storage and retrieval interface stays largely the same.

Topics

LangChain Loaders
Vector Stores
Embeddings
PDF Chunking
RetrievalQA