Loaders, Indexes & Vectorstores in LangChain: Question Answering on PDF files with ChatGPT
Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Use LangChain loaders to ingest different document types (text, YouTube transcripts, PDFs) while keeping the downstream retrieval pattern consistent.
Briefing
A practical LangChain pipeline for turning PDFs, YouTube transcripts, and plain text into question-answering over embeddings is the core takeaway—and the workflow is built to show every moving part, from loaders to vector stores to retrieval-based QA. The setup starts with simple document ingestion (a text file), then scales to richer sources (YouTube transcripts and an unstructured PDF), and finally wires everything into a single retrieval + ChatGPT-style answering chain. The result is a repeatable method for asking targeted questions about a document’s contents without manually summarizing or searching through it.
The walkthrough begins with loading a custom essay from a local text file using LangChain’s text loader, then indexing it with a vector store index creator backed by Chroma DB. A short query—“Why someone in today’s world would read”—is answered using text DaVinci 003, with the response returned alongside source attribution pointing back to the essay. The emphasis is on how quickly LangChain can turn raw text into a searchable embedding space, enabling concise answers grounded in the indexed content.
Next, the pipeline demonstrates how swapping loaders changes the input type while keeping the rest of the retrieval approach intact. A YouTube loader pulls a transcript (via the YouTube transcript API) for a specific video, producing document objects with page content and metadata. Those transcript documents can then be indexed the same way and queried for information—showing how video knowledge can be converted into the same embedding-based question-answering flow.
The most important scaling step is handling PDFs. Using an unstructured PDF loader, the pipeline extracts page content from an older resume PDF (Andre Karpathy’s CV). Because long documents exceed model context limits, the transcript is split into overlapping chunks using a recursive character text splitter. The chunk size and overlap are chosen to preserve continuity across boundaries—overlap helps prevent the loss of context when a relevant idea spans two chunks. This chunking produces multiple smaller document segments suitable for embedding.
Embeddings then convert each chunk into vectors. The walkthrough compares sentence-transformers embeddings from Hugging Face (768-dimensional vectors) with OpenAI embeddings (larger dimensionality, described as roughly about twice the sentence-transformers size). After embeddings are created, they’re stored in a vector database. Chroma DB is used both in a non-persistent mode and with persistence to disk (via a persist directory and DuckDB-backed storage), so the indexed representation can be reused later.
Finally, a RetrievalQA chain combines the Chroma retriever with a ChatGPT model (temperature set to 0) to answer questions directly from the indexed PDF chunks. Example prompts include extracting the candidate’s work experience in no more than two sentences, generating a background summary in up to three sentences, and assigning a 0–10 likelihood rating for becoming a top-tier deep learning researcher two years later with an explanation. Across these examples, the pipeline demonstrates a consistent pattern: load → split (for PDFs) → embed → store (vector DB) → retrieve relevant chunks → generate grounded answers.
Overall, the workflow matters because it turns unstructured personal data (PDFs, transcripts, text) into a queryable knowledge base with minimal glue code, while explicitly addressing the two practical bottlenecks that usually break naive approaches: document heterogeneity and context-length limits.
Cornell Notes
The pipeline builds a LangChain retrieval-and-QA system that can answer questions over custom documents by converting them into embeddings and storing them in a vector database. It starts with a simple text loader and Chroma DB indexing, then swaps in a YouTube transcript loader to treat video content as searchable text. For PDFs, it extracts page content with an unstructured PDF loader and uses a recursive character text splitter with chunk overlap to avoid losing context and to fit model context limits. It compares Hugging Face sentence-transformers embeddings with OpenAI embeddings, then persists the Chroma index to disk for reuse. A RetrievalQA chain uses the Chroma retriever plus a ChatGPT model (temperature 0) to generate answers grounded in the most similar chunks.
Why does the pipeline split PDF text into chunks, and what role does overlap play?
How do loaders and vector indexing stay consistent even when the input source changes?
What’s the practical difference between Hugging Face sentence-transformers embeddings and OpenAI embeddings in the workflow?
How does persistence change the vector store workflow in Chroma DB?
What does RetrievalQA do differently from simply asking a model a question?
Why is temperature set to 0 for the QA chain?
Review Questions
- When working with a long PDF, what two mechanisms in the pipeline prevent context-limit failures, and how do they interact?
- Describe the end-to-end flow from raw document to answer generation, naming the roles of the loader, text splitter (if used), embeddings, vector store, and RetrievalQA chain.
- What changes in the pipeline when switching from a text file to a YouTube transcript, and what stays the same?
Key Points
- 1
Use LangChain loaders to ingest different document types (text, YouTube transcripts, PDFs) while keeping the downstream retrieval pattern consistent.
- 2
For PDFs, extract page content with an unstructured PDF loader and apply a recursive character text splitter to fit model context limits.
- 3
Add chunk overlap so cross-boundary context isn’t lost when splitting long documents into embedding-ready segments.
- 4
Create embeddings for each chunk and store them in a vector database (Chroma DB), then use similarity search to retrieve the most relevant chunks for a query.
- 5
Persist Chroma indexes to disk (e.g., using a persist directory like “DB”) so the embedding store can be reused without re-indexing.
- 6
Use a RetrievalQA chain to ground answers in retrieved chunks, and set temperature to 0 for more deterministic, constraint-friendly responses.