LangChain Retrieval QA Over Multiple Files with ChromaDB
Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Load multiple documents from a folder using a file glob pattern, and swap loaders based on file type (text, PDF, Markdown).
Briefing
LangChain retrieval QA becomes practical at scale once the document embeddings live in a persistent ChromaDB vector store on disk. Instead of rebuilding an in-memory index every time the app starts, the workflow loads many text files from a folder, chunks them, embeds the chunks, and writes the resulting vector index into a persisted directory (named “DB”). That persistence matters most when the corpus grows to hundreds or thousands of long documents, where re-embedding on every launch would be slow and expensive.
The setup starts by pulling multiple files using a simple folder-based loader pattern: a directory is scanned with a glob like “*.text” (and the loader can be swapped for other formats such as PDFs or Markdown by changing the file extension and loader type). After loading, the documents are split into chunks—small units that fit within embedding and LLM context limits. With chunks ready, the code initializes embeddings using OpenAI for both the language model and the embedding model. Those embeddings are then used to create a Chroma vector store from the chunked documents, with a persistence directory so the index is saved to disk. Once persisted, the original embedding step can be skipped on subsequent runs by reloading the vector store from the same directory.
After the vector database is in place, the system turns it into a retriever configured for similarity search. A query is issued, the retriever returns the top matching chunks (the example uses k=2, with guidance that k≈5 often works well when answering from multiple sources). Those retrieved chunks become the “context” fed into a RetrievalQA chain, which combines the context with the user’s question and asks the LLM to answer strictly using the provided material. The chain is configured to return the source documents as well, enabling citation-style outputs rather than ungrounded answers.
Concrete examples show the retrieval-and-answer loop working: questions about Pendo’s fundraising return a “30 million” figure with the relevant source chunks; a query about “news about Pando” yields both an answer and the underlying documents, including details like the round type and usage; and questions like “what is generative AI?” or “who is CMA?” pull answers from different articles based on semantic similarity. The prompt template reinforces the guardrail: if the answer isn’t in the retrieved context, the model should say it doesn’t know.
Finally, the notebook demonstrates swapping the LLM backend to the GPT-3.5-turbo API (via LangChain’s chat prompt structure with system and human messages). The retrieval and ChromaDB wiring stays the same; only the prompt formatting changes to match the chat model interface. The result is a reusable, disk-backed retrieval QA pipeline with grounded answers and source traceability—positioning it for future upgrades like Pinecone deployment or local embeddings for the lookup stage.
Cornell Notes
The core workflow builds a disk-persisted ChromaDB vector store for multiple documents, then uses LangChain RetrievalQA to answer questions grounded in retrieved chunks. Documents are loaded from a folder, chunked, embedded with OpenAI embeddings, and written to a persistent directory (“DB”). On later runs, the system reloads the existing vector store instead of re-embedding everything, which is crucial for large corpora. A similarity-search retriever selects the top-k chunks (k is configurable), and the LLM answers using those chunks as context while returning the source documents for citation-style verification. The example also swaps the LLM to GPT-3.5-turbo via chat prompts without changing the retrieval logic.
How does the pipeline avoid re-embedding documents every time it starts?
What determines which document chunks get used to answer a question?
How does the system produce answers with traceable sources?
What role does the prompt template play in grounding answers?
What changes when switching from a basic LLM setup to GPT-3.5-turbo API?
Why does chunking matter in this design?
Review Questions
- When and why would you choose a persistent vector store over an in-memory index in a RetrievalQA pipeline?
- How do k and similarity search affect the quality and coverage of answers in this setup?
- What prompt instructions are used to prevent the LLM from guessing when the retrieved context doesn’t contain the answer?
Key Points
- 1
Load multiple documents from a folder using a file glob pattern, and swap loaders based on file type (text, PDF, Markdown).
- 2
Split documents into chunks before embedding so retrieval can operate on relevant segments rather than entire files.
- 3
Create a ChromaDB vector store with OpenAI embeddings and persist it to disk (persist_directory set to “DB”) to reuse embeddings across runs.
- 4
Use a similarity-search retriever with a configurable k (top-k chunks) and feed those retrieved chunks as context into a RetrievalQA chain.
- 5
Enable grounded outputs by returning the retrieved source documents (return_documents=True) alongside the generated answer.
- 6
When switching to GPT-3.5-turbo, keep retrieval the same but adjust prompt formatting to match chat-style system and human messages.
- 7
Reload the persisted ChromaDB index on startup to avoid re-embedding large corpora every time the app launches.