Get AI summaries of any video or article — Sign up free
LangChain Retrieval QA Over Multiple Files with ChromaDB thumbnail

LangChain Retrieval QA Over Multiple Files with ChromaDB

Sam Witteveen·
5 min read

Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Load multiple documents from a folder using a file glob pattern, and swap loaders based on file type (text, PDF, Markdown).

Briefing

LangChain retrieval QA becomes practical at scale once the document embeddings live in a persistent ChromaDB vector store on disk. Instead of rebuilding an in-memory index every time the app starts, the workflow loads many text files from a folder, chunks them, embeds the chunks, and writes the resulting vector index into a persisted directory (named “DB”). That persistence matters most when the corpus grows to hundreds or thousands of long documents, where re-embedding on every launch would be slow and expensive.

The setup starts by pulling multiple files using a simple folder-based loader pattern: a directory is scanned with a glob like “*.text” (and the loader can be swapped for other formats such as PDFs or Markdown by changing the file extension and loader type). After loading, the documents are split into chunks—small units that fit within embedding and LLM context limits. With chunks ready, the code initializes embeddings using OpenAI for both the language model and the embedding model. Those embeddings are then used to create a Chroma vector store from the chunked documents, with a persistence directory so the index is saved to disk. Once persisted, the original embedding step can be skipped on subsequent runs by reloading the vector store from the same directory.

After the vector database is in place, the system turns it into a retriever configured for similarity search. A query is issued, the retriever returns the top matching chunks (the example uses k=2, with guidance that k≈5 often works well when answering from multiple sources). Those retrieved chunks become the “context” fed into a RetrievalQA chain, which combines the context with the user’s question and asks the LLM to answer strictly using the provided material. The chain is configured to return the source documents as well, enabling citation-style outputs rather than ungrounded answers.

Concrete examples show the retrieval-and-answer loop working: questions about Pendo’s fundraising return a “30 million” figure with the relevant source chunks; a query about “news about Pando” yields both an answer and the underlying documents, including details like the round type and usage; and questions like “what is generative AI?” or “who is CMA?” pull answers from different articles based on semantic similarity. The prompt template reinforces the guardrail: if the answer isn’t in the retrieved context, the model should say it doesn’t know.

Finally, the notebook demonstrates swapping the LLM backend to the GPT-3.5-turbo API (via LangChain’s chat prompt structure with system and human messages). The retrieval and ChromaDB wiring stays the same; only the prompt formatting changes to match the chat model interface. The result is a reusable, disk-backed retrieval QA pipeline with grounded answers and source traceability—positioning it for future upgrades like Pinecone deployment or local embeddings for the lookup stage.

Cornell Notes

The core workflow builds a disk-persisted ChromaDB vector store for multiple documents, then uses LangChain RetrievalQA to answer questions grounded in retrieved chunks. Documents are loaded from a folder, chunked, embedded with OpenAI embeddings, and written to a persistent directory (“DB”). On later runs, the system reloads the existing vector store instead of re-embedding everything, which is crucial for large corpora. A similarity-search retriever selects the top-k chunks (k is configurable), and the LLM answers using those chunks as context while returning the source documents for citation-style verification. The example also swaps the LLM to GPT-3.5-turbo via chat prompts without changing the retrieval logic.

How does the pipeline avoid re-embedding documents every time it starts?

It persists the ChromaDB vector store to disk. After chunking and embedding the loaded documents, the code creates a Chroma vector store with a persist_directory set to a folder named “DB”. The index is saved there (including an index structure and stored vectors). On the next runtime, the code can reload the vector store by pointing at the same persist folder, skipping the embedding step entirely.

What determines which document chunks get used to answer a question?

A retriever configured for similarity search. The retriever runs a semantic similarity lookup against the ChromaDB vectors and returns the top-k matching chunks. In the example, k=2 is used, and the chain feeds those two retrieved chunks into the LLM as context. The retriever’s search type is explicitly similarity search, and the chain uses the retrieved documents as the context inputs.

How does the system produce answers with traceable sources?

The RetrievalQA chain is configured with return_documents=True. That means the chain returns both the generated answer and the specific retrieved source documents (the chunks) that were used as context. Since the original files can be HTML/text, those sources can be used to provide citation links back to the original pages if the metadata is preserved.

What role does the prompt template play in grounding answers?

The prompt template instructs the LLM to use the provided context pieces to answer the question. It also includes a fallback instruction: if the answer isn’t in the context, the model should say it doesn’t know rather than guessing. The template inserts the retrieved context chunks and then appends the user’s question.

What changes when switching from a basic LLM setup to GPT-3.5-turbo API?

The retrieval and ChromaDB wiring stays the same, but the chat model interface requires different prompt formatting. With GPT-3.5-turbo, the system prompt and human prompt are separated: the system message includes the instruction to use the context to answer, and the human message supplies the context and the question. The example notes that printing the same prompt as before can cause issues, so the chat prompt structure must be used.

Why does chunking matter in this design?

Chunking turns long documents (like scraped TechCrunch articles) into smaller text segments that can be embedded and retrieved effectively. Those chunks become the atomic units stored in the vector database. During retrieval, the system returns the most relevant chunks rather than entire documents, which improves relevance and keeps the context passed to the LLM within length limits (the example mentions stuffing two contexts of about a thousand characters each).

Review Questions

  1. When and why would you choose a persistent vector store over an in-memory index in a RetrievalQA pipeline?
  2. How do k and similarity search affect the quality and coverage of answers in this setup?
  3. What prompt instructions are used to prevent the LLM from guessing when the retrieved context doesn’t contain the answer?

Key Points

  1. 1

    Load multiple documents from a folder using a file glob pattern, and swap loaders based on file type (text, PDF, Markdown).

  2. 2

    Split documents into chunks before embedding so retrieval can operate on relevant segments rather than entire files.

  3. 3

    Create a ChromaDB vector store with OpenAI embeddings and persist it to disk (persist_directory set to “DB”) to reuse embeddings across runs.

  4. 4

    Use a similarity-search retriever with a configurable k (top-k chunks) and feed those retrieved chunks as context into a RetrievalQA chain.

  5. 5

    Enable grounded outputs by returning the retrieved source documents (return_documents=True) alongside the generated answer.

  6. 6

    When switching to GPT-3.5-turbo, keep retrieval the same but adjust prompt formatting to match chat-style system and human messages.

  7. 7

    Reload the persisted ChromaDB index on startup to avoid re-embedding large corpora every time the app launches.

Highlights

Persisting ChromaDB to a “DB” folder turns a one-off embedding step into a reusable index, eliminating repeated embedding work on every launch.
RetrievalQA is grounded by feeding the LLM only the top-k similarity-matched chunks from ChromaDB as context.
return_documents=True enables citation-style verification by returning the exact source chunks used to generate each answer.
Switching to GPT-3.5-turbo mainly changes prompt structure (system vs human messages), not the retrieval logic or vector store setup.

Topics

Mentioned