RetrievalQA with LLaMA 2 70b & Chroma DB
Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Build a local Chroma vector store by ingesting multiple PDFs, chunking them with overlap, embedding with instructor X L, and persisting the database to disk.
Briefing
Retrieval-augmented QA with LLaMA-2 70B works cleanly when answers are grounded in a local Chroma vector database built from a set of research PDFs. The setup pairs a Together-hosted LLaMA-2 70B chat model with LangChain, ingests multiple papers into Chroma, and retrieves the top matching chunks (k=5) so the model can answer questions with traceable sources—often pulling the correct context from the intended document.
The workflow starts by loading a batch of PDFs (including papers on Flash Attention, LLaMA-2, Toolformer, and ReAct, plus other LLM augmentation material) and splitting them into chunks using a character-based splitter with chunk overlap. Those chunks are embedded with instructor embeddings—specifically “instructor X L” from the Hong Kong University’s NLP team—then stored in a local Chroma DB on disk. Although the model is accessed via an API (so not strictly “fully local” compute), the vector store and document ingestion happen locally, avoiding the need to upload the corpus to a cloud vector service.
Once the Chroma DB is built, a retriever is configured to return five relevant contexts per query. A retrieval chain then feeds those contexts into the LLaMA-2 70B chat model (temperature 0.1, max tokens 1024), using LangChain’s “stuff” approach to pack retrieved text into LLaMA-2’s 4096-token context window. The system also enables source-document reporting, so each answer can be attributed to the PDF(s) that contributed the retrieved chunks.
Testing shows the retrieval layer behaving as intended. Asking “what is flash attention?” returns an explanation aligned with the Flash Attention paper and cites that source; follow-up questions like “IO aware mean” also come back from the same document. Queries about LLaMA-2’s context window correctly return 4096 tokens, and questions about training scale return “2 trillion tokens.” When asked about “when is LLaMA-3 coming,” the system doesn’t fabricate a direct claim from the corpus; instead, it returns the closest relevant material, forcing the language model to reason without a direct citation.
A more adversarial prompt—“what is the new model from Meta called?”—produces a mixed but explainable result: the retrieved sources include LLaMA-2-related material and also the “augmenting LLMs” survey, which mentions multiple models. The answer then references RLHF components used in LLaMA-2 chat (such as the safety reward model and helpfulness reward model), illustrating both the strengths and limits of retrieval grounding when the question is ambiguous or lacks temporal cues.
The system also succeeds on direct definitions: “what is tool former” yields a description of tool use via APIs (search engines, calculators, translation systems) and asks about the number of examples per tool; “what is react” returns a definition of ReAct as a prompt-based paradigm combining reasoning and acting. The overall takeaway is that LLaMA-2 70B can serve as a strong retrieval QA backbone for RAG, but real-world performance likely improves with fine-tuning targeted to retrieval-augmented generation rather than relying on the base chat model as-is.
Cornell Notes
A RAG pipeline pairs a LLaMA-2 70B chat model with a locally built Chroma vector database to answer questions using retrieved PDF chunks. PDFs are ingested with LangChain, split into overlapping character chunks, embedded using instructor X L embeddings, and stored in Chroma. At query time, a retriever returns the top k=5 contexts, which are packed into LLaMA-2’s 4096-token context window via a “stuff” chain. Source-document metadata is used to show which PDFs the answer came from. Tests on Flash Attention, LLaMA-2, Toolformer, and ReAct show strong retrieval grounding, while ambiguous questions can pull from multiple related papers and lead to partially mixed answers.
How does the system ground LLaMA-2 answers in the PDF corpus?
What embedding and chunking choices affect retrieval quality here?
Why does k=5 matter in this setup?
What evidence suggests the retrieval step is working correctly?
What happens when the question is ambiguous or not directly present in the corpus?
How do direct factual questions about Toolformer and ReAct perform?
Review Questions
- If you increased k from 5 to a larger value, what risks would you expect regarding prompt length and answer specificity in a 4096-token context window?
- Which parts of the pipeline are responsible for answer attribution to specific PDFs, and how does that attribution get surfaced to the user?
- How would changing the chunking strategy (e.g., replacing the character splitter) likely affect retrieval performance for definitions that span multiple sections of a paper?
Key Points
- 1
Build a local Chroma vector store by ingesting multiple PDFs, chunking them with overlap, embedding with instructor X L, and persisting the database to disk.
- 2
Use LangChain to create a retriever that returns the top k=5 contexts and feed those contexts into LLaMA-2 70B with a “stuff” chain strategy.
- 3
Set LLaMA-2 generation parameters conservatively (temperature 0.1) to reduce randomness when answers depend on retrieved text.
- 4
Enable source-document tracking so each answer can be tied back to the PDF(s) that supplied the retrieved chunks.
- 5
Expect strong performance on questions that match content in the indexed papers (e.g., Flash Attention, LLaMA-2 context window, Toolformer, ReAct).
- 6
Ambiguous or time-sensitive questions can produce mixed answers when multiple related models appear across retrieved sources.
- 7
For production use, fine-tune a retrieval-augmented generation model for the specific question types rather than relying on a base chat model alone.