RetrievalQA with LLaMA 2 70b & Chroma DB

TL;DR

Build a local Chroma vector store by ingesting multiple PDFs, chunking them with overlap, embedding with instructor X L, and persisting the database to disk.

Briefing Cornell Notes

Briefing

Retrieval-augmented QA with LLaMA-2 70B works cleanly when answers are grounded in a local Chroma vector database built from a set of research PDFs. The setup pairs a Together-hosted LLaMA-2 70B chat model with LangChain, ingests multiple papers into Chroma, and retrieves the top matching chunks (k=5) so the model can answer questions with traceable sources—often pulling the correct context from the intended document.

The workflow starts by loading a batch of PDFs (including papers on Flash Attention, LLaMA-2, Toolformer, and ReAct, plus other LLM augmentation material) and splitting them into chunks using a character-based splitter with chunk overlap. Those chunks are embedded with instructor embeddings—specifically “instructor X L” from the Hong Kong University’s NLP team—then stored in a local Chroma DB on disk. Although the model is accessed via an API (so not strictly “fully local” compute), the vector store and document ingestion happen locally, avoiding the need to upload the corpus to a cloud vector service.

Once the Chroma DB is built, a retriever is configured to return five relevant contexts per query. A retrieval chain then feeds those contexts into the LLaMA-2 70B chat model (temperature 0.1, max tokens 1024), using LangChain’s “stuff” approach to pack retrieved text into LLaMA-2’s 4096-token context window. The system also enables source-document reporting, so each answer can be attributed to the PDF(s) that contributed the retrieved chunks.

Testing shows the retrieval layer behaving as intended. Asking “what is flash attention?” returns an explanation aligned with the Flash Attention paper and cites that source; follow-up questions like “IO aware mean” also come back from the same document. Queries about LLaMA-2’s context window correctly return 4096 tokens, and questions about training scale return “2 trillion tokens.” When asked about “when is LLaMA-3 coming,” the system doesn’t fabricate a direct claim from the corpus; instead, it returns the closest relevant material, forcing the language model to reason without a direct citation.

A more adversarial prompt—“what is the new model from Meta called?”—produces a mixed but explainable result: the retrieved sources include LLaMA-2-related material and also the “augmenting LLMs” survey, which mentions multiple models. The answer then references RLHF components used in LLaMA-2 chat (such as the safety reward model and helpfulness reward model), illustrating both the strengths and limits of retrieval grounding when the question is ambiguous or lacks temporal cues.

The system also succeeds on direct definitions: “what is tool former” yields a description of tool use via APIs (search engines, calculators, translation systems) and asks about the number of examples per tool; “what is react” returns a definition of ReAct as a prompt-based paradigm combining reasoning and acting. The overall takeaway is that LLaMA-2 70B can serve as a strong retrieval QA backbone for RAG, but real-world performance likely improves with fine-tuning targeted to retrieval-augmented generation rather than relying on the base chat model as-is.

Cornell Notes

A RAG pipeline pairs a LLaMA-2 70B chat model with a locally built Chroma vector database to answer questions using retrieved PDF chunks. PDFs are ingested with LangChain, split into overlapping character chunks, embedded using instructor X L embeddings, and stored in Chroma. At query time, a retriever returns the top k=5 contexts, which are packed into LLaMA-2’s 4096-token context window via a “stuff” chain. Source-document metadata is used to show which PDFs the answer came from. Tests on Flash Attention, LLaMA-2, Toolformer, and ReAct show strong retrieval grounding, while ambiguous questions can pull from multiple related papers and lead to partially mixed answers.

How does the system ground LLaMA-2 answers in the PDF corpus?

It embeds chunked PDF text into a local Chroma DB, then retrieves the top k=5 most similar chunks for each question. Those retrieved contexts are passed into a LangChain retrieval chain that uses LLaMA-2 70B (temperature 0.1, max tokens 1024) with a “stuff” strategy, so the model answers using the retrieved text rather than relying purely on parametric knowledge. Source-document reporting is enabled so the output can be attributed to the originating PDF(s).

What embedding and chunking choices affect retrieval quality here?

The pipeline uses instructor embeddings—specifically “instructor X L” from the Hong Kong University’s NLP team—to convert text chunks into vectors for Chroma. Documents are loaded with a PyPDF loader, split into pages and then into chunks using a character splitter. Chunk overlap is used so ideas spanning chunk boundaries can appear in at least one complete chunk, which helps retrieval match the question to coherent context.

Why does k=5 matter in this setup?

With k=5, the retriever returns five separate context chunks per query. That increases coverage—especially for definitions or multi-part answers—because the model can draw from multiple relevant snippets. The tradeoff is that more contexts also means more text packed into the prompt, so it must still fit within LLaMA-2’s 4096-token context window.

What evidence suggests the retrieval step is working correctly?

When asked “what is flash attention?” the system returns an explanation consistent with the Flash Attention paper and cites that PDF; follow-up queries like “IO aware mean from flash attention” also come back from the same source. For LLaMA-2-specific questions, it correctly returns the context window (4096 tokens) and training scale (2 trillion tokens), again with the relevant PDF sources appearing among the retrieved contexts.

What happens when the question is ambiguous or not directly present in the corpus?

For “when is LLaMA-3 coming,” the corpus only discusses LLaMA-2 and variants, so there’s no direct mention of LLaMA-3. The system still returns something based on closest matches, but it can’t cite a definitive LLaMA-3 reference. For “what is the new model from Meta called?” the answer can become mixed because multiple related models are mentioned across retrieved papers (e.g., LLaMA-2 and the augmenting LLMs survey), leading the model to pull in adjacent details like RLHF reward models used in LLaMA-2 chat.

How do direct factual questions about Toolformer and ReAct perform?

The system retrieves and summarizes definitions from the corresponding papers. “What is tool former” returns a description of tool use via APIs (search engines, calculators, translation systems) and discusses how many examples are needed per tool. “What is react” yields a definition of ReAct as a prompt-based paradigm combining reasoning and acting in language models for general task solving.

Review Questions

If you increased k from 5 to a larger value, what risks would you expect regarding prompt length and answer specificity in a 4096-token context window?
Which parts of the pipeline are responsible for answer attribution to specific PDFs, and how does that attribution get surfaced to the user?
How would changing the chunking strategy (e.g., replacing the character splitter) likely affect retrieval performance for definitions that span multiple sections of a paper?

Key Points

1
Build a local Chroma vector store by ingesting multiple PDFs, chunking them with overlap, embedding with instructor X L, and persisting the database to disk.
2
Use LangChain to create a retriever that returns the top k=5 contexts and feed those contexts into LLaMA-2 70B with a “stuff” chain strategy.
3
Set LLaMA-2 generation parameters conservatively (temperature 0.1) to reduce randomness when answers depend on retrieved text.
4
Enable source-document tracking so each answer can be tied back to the PDF(s) that supplied the retrieved chunks.
5
Expect strong performance on questions that match content in the indexed papers (e.g., Flash Attention, LLaMA-2 context window, Toolformer, ReAct).
6
Ambiguous or time-sensitive questions can produce mixed answers when multiple related models appear across retrieved sources.
7
For production use, fine-tune a retrieval-augmented generation model for the specific question types rather than relying on a base chat model alone.

Highlights

The pipeline retrieves exactly five relevant chunks (k=5) from a locally stored Chroma DB and uses those contexts to ground LLaMA-2 70B answers.

Instructor X L embeddings (instructor X L) paired with overlapping chunking produce correct citations for Flash Attention and LLaMA-2 facts like the 4096-token context window and 2-trillion-token training scale.

When the corpus lacks a direct reference (e.g., LLaMA-3), the system can’t cite it and instead returns the closest retrieved material, illustrating the limits of retrieval grounding.

A trick question about Meta’s “new model” pulls from multiple papers, demonstrating how retrieval ambiguity can lead to partially mixed, but still source-linked, answers.

Topics

Retrieval QA
RAG Pipeline
Chroma Vector Store
Instructor Embeddings
LangChain Retrieval Chain

Mentioned

RAG
QA
API
LLM
RLHF
GPU
T4
PDF
NLP