Get AI summaries of any video or article — Sign up free
How to use Custom Prompts for RetrievalQA on LLaMA-2 7B thumbnail

How to use Custom Prompts for RetrievalQA on LLaMA-2 7B

Sam Witteveen·
5 min read

Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Keep the retrieval pipeline constant when debugging RAG; it helps isolate whether failures come from embeddings/retrieval or from the language model’s generation behavior.

Briefing

RetrievalQA with LLaMA-2 can produce “correct-but-then-junk” outputs—answers that start right and then trail off into unhelpful or incorrect text. The fix isn’t changing the retrieval side (the BGE embeddings and returned contexts), but tightening the LLaMA-2 prompt so the model is forced to use only the provided context and stop after delivering a single answer.

The walkthrough starts by testing smaller LLaMA-2 variants while keeping the retrieval setup constant: BGE embeddings feeding a ChromaDB retriever that returns five contexts. With LLaMA-2 13B, answers remain generally decent, but some queries still show missing or inconsistent correctness—suggesting the model size affects how reliably it uses retrieved evidence. Switching down to LLaMA-2 7B-Chat makes the issue more visible: the system often produces the right idea at first, then adds extra, unhelpful content, and occasionally returns clearly wrong statements (including questions about LLaMA-2’s context window). Importantly, the retrieved context itself appears to be comparable to the larger model runs, pointing away from embedding/retrieval quality as the main culprit.

That diagnosis leads to a prompt redesign while staying on the 7B model. The default LLaMA-2 system prompt includes broad safety and politeness constraints and encourages the assistant to avoid harmful content. While those guardrails are useful, they can also influence generation behavior in ways that don’t align with strict RetrievalQA requirements. The new system prompt keeps the “helpful, respectful and honest assistant” language but adds a hard retrieval constraint: answers must use only the context text provided. The instruction prompt further tightens output structure: the assistant should answer the question once, avoid any text after the answer is complete, and if the question is incoherent or not factually coherent, explain why rather than guessing. It also preserves the “don’t share false information” rule.

Technically, the solution is implemented by building a prompt template that injects the retrieved context and the user question into the LLaMA-2 chat-style instruction format, then passing that prompt into the RetrievalQA chain. Temperature is left unchanged at first to isolate the prompt’s effect.

After the prompt change, the “extra junk” largely disappears. Examples like “What is flash attention?” return a clean, context-grounded response without trailing irrelevant material. Claims that previously went wrong—such as the context window and training token count—now match the expected values from the supplied sources. Even when asked about LLaMA-3 release timing, the model responds that the provided texts do not contain that information, rather than inventing details. The same improvement shows up for other retrieval-augmented questions, including cases where answers draw from multiple retrieved contexts.

The practical takeaway is straightforward: for production-grade RAG, prompt customization can dramatically improve answer quality and reliability, even when the retrieval pipeline stays the same. The notebooks are provided so builders can test the prompt variations directly using Together API and a free Colab GPU.

Cornell Notes

Smaller LLaMA-2 models (13B and especially 7B-Chat) can still retrieve the right evidence via BGE embeddings, but they may generate “correct answer + extra junk” or even incorrect follow-on text. Keeping retrieval constant, the main lever becomes the LLaMA-2 prompt. A revised system/instruction prompt forces the model to use only the provided context, answer the question once, and stop without adding any text afterward. After swapping in this stricter prompt template inside the RetrievalQA chain, outputs become cleaner and more truthful—e.g., the model refuses to invent missing facts like LLaMA-3 release dates when those details aren’t present in the retrieved context. This highlights prompt tuning as a key reliability tool for RAG systems.

Why did switching from LLaMA-2 70B to 13B and then 7B-Chat not fully solve the “junk after the answer” problem?

The retrieval side stayed the same: BGE embeddings, ChromaDB, and a retriever returning five contexts. With LLaMA-2 13B, answers were often decent but still sometimes missed correctness. With LLaMA-2 7B-Chat, the model more frequently produced unhelpful extra text and occasional clearly wrong answers. Since the retrieved context quality was effectively unchanged, the failure mode pointed to how the language model consumed and continued generation, not to the embedding/retrieval pipeline.

What evidence suggested the problem wasn’t the retrieved context?

The walkthrough notes that the context returned from the BGE embeddings was “the same as what we were getting” in the larger-model runs. The model still had the right information available but then generated additional, irrelevant or incorrect content afterward. That pattern implies the retrieved passages were sufficient, while the generation constraints were not.

How did the prompt change reduce hallucinated or trailing output?

The new system prompt instructs the assistant to use only the provided context text. The instruction prompt adds output control: answer the question once, and produce no text after the answer is done. It also keeps the rule to explain incoherent or factually incoherent questions instead of guessing, and to avoid false information. Together, these constraints reduce the model’s tendency to continue generating beyond the grounded response.

What does “answer once” and “no text after the answer” practically enforce in RetrievalQA?

It limits the model’s completion behavior. Instead of allowing the assistant to keep elaborating (which previously produced the “weird stuff after it”), the prompt tells it to stop after delivering the single grounded response. This is why previously observed artifacts largely disappear in the improved runs.

How did the improved prompt handle questions where the retrieved context lacked the answer?

For “When is LLaMA-3 coming out?”, the model responds that the release date isn’t available in the provided texts. That behavior reflects the instruction to rely only on context and not invent missing facts—contrasting with earlier runs that could produce extra or incorrect follow-on content.

Why might prompt tuning outperform retrieval changes in this case?

The retrieval pipeline already returned relevant passages (BGE embeddings and five retrieved contexts). The remaining failure mode was generation discipline: the model needed stricter instructions to constrain its output to the retrieved evidence and to terminate cleanly. Prompt tuning directly targets that behavior without requiring changes to embeddings, retrievers, or database setup.

Review Questions

  1. When moving from LLaMA-2 70B to 7B-Chat, what stayed constant in the RetrievalQA setup, and why does that matter for diagnosing the error source?
  2. Which specific prompt constraints were added to stop the model from generating trailing or unhelpful text, and how do they relate to hallucination risk?
  3. How should a well-behaved RetrievalQA system respond when the retrieved context does not contain the requested fact?

Key Points

  1. 1

    Keep the retrieval pipeline constant when debugging RAG; it helps isolate whether failures come from embeddings/retrieval or from the language model’s generation behavior.

  2. 2

    Smaller LLaMA-2 models can retrieve relevant evidence yet still produce trailing unhelpful or incorrect text if the prompt doesn’t enforce output boundaries.

  3. 3

    Constrain the system prompt to “use only the context text provided” to reduce hallucinations and irrelevant elaboration.

  4. 4

    Add instruction-level output control: answer the question once and emit no text after the answer is complete.

  5. 5

    Preserve “don’t share false information” and require explanations for incoherent or factually incoherent questions instead of guessing.

  6. 6

    Implement prompt changes via a prompt template that injects retrieved context and the user question into the LLaMA-2 chat-style format before running the RetrievalQA chain.

  7. 7

    Treat prompt tuning as a practical reliability step for production RAG—often more impactful than tweaking retrieval when the evidence is already present.

Highlights

The retrieved context quality (BGE + five contexts) looked sufficient across model sizes; the main reliability gap came from how LLaMA-2 continued generating after the grounded answer.
A stricter prompt—“use only the context,” “answer once,” and “no text after the answer”—largely eliminated the “correct answer followed by junk” pattern.
When asked about LLaMA-3 release timing, the improved setup correctly refused to invent details not present in the retrieved passages.
Prompt tuning can make RetrievalQA outputs more succinct and truthful without changing embeddings, retrievers, or temperature.

Topics

Mentioned