How to use Custom Prompts for RetrievalQA on LLaMA-2 7B
Based on Sam Witteveen's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Keep the retrieval pipeline constant when debugging RAG; it helps isolate whether failures come from embeddings/retrieval or from the language model’s generation behavior.
Briefing
RetrievalQA with LLaMA-2 can produce “correct-but-then-junk” outputs—answers that start right and then trail off into unhelpful or incorrect text. The fix isn’t changing the retrieval side (the BGE embeddings and returned contexts), but tightening the LLaMA-2 prompt so the model is forced to use only the provided context and stop after delivering a single answer.
The walkthrough starts by testing smaller LLaMA-2 variants while keeping the retrieval setup constant: BGE embeddings feeding a ChromaDB retriever that returns five contexts. With LLaMA-2 13B, answers remain generally decent, but some queries still show missing or inconsistent correctness—suggesting the model size affects how reliably it uses retrieved evidence. Switching down to LLaMA-2 7B-Chat makes the issue more visible: the system often produces the right idea at first, then adds extra, unhelpful content, and occasionally returns clearly wrong statements (including questions about LLaMA-2’s context window). Importantly, the retrieved context itself appears to be comparable to the larger model runs, pointing away from embedding/retrieval quality as the main culprit.
That diagnosis leads to a prompt redesign while staying on the 7B model. The default LLaMA-2 system prompt includes broad safety and politeness constraints and encourages the assistant to avoid harmful content. While those guardrails are useful, they can also influence generation behavior in ways that don’t align with strict RetrievalQA requirements. The new system prompt keeps the “helpful, respectful and honest assistant” language but adds a hard retrieval constraint: answers must use only the context text provided. The instruction prompt further tightens output structure: the assistant should answer the question once, avoid any text after the answer is complete, and if the question is incoherent or not factually coherent, explain why rather than guessing. It also preserves the “don’t share false information” rule.
Technically, the solution is implemented by building a prompt template that injects the retrieved context and the user question into the LLaMA-2 chat-style instruction format, then passing that prompt into the RetrievalQA chain. Temperature is left unchanged at first to isolate the prompt’s effect.
After the prompt change, the “extra junk” largely disappears. Examples like “What is flash attention?” return a clean, context-grounded response without trailing irrelevant material. Claims that previously went wrong—such as the context window and training token count—now match the expected values from the supplied sources. Even when asked about LLaMA-3 release timing, the model responds that the provided texts do not contain that information, rather than inventing details. The same improvement shows up for other retrieval-augmented questions, including cases where answers draw from multiple retrieved contexts.
The practical takeaway is straightforward: for production-grade RAG, prompt customization can dramatically improve answer quality and reliability, even when the retrieval pipeline stays the same. The notebooks are provided so builders can test the prompt variations directly using Together API and a free Colab GPU.
Cornell Notes
Smaller LLaMA-2 models (13B and especially 7B-Chat) can still retrieve the right evidence via BGE embeddings, but they may generate “correct answer + extra junk” or even incorrect follow-on text. Keeping retrieval constant, the main lever becomes the LLaMA-2 prompt. A revised system/instruction prompt forces the model to use only the provided context, answer the question once, and stop without adding any text afterward. After swapping in this stricter prompt template inside the RetrievalQA chain, outputs become cleaner and more truthful—e.g., the model refuses to invent missing facts like LLaMA-3 release dates when those details aren’t present in the retrieved context. This highlights prompt tuning as a key reliability tool for RAG systems.
Why did switching from LLaMA-2 70B to 13B and then 7B-Chat not fully solve the “junk after the answer” problem?
What evidence suggested the problem wasn’t the retrieved context?
How did the prompt change reduce hallucinated or trailing output?
What does “answer once” and “no text after the answer” practically enforce in RetrievalQA?
How did the improved prompt handle questions where the retrieved context lacked the answer?
Why might prompt tuning outperform retrieval changes in this case?
Review Questions
- When moving from LLaMA-2 70B to 7B-Chat, what stayed constant in the RetrievalQA setup, and why does that matter for diagnosing the error source?
- Which specific prompt constraints were added to stop the model from generating trailing or unhelpful text, and how do they relate to hallucination risk?
- How should a well-behaved RetrievalQA system respond when the retrieved context does not contain the requested fact?
Key Points
- 1
Keep the retrieval pipeline constant when debugging RAG; it helps isolate whether failures come from embeddings/retrieval or from the language model’s generation behavior.
- 2
Smaller LLaMA-2 models can retrieve relevant evidence yet still produce trailing unhelpful or incorrect text if the prompt doesn’t enforce output boundaries.
- 3
Constrain the system prompt to “use only the context text provided” to reduce hallucinations and irrelevant elaboration.
- 4
Add instruction-level output control: answer the question once and emit no text after the answer is complete.
- 5
Preserve “don’t share false information” and require explanations for incoherent or factually incoherent questions instead of guessing.
- 6
Implement prompt changes via a prompt template that injects retrieved context and the user question into the LLaMA-2 chat-style format before running the RetrievalQA chain.
- 7
Treat prompt tuning as a practical reliability step for production RAG—often more impactful than tweaking retrieval when the evidence is already present.