37% Better Output with 15 Lines of Code - Llama 3 8B (Ollama) & 70B (Groq)
Based on All About AI's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Add a query-rewriting step for follow-up questions so retrieval uses a clearer, context-aware query instead of a vague prompt.
Briefing
A simple query-rewriting step inside a local RAG (retrieval-augmented generation) pipeline can materially improve answers—often by roughly 37%—even when the user’s original question is vague. The core fix: before retrieving documents, the system asks a local Llama 3 model to rewrite the user’s prompt by explicitly incorporating relevant conversation context and clarifying what the question is really asking, without changing its intent. That rewritten query then drives the document retrieval, leading to more useful context being pulled and better downstream responses.
The workflow starts with a RAG setup where questions are answered using retrieved snippets from a document collection. When the user asks something broad like “what does that mean” after asking about Llama 3’s training tokens, retrieval can fail because the query provides too little information for the retriever to find relevant passages. In the baseline run, the system retrieves no useful context for the vague follow-up, so the model has little to ground its answer.
The improved version keeps the first query unchanged, but from the second user message onward it generates a “Rewritten query.” A dedicated prompt instructs the model to (1) preserve the original intent, (2) expand and clarify the query using conversation history, (3) avoid introducing new topics, and (4) output only the rewritten query text. The implementation uses structured JSON output so the rewritten text is extracted deterministically. In code terms, the function receives the user’s original query as JSON, parses it into a Python dictionary, builds a prompt that includes the prior conversation messages, calls a local Llama 3 model (running via Ollama), extracts the rewritten query from the model response, and returns it as JSON.
Crucially, the pipeline then feeds only the rewritten query into the “get relevant context” retrieval function—skipping the original vague query for retrieval. That change is what increases the chance that the retriever finds the right document chunks.
Testing on both an 8B local Llama 3 model and a larger 70B model served by Groq, the rewritten-query approach consistently produced better retrieval and more coherent answers. For example, the follow-up “what does that mean” becomes a much more specific retrieval query describing what “15 trillion tokens” means, including token definitions and implications. The creator also reports a rough quantitative estimate: comparing two responses (one without rewriting and one with rewriting) using GPT-4 across multiple trials, the rewritten version typically scored about 30–50% better, landing around 37% in aggregate.
Beyond the rewriting trick, the project includes practical local RAG updates: switching to the “dolphin” tree Llama model, using an Ollama embeddings model for vectorization, and allowing model selection from the terminal. The takeaway is less about any single model size and more about improving retrieval quality by rewriting ambiguous follow-ups into retrieval-friendly queries using the model itself—then letting embeddings and the retriever do the heavy lifting.
Cornell Notes
The pipeline improves RAG performance by rewriting vague follow-up questions into clearer, context-aware retrieval queries. Instead of sending the user’s original ambiguous prompt to the “get relevant context” function, the system asks a local Llama 3 model to produce a rewritten query that preserves intent, expands details using conversation history, and avoids new topics. The rewritten query is generated with a structured JSON prompt so the output can be extracted reliably. Feeding this rewritten query into document retrieval leads to more relevant context being pulled, which in turn produces better answers. Reported comparisons using GPT-4 suggest the rewritten-query approach often yields roughly 30–50% better responses, averaging around 37%.
Why does retrieval fail for vague follow-up questions like “what does that mean”?
What exactly changes in the improved pipeline?
How does the query rewriting prompt constrain the model’s output?
Why use JSON in the rewriting step?
What models and infrastructure are used in the tests?
How was the “37% better” figure estimated?
Review Questions
- In a RAG pipeline, which component is most directly affected by switching from the original query to a rewritten query, and why?
- What constraints in the rewriting prompt help prevent the model from drifting into new topics during query expansion?
- How does structured JSON output change the reliability of integrating an LLM-generated rewrite into downstream retrieval code?
Key Points
- 1
Add a query-rewriting step for follow-up questions so retrieval uses a clearer, context-aware query instead of a vague prompt.
- 2
Preserve the user’s original intent while expanding details using conversation history; avoid introducing new topics.
- 3
Generate the rewritten query with a constrained prompt that returns only the rewritten text, not an answer.
- 4
Use JSON for deterministic parsing: extract the rewritten query field reliably before retrieval.
- 5
Feed only the rewritten query into the “get relevant context” retrieval function; skip the original vague query for retrieval.
- 6
Expect larger gains when the retriever would otherwise return little or no relevant context due to under-specified questions.
- 7
Quantify improvements by side-by-side comparisons (e.g., GPT-4 judging) between baseline and rewritten-query responses.