Get AI summaries of any video or article — Sign up free
37% Better Output with 15 Lines of Code - Llama 3 8B (Ollama) & 70B (Groq) thumbnail

37% Better Output with 15 Lines of Code - Llama 3 8B (Ollama) & 70B (Groq)

All About AI·
5 min read

Based on All About AI's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Add a query-rewriting step for follow-up questions so retrieval uses a clearer, context-aware query instead of a vague prompt.

Briefing

A simple query-rewriting step inside a local RAG (retrieval-augmented generation) pipeline can materially improve answers—often by roughly 37%—even when the user’s original question is vague. The core fix: before retrieving documents, the system asks a local Llama 3 model to rewrite the user’s prompt by explicitly incorporating relevant conversation context and clarifying what the question is really asking, without changing its intent. That rewritten query then drives the document retrieval, leading to more useful context being pulled and better downstream responses.

The workflow starts with a RAG setup where questions are answered using retrieved snippets from a document collection. When the user asks something broad like “what does that mean” after asking about Llama 3’s training tokens, retrieval can fail because the query provides too little information for the retriever to find relevant passages. In the baseline run, the system retrieves no useful context for the vague follow-up, so the model has little to ground its answer.

The improved version keeps the first query unchanged, but from the second user message onward it generates a “Rewritten query.” A dedicated prompt instructs the model to (1) preserve the original intent, (2) expand and clarify the query using conversation history, (3) avoid introducing new topics, and (4) output only the rewritten query text. The implementation uses structured JSON output so the rewritten text is extracted deterministically. In code terms, the function receives the user’s original query as JSON, parses it into a Python dictionary, builds a prompt that includes the prior conversation messages, calls a local Llama 3 model (running via Ollama), extracts the rewritten query from the model response, and returns it as JSON.

Crucially, the pipeline then feeds only the rewritten query into the “get relevant context” retrieval function—skipping the original vague query for retrieval. That change is what increases the chance that the retriever finds the right document chunks.

Testing on both an 8B local Llama 3 model and a larger 70B model served by Groq, the rewritten-query approach consistently produced better retrieval and more coherent answers. For example, the follow-up “what does that mean” becomes a much more specific retrieval query describing what “15 trillion tokens” means, including token definitions and implications. The creator also reports a rough quantitative estimate: comparing two responses (one without rewriting and one with rewriting) using GPT-4 across multiple trials, the rewritten version typically scored about 30–50% better, landing around 37% in aggregate.

Beyond the rewriting trick, the project includes practical local RAG updates: switching to the “dolphin” tree Llama model, using an Ollama embeddings model for vectorization, and allowing model selection from the terminal. The takeaway is less about any single model size and more about improving retrieval quality by rewriting ambiguous follow-ups into retrieval-friendly queries using the model itself—then letting embeddings and the retriever do the heavy lifting.

Cornell Notes

The pipeline improves RAG performance by rewriting vague follow-up questions into clearer, context-aware retrieval queries. Instead of sending the user’s original ambiguous prompt to the “get relevant context” function, the system asks a local Llama 3 model to produce a rewritten query that preserves intent, expands details using conversation history, and avoids new topics. The rewritten query is generated with a structured JSON prompt so the output can be extracted reliably. Feeding this rewritten query into document retrieval leads to more relevant context being pulled, which in turn produces better answers. Reported comparisons using GPT-4 suggest the rewritten-query approach often yields roughly 30–50% better responses, averaging around 37%.

Why does retrieval fail for vague follow-up questions like “what does that mean”?

In a RAG system, the retriever depends on the query text to find matching document chunks. A vague follow-up like “what does that mean” doesn’t specify the concept (e.g., “15 trillion tokens”), so the retriever may return no relevant context. With no retrieved snippets, the model has less grounding and the answer quality drops.

What exactly changes in the improved pipeline?

Only the retrieval query changes. The first user question is sent as-is. From the second message onward, the system generates a “Rewritten query” that incorporates relevant conversation history (the prior messages) and clarifies what the user is asking. The rewritten query—not the original vague query—is then passed into the “get relevant context” function to retrieve document context.

How does the query rewriting prompt constrain the model’s output?

The prompt instructs the model to preserve the original intent and meaning, expand and clarify the query to make it more specific for retrieval, avoid introducing new topics or deviating from the original question, and never answer the original question directly. It also requires returning only the rewritten query text (no extra formatting or explanation).

Why use JSON in the rewriting step?

JSON makes the output more deterministic and easier to parse. The function receives a JSON string containing the user’s original query, parses it with Json loads into a Python dictionary, then calls the Llama model with a prompt that returns a structured response. The code extracts the “Rewritten query” field from the model output and returns it as JSON for the retrieval step.

What models and infrastructure are used in the tests?

The creator runs an 8B Llama 3 model locally via Ollama for the first demonstration. Later, the same rewritten-query approach is tested with a 70B model served through Groq. In both cases, the rewritten query improves the retrieval and the resulting answer quality.

How was the “37% better” figure estimated?

The creator takes one response generated without the rewrite-query step and a second response generated with the rewrite-query step. Then GPT-4 is used to compare the two responses repeatedly. Most comparisons land in the 30–50% better range for the rewritten-query response, with an average reported around 37% (and similarly around 30–40% on a separate comparison set).

Review Questions

  1. In a RAG pipeline, which component is most directly affected by switching from the original query to a rewritten query, and why?
  2. What constraints in the rewriting prompt help prevent the model from drifting into new topics during query expansion?
  3. How does structured JSON output change the reliability of integrating an LLM-generated rewrite into downstream retrieval code?

Key Points

  1. 1

    Add a query-rewriting step for follow-up questions so retrieval uses a clearer, context-aware query instead of a vague prompt.

  2. 2

    Preserve the user’s original intent while expanding details using conversation history; avoid introducing new topics.

  3. 3

    Generate the rewritten query with a constrained prompt that returns only the rewritten text, not an answer.

  4. 4

    Use JSON for deterministic parsing: extract the rewritten query field reliably before retrieval.

  5. 5

    Feed only the rewritten query into the “get relevant context” retrieval function; skip the original vague query for retrieval.

  6. 6

    Expect larger gains when the retriever would otherwise return little or no relevant context due to under-specified questions.

  7. 7

    Quantify improvements by side-by-side comparisons (e.g., GPT-4 judging) between baseline and rewritten-query responses.

Highlights

A vague follow-up like “what does that mean” can cause retrieval to pull zero relevant context; rewriting fixes that by making the retrieval query explicit.
The rewritten-query prompt is designed to clarify for retrieval while preserving intent and forbidding direct answering.
Feeding only the rewritten query into document retrieval—not the original—drives the quality jump.
Reported comparisons using GPT-4 place the rewritten-query approach in the ~30–50% better range, averaging around 37%.

Mentioned

  • RAG
  • GPT-4
  • JSON
  • LLM
  • AMA