Evaluate LLM Systems & RAGs: Choose the Best LLM Using Automatic Metrics on Your Dataset
Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Classical ML metrics like accuracy and F1 don’t map cleanly onto LLM generation quality because outputs are free-form text rather than fixed labels.
Briefing
Choosing an LLM for a real project often fails when teams rely on classical ML metrics like accuracy, F1, or regression error. Those metrics assume outputs fit a fixed label or numeric target, but question answering, summarization, and RAG systems produce free-form text where “correctness” depends on alignment with a question and with retrieved context. The practical takeaway: evaluation has to measure what matters—relevance, faithfulness to context, and hallucination risk—rather than treating generation as a simple classification problem.
A common workaround is human review, but it’s slow and expensive at scale. Another approach uses one model to critique another, an “AI evaluating AI” method that can scale faster than humans. Yet this introduces a bias: if the same model family is used for both generation and critique, the evaluator tends to favor its own style and outputs. The transcript emphasizes this self-evaluation pitfall and recommends using a strong, separate critique model to reduce favoritism.
The workflow presented starts with a dataset structured for RAG-style evaluation—each example includes a question, a context passage, and a ground-truth answer. The system generates answers using candidate models, then sends each generated answer (along with the question and context) to a critic model for scoring. A naive baseline can be implemented by prompting the critic to rate correctness on a 0–10 integer scale. That method can rank models, but it can miss where failures come from—especially when a RAG pipeline’s retriever is weak and the context itself is wrong.
To get more diagnostic power, the approach uses DeepEval metrics (via the DeepEval framework) that score different failure modes. Three metrics are highlighted: hallucination (whether the answer contradicts or introduces information not supported by the context), faithfulness (whether the context contains what the answer claims), and answer relevance (how well the answer addresses the question with correct information). The evaluation diagram is straightforward: user question + context → model response → critic model + metrics → scores and reasons.
Implementation details include using Google’s Gemini API through the Google GenAI client, with a critique model such as gemini-1.5-pro-370b (and notes that smaller models may struggle to produce the structured outputs required by metric tooling). Candidate generation models include Gemma 7B, gemini-1.5-flash-8b, and mixtral (as provided by the API). The transcript reports that a small sample (about 20 examples) can still show clear differences: Gemma 7B scored highest under the simple 0–10 critique, while deeper metric evaluation found Gemma performing strongly on relevance and faithfulness with minimal hallucination, whereas other models showed issues like missing relevant information, incomplete answers, omissions of crucial details, and contradictions.
The end result is a repeatable “test suite” for LLM and RAG systems on custom data: generate with candidate models, score with context-aware metrics, and use the returned reasons to decide not just which model is best, but what kinds of errors each model makes—information that supports targeted improvements to prompting, retrieval, or model choice.
Cornell Notes
Classical ML metrics don’t fit LLM generation and RAG because outputs are free-form text and correctness depends on both the question and the retrieved context. A scalable alternative is “AI evaluating AI,” but using the same model for generation and critique can bias results toward that model’s own style. The workflow uses a critic model (e.g., gemini-1.5-pro-370b) to score candidate answers on a dataset structured as question + context + ground-truth answer. It then moves beyond a simple 0–10 rating to context-aware DeepEval metrics: hallucination, faithfulness, and answer relevance, which also provide reasons for low scores. This produces a diagnostic evaluation report that helps identify whether failures come from missing details, irrelevance, or unsupported claims.
Why do accuracy/F1-style metrics usually break down for LLM and RAG evaluation?
What bias can appear when the same model is used to generate and critique answers?
How does the naive “0–10 correctness” critique differ from metric-based evaluation?
What does hallucination vs. faithfulness measure in a RAG setting?
Why can retriever quality dominate RAG evaluation outcomes?
What practical constraints affect evaluation speed when using API-based critique?
Review Questions
- Design an evaluation plan for a RAG system: which dataset fields would you require, and which metrics would you prioritize (and why)?
- Explain how you would detect whether a low score is caused by retrieval errors versus generation errors using hallucination/faithfulness/relevance.
- What steps would you take to reduce bias in AI-as-judge evaluation when comparing multiple candidate LLMs?
Key Points
- 1
Classical ML metrics like accuracy and F1 don’t map cleanly onto LLM generation quality because outputs are free-form text rather than fixed labels.
- 2
Human evaluation is accurate but slow and costly, making it hard to scale across many models and datasets.
- 3
AI-as-judge evaluation can scale, but using the same model for generation and critique can bias scores toward that model’s own outputs.
- 4
A RAG evaluation dataset should include question, retrieved context, and ground-truth answer so metrics can check both relevance and context alignment.
- 5
Simple 0–10 critique prompts can rank models, but context-aware metrics provide diagnostic detail about why answers fail.
- 6
DeepEval metrics such as hallucination, faithfulness, and answer relevance help separate unsupported claims, context misalignment, and question mismatch.
- 7
API-based evaluation requires batching/subsetting and throttling awareness because critique calls can be slow.