Evaluate LLM Systems & RAGs: Choose the Best LLM Using Automatic Metrics on Your Dataset

TL;DR

Classical ML metrics like accuracy and F1 don’t map cleanly onto LLM generation quality because outputs are free-form text rather than fixed labels.

Briefing Cornell Notes

Briefing

Choosing an LLM for a real project often fails when teams rely on classical ML metrics like accuracy, F1, or regression error. Those metrics assume outputs fit a fixed label or numeric target, but question answering, summarization, and RAG systems produce free-form text where “correctness” depends on alignment with a question and with retrieved context. The practical takeaway: evaluation has to measure what matters—relevance, faithfulness to context, and hallucination risk—rather than treating generation as a simple classification problem.

A common workaround is human review, but it’s slow and expensive at scale. Another approach uses one model to critique another, an “AI evaluating AI” method that can scale faster than humans. Yet this introduces a bias: if the same model family is used for both generation and critique, the evaluator tends to favor its own style and outputs. The transcript emphasizes this self-evaluation pitfall and recommends using a strong, separate critique model to reduce favoritism.

The workflow presented starts with a dataset structured for RAG-style evaluation—each example includes a question, a context passage, and a ground-truth answer. The system generates answers using candidate models, then sends each generated answer (along with the question and context) to a critic model for scoring. A naive baseline can be implemented by prompting the critic to rate correctness on a 0–10 integer scale. That method can rank models, but it can miss where failures come from—especially when a RAG pipeline’s retriever is weak and the context itself is wrong.

To get more diagnostic power, the approach uses DeepEval metrics (via the DeepEval framework) that score different failure modes. Three metrics are highlighted: hallucination (whether the answer contradicts or introduces information not supported by the context), faithfulness (whether the context contains what the answer claims), and answer relevance (how well the answer addresses the question with correct information). The evaluation diagram is straightforward: user question + context → model response → critic model + metrics → scores and reasons.

Implementation details include using Google’s Gemini API through the Google GenAI client, with a critique model such as gemini-1.5-pro-370b (and notes that smaller models may struggle to produce the structured outputs required by metric tooling). Candidate generation models include Gemma 7B, gemini-1.5-flash-8b, and mixtral (as provided by the API). The transcript reports that a small sample (about 20 examples) can still show clear differences: Gemma 7B scored highest under the simple 0–10 critique, while deeper metric evaluation found Gemma performing strongly on relevance and faithfulness with minimal hallucination, whereas other models showed issues like missing relevant information, incomplete answers, omissions of crucial details, and contradictions.

The end result is a repeatable “test suite” for LLM and RAG systems on custom data: generate with candidate models, score with context-aware metrics, and use the returned reasons to decide not just which model is best, but what kinds of errors each model makes—information that supports targeted improvements to prompting, retrieval, or model choice.

Cornell Notes

Classical ML metrics don’t fit LLM generation and RAG because outputs are free-form text and correctness depends on both the question and the retrieved context. A scalable alternative is “AI evaluating AI,” but using the same model for generation and critique can bias results toward that model’s own style. The workflow uses a critic model (e.g., gemini-1.5-pro-370b) to score candidate answers on a dataset structured as question + context + ground-truth answer. It then moves beyond a simple 0–10 rating to context-aware DeepEval metrics: hallucination, faithfulness, and answer relevance, which also provide reasons for low scores. This produces a diagnostic evaluation report that helps identify whether failures come from missing details, irrelevance, or unsupported claims.

Why do accuracy/F1-style metrics usually break down for LLM and RAG evaluation?

Those metrics assume a fixed label or numeric target (e.g., class membership or regression error). LLM outputs are natural-language responses, so “correctness” isn’t just a single category—it depends on whether the answer addresses the question and whether it stays consistent with the provided context. For tasks like question answering and summarization, treating generation as classification leads to misleading scores.

What bias can appear when the same model is used to generate and critique answers?

When generation and critique use the same model family, the evaluator tends to recognize and favor its own generation patterns. The transcript highlights the risk that evaluators “recognize and favor their own generations,” meaning scores can be inflated for the model being evaluated. Using a different, stronger critique model helps reduce this self-favoring effect.

How does the naive “0–10 correctness” critique differ from metric-based evaluation?

The naive method prompts a critic model to rate an answer’s correctness on a 0–10 integer scale. It can rank models, but it often doesn’t reveal failure sources. Metric-based evaluation separates issues: hallucination checks for unsupported or contradictory claims, faithfulness checks whether the context supports the answer, and answer relevance checks whether the response actually matches the question.

What does hallucination vs. faithfulness measure in a RAG setting?

Hallucination focuses on whether the answer introduces information not present in the context or contradicts it. Faithfulness checks whether the context contains the information needed for the answer claims—essentially whether the response aligns with what retrieval provided. Together they help distinguish “unsupported claims” from “context not containing the needed facts.”

Why can retriever quality dominate RAG evaluation outcomes?

If the retriever fetches irrelevant or incomplete context, even a strong generator can’t reliably produce a correct answer. The transcript notes that simple scoring can fail to diagnose this: a low score might reflect retrieval failure rather than generation quality. Context-aware metrics make it easier to see when the answer is unfaithful or hallucinated relative to the retrieved passage.

What practical constraints affect evaluation speed when using API-based critique?

Critic scoring can be slow because each example requires API calls and the service may require throttling. The transcript reports that even around 100 examples can involve substantial waiting, so it subsets the dataset (about 20 examples) for feasibility. It also notes that faster requests may be possible with higher-throughput models (e.g., GPT 4o or Claude 3 mentioned as examples).

Review Questions

Design an evaluation plan for a RAG system: which dataset fields would you require, and which metrics would you prioritize (and why)?
Explain how you would detect whether a low score is caused by retrieval errors versus generation errors using hallucination/faithfulness/relevance.
What steps would you take to reduce bias in AI-as-judge evaluation when comparing multiple candidate LLMs?

Key Points

1
Classical ML metrics like accuracy and F1 don’t map cleanly onto LLM generation quality because outputs are free-form text rather than fixed labels.
2
Human evaluation is accurate but slow and costly, making it hard to scale across many models and datasets.
3
AI-as-judge evaluation can scale, but using the same model for generation and critique can bias scores toward that model’s own outputs.
4
A RAG evaluation dataset should include question, retrieved context, and ground-truth answer so metrics can check both relevance and context alignment.
5
Simple 0–10 critique prompts can rank models, but context-aware metrics provide diagnostic detail about why answers fail.
6
DeepEval metrics such as hallucination, faithfulness, and answer relevance help separate unsupported claims, context misalignment, and question mismatch.
7
API-based evaluation requires batching/subsetting and throttling awareness because critique calls can be slow.

Highlights

Using one LLM to critique another can scale evaluation, but self-critique bias is real—separating the critique model from the candidate model reduces favoritism.

Hallucination, faithfulness, and answer relevance offer a more actionable breakdown than a single “correctness” score, especially for RAG where retrieval quality matters.

A small sample (around 20 examples) can still reveal consistent patterns: Gemma 7B scored highest in the simple critique and also performed strongly on relevance/faithfulness with minimal hallucination in metric-based scoring.

Topics

LLM Evaluation
RAG Metrics
AI-as-Judge
Hallucination Detection
DeepEval

Mentioned

DeepEval
Google GenAI
Gemini
Hugging Face
D Val
Venelin Valkov
RAG
LLM
AI