Mastering LLM Chatbots And RAG Evaluation Crash Course

TL;DR

Create evaluation datasets as question/expected-answer pairs so correctness metrics have a ground truth reference.

Briefing Cornell Notes

Briefing

LLM chatbot and RAG quality can be measured systematically by combining three ingredients: curated test data (inputs plus ground-truth outputs), evaluation metrics that compare model outputs to that ground truth (or to retrieved context), and an “LLM-as-a-judge” layer that produces consistent, structured scores. The practical takeaway is that model choice stops being guesswork: different LLMs can be run against the same dataset, then ranked using the same metrics inside LangSmith.

The crash course starts with the core problem behind evaluation: selecting an LLM model, defining what “good” looks like for a specific use case, and building the data needed to compare outputs. For chatbots, the workflow begins by creating a dataset of question/expected-answer pairs. Those examples become the ground truth. Next comes the evaluation step, where an LLM is prompted to grade a generated answer against the reference. In the implementation, LangSmith is used for observability and experiment tracking, while OpenAI is used both to generate chatbot responses and to act as the judge.

A concrete example defines two custom metrics. “Correctness” uses an LLM prompt that grades whether the predicted answer is factually correct relative to the ground truth, returning a boolean. “Concision” applies a simple length-based rule: the response must be less than twice the length of the expected answer. An evaluation function then runs the chatbot for each dataset question, collects the model output, and feeds both the question, reference output, and predicted output into the judge-based metrics. LangSmith records the results under an experiment name, including per-example comparisons and aggregate scores.

To show how model selection becomes data-driven, the same evaluation is repeated with different OpenAI models. The course demonstrates running experiments with a smaller model first (GPT-4o mini) and then with a larger one (GPT-4 turbo), using the same dataset and evaluators. The resulting correctness and concision scores determine which model performs better under the defined criteria.

The focus then shifts to RAG evaluation, where “quality” depends on more than final answer correctness. A typical RAG pipeline has a retriever that selects documents and a generator that answers using those documents as context. The course lays out four key evaluation dimensions: retrieval relevance (are the retrieved documents relevant to the question), groundness (is the answer supported by the retrieved documents, avoiding hallucinations), correctness (does the answer match the ground truth), and answer relevance (does the answer address the question). The workflow again uses LangSmith for dataset and experiment management.

In the RAG setup, web pages are loaded, split into chunks, embedded, and stored in an in-memory vector store to create a retriever. A traced “rag_bot” function retrieves context, injects it into a prompt, and generates an answer via an OpenAI chat model. Test data is created from the source material as question/answer ground truth pairs. Finally, custom evaluators are implemented using structured outputs: correctness is graded against the reference answer, relevance is graded against the question, groundness is graded against retrieved context, and retrieval relevance is graded by comparing retrieved facts to the question. Running the evaluation sends scores back to LangSmith, producing a dashboard with correctness, groundness, relevance, and retrieval metrics along with latency and cost signals—turning RAG tuning into an evidence-based process rather than intuition.

Cornell Notes

The crash course presents an evaluation workflow for both LLM chatbots and RAG systems built around three steps: (1) create datasets of inputs with ground-truth outputs, (2) run models to generate predictions, and (3) score those predictions using evaluators—often implemented as “LLM-as-a-judge.” For chatbots, it demonstrates custom metrics like correctness (LLM grades predicted vs reference answer) and concision (a length-based constraint). It then compares multiple OpenAI models on the same dataset using LangSmith experiments to choose the better performer. For RAG, it expands evaluation beyond final accuracy to include retrieval relevance, groundness (answer supported by retrieved documents), correctness, and answer relevance, again using LLM judge evaluators with structured outputs.

Why is dataset construction (inputs + ground truth) the first non-negotiable step in LLM chatbot evaluation?

Evaluation needs something to compare against. For chatbots, each dataset example pairs a question (input) with an expected answer (reference output). That reference becomes the ground truth used by metrics like correctness. Without ground truth, the system can only score weaker properties (e.g., relevance) rather than factual accuracy.

How does “LLM-as-a-judge” turn qualitative answer quality into measurable scores?

An LLM is prompted with grading instructions and given the question, the predicted answer, and (when available) the reference answer or retrieved context. The judge returns structured outputs—such as a boolean correctness grade (“true” if the predicted answer is factually accurate, otherwise “false”). This makes scoring repeatable across many examples and models.

What are the two example metrics used for chatbot evaluation, and how are they computed?

The course uses (1) correctness, where the judge compares the predicted answer to the reference answer and returns a boolean, and (2) concision, where the response must be shorter than twice the length of the expected answer. The evaluation function runs the chatbot for each question, then applies both metrics to each generated response.

How does the workflow support comparing multiple LLM models fairly?

The same dataset and the same evaluators are reused across experiments. The chatbot generation function is parameterized by the model name (e.g., GPT-4o mini vs GPT-4 turbo). LangSmith then records aggregate correctness and concision scores per experiment, enabling a direct model ranking under identical evaluation rules.

Why does RAG evaluation require more metrics than chatbot correctness?

RAG quality depends on intermediate steps. Even if a final answer sounds plausible, it may be unsupported by retrieved documents (hallucination) or may not address the question. The course’s RAG metrics include retrieval relevance (are retrieved docs relevant), groundness (is the answer grounded in retrieved docs), correctness (matches ground truth), and answer relevance (addresses the question).

What does “groundness” measure in RAG, and what inputs does the judge use?

Groundness checks whether the generated answer stays within the facts provided by the retrieved documents. The judge compares the response against the retrieved context (the document snippets returned by the retriever) and flags unsupported or hallucinated claims as incorrect.

Review Questions

In a chatbot evaluation setup, what exact elements must each dataset record contain to compute correctness?
How would you modify the evaluators if you wanted to penalize overly verbose answers more strongly than the provided concision rule?
For RAG, which metric would you use to detect hallucinations, and what two artifacts does the judge compare?

Key Points

1
Create evaluation datasets as question/expected-answer pairs so correctness metrics have a ground truth reference.
2
Use LangSmith datasets and experiments to store test cases, run evaluations, and track results across model variants.
3
Implement “LLM-as-a-judge” evaluators that return structured outputs (e.g., boolean correctness) for consistency and automation.
4
For chatbot evaluation, combine factual correctness with additional constraints like concision to reflect real product requirements.
5
For RAG, evaluate both retrieval and generation by scoring retrieval relevance, groundness (answer supported by retrieved docs), correctness, and answer relevance.
6
Run the same RAG pipeline and the same evaluators across multiple model choices to make model selection evidence-based.
7
Use the LangSmith experiment outputs (scores plus latency/cost signals) to guide iteration on prompts, retrievers, and model selection.

Highlights

Chatbot evaluation becomes repeatable when each question has a reference answer and an LLM judge grades predicted outputs against that ground truth.

The concision metric in the example is intentionally simple—response length must be under 2× the reference length—showing how custom metrics can be lightweight but effective.

RAG evaluation is framed as a multi-stage problem: retrieval relevance and groundness can fail even when final correctness looks acceptable.

Custom RAG evaluators can be built with structured outputs so LangSmith can aggregate correctness, groundness, relevance, and retrieval relevance in one place.

Topics

LLM Chatbot Evaluation
LLM-as-a-Judge
LangSmith Experiments
RAG Evaluation Metrics
Groundness Scoring