Mastering LLM Chatbots And RAG Evaluation Crash Course
Based on Krish Naik's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Create evaluation datasets as question/expected-answer pairs so correctness metrics have a ground truth reference.
Briefing
LLM chatbot and RAG quality can be measured systematically by combining three ingredients: curated test data (inputs plus ground-truth outputs), evaluation metrics that compare model outputs to that ground truth (or to retrieved context), and an “LLM-as-a-judge” layer that produces consistent, structured scores. The practical takeaway is that model choice stops being guesswork: different LLMs can be run against the same dataset, then ranked using the same metrics inside LangSmith.
The crash course starts with the core problem behind evaluation: selecting an LLM model, defining what “good” looks like for a specific use case, and building the data needed to compare outputs. For chatbots, the workflow begins by creating a dataset of question/expected-answer pairs. Those examples become the ground truth. Next comes the evaluation step, where an LLM is prompted to grade a generated answer against the reference. In the implementation, LangSmith is used for observability and experiment tracking, while OpenAI is used both to generate chatbot responses and to act as the judge.
A concrete example defines two custom metrics. “Correctness” uses an LLM prompt that grades whether the predicted answer is factually correct relative to the ground truth, returning a boolean. “Concision” applies a simple length-based rule: the response must be less than twice the length of the expected answer. An evaluation function then runs the chatbot for each dataset question, collects the model output, and feeds both the question, reference output, and predicted output into the judge-based metrics. LangSmith records the results under an experiment name, including per-example comparisons and aggregate scores.
To show how model selection becomes data-driven, the same evaluation is repeated with different OpenAI models. The course demonstrates running experiments with a smaller model first (GPT-4o mini) and then with a larger one (GPT-4 turbo), using the same dataset and evaluators. The resulting correctness and concision scores determine which model performs better under the defined criteria.
The focus then shifts to RAG evaluation, where “quality” depends on more than final answer correctness. A typical RAG pipeline has a retriever that selects documents and a generator that answers using those documents as context. The course lays out four key evaluation dimensions: retrieval relevance (are the retrieved documents relevant to the question), groundness (is the answer supported by the retrieved documents, avoiding hallucinations), correctness (does the answer match the ground truth), and answer relevance (does the answer address the question). The workflow again uses LangSmith for dataset and experiment management.
In the RAG setup, web pages are loaded, split into chunks, embedded, and stored in an in-memory vector store to create a retriever. A traced “rag_bot” function retrieves context, injects it into a prompt, and generates an answer via an OpenAI chat model. Test data is created from the source material as question/answer ground truth pairs. Finally, custom evaluators are implemented using structured outputs: correctness is graded against the reference answer, relevance is graded against the question, groundness is graded against retrieved context, and retrieval relevance is graded by comparing retrieved facts to the question. Running the evaluation sends scores back to LangSmith, producing a dashboard with correctness, groundness, relevance, and retrieval metrics along with latency and cost signals—turning RAG tuning into an evidence-based process rather than intuition.
Cornell Notes
The crash course presents an evaluation workflow for both LLM chatbots and RAG systems built around three steps: (1) create datasets of inputs with ground-truth outputs, (2) run models to generate predictions, and (3) score those predictions using evaluators—often implemented as “LLM-as-a-judge.” For chatbots, it demonstrates custom metrics like correctness (LLM grades predicted vs reference answer) and concision (a length-based constraint). It then compares multiple OpenAI models on the same dataset using LangSmith experiments to choose the better performer. For RAG, it expands evaluation beyond final accuracy to include retrieval relevance, groundness (answer supported by retrieved documents), correctness, and answer relevance, again using LLM judge evaluators with structured outputs.
Why is dataset construction (inputs + ground truth) the first non-negotiable step in LLM chatbot evaluation?
How does “LLM-as-a-judge” turn qualitative answer quality into measurable scores?
What are the two example metrics used for chatbot evaluation, and how are they computed?
How does the workflow support comparing multiple LLM models fairly?
Why does RAG evaluation require more metrics than chatbot correctness?
What does “groundness” measure in RAG, and what inputs does the judge use?
Review Questions
- In a chatbot evaluation setup, what exact elements must each dataset record contain to compute correctness?
- How would you modify the evaluators if you wanted to penalize overly verbose answers more strongly than the provided concision rule?
- For RAG, which metric would you use to detect hallucinations, and what two artifacts does the judge compare?
Key Points
- 1
Create evaluation datasets as question/expected-answer pairs so correctness metrics have a ground truth reference.
- 2
Use LangSmith datasets and experiments to store test cases, run evaluations, and track results across model variants.
- 3
Implement “LLM-as-a-judge” evaluators that return structured outputs (e.g., boolean correctness) for consistency and automation.
- 4
For chatbot evaluation, combine factual correctness with additional constraints like concision to reflect real product requirements.
- 5
For RAG, evaluate both retrieval and generation by scoring retrieval relevance, groundness (answer supported by retrieved docs), correctness, and answer relevance.
- 6
Run the same RAG pipeline and the same evaluators across multiple model choices to make model selection evidence-based.
- 7
Use the LangSmith experiment outputs (scores plus latency/cost signals) to guide iteration on prompts, retrievers, and model selection.