LLM Evaluation With MLFLOW And Dagshub For Generative AI Application

TL;DR

MLflow provides experiment tracking, evaluation, and model artifact logging that can be applied to LLMs, not just traditional ML.

Briefing Cornell Notes

Briefing

LLM evaluation becomes manageable when experiment tracking and metric scoring are centralized in MLflow and pushed to a shared dashboard via DagsHub. Instead of manually comparing outputs from large language models, the workflow logs a model run, generates answers for a test set, scores them with built-in LLM metrics, and stores the results so teams can compare runs side by side.

The setup starts with MLflow as an open-source platform for the end-to-end machine learning lifecycle—tracking experiments, visualizing results, and supporting model management. For generative AI, the key challenge is that LLM performance involves many dimensions (quality, readability, latency, toxicity, and semantic alignment), and those metrics are hard to track consistently across iterations. The approach here uses MLflow’s LLM evaluation tooling to automate that scoring.

A small test dataset is created using a pandas DataFrame with two columns: inputs (questions like “What is mlflow” and “What is spark”) and ground truth answers. The ground truth is produced ahead of time (in the transcript, it’s generated using ChatGPT and pasted in), so evaluation can compare the model’s generated output against a reference.

An MLflow experiment is then started with mlflow.start_run, and the LLM is wrapped as an MLflow model using mlflow.openai.log_model. The example uses GPT-4 with an OpenAI chat completion task. The logged model includes a system prompt (“answer the following question in two sentences”) and user messages that take the question from the evaluation dataset. After the run executes, MLflow produces an evaluation CSV containing the input, target (ground truth), model output, and additional scoring fields.

Evaluation uses MLflow’s predefined metric suite for question answering. The transcript highlights metrics such as answer similarity, toxicity, latency, and readability-related scores (e.g., Flesch-Kincaid grade level and related indices). MLflow’s evaluation step compares generated answers to the ground truth and aggregates the results into a table, which is then saved as a CSV for inspection.

The workflow then moves from local experimentation to remote collaboration. By configuring a DagsHub-backed MLflow tracking URI, the same evaluation run is logged to a remote repository. Once pushed, the experiments appear in the DagsHub UI, where metrics like answer similarity, grade level, and variance can be reviewed through dashboards. That remote view enables quick comparisons across multiple runs and makes it easier to identify which model settings perform best.

Overall, the core takeaway is a repeatable pipeline: wrap an LLM in MLflow, evaluate it against a labeled set with automated metrics, save the outputs, and publish the experiment results to DagsHub so performance comparisons are transparent and team-accessible.

Cornell Notes

MLflow can turn LLM testing into a structured experiment: log an OpenAI chat model (example: GPT-4), run it on a labeled dataset, and score outputs with built-in question-answering metrics. The workflow builds a pandas DataFrame with inputs and ground-truth answers, starts an MLflow run, and uses mlflow.openai.log_model to store the model artifact and prompts. Then mlflow.evaluate compares generated responses to targets and outputs an aggregated results table plus an evaluation CSV. Finally, configuring a DagsHub MLflow tracking URI pushes experiments to a shared UI, where metrics like answer similarity, toxicity, latency, and readability/grade-level scores can be compared across runs. This matters because it makes LLM evaluation repeatable and auditable rather than ad hoc.

How does the workflow convert an LLM into something MLflow can track and evaluate?

It wraps the OpenAI chat model using mlflow.openai.log_model, specifying the model name (e.g., gpt-4), the task (open.chat.completion), and the artifact location. The logged model includes message templates: a system prompt that constrains responses (in the transcript: “answer the following question in two sentences”) and a user content field that is filled with each question from the evaluation dataset. Once logged, MLflow can run the model during evaluation and store the resulting artifact and outputs.

What does “ground truth” mean in this evaluation setup, and how is it produced?

Ground truth refers to reference answers used as targets for scoring. In the transcript, ground truth is created by generating outputs from ChatGPT for each input question and pasting those responses into the DataFrame as the target column. During evaluation, MLflow compares the model’s generated output against these targets using metrics such as answer similarity and readability/grade-level measures.

Which metrics are used for question answering, and what do they measure?

For the question answering model type, MLflow provides a set of predefined metrics. The transcript mentions answer similarity (semantic alignment between generated output and target), toxicity (safety-related scoring), latency (time taken), and readability/grade-level indices such as Flesch-Kincaid grade level and related measures. These metrics help evaluate both quality and operational aspects like speed, not just whether the answer is “close” in meaning.

How does mlflow.evaluate use the dataset during scoring?

mlflow.evaluate takes the logged model reference (model URI), the evaluation data, and a model_type (question answering). The evaluation data includes inputs and targets; MLflow treats the target column as ground truth and computes metrics by comparing generated responses to those targets. The results are aggregated into a metrics table and saved (the transcript notes an eval CSV file) for later inspection.

How are local evaluation results published for team review?

The transcript configures MLflow to use a remote tracking URI backed by DagsHub. After setting the tracking URI and initializing DagsHub integration, rerunning the same experiment logs runs and metrics to the remote repository. In the DagsHub UI, experiments appear under the experiments section, and metrics dashboards allow comparisons across runs (e.g., answer similarity variance and grade-level scores).

Review Questions

When wrapping GPT-4 with mlflow.openai.log_model, which fields control the prompt behavior and how are questions injected into the user message?
What is the role of the target column during mlflow.evaluate, and how does it affect metrics like answer similarity?
How does switching from local MLflow tracking to a DagsHub tracking URI change where evaluation results can be viewed and compared?

Key Points

1
MLflow provides experiment tracking, evaluation, and model artifact logging that can be applied to LLMs, not just traditional ML.
2
A labeled evaluation dataset (inputs plus ground-truth targets) is the foundation for automated LLM scoring.
3
Wrapping an OpenAI chat model with mlflow.openai.log_model (including system and user message templates) makes it evaluable via MLflow.
4
mlflow.evaluate can score question-answering outputs using metrics such as answer similarity, toxicity, latency, and readability/grade-level indices.
5
Evaluation outputs are aggregated into a table and saved as an evaluation CSV for inspection and record-keeping.
6
Configuring a DagsHub-backed MLflow tracking URI pushes experiments to a shared dashboard for run-to-run comparison.
7
Remote dashboards make it easier to identify the best-performing LLM settings using consistent metrics across iterations.

Highlights

MLflow evaluation turns LLM output comparison into a repeatable experiment: log the model, run it on a labeled set, score it, and save the results.

Question-answering metrics include both quality signals (answer similarity) and operational/safety signals (latency, toxicity) plus readability/grade-level measures.

Pushing MLflow tracking to DagsHub converts local evaluation logs into a team-accessible dashboard for comparing multiple runs. 

Topics

MLflow LLM Evaluation
DagsHub Tracking
OpenAI Chat Completion
Question Answering Metrics
Experiment Tracking

Mentioned

MLflow
LLM
RAG
UI
CSV
API