Monitoring ML Models & Data in Production

TL;DR

Monitor ML systems by logging input data profiles and model outputs continuously, not only accuracy after labels arrive.

Briefing Cornell Notes

Briefing

ML monitoring in production hinges on catching distribution and quality problems early—before they quietly degrade model performance. The session lays out a practical workflow: profile incoming data, log model inputs and predictions over time, and trigger alerts when new data diverges from training (or from recent history). It also shows how to validate those alerts with ground truth once labels arrive, so teams can distinguish “data drift” from “data quality” issues and decide whether retraining is needed.

The core monitoring targets are fourfold. First is **data drift**, where the incoming feature distribution shifts away from what the model saw during training—often after the model is deployed and the environment changes. Second is **data quality**, including missing values, schema or pipeline issues, sensor/camera problems, or manual entry errors. Third is **concept drift**, where the real-world relationship the model learned becomes stale (e.g., a housing-price model trained years ago). Fourth is **model performance**, measured with metrics like accuracy, precision/recall, F1, and confusion matrices once ground truth is available.

To make those ideas concrete, the workshop uses a simple baseline model trained on the **Iris dataset** with **scikit-learn’s K nearest neighbors** (K=5). After training reaches roughly 97–98% accuracy on a held-out test split, the notebook simulates a week of production by generating multiple daily batches of input features and running predictions. The key move is logging lightweight statistical “profiles” of the input data using **y logs** (an open-source library). Those profiles include cardinality estimates, counts of nulls/invalids, and distribution summaries (mean/median/min/max and quantiles). By writing these profiles to **ylabs**, the system can visualize how feature distributions evolve day by day.

The session then demonstrates how anomalies appear in practice. Some simulated days show dramatic changes—such as a feature distribution jumping sharply, or a batch where values collapse to all zeros—patterns consistent with drift or quality failures. Instead of manually inspecting plots, the workshop configures **ylabs monitor manager** presets for data drift using a drift-distance threshold (based on **Hellinger distance**). It shows two alerting strategies: comparing against a rolling window of recent days, and comparing against a **reference profile** built from training data. When the drift distance crosses the configured sensitivity threshold, alerts are previewed immediately.

Monitoring doesn’t stop at inputs. The workshop logs **model outputs** by sending class predictions and probability outputs to the ylabs “output” tab (with a naming convention requiring “output” in feature names). This reveals why certain predictions look suspicious: with KNN, out-of-distribution or invalid inputs still get classified based on the nearest neighbors—even if the input should not exist in the real data. The “all zeros” batch, for example, produces confident predictions because the nearest-neighbor set is internally consistent.

Finally, ground truth is appended to the logged data so the system can compute **performance metrics** over time. Accuracy and other measures remain high for normal days, then drop sharply on the drifted/invalid batch. The session also highlights a useful operational signal: discrepancies between logged inputs and logged outputs can indicate pipeline or logging failures, not just model issues.

Overall, the workflow emphasizes a layered defense: monitor inputs and predictions continuously, alert on drift/quality anomalies, and validate impact with labeled performance—so teams can respond with targeted fixes or retraining rather than guessing after the damage is done.

Cornell Notes

The workshop presents a production-ready approach to ML monitoring: profile incoming data, log model inputs and predictions over time, and trigger alerts when distributions shift. Using y logs and ylabs, it simulates a week of Iris-model traffic and shows how drift and data-quality anomalies surface as changes in feature profiles (e.g., sharp median/max jumps or all-zero batches). It then logs prediction outputs (class and probability) to explain suspicious behavior—especially how KNN can still classify out-of-distribution inputs based on nearest neighbors. Once ground truth is added, performance metrics (accuracy, precision/recall, confusion matrix) confirm which anomalies actually hurt model quality, enabling informed retraining decisions.

What monitoring signals matter most in an ML system once it’s deployed?

The session prioritizes four signals: (1) data drift—feature distributions change versus training; (2) data quality—missing/invalid values, sensor or pipeline failures, schema changes, or manual entry errors; (3) concept drift—real-world relationships become stale; and (4) model performance—accuracy/precision/recall/F1 and confusion matrices once ground truth arrives. A key mantra is “bad data happens to good models,” so input monitoring often catches problems before performance collapses.

How does the workshop detect data drift in practice?

It logs per-day statistical profiles of input features using y logs, then uses ylabs monitor manager to compute drift distance. For numerical features, it configures a drift monitor using **Hellinger distance** with a sensitivity threshold (lower threshold = more sensitive). It demonstrates two comparison modes: a rolling window of recent days and a **reference profile** built from training data. Alerts preview when drift distance crosses the configured cutoff.

Why can model outputs look confident even when inputs are clearly wrong?

With **K nearest neighbors (K=5)**, predictions are based on the closest training points in feature space. If an invalid batch (like all zeros) appears, the model still finds the nearest neighbors among training data and outputs the majority class—often with high probability—because the nearest-neighbor set is internally consistent. The workshop uses this to show that output monitoring can reveal anomalies, but input validation is needed to prevent nonsensical predictions.

What role does ground truth play in turning alerts into actionable decisions?

Ground truth enables performance evaluation. The workshop appends a ground truth column to the logged predictions and then logs classification metrics to ylabs. This confirms whether drift/quality alerts actually reduce accuracy and other metrics, and it helps distinguish “data changed but model still works” from “data changed and performance dropped,” guiding whether retraining is warranted.

How can monitoring expose pipeline or logging failures, not just model failures?

The session notes that differences between logged input availability and logged output availability can indicate broken logging or pipeline throughput. For example, a day may show inputs but missing outputs, which can be a strong operational signal that something in the production pipeline failed to run or to record predictions.

Review Questions

When would a reference-profile drift monitor be preferable to a rolling-window monitor?
What specific input-profile patterns in the workshop correspond to data drift versus data quality issues?
How does KNN’s nearest-neighbor behavior influence the interpretation of confident probability outputs on invalid inputs?

Key Points

1
Monitor ML systems by logging input data profiles and model outputs continuously, not only accuracy after labels arrive.
2
Use drift detection against either a rolling window or a training-based reference profile; tune sensitivity via drift-distance thresholds (e.g., Hellinger distance).
3
Treat data quality failures (like all-zero batches or invalid ranges) as first-class monitoring events, since they can trigger misleadingly confident predictions.
4
Log predictions with clear naming so outputs land in the correct ylabs output tab (the workshop uses an “output” naming convention).
5
Validate drift alerts with ground truth performance metrics (accuracy, precision/recall, F1, confusion matrix) to decide whether retraining is needed.
6
Watch for operational anomalies such as missing outputs despite logged inputs, which can indicate pipeline or logging failures.
7
Build a layered response: alert on drift/quality, investigate inputs/outputs, then label and measure impact before taking corrective action.

Highlights

The workshop shows that drift and data-quality anomalies can be detected by comparing logged feature distribution profiles over time using drift distance thresholds.

KNN can produce confident predictions on invalid or out-of-distribution inputs because it still classifies based on nearest neighbors—making input validation essential.

Ground truth turns monitoring from “something looks weird” into actionable decisions by confirming performance drops and identifying which days truly harmed accuracy.

Monitoring can also flag pipeline/logging problems when inputs appear but outputs don’t, indicating operational failure rather than model behavior alone.

Topics

ML Monitoring
Data Drift
Data Quality
Model Performance
KNN Predictions

Mentioned

ML
API
KPIs
KNN
TPU
GPU
NLP
ML Ops
F1