Monitoring ML Models & Data in Production
Based on WhyLabs's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Monitor ML systems by logging input data profiles and model outputs continuously, not only accuracy after labels arrive.
Briefing
ML monitoring in production hinges on catching distribution and quality problems early—before they quietly degrade model performance. The session lays out a practical workflow: profile incoming data, log model inputs and predictions over time, and trigger alerts when new data diverges from training (or from recent history). It also shows how to validate those alerts with ground truth once labels arrive, so teams can distinguish “data drift” from “data quality” issues and decide whether retraining is needed.
The core monitoring targets are fourfold. First is **data drift**, where the incoming feature distribution shifts away from what the model saw during training—often after the model is deployed and the environment changes. Second is **data quality**, including missing values, schema or pipeline issues, sensor/camera problems, or manual entry errors. Third is **concept drift**, where the real-world relationship the model learned becomes stale (e.g., a housing-price model trained years ago). Fourth is **model performance**, measured with metrics like accuracy, precision/recall, F1, and confusion matrices once ground truth is available.
To make those ideas concrete, the workshop uses a simple baseline model trained on the **Iris dataset** with **scikit-learn’s K nearest neighbors** (K=5). After training reaches roughly 97–98% accuracy on a held-out test split, the notebook simulates a week of production by generating multiple daily batches of input features and running predictions. The key move is logging lightweight statistical “profiles” of the input data using **y logs** (an open-source library). Those profiles include cardinality estimates, counts of nulls/invalids, and distribution summaries (mean/median/min/max and quantiles). By writing these profiles to **ylabs**, the system can visualize how feature distributions evolve day by day.
The session then demonstrates how anomalies appear in practice. Some simulated days show dramatic changes—such as a feature distribution jumping sharply, or a batch where values collapse to all zeros—patterns consistent with drift or quality failures. Instead of manually inspecting plots, the workshop configures **ylabs monitor manager** presets for data drift using a drift-distance threshold (based on **Hellinger distance**). It shows two alerting strategies: comparing against a rolling window of recent days, and comparing against a **reference profile** built from training data. When the drift distance crosses the configured sensitivity threshold, alerts are previewed immediately.
Monitoring doesn’t stop at inputs. The workshop logs **model outputs** by sending class predictions and probability outputs to the ylabs “output” tab (with a naming convention requiring “output” in feature names). This reveals why certain predictions look suspicious: with KNN, out-of-distribution or invalid inputs still get classified based on the nearest neighbors—even if the input should not exist in the real data. The “all zeros” batch, for example, produces confident predictions because the nearest-neighbor set is internally consistent.
Finally, ground truth is appended to the logged data so the system can compute **performance metrics** over time. Accuracy and other measures remain high for normal days, then drop sharply on the drifted/invalid batch. The session also highlights a useful operational signal: discrepancies between logged inputs and logged outputs can indicate pipeline or logging failures, not just model issues.
Overall, the workflow emphasizes a layered defense: monitor inputs and predictions continuously, alert on drift/quality anomalies, and validate impact with labeled performance—so teams can respond with targeted fixes or retraining rather than guessing after the damage is done.
Cornell Notes
The workshop presents a production-ready approach to ML monitoring: profile incoming data, log model inputs and predictions over time, and trigger alerts when distributions shift. Using y logs and ylabs, it simulates a week of Iris-model traffic and shows how drift and data-quality anomalies surface as changes in feature profiles (e.g., sharp median/max jumps or all-zero batches). It then logs prediction outputs (class and probability) to explain suspicious behavior—especially how KNN can still classify out-of-distribution inputs based on nearest neighbors. Once ground truth is added, performance metrics (accuracy, precision/recall, confusion matrix) confirm which anomalies actually hurt model quality, enabling informed retraining decisions.
What monitoring signals matter most in an ML system once it’s deployed?
How does the workshop detect data drift in practice?
Why can model outputs look confident even when inputs are clearly wrong?
What role does ground truth play in turning alerts into actionable decisions?
How can monitoring expose pipeline or logging failures, not just model failures?
Review Questions
- When would a reference-profile drift monitor be preferable to a rolling-window monitor?
- What specific input-profile patterns in the workshop correspond to data drift versus data quality issues?
- How does KNN’s nearest-neighbor behavior influence the interpretation of confident probability outputs on invalid inputs?
Key Points
- 1
Monitor ML systems by logging input data profiles and model outputs continuously, not only accuracy after labels arrive.
- 2
Use drift detection against either a rolling window or a training-based reference profile; tune sensitivity via drift-distance thresholds (e.g., Hellinger distance).
- 3
Treat data quality failures (like all-zero batches or invalid ranges) as first-class monitoring events, since they can trigger misleadingly confident predictions.
- 4
Log predictions with clear naming so outputs land in the correct ylabs output tab (the workshop uses an “output” naming convention).
- 5
Validate drift alerts with ground truth performance metrics (accuracy, precision/recall, F1, confusion matrix) to decide whether retraining is needed.
- 6
Watch for operational anomalies such as missing outputs despite logged inputs, which can indicate pipeline or logging failures.
- 7
Build a layered response: alert on drift/quality, investigate inputs/outputs, then label and measure impact before taking corrective action.