Intro to ML Monitoring: Data Drift, Quality, Bias and Explainability
Based on WhyLabs's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Treat ML monitoring as an operational loop: collect telemetry, compare against training references or rolling windows, measure drift/quality/performance, then trigger investigation or workflows.
Briefing
ML monitoring is positioned as the practical way to catch “bad data” and model failures early—by tracking data drift, data quality, bias across segments, and explainability signals in production—then triggering alerts or workflows before users feel the impact. The workshop frames monitoring as an end-to-end loop: collect AI telemetry from inputs/outputs, compare it to training references (or rolling windows), measure drift and quality constraints, and use those findings to investigate, retrain, or mitigate.
A core theme is that production models rarely fail in a single dramatic way; instead, they degrade as inputs shift or pipelines break. Common failure modes include data drift (input distributions changing over time), data quality problems (sensor issues, manual entry errors, schema changes), and concept drift (the relationship between inputs and the real-world target changes). The session emphasizes that even strong offline evaluation can’t prevent these issues, because real environments introduce new conditions—seasonal changes, shifting user behavior, or upstream library updates that subtly alter schemas.
To make monitoring concrete, the workshop uses Google Colab with scikit-learn’s Iris dataset and a K-nearest neighbors classifier as a stand-in production model. The open-source library WhyLabs maintains—y logs—generates compact “profiles” (summary statistics rather than raw data) for each batch of inputs. Those profiles are logged to WhyLabs’ platform (y labs) to build time series views, detect drift, and support alerting. In the hands-on portion, the Iris data is split into seven simulated “days” of production batches, and the platform reveals anomalies like batches with extreme feature values, batches of all zeros, and distribution shifts that aren’t obvious from a single snapshot.
Monitoring is demonstrated in layers. First comes input monitoring: a reference profile is created from training data, then a drift monitor compares new batches against that reference using Hellinger distance. Alerts trigger when drift exceeds a chosen threshold, catching cases where medians may look similar but distribution tails shift dramatically. Next comes output monitoring: predicted class distributions and confidence/probability outputs are logged and segmented into an “outputs” view. Even without ground truth, output drift can signal trouble—such as the model collapsing toward a single class when inputs are out-of-range.
When ground truth is added, performance monitoring becomes possible. The platform logs classification metrics (accuracy and related scores) per batch, showing sharp drops on the anomalous days. The workshop also highlights segment-level monitoring for fairness: using an additional “state” attribute (where flowers were grown), it compares metrics like accuracy and false positive rate across segments (Florida, Missouri, Washington). Tracing views make it easier to see that one segment can underperform even when overall metrics look acceptable.
Finally, explainability is introduced through SHAP. Feature importance plots identify which inputs drive predictions (in this simplified setup, petal length and petal width dominate), and those importance signals are stored in the platform’s explainability tab. The session closes with local drift and data validation tooling using y logs constraints—effectively unit tests for data quality (e.g., feature ranges like pedal width must be >0 and <15). If constraints fail, the workflow can block predictions or trigger alerts, turning monitoring into an actionable guardrail rather than a passive dashboard.
Cornell Notes
The workshop argues that reliable ML monitoring requires more than tracking accuracy. It combines input drift detection, output monitoring, performance metrics (when ground truth exists), segment-level fairness checks, and explainability signals. Using y logs, the session generates lightweight “profiles” (summary statistics) for each batch of data and logs them to y labs to visualize changes over time and trigger drift alerts. A K-nearest neighbors model trained on scikit-learn’s Iris dataset is then stress-tested with seven simulated production batches, revealing anomalies like extreme values and all-zero inputs. Adding ground truth enables accuracy/F1-style monitoring, while SHAP and constraints support investigation and automated data quality gating.
Why does monitoring need to track both inputs and outputs, not just model accuracy?
How does the workshop detect data drift in practice?
What’s the difference between data drift and concept drift, and how are they illustrated?
How does segment-level monitoring support bias and fairness checks?
What role does SHAP play in explainability monitoring?
How do constraints turn monitoring into automated data quality enforcement?
Review Questions
- In the demo, what specific drift mechanism and threshold were used to trigger alerts, and what does changing the threshold do?
- How would you design a monitoring plan for a model where ground truth arrives late or not at all?
- What kinds of segment definitions (columns) would you choose to evaluate fairness in a credit, healthcare, or recommender system?
Key Points
- 1
Treat ML monitoring as an operational loop: collect telemetry, compare against training references or rolling windows, measure drift/quality/performance, then trigger investigation or workflows.
- 2
Expect data drift and data quality failures in production even after strong offline evaluation; schema changes and upstream sensor/input issues are common.
- 3
Use y logs profiles to log lightweight summary statistics (not raw data) so drift and quality checks remain feasible in constrained environments like healthcare or fintech.
- 4
Set up input monitoring first (direct model inputs), then add output monitoring to catch issues when ground truth isn’t immediately available.
- 5
When ground truth is available, log performance metrics per batch and alert on accuracy/F1-style drops to quantify impact.
- 6
Add segment-level tracing (e.g., demographic or categorical group columns) to detect fairness problems that overall metrics can mask.
- 7
Use SHAP feature importance and constraints to prioritize what to monitor and to enforce automated data quality gates before predictions.