Intro to ML Monitoring: Data Drift, Quality, Bias and Explainability

Q: Why does monitoring need to track both inputs and outputs, not just model accuracy?

Input drift can appear before performance collapses, and output behavior can reveal problems even when ground truth isn’t available yet. In the hands-on Iris example, anomalous input batches (e.g., extreme feature values or all zeros) led to distinctive shifts in predicted class distributions and confidence/probability outputs. Those output changes provided early warning that the model was likely behaving incorrectly, even before ground truth was logged.

Q: How does the workshop detect data drift in practice?

It creates a reference profile from training data, then compares new production batches against that reference using a drift algorithm (Hellinger distance). A drift threshold controls sensitivity: lower thresholds trigger alerts more readily. In the demo, the monitor flagged days with distribution tail changes and days where feature values collapsed (like all zeros), even when some summary statistics (like medians) looked less dramatic.

Q: What’s the difference between data drift and concept drift, and how are they illustrated?

Data drift is a change in the input distribution (the features look different than training). Concept drift is a change in the relationship between inputs and the target, so the mapping the model learned no longer matches reality. The session uses seasonal image examples for data drift (summer vs winter backgrounds) and a housing-price example for concept drift (prices change over years even if features like rooms and square footage stay similar).

Q: How does segment-level monitoring support bias and fairness checks?

The workshop adds a segmentation column (“state” in the Iris variant) and then compares metrics across segments. Tracing views show that overall accuracy can hide problems: Florida underperformed relative to Missouri and Washington, with higher false positive rates and lower accuracy on certain days. This approach mirrors real-world fairness workflows where performance must be evaluated across demographic or categorical groups.

Q: What role does SHAP play in explainability monitoring?

SHAP identifies which features most influence predictions. In the Iris/KNN setup, SHAP indicates petal length and petal width are the most important drivers, while other features contribute less. Those feature-importance signals can guide what to monitor more tightly (e.g., if petal length shifts in production, drift alerts may matter more). The platform stores these explainability outputs in an explainability tab for team-friendly investigation.

Q: How do constraints turn monitoring into automated data quality enforcement?

Using y logs constraint builders, the session defines unit-test-like rules on profile metrics (e.g., pedal length must be >0 and <15). Profiles are validated against these constraints, producing pass/fail results per feature. When a constraint fails (like pedal width having a minimum of 0), the workflow can block predictions or trigger an alert, preventing bad inputs from reaching the model.

TL;DR

Treat ML monitoring as an operational loop: collect telemetry, compare against training references or rolling windows, measure drift/quality/performance, then trigger investigation or workflows.

Briefing Cornell Notes

Briefing

ML monitoring is positioned as the practical way to catch “bad data” and model failures early—by tracking data drift, data quality, bias across segments, and explainability signals in production—then triggering alerts or workflows before users feel the impact. The workshop frames monitoring as an end-to-end loop: collect AI telemetry from inputs/outputs, compare it to training references (or rolling windows), measure drift and quality constraints, and use those findings to investigate, retrain, or mitigate.

A core theme is that production models rarely fail in a single dramatic way; instead, they degrade as inputs shift or pipelines break. Common failure modes include data drift (input distributions changing over time), data quality problems (sensor issues, manual entry errors, schema changes), and concept drift (the relationship between inputs and the real-world target changes). The session emphasizes that even strong offline evaluation can’t prevent these issues, because real environments introduce new conditions—seasonal changes, shifting user behavior, or upstream library updates that subtly alter schemas.

To make monitoring concrete, the workshop uses Google Colab with scikit-learn’s Iris dataset and a K-nearest neighbors classifier as a stand-in production model. The open-source library WhyLabs maintains—y logs—generates compact “profiles” (summary statistics rather than raw data) for each batch of inputs. Those profiles are logged to WhyLabs’ platform (y labs) to build time series views, detect drift, and support alerting. In the hands-on portion, the Iris data is split into seven simulated “days” of production batches, and the platform reveals anomalies like batches with extreme feature values, batches of all zeros, and distribution shifts that aren’t obvious from a single snapshot.

Monitoring is demonstrated in layers. First comes input monitoring: a reference profile is created from training data, then a drift monitor compares new batches against that reference using Hellinger distance. Alerts trigger when drift exceeds a chosen threshold, catching cases where medians may look similar but distribution tails shift dramatically. Next comes output monitoring: predicted class distributions and confidence/probability outputs are logged and segmented into an “outputs” view. Even without ground truth, output drift can signal trouble—such as the model collapsing toward a single class when inputs are out-of-range.

When ground truth is added, performance monitoring becomes possible. The platform logs classification metrics (accuracy and related scores) per batch, showing sharp drops on the anomalous days. The workshop also highlights segment-level monitoring for fairness: using an additional “state” attribute (where flowers were grown), it compares metrics like accuracy and false positive rate across segments (Florida, Missouri, Washington). Tracing views make it easier to see that one segment can underperform even when overall metrics look acceptable.

Finally, explainability is introduced through SHAP. Feature importance plots identify which inputs drive predictions (in this simplified setup, petal length and petal width dominate), and those importance signals are stored in the platform’s explainability tab. The session closes with local drift and data validation tooling using y logs constraints—effectively unit tests for data quality (e.g., feature ranges like pedal width must be >0 and <15). If constraints fail, the workflow can block predictions or trigger alerts, turning monitoring into an actionable guardrail rather than a passive dashboard.

Cornell Notes

The workshop argues that reliable ML monitoring requires more than tracking accuracy. It combines input drift detection, output monitoring, performance metrics (when ground truth exists), segment-level fairness checks, and explainability signals. Using y logs, the session generates lightweight “profiles” (summary statistics) for each batch of data and logs them to y labs to visualize changes over time and trigger drift alerts. A K-nearest neighbors model trained on scikit-learn’s Iris dataset is then stress-tested with seven simulated production batches, revealing anomalies like extreme values and all-zero inputs. Adding ground truth enables accuracy/F1-style monitoring, while SHAP and constraints support investigation and automated data quality gating.

Why does monitoring need to track both inputs and outputs, not just model accuracy?

Input drift can appear before performance collapses, and output behavior can reveal problems even when ground truth isn’t available yet. In the hands-on Iris example, anomalous input batches (e.g., extreme feature values or all zeros) led to distinctive shifts in predicted class distributions and confidence/probability outputs. Those output changes provided early warning that the model was likely behaving incorrectly, even before ground truth was logged.

How does the workshop detect data drift in practice?

It creates a reference profile from training data, then compares new production batches against that reference using a drift algorithm (Hellinger distance). A drift threshold controls sensitivity: lower thresholds trigger alerts more readily. In the demo, the monitor flagged days with distribution tail changes and days where feature values collapsed (like all zeros), even when some summary statistics (like medians) looked less dramatic.

What’s the difference between data drift and concept drift, and how are they illustrated?

Data drift is a change in the input distribution (the features look different than training). Concept drift is a change in the relationship between inputs and the target, so the mapping the model learned no longer matches reality. The session uses seasonal image examples for data drift (summer vs winter backgrounds) and a housing-price example for concept drift (prices change over years even if features like rooms and square footage stay similar).

How does segment-level monitoring support bias and fairness checks?

The workshop adds a segmentation column (“state” in the Iris variant) and then compares metrics across segments. Tracing views show that overall accuracy can hide problems: Florida underperformed relative to Missouri and Washington, with higher false positive rates and lower accuracy on certain days. This approach mirrors real-world fairness workflows where performance must be evaluated across demographic or categorical groups.

What role does SHAP play in explainability monitoring?

SHAP identifies which features most influence predictions. In the Iris/KNN setup, SHAP indicates petal length and petal width are the most important drivers, while other features contribute less. Those feature-importance signals can guide what to monitor more tightly (e.g., if petal length shifts in production, drift alerts may matter more). The platform stores these explainability outputs in an explainability tab for team-friendly investigation.

How do constraints turn monitoring into automated data quality enforcement?