Intro to AI Observability: Monitoring ML Models & Data in Production
Based on WhyLabs's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
ML monitoring should track both input distribution changes and output behavior after deployment, because models can degrade silently when the world shifts.
Briefing
AI observability for machine learning boils down to one practical goal: keep models from silently degrading after they ship. In a hands-on workshop, WhyLabs walks through how to monitor both incoming data and model outputs in production so teams can detect issues like data drift, data quality failures, and performance drops—then trigger alerts or workflows before users feel the impact.
The session starts by framing ML monitoring as a pipeline problem. Predictions flow out continuously, while a telemetry layer collects statistics (not necessarily raw data) and sends them to a monitoring system. “Monitoring” focuses on catching when something goes wrong (or sometimes when it improves), while “observability” adds the breadcrumbs needed to trace why. The workshop emphasizes that once a model is live, the world changes: input distributions shift (data drift), sensors or formats break (data quality), and the mapping between inputs and outcomes can become outdated (concept drift). It also highlights bias and fairness as a monitoring target, typically via segmentation—checking performance across groups rather than only overall metrics.
To make this concrete, the workshop uses a simple Iris classification model built with scikit-learn (K-Nearest Neighbors). It then simulates “production” for a week by splitting the dataset into daily batches and intentionally introducing anomalies—such as feature values that become all zeros or jump to extreme ranges. The key move is logging statistical profiles of inputs and outputs using Y Logs, an open-source library that creates lightweight dataset summaries (cardinality, quantiles, min/max, distribution metrics, etc.). Those profiles are privacy-preserving because they summarize distributions rather than storing raw records.
The workflow then connects those profiles to WhyLabs’ platform (ylabs). First, the training dataset is logged as a reference profile, establishing a baseline. Next, daily input profiles are written over time. A data drift monitor compares new data against the reference using distribution-distance measures (the platform uses Hellinger distance in the configured monitor). When drift crosses a threshold, the system can alert via email/Slack/PagerDuty and support follow-on actions.
The workshop shows why this matters by moving from inputs to outputs. When the input batch contains suspicious values (like all zeros), the KNN model still produces predictions—often collapsing into a single class because it always selects the “closest” neighbors even if they are far outside the training distribution. Logging output distributions (predicted class and probability) reveals those shifts quickly, even without ground truth.
Finally, the session adds ground truth by attaching labels and logging classification metrics. Performance drops sharply on the anomalous days (accuracy falls from the high 90s to roughly the 30s), confirming that the drift and data quality problems translate into real model failure. The workshop closes by demonstrating open-source capabilities directly in the notebook: computing drift scores and histograms, and defining constraints (unit-test style checks) such as “min must be > 0” and “max must be < 15.” If constraints fail, the pipeline can block predictions or trigger remediation.
Overall, the core takeaway is operational: statistical profiling plus drift/performance monitoring creates an early-warning system for ML systems that change after deployment, turning silent degradation into actionable alerts and automated safeguards.
Cornell Notes
ML monitoring is presented as an operational safety net for models after deployment, where data drift and data quality issues can quietly degrade performance. The workshop logs privacy-preserving statistical profiles of inputs (feature distributions) and outputs (predicted class/probability) using Y Logs, then compares new profiles against a training reference profile in ylabs. A drift monitor (configured with Hellinger distance and a threshold) flags days where incoming data diverges, and alerts can trigger workflows such as data annotation. Adding ground truth later enables performance monitoring (accuracy/F1-style metrics), confirming that drift and anomalies correspond to real failures. The open-source notebook also demonstrates drift scoring/visualization and constraint-based “unit tests” to block bad data before predictions.
What’s the difference between ML monitoring and AI observability in practice?
How does the workshop detect data drift without storing raw data?
Why can a KNN model keep producing confident predictions even when inputs are clearly wrong?
How does the workshop connect drift detection to real model failure?
What are “constraints” in Y Logs, and how do they function like unit tests for data?
How can monitoring be made more actionable beyond whole-dataset drift?
Review Questions
- What specific statistical profile elements (e.g., min/max/quantiles/cardinality) are used to detect drift, and why are they sufficient without raw data?
- Describe the sequence of monitoring steps used in the workshop from reference profile creation to drift monitoring to performance validation with ground truth.
- Give an example of a constraint rule and explain how it would prevent the model from producing misleading predictions on anomalous inputs.
Key Points
- 1
ML monitoring should track both input distribution changes and output behavior after deployment, because models can degrade silently when the world shifts.
- 2
Data drift, data quality failures, and concept drift are distinct failure modes; monitoring should be designed to catch each one with appropriate signals.
- 3
Y Logs creates lightweight, privacy-preserving statistical profiles (not raw data) that enable drift and quality checks using distribution metrics.
- 4
A reference profile from training data provides a baseline; drift monitors compare new profiles against that baseline and can alert when thresholds are exceeded.
- 5
Output logging (predicted class and probability distributions) can reveal model collapse into a single class even before ground truth labels arrive.
- 6
Adding ground truth later allows performance monitoring (accuracy/F1-style metrics), validating that drift signals correspond to real predictive harm.
- 7
Constraint-based checks act like unit tests for data, enabling pipelines to block predictions when profile statistics violate expected ranges.