Intro to AI Observability: Monitoring ML Models & Data in Production

Q: What’s the difference between ML monitoring and AI observability in practice?

Monitoring is the alerting layer: detect when something goes wrong (e.g., drift, data quality violations, performance drops) and notify the team or trigger a workflow. Observability is the investigation layer: once an alert fires, the system provides the evidence needed to trace the cause—such as which input feature distributions shifted, whether output distributions collapsed, and whether performance metrics fell on the same time window.

Q: How does the workshop detect data drift without storing raw data?

It uses Y Logs to compute statistical profiles from incoming tabular data. Those profiles include distribution summaries like min/max, mean/median, quantiles, and cardinality estimates. By comparing these summaries over time (and against a reference profile from training data) the system can measure distribution distance (e.g., Hellinger distance in the configured monitor) and flag when new data no longer matches the training distribution.

Q: Why can a KNN model keep producing confident predictions even when inputs are clearly wrong?

K-Nearest Neighbors always selects the k closest training points in feature space. If the input batch contains extreme or invalid values (like all zeros or very large measurements), the model still finds “closest” neighbors—even if they are far outside the normal training region—and then outputs a class based on those neighbors. That can cause output distributions to collapse into a single class, revealing the issue even before ground truth is available.

Q: How does the workshop connect drift detection to real model failure?

After identifying anomalous days via input drift and output distribution shifts, it attaches ground truth labels and logs classification metrics using ylabs’ classification metrics logging. The performance tab then shows accuracy dropping dramatically on the same days where drift/data quality issues appeared, confirming that the monitoring signals correspond to degraded predictive quality.

Q: What are “constraints” in Y Logs, and how do they function like unit tests for data?

Constraints define pass/fail rules over profile statistics (e.g., feature min must be > 0 and max must be < 15). The notebook builds a set of constraint checks per feature, then validates a new profile against them. If any constraint fails, the validation returns false—allowing a pipeline to block predictions or trigger remediation before the model runs on bad inputs.

Q: How can monitoring be made more actionable beyond whole-dataset drift?

The workshop notes segmentation and tracing: monitors can be configured for specific segments (e.g., groups/classes), and tracing helps drill down performance differences across those segments over time. This supports bias/fairness monitoring and targeted debugging rather than treating the model as a single undifferentiated system.

TL;DR

ML monitoring should track both input distribution changes and output behavior after deployment, because models can degrade silently when the world shifts.

Briefing Cornell Notes

Briefing

AI observability for machine learning boils down to one practical goal: keep models from silently degrading after they ship. In a hands-on workshop, WhyLabs walks through how to monitor both incoming data and model outputs in production so teams can detect issues like data drift, data quality failures, and performance drops—then trigger alerts or workflows before users feel the impact.

The session starts by framing ML monitoring as a pipeline problem. Predictions flow out continuously, while a telemetry layer collects statistics (not necessarily raw data) and sends them to a monitoring system. “Monitoring” focuses on catching when something goes wrong (or sometimes when it improves), while “observability” adds the breadcrumbs needed to trace why. The workshop emphasizes that once a model is live, the world changes: input distributions shift (data drift), sensors or formats break (data quality), and the mapping between inputs and outcomes can become outdated (concept drift). It also highlights bias and fairness as a monitoring target, typically via segmentation—checking performance across groups rather than only overall metrics.

To make this concrete, the workshop uses a simple Iris classification model built with scikit-learn (K-Nearest Neighbors). It then simulates “production” for a week by splitting the dataset into daily batches and intentionally introducing anomalies—such as feature values that become all zeros or jump to extreme ranges. The key move is logging statistical profiles of inputs and outputs using Y Logs, an open-source library that creates lightweight dataset summaries (cardinality, quantiles, min/max, distribution metrics, etc.). Those profiles are privacy-preserving because they summarize distributions rather than storing raw records.

The workflow then connects those profiles to WhyLabs’ platform (ylabs). First, the training dataset is logged as a reference profile, establishing a baseline. Next, daily input profiles are written over time. A data drift monitor compares new data against the reference using distribution-distance measures (the platform uses Hellinger distance in the configured monitor). When drift crosses a threshold, the system can alert via email/Slack/PagerDuty and support follow-on actions.

The workshop shows why this matters by moving from inputs to outputs. When the input batch contains suspicious values (like all zeros), the KNN model still produces predictions—often collapsing into a single class because it always selects the “closest” neighbors even if they are far outside the training distribution. Logging output distributions (predicted class and probability) reveals those shifts quickly, even without ground truth.

Finally, the session adds ground truth by attaching labels and logging classification metrics. Performance drops sharply on the anomalous days (accuracy falls from the high 90s to roughly the 30s), confirming that the drift and data quality problems translate into real model failure. The workshop closes by demonstrating open-source capabilities directly in the notebook: computing drift scores and histograms, and defining constraints (unit-test style checks) such as “min must be > 0” and “max must be < 15.” If constraints fail, the pipeline can block predictions or trigger remediation.

Overall, the core takeaway is operational: statistical profiling plus drift/performance monitoring creates an early-warning system for ML systems that change after deployment, turning silent degradation into actionable alerts and automated safeguards.

Cornell Notes

ML monitoring is presented as an operational safety net for models after deployment, where data drift and data quality issues can quietly degrade performance. The workshop logs privacy-preserving statistical profiles of inputs (feature distributions) and outputs (predicted class/probability) using Y Logs, then compares new profiles against a training reference profile in ylabs. A drift monitor (configured with Hellinger distance and a threshold) flags days where incoming data diverges, and alerts can trigger workflows such as data annotation. Adding ground truth later enables performance monitoring (accuracy/F1-style metrics), confirming that drift and anomalies correspond to real failures. The open-source notebook also demonstrates drift scoring/visualization and constraint-based “unit tests” to block bad data before predictions.

What’s the difference between ML monitoring and AI observability in practice?

Monitoring is the alerting layer: detect when something goes wrong (e.g., drift, data quality violations, performance drops) and notify the team or trigger a workflow. Observability is the investigation layer: once an alert fires, the system provides the evidence needed to trace the cause—such as which input feature distributions shifted, whether output distributions collapsed, and whether performance metrics fell on the same time window.

How does the workshop detect data drift without storing raw data?

It uses Y Logs to compute statistical profiles from incoming tabular data. Those profiles include distribution summaries like min/max, mean/median, quantiles, and cardinality estimates. By comparing these summaries over time (and against a reference profile from training data) the system can measure distribution distance (e.g., Hellinger distance in the configured monitor) and flag when new data no longer matches the training distribution.

Why can a KNN model keep producing confident predictions even when inputs are clearly wrong?

K-Nearest Neighbors always selects the k closest training points in feature space. If the input batch contains extreme or invalid values (like all zeros or very large measurements), the model still finds “closest” neighbors—even if they are far outside the normal training region—and then outputs a class based on those neighbors. That can cause output distributions to collapse into a single class, revealing the issue even before ground truth is available.

How does the workshop connect drift detection to real model failure?

After identifying anomalous days via input drift and output distribution shifts, it attaches ground truth labels and logs classification metrics using ylabs’ classification metrics logging. The performance tab then shows accuracy dropping dramatically on the same days where drift/data quality issues appeared, confirming that the monitoring signals correspond to degraded predictive quality.

What are “constraints” in Y Logs, and how do they function like unit tests for data?

Constraints define pass/fail rules over profile statistics (e.g., feature min must be > 0 and max must be < 15). The notebook builds a set of constraint checks per feature, then validates a new profile against them. If any constraint fails, the validation returns false—allowing a pipeline to block predictions or trigger remediation before the model runs on bad inputs.

How can monitoring be made more actionable beyond whole-dataset drift?