Get AI summaries of any video or article — Sign up free
Intro to AI Observability: Monitoring ML Models & Data in Production thumbnail

Intro to AI Observability: Monitoring ML Models & Data in Production

WhyLabs·
5 min read

Based on WhyLabs's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

ML monitoring should track both input distribution changes and output behavior after deployment, because models can degrade silently when the world shifts.

Briefing

AI observability for machine learning boils down to one practical goal: keep models from silently degrading after they ship. In a hands-on workshop, WhyLabs walks through how to monitor both incoming data and model outputs in production so teams can detect issues like data drift, data quality failures, and performance drops—then trigger alerts or workflows before users feel the impact.

The session starts by framing ML monitoring as a pipeline problem. Predictions flow out continuously, while a telemetry layer collects statistics (not necessarily raw data) and sends them to a monitoring system. “Monitoring” focuses on catching when something goes wrong (or sometimes when it improves), while “observability” adds the breadcrumbs needed to trace why. The workshop emphasizes that once a model is live, the world changes: input distributions shift (data drift), sensors or formats break (data quality), and the mapping between inputs and outcomes can become outdated (concept drift). It also highlights bias and fairness as a monitoring target, typically via segmentation—checking performance across groups rather than only overall metrics.

To make this concrete, the workshop uses a simple Iris classification model built with scikit-learn (K-Nearest Neighbors). It then simulates “production” for a week by splitting the dataset into daily batches and intentionally introducing anomalies—such as feature values that become all zeros or jump to extreme ranges. The key move is logging statistical profiles of inputs and outputs using Y Logs, an open-source library that creates lightweight dataset summaries (cardinality, quantiles, min/max, distribution metrics, etc.). Those profiles are privacy-preserving because they summarize distributions rather than storing raw records.

The workflow then connects those profiles to WhyLabs’ platform (ylabs). First, the training dataset is logged as a reference profile, establishing a baseline. Next, daily input profiles are written over time. A data drift monitor compares new data against the reference using distribution-distance measures (the platform uses Hellinger distance in the configured monitor). When drift crosses a threshold, the system can alert via email/Slack/PagerDuty and support follow-on actions.

The workshop shows why this matters by moving from inputs to outputs. When the input batch contains suspicious values (like all zeros), the KNN model still produces predictions—often collapsing into a single class because it always selects the “closest” neighbors even if they are far outside the training distribution. Logging output distributions (predicted class and probability) reveals those shifts quickly, even without ground truth.

Finally, the session adds ground truth by attaching labels and logging classification metrics. Performance drops sharply on the anomalous days (accuracy falls from the high 90s to roughly the 30s), confirming that the drift and data quality problems translate into real model failure. The workshop closes by demonstrating open-source capabilities directly in the notebook: computing drift scores and histograms, and defining constraints (unit-test style checks) such as “min must be > 0” and “max must be < 15.” If constraints fail, the pipeline can block predictions or trigger remediation.

Overall, the core takeaway is operational: statistical profiling plus drift/performance monitoring creates an early-warning system for ML systems that change after deployment, turning silent degradation into actionable alerts and automated safeguards.

Cornell Notes

ML monitoring is presented as an operational safety net for models after deployment, where data drift and data quality issues can quietly degrade performance. The workshop logs privacy-preserving statistical profiles of inputs (feature distributions) and outputs (predicted class/probability) using Y Logs, then compares new profiles against a training reference profile in ylabs. A drift monitor (configured with Hellinger distance and a threshold) flags days where incoming data diverges, and alerts can trigger workflows such as data annotation. Adding ground truth later enables performance monitoring (accuracy/F1-style metrics), confirming that drift and anomalies correspond to real failures. The open-source notebook also demonstrates drift scoring/visualization and constraint-based “unit tests” to block bad data before predictions.

What’s the difference between ML monitoring and AI observability in practice?

Monitoring is the alerting layer: detect when something goes wrong (e.g., drift, data quality violations, performance drops) and notify the team or trigger a workflow. Observability is the investigation layer: once an alert fires, the system provides the evidence needed to trace the cause—such as which input feature distributions shifted, whether output distributions collapsed, and whether performance metrics fell on the same time window.

How does the workshop detect data drift without storing raw data?

It uses Y Logs to compute statistical profiles from incoming tabular data. Those profiles include distribution summaries like min/max, mean/median, quantiles, and cardinality estimates. By comparing these summaries over time (and against a reference profile from training data) the system can measure distribution distance (e.g., Hellinger distance in the configured monitor) and flag when new data no longer matches the training distribution.

Why can a KNN model keep producing confident predictions even when inputs are clearly wrong?

K-Nearest Neighbors always selects the k closest training points in feature space. If the input batch contains extreme or invalid values (like all zeros or very large measurements), the model still finds “closest” neighbors—even if they are far outside the normal training region—and then outputs a class based on those neighbors. That can cause output distributions to collapse into a single class, revealing the issue even before ground truth is available.

How does the workshop connect drift detection to real model failure?

After identifying anomalous days via input drift and output distribution shifts, it attaches ground truth labels and logs classification metrics using ylabs’ classification metrics logging. The performance tab then shows accuracy dropping dramatically on the same days where drift/data quality issues appeared, confirming that the monitoring signals correspond to degraded predictive quality.

What are “constraints” in Y Logs, and how do they function like unit tests for data?

Constraints define pass/fail rules over profile statistics (e.g., feature min must be > 0 and max must be < 15). The notebook builds a set of constraint checks per feature, then validates a new profile against them. If any constraint fails, the validation returns false—allowing a pipeline to block predictions or trigger remediation before the model runs on bad inputs.

How can monitoring be made more actionable beyond whole-dataset drift?

The workshop notes segmentation and tracing: monitors can be configured for specific segments (e.g., groups/classes), and tracing helps drill down performance differences across those segments over time. This supports bias/fairness monitoring and targeted debugging rather than treating the model as a single undifferentiated system.

Review Questions

  1. What specific statistical profile elements (e.g., min/max/quantiles/cardinality) are used to detect drift, and why are they sufficient without raw data?
  2. Describe the sequence of monitoring steps used in the workshop from reference profile creation to drift monitoring to performance validation with ground truth.
  3. Give an example of a constraint rule and explain how it would prevent the model from producing misleading predictions on anomalous inputs.

Key Points

  1. 1

    ML monitoring should track both input distribution changes and output behavior after deployment, because models can degrade silently when the world shifts.

  2. 2

    Data drift, data quality failures, and concept drift are distinct failure modes; monitoring should be designed to catch each one with appropriate signals.

  3. 3

    Y Logs creates lightweight, privacy-preserving statistical profiles (not raw data) that enable drift and quality checks using distribution metrics.

  4. 4

    A reference profile from training data provides a baseline; drift monitors compare new profiles against that baseline and can alert when thresholds are exceeded.

  5. 5

    Output logging (predicted class and probability distributions) can reveal model collapse into a single class even before ground truth labels arrive.

  6. 6

    Adding ground truth later allows performance monitoring (accuracy/F1-style metrics), validating that drift signals correspond to real predictive harm.

  7. 7

    Constraint-based checks act like unit tests for data, enabling pipelines to block predictions when profile statistics violate expected ranges.

Highlights

A single anomalous input batch (e.g., all zeros) can cause a KNN model to collapse into one predicted class, even though the model still “runs” and produces outputs.
Logging only statistical profiles of inputs and outputs can detect drift and data quality problems without storing raw records.
Reference-profile drift monitoring ties alerts to measurable distribution distance (Hellinger distance in the configured example).
Constraints in Y Logs provide a practical guardrail: fail fast on invalid feature ranges before predictions are made.
Ground truth added after the fact turns monitoring signals into confirmed performance drops—accuracy falling from the high 90s to the 30s on anomalous days.

Topics

Mentioned