ML Monitoring CS329S Machine Learning Systems Design Stanford by guest Alessya Visnjic (WhyLabs)

TL;DR

Treat telemetry as the limiting factor for ML observability: missing or poorly designed telemetry constrains drift detection and root-cause capabilities.

Briefing Cornell Notes

Briefing

Machine learning observability hinges on one practical bottleneck: telemetry. Alyssa Visnjic argues that if teams don’t capture the right “vitals” quickly, extensibly, and cheaply enough, everything downstream—drift detection, dashboards, alerts, and root-cause workflows—gets constrained by what was (or wasn’t) recorded. In production, those vitals must also fit into a larger system of upstream data sources and downstream consumers, while meeting business realities like latency, scale, operating cost, and long-term maintenance.

The talk frames monitoring as a three-step loop: collect pipeline vitals, detect changes, and involve humans (or automate) when something shifts. That loop becomes harder once an ML service sits inside a “breathing” software ecosystem where upstream inputs and downstream outputs interact. Monitoring tools must therefore understand cross-system dependencies, and they can’t be more expensive to run than the ML product they protect. Visnjic also draws a sharp distinction between monitoring and observability: monitoring often stops at detection and notification, while observability pulls relevant context together and correlates signals to help teams root-cause faster—especially when data drift or data quality issues appear.

To make that distinction concrete, she uses an observability architecture built around logs, dashboards, and alerts. Telemetry feeds the system; processed outputs become dashboards for rapid investigation and alerts for timely intervention. Alerts can also trigger automation—such as kicking off retraining when distribution drift crosses a threshold, or halting predictions when data quality degrades (for example, when a large fraction of features become null). A recurring operational theme is alert fatigue: systems that cry wolf lose credibility, while systems that stay silent fail when it matters. The balance requires careful tuning by the people who built and operate the system.

From there, the talk zooms into what telemetry should include for ML. Visnjic defines telemetry broadly as all information that can describe the state of an ML application, organized into “profiles,” where a profile is a unit of time (e.g., an hour or a day). Profiles must be comparable across time windows to support learning-based monitoring. Key telemetry categories include lineage metadata (model versions, hyperparameters, and artifacts), schema (data shapes across the feature pipeline), counts and statistics (missing values, run counts, prediction counts), and distribution summaries of inputs and outputs to detect distribution drift. Resource telemetry like CPU and memory during training is also useful for diagnosing bottlenecks and failures.

To address the tedious, error-prone work of defining telemetry schemas, Visnjic introduces WhyLabs’ open-source library, ylogs, designed to capture ML telemetry profiles efficiently. The library emphasizes simplicity and integration (e.g., logging from a dataframe in one line), lightweight production performance (built on streaming-friendly techniques such as Apache DataSketches), and mergeable, map-reducible statistics so distributed or micro-batched computation can be combined into day- or hour-level profiles. A key claim is that it can process 100% of data in a streaming fashion rather than sampling—important for avoiding false alarms from misestimated long-tail distributions.

The final segment demonstrates the workflow: profiles are generated locally from a public Lending Club dataset segmented into seven days, then uploaded to the WhyLabs enterprise platform for dashboarding and monitoring. A sample monitor computes distribution distance using Hellinger distance and tracks metrics like medians and “busy values,” producing an alert when missing values spike on a later day—without hand-tuning per-feature thresholds. The platform supports configurable baselines and alert behavior, and the system can be operated via API for enterprise-scale ingestion across many models.

Cornell Notes

The core message is that ML observability depends on high-quality telemetry: if the system doesn’t capture the right ML “vitals” as time-based profiles, drift detection and root-cause workflows will be limited. Visnjic frames monitoring as collecting vitals, detecting changes, and escalating or automating responses, while observability goes further by correlating signals to help teams understand why changes happen. For ML telemetry, she highlights lineage metadata, schema across the feature pipeline, counts/statistics (e.g., missing values), and distribution summaries of inputs/outputs, organized into comparable time windows (“profiles”). She then presents ylogs, an open-source library that builds mergeable streaming statistics (based on Apache DataSketches) so production pipelines can log telemetry cheaply and accurately without sampling long-tail data. A demo shows profiles uploaded to WhyLabs to power dashboards and alerts, including distribution-distance-based drift detection using Hellinger distance.

Why does telemetry quality determine how well an ML observability system works in practice?

Telemetry is treated as the source of truth for the vitals that later components—drift detection, dashboards, and alerts—depend on. If telemetry is captured too slowly, too narrowly, or without extensibility, downstream reasoning can’t recover missing context. In production, telemetry also must respect latency, scale, and cost constraints, and it must be maintainable as upstream/downstream systems evolve.

How does the talk distinguish monitoring from observability at the system level?

Monitoring typically ends after a change is detected and the team is notified; the team then investigates manually. Observability aims to pull relevant knowledge into the system and correlate signals automatically to support root-cause analysis. The architecture used to illustrate this includes telemetry ingestion, then outputs like dashboards (for faster investigation) and alerts (for timely notification and automation triggers).

What does “profiles” mean in ML telemetry, and why must profiles be comparable over time?

A profile is a unit of telemetry over a unit of time (e.g., an hour or a day). Teams care about recent windows and prior windows for comparison, and monitoring systems that learn from history require those time units to be comparable. That comparability enables baseline-vs-test comparisons for drift detection and supports historical learning.

What categories of telemetry are recommended for ML systems?

The talk emphasizes lineage metadata (model versions, hyperparameters, artifacts), schema (data shapes across the feature pipeline), counts/statistics (data volume, missing values, run/prediction counts), and distribution summaries of inputs/outputs to detect distribution drift. It also mentions resource telemetry such as CPU and memory during training to diagnose bottlenecks and errors like out-of-memory.

What design choices make ylogs suitable for production-scale telemetry capture?

ylogs targets simplicity/integration (e.g., logging from a dataframe), lightweight operation, and streaming efficiency. It uses stochastic streaming algorithms and builds on Apache DataSketches so statistics can be computed in one pass, merged across distributed workers, and aggregated into day/hour profiles. It also aims to process 100% of data in a streaming fashion to avoid sampling errors that can cause false alarms, especially with long-tail distributions.

How does the demo detect drift and generate an alert?

In the enterprise dashboard demo, the monitor computes distribution distance between a baseline and recent data using Hellinger distance. It tracks summary statistics such as the median and “busy values,” then triggers an alert when missing values jump on a later day. The example highlights that monitors can be onboarded with defaults rather than per-feature threshold tuning, while still allowing configuration of baselines and alert sensitivity.

Review Questions

What production constraints (cost, latency, scale, maintenance) does the talk say telemetry must satisfy, and why do they matter for observability outcomes?
How do lineage metadata, schema, and distribution summaries each contribute to diagnosing drift or data quality problems?
Why is avoiding sampling emphasized for long-tail distributions, and how do mergeable streaming statistics help?

Key Points

1
Treat telemetry as the limiting factor for ML observability: missing or poorly designed telemetry constrains drift detection and root-cause capabilities.
2
Model monitoring as a loop—collect vitals, detect change, then escalate to humans or automate actions when thresholds are crossed.
3
Design telemetry around ML-specific context: lineage metadata, schema across the feature pipeline, counts/statistics, and distribution summaries of inputs/outputs.
4
Organize telemetry into comparable time-based profiles so baselines and test windows can be meaningfully compared over time.
5
Use streaming, mergeable statistics to keep production overhead low and to support distributed computation without moving large datasets.
6
Avoid sampling when estimating distributions with long tails to reduce false alarms from misestimated rare events.
7
Tune alerting carefully to prevent alert fatigue; alerts must match the needs of the people who operate the system.

Highlights

Observability depends on telemetry: if the vitals aren’t captured well enough, dashboards and alerts can’t compensate later.

Profiles are time-bounded units of telemetry (hour/day) that must be comparable to enable drift detection and historical learning.

ylogs is built for production by using mergeable streaming statistics (via Apache DataSketches) and aims to process 100% of data rather than sampling.

The enterprise demo shows drift detection using distribution distance (Hellinger distance) and automatic alerting when missing values spike.

Topics

ML Observability
Telemetry Profiles
Distribution Drift
Alerting
Open Source ylogs

Mentioned

WhyLabs
ylogs
Apache DataSketches
Amazon AWS
MLflow
Alyssa Visnjic