Lecture 10: ML Testing & Explainability (Full Stack Deep Learning

TL;DR

Offline evaluation can fail when training/test/production distributions diverge due to drift, adversarial inputs, or long-tail underrepresentation.

Briefing Cornell Notes

Briefing

Machine-learning systems fail in ways that offline test scores can’t fully predict, so teams need a broader testing mindset: validate not just a model’s accuracy, but the entire production pipeline, across data slices, metrics, and time. The lecture frames this as a shift from “black-box” dismissal to a more practical question—what assumptions must hold for a good test score to translate into reliable real-world performance, and what breaks those assumptions.

When offline evaluation looks strong, it often relies on a key assumption: training, test, and production data come from the same distribution. In practice, that assumption frequently fails. Data drift can occur naturally, malicious users can intentionally shift inputs, and long-tail distributions mean a test set may underrepresent rare but critical cases. Even when distribution matches, a single aggregate metric (like accuracy) can hide weak performance on important subgroups (“slices”) such as regions, languages, or other categorical partitions. The lecture also stresses that building a machine learning system isn’t only about model quality—preprocessing, deployment code, labeling, and feedback loops can introduce failure modes that never show up in model-only testing.

To address these gaps, the lecture introduces software testing fundamentals—unit, integration, and end-to-end tests—then adapts them to machine learning. The core takeaway is that “testing” must cover the full system: training infrastructure, data pipelines, the prediction wrapper, serving infrastructure, labeling, and data storage/preprocessing. For training, infrastructure tests should catch regressions quickly using short runs (e.g., a single gradient step or a single epoch). For training-data integration, teams should ensure reproducibility by rerunning abbreviated training jobs on fixed or sliding-window datasets and checking performance consistency against reference runs.

For model readiness, the lecture emphasizes evaluation tests that go beyond a single validation score. Teams should compare candidate models to baselines across multiple datasets, multiple metrics, and multiple slices, aiming to map a “performance envelope”—where the model is expected to work and where it should fail. It lists categories of evaluation beyond standard metrics: behavioral tests (invariance, directional, and minimum-functionality checks), robustness tests (feature importance, sensitivity to staleness, and drift sensitivity), privacy and fairness checks, and simulation tests for systems that affect the world (notably autonomous vehicles).

Finally, the lecture covers online verification through shadow tests and A/B testing. Shadow tests run the new model in production alongside the old one without exposing predictions to users, catching offline/online discrepancies such as preprocessing mismatches or deployment translation bugs. A/B tests then measure user and business impact, often using canary rollouts first. The lecture closes by arguing that explainability for deep learning is often unreliable as a faithful “why,” and that the more realistic goal is domain predictability—knowing the performance envelope and reducing unknown unknowns—using interpretable model families when true explanations are required, while treating tools like SHAP and LIME more as debugging and intuition aids than guaranteed causal truth.

Cornell Notes

The lecture argues that strong offline test scores don’t reliably guarantee production performance because key assumptions—like matching data distributions—often fail due to drift, long-tail gaps, adversarial inputs, and hidden slice-level weaknesses. It proposes testing the entire machine learning system, not just the model: training infrastructure, data pipelines, prediction wrappers, serving, labeling, and data storage/preprocessing. Evaluation tests should compare candidate models to baselines across multiple metrics, datasets, and slices to build a “performance envelope” (where the model works and where it doesn’t). Online checks such as shadow testing and A/B testing help catch deployment and preprocessing inconsistencies that offline tests miss. The lecture also critiques “explainability” as a faithful causal explanation for deep learning, positioning domain predictability as the more actionable goal.

Why can a high offline validation score still mislead teams once a model reaches production?

Offline evaluation typically assumes training, test, and production data come from the same distribution. The lecture lists common ways this breaks: data drift/shift, malicious users intentionally changing input distributions, and long-tail data where rare cases are underrepresented in the test set. It also notes that aggregate metrics (e.g., accuracy) can mask poor performance on important subsets (“slices”), such as demographic groups, regions, or other partitions. Finally, model-only testing ignores system-level failures in preprocessing, deployment code, labeling, and feedback loops.

What does it mean to “test the entire machine learning system,” and how does that differ from testing a model?

The system includes more than the trained model artifact. The lecture describes components like a training system (producing the model), a prediction system (preprocesses inputs, loads weights, calls predict, post-processes outputs), a serving system (handles requests and scaling), a labeling system (creates ground-truth labels), and storage/preprocessing pipelines. Testing should cover each component and the boundaries between them: infrastructure tests for training code, integration tests for data+training reproducibility, evaluation tests for model readiness, and shadow/A-B tests for online behavior.

How should teams structure tests for the training pipeline?

For training infrastructure, the lecture recommends unit-style tests that catch bugs in the training code itself, plus “single batch” or “single epoch” tests that run a short training step on small datasets for each model the codebase supports. These are designed to be fast and to catch obvious regressions early. For data+training integration, the focus is reproducibility: rerun abbreviated training jobs on fixed datasets or sliding windows (e.g., data from a recent time range) and verify that performance matches reference runs. These tests are typically run periodically (e.g., nightly) rather than on every commit.

What makes evaluation tests for ML different from traditional software test pass/fail?

Evaluation tests aim to determine whether a candidate model is production-ready by comparing it to baselines across multiple datasets, metrics, and slices—not just a single validation score. The lecture emphasizes building a performance envelope: understanding where the model performs well and where it fails, including robustness, fairness, and privacy concerns. It also highlights the need to set thresholds carefully because some slices will naturally fluctuate; teams should compare against both the previous model and a fixed older reference model to prevent gradual degradation.

How do shadow tests and A/B tests help catch ML-specific production bugs?

Shadow tests run the new model in the production environment alongside the old one but don’t return its predictions to users. This helps detect issues that only appear with real production data or with deployment differences, such as preprocessing mismatches or translation bugs between offline training and online serving. A/B tests then measure how users and business metrics respond to the new model’s predictions, often starting with canary rollouts (small traffic fractions) and then using statistical comparisons between cohorts.

Why does the lecture treat “explainability” as a limited goal for deep learning?

It argues that many explanation methods aren’t reliably faithful to the model’s true decision process and can be fragile—small input changes can produce different explanations. The lecture also criticizes attention maps as incomplete and sometimes unreliable “reasons.” Instead of demanding faithful causal explanations, it frames the more practical goal as domain predictability: knowing the model’s performance envelope and reducing unknown unknowns. Interpretable model families (e.g., linear models, decision trees) are positioned as the best route when true explanation is required.

Review Questions

What assumptions must hold for offline test metrics to transfer to production, and which failure modes break those assumptions?
Describe a testing strategy that covers training, prediction, serving, labeling, and data preprocessing—what test types apply to each?
How do slice-based evaluation and performance envelopes change the way teams decide whether to promote a model?

Key Points

1
Offline evaluation can fail when training/test/production distributions diverge due to drift, adversarial inputs, or long-tail underrepresentation.
2
Testing must extend beyond model accuracy to cover the full ML system: training, prediction wrappers, serving, labeling, and data storage/preprocessing.
3
Infrastructure tests for training should be fast (single-batch/epoch) to catch regressions early, while integration tests should focus on reproducibility using fixed datasets or sliding windows.
4
Evaluation tests should compare candidate models to baselines across multiple metrics, datasets, and slices to map a performance envelope rather than rely on one aggregate score.
5
Shadow testing helps detect offline/online inconsistencies (e.g., preprocessing or deployment translation bugs) before user exposure.
6
A/B testing and canary rollouts measure user and business impact, catching failures that only appear when real users interact with predictions.
7
Deep-learning “explainability” is often unreliable as a faithful causal account; domain predictability and interpretable model families are more dependable when true explanations are required.

Highlights

A good offline score depends on distributional assumptions that often fail in production through drift, adversarial shifts, and long-tail gaps.

The lecture’s ML testing blueprint treats the system as a pipeline of components—training, prediction, serving, labeling, and data preprocessing—each needing targeted tests.

Shadow tests run the new model in production without returning predictions, catching deployment and preprocessing mismatches early.

Evaluation should be slice-based and multi-metric to reveal a model’s performance envelope, not just an aggregate validation number.

Explainability tools may help with debugging and intuition, but the lecture doubts they deliver faithful causal explanations for deep learning in production settings.

Topics

ML Testing
Performance Envelope
Shadow Testing
Evaluation Slices
Explainable AI

Mentioned

CI
CD
A/B
NLP
SHAP
LIME
GPU
CI/CD

Lecture 10: ML Testing & Explainability (Full Stack Deep Learning - Spring 2021)