Lecture 10: ML Testing & Explainability (Full Stack Deep Learning - Spring 2021)
Based on The Full Stack's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Offline evaluation can fail when training/test/production distributions diverge due to drift, adversarial inputs, or long-tail underrepresentation.
Briefing
Machine-learning systems fail in ways that offline test scores can’t fully predict, so teams need a broader testing mindset: validate not just a model’s accuracy, but the entire production pipeline, across data slices, metrics, and time. The lecture frames this as a shift from “black-box” dismissal to a more practical question—what assumptions must hold for a good test score to translate into reliable real-world performance, and what breaks those assumptions.
When offline evaluation looks strong, it often relies on a key assumption: training, test, and production data come from the same distribution. In practice, that assumption frequently fails. Data drift can occur naturally, malicious users can intentionally shift inputs, and long-tail distributions mean a test set may underrepresent rare but critical cases. Even when distribution matches, a single aggregate metric (like accuracy) can hide weak performance on important subgroups (“slices”) such as regions, languages, or other categorical partitions. The lecture also stresses that building a machine learning system isn’t only about model quality—preprocessing, deployment code, labeling, and feedback loops can introduce failure modes that never show up in model-only testing.
To address these gaps, the lecture introduces software testing fundamentals—unit, integration, and end-to-end tests—then adapts them to machine learning. The core takeaway is that “testing” must cover the full system: training infrastructure, data pipelines, the prediction wrapper, serving infrastructure, labeling, and data storage/preprocessing. For training, infrastructure tests should catch regressions quickly using short runs (e.g., a single gradient step or a single epoch). For training-data integration, teams should ensure reproducibility by rerunning abbreviated training jobs on fixed or sliding-window datasets and checking performance consistency against reference runs.
For model readiness, the lecture emphasizes evaluation tests that go beyond a single validation score. Teams should compare candidate models to baselines across multiple datasets, multiple metrics, and multiple slices, aiming to map a “performance envelope”—where the model is expected to work and where it should fail. It lists categories of evaluation beyond standard metrics: behavioral tests (invariance, directional, and minimum-functionality checks), robustness tests (feature importance, sensitivity to staleness, and drift sensitivity), privacy and fairness checks, and simulation tests for systems that affect the world (notably autonomous vehicles).
Finally, the lecture covers online verification through shadow tests and A/B testing. Shadow tests run the new model in production alongside the old one without exposing predictions to users, catching offline/online discrepancies such as preprocessing mismatches or deployment translation bugs. A/B tests then measure user and business impact, often using canary rollouts first. The lecture closes by arguing that explainability for deep learning is often unreliable as a faithful “why,” and that the more realistic goal is domain predictability—knowing the performance envelope and reducing unknown unknowns—using interpretable model families when true explanations are required, while treating tools like SHAP and LIME more as debugging and intuition aids than guaranteed causal truth.
Cornell Notes
The lecture argues that strong offline test scores don’t reliably guarantee production performance because key assumptions—like matching data distributions—often fail due to drift, long-tail gaps, adversarial inputs, and hidden slice-level weaknesses. It proposes testing the entire machine learning system, not just the model: training infrastructure, data pipelines, prediction wrappers, serving, labeling, and data storage/preprocessing. Evaluation tests should compare candidate models to baselines across multiple metrics, datasets, and slices to build a “performance envelope” (where the model works and where it doesn’t). Online checks such as shadow testing and A/B testing help catch deployment and preprocessing inconsistencies that offline tests miss. The lecture also critiques “explainability” as a faithful causal explanation for deep learning, positioning domain predictability as the more actionable goal.
Why can a high offline validation score still mislead teams once a model reaches production?
What does it mean to “test the entire machine learning system,” and how does that differ from testing a model?
How should teams structure tests for the training pipeline?
What makes evaluation tests for ML different from traditional software test pass/fail?
How do shadow tests and A/B tests help catch ML-specific production bugs?
Why does the lecture treat “explainability” as a limited goal for deep learning?
Review Questions
- What assumptions must hold for offline test metrics to transfer to production, and which failure modes break those assumptions?
- Describe a testing strategy that covers training, prediction, serving, labeling, and data preprocessing—what test types apply to each?
- How do slice-based evaluation and performance envelopes change the way teams decide whether to promote a model?
Key Points
- 1
Offline evaluation can fail when training/test/production distributions diverge due to drift, adversarial inputs, or long-tail underrepresentation.
- 2
Testing must extend beyond model accuracy to cover the full ML system: training, prediction wrappers, serving, labeling, and data storage/preprocessing.
- 3
Infrastructure tests for training should be fast (single-batch/epoch) to catch regressions early, while integration tests should focus on reproducibility using fixed datasets or sliding windows.
- 4
Evaluation tests should compare candidate models to baselines across multiple metrics, datasets, and slices to map a performance envelope rather than rely on one aggregate score.
- 5
Shadow testing helps detect offline/online inconsistencies (e.g., preprocessing or deployment translation bugs) before user exposure.
- 6
A/B testing and canary rollouts measure user and business impact, catching failures that only appear when real users interact with predictions.
- 7
Deep-learning “explainability” is often unreliable as a faithful causal account; domain predictability and interpretable model families are more dependable when true explanations are required.