Lecture 03: Troubleshooting & Testing (FSDL 2022)

TL;DR

Treat test suites as classifiers that predict whether a commit introduced a bug, and design tests to maximize real bug detection while minimizing false alarms.

Briefing Cornell Notes

Briefing

Troubleshooting and testing in software is about risk reduction, but testing never becomes a guarantee of correctness—so the practical goal is to catch the most likely failures early, with minimal friction. Tests are written to fail clearly when bugs exist, yet they can’t prove correctness (especially in loosely typed, fast-moving environments like Python). A useful way to think about test suites is as classifiers: each test pass/fail acts like a prediction about whether a commit introduced a bug. That framing shifts attention from “maximum coverage” to smarter tradeoffs—especially the balance between detection and false alarms, where a false alarm looks like a failing test followed by a commit that fixes the test rather than the underlying code.

Before adding a test, engineers should identify which real-world bugs the test would catch—such as plausible ways the system could change or be modified—and then list what kinds of legitimate changes would cause the test to fail anyway. If false alarms outnumber real bug detections, the test is likely to waste time. In high-stakes domains—cardiac diagnostics, self-driving cars, and regulated or soon-to-be-regulated finance—teams may need higher confidence, but even then the approach remains pragmatic: use tools like pytest for Python testing, doctest for keeping docstring examples aligned with code, and coverage tooling (like codecov) to understand what’s being exercised rather than to chase arbitrary coverage targets. The guidance is to follow an 80/20 mindset: invest in a small set of high-impact tests rather than writing low-value tests just to meet a metric.

Quality assurance also includes linting and formatting to keep code review focused on substance. Uniform style reduces bikeshedding and makes version control diffs smaller and easier to scan. In Python, black handles automated formatting, while flake8 covers non-automatable style issues and can be extended for type hints, docstring conventions, and common bug patterns. For shell scripts, shellcheck helps catch sharp-edge bash behaviors quickly. But strict enforcement can become self-defeating; the recommendation is to filter lint rules down to what achieves the intended goals and to apply rules in an opt-in way so existing codebases can adopt improvements gradually.

Automation is the force multiplier that makes testing and linting sustainable. The workflow should run in the cloud via GitHub Actions—fast checks on every push/pull request, heavier suites on schedules—while pre-commit enables quick local runs with isolated environments. Automation reduces context switching, improves reproducibility, and turns the “how” of quality checks into living artifacts.

Machine learning testing adds extra difficulty because data, training, and models behave differently from compiled software: data is heavier and more inscrutable, training is complex, and debugging is harder. The recommended starting point is “smoke tests” that detect when something is on fire. For data pipelines, expectation testing checks basic properties (e.g., no nulls, valid date ordering) using Great Expectations, with expectations loosened to avoid false positives. For training, memorization tests verify that a model can overfit a tiny subset; if it can’t, something is broken in gradients, labels, or numerical stability. For model behavior, regression testing treats models like functions: build suites from hard examples (high loss) and group failures into named categories (e.g., pedestrian detection misses due to shadows/reflections/night scenes). Production testing then means monitoring and fast remediation, not simply shipping and hoping.

When models fail tests, troubleshooting follows a three-step ladder: make the model run (fix shape mismatches, out-of-memory, NaNs/Infs), make it fast (profile and remove bottlenecks), then make it correct (reduce loss on validation/test/production data). Because models are never perfectly correct, scale often resolves many issues—overfitting, underfitting, and distribution shift—though fine-tuning and using foundation models become necessary when compute budgets are limited. The overall message is to start with low-hanging fruit tests, automate them, and then iterate toward deeper, ML-specific quality checks and targeted troubleshooting workflows.

Cornell Notes

Testing and troubleshooting aim to reduce shipping risk, but tests are not certificates of correctness. A practical mindset treats test suites like classifiers: each pass/fail is a prediction about whether a commit introduced a bug, so teams should design tests to catch real failures while limiting false alarms. For software, pytest, doctest, linting (black, flake8, shellcheck), and coverage tooling (codecov) support fast feedback, while automation via GitHub Actions and pre-commit keeps quality checks consistent and reproducible. For ML systems, “smoke tests” start with expectation testing on data (Great Expectations), memorization tests for training, and regression suites built from hard examples and production failures. When models still fail, troubleshooting proceeds in three steps: make it run (shapes/OOM/numerics), make it fast (profiling), then make it correct (loss reduction, often via scale).

Why should teams avoid chasing 100% test coverage or strict coverage thresholds?

Tests aren’t formal proofs of correctness, especially in fast-moving, loosely typed environments like Python. The lecture frames tests as classifiers that predict whether a commit has a bug; chasing coverage targets can push teams to write low-value tests that merely satisfy a metric. That increases both writing and maintenance costs, and false alarms become more likely. Instead, teams should invest in a small set of high-impact tests (the 80/20 idea) and use coverage tools like codecov mainly to understand test health and spot regressions in what’s being exercised.

How does the “false alarm” concept change how engineers design tests?

A false alarm is a failing test followed by a commit that fixes the test rather than fixing the underlying code. Before adding a test, engineers should list which real bugs it would catch (plausible ways the system changes or is modified) and then list which legitimate changes would still cause the test to fail. If the second list is larger, the test likely creates more noise than value and should be reconsidered.

What are “smoke tests” for ML, and why are they emphasized?

Smoke tests are easy to implement but still highly effective at detecting when something is fundamentally wrong—when the system is “on fire.” The lecture recommends expectation testing for data pipelines (basic properties like no nulls and valid date ordering), memorization tests for training (a model should overfit a tiny subset; failure suggests broken gradients, numerical issues, or label problems), and regression testing for model behavior (using loss/metrics to find hard examples and grouping failures into named suites).

How do memorization tests work, and how can they be made fast enough for frequent runs?

Memorization tests check whether a model can learn a very small fraction of the training data; if it can’t, gross training issues are likely (e.g., gradients not calculated correctly, numerical instability, labels shuffled). To keep runtime under control (ideally under 10 minutes), teams can reduce the dataset size used for the test, reduce batch size, disable regularization that fights memorization (turn off dropout and augmentation), and shrink the model (fewer layers or narrower layers). The lecture also notes PyTorch Lightning’s overfit-batches feature as a practical way to implement this.

What does ML regression testing look like when outputs are complex or when failures are recurring?

For simple outputs, models can be tested like functions by checking expected outputs for specific inputs, but complex outputs (like text recognizers) can become flaky. A more robust approach uses metrics and production/training comparisons: build documented regression suites from data points with the highest loss and label them by failure type. For example, pedestrian detection failures can be grouped into causes such as shadows hiding pedestrians, windshield reflections, or night scenes, enabling targeted iteration with annotation-team feedback loops.

What is the three-step troubleshooting ladder for failing models?

First, make the model run by addressing loud failures: shape mismatches, out-of-memory errors, and numerical problems like NaNs/Infs. Second, make it fast by profiling and removing bottlenecks (data loading and optimizer state can surprise teams as performance costs). Third, make it correct by reducing loss/metrics on validation/test/production data. The lecture emphasizes that models are never perfectly correct, so “correctness” is about improving performance; scaling often resolves overfitting, underfitting, and distribution shift, while fine-tuning or foundation models help when compute budgets are limited.

Review Questions

When adding a new test, what two lists should be compared to reduce false alarms, and what does it mean if the false-alarm list is longer?
How do expectation tests, memorization tests, and regression suites each target a different failure mode in an ML pipeline?
In troubleshooting, why is “make it run” separated from “make it fast” and “make it correct,” and what kinds of issues belong in each step?

Key Points

1
Treat test suites as classifiers that predict whether a commit introduced a bug, and design tests to maximize real bug detection while minimizing false alarms.
2
Avoid coverage-chasing: use coverage tooling to understand test health, but prioritize high-impact tests over meeting arbitrary coverage targets.
3
Use pytest for Python tests and doctest to keep docstring examples synchronized with code; for notebooks, run them and add assertions as a pragmatic “cheap and dirty” approach.
4
Standardize formatting and linting with black and flake8 (plus shellcheck for bash) while keeping an escape valve: filter rules to essentials and apply them gradually via opt-in strategies.
5
Automate quality checks with GitHub Actions (fast checks on push/PR, heavier suites on schedules) and pre-commit for quick local runs with isolated environments.
6
For ML, start with smoke tests: expectation testing for data properties (Great Expectations), memorization tests for training sanity, and regression suites built from hard examples and production failures.
7
Troubleshoot failing models in order: fix run-stoppers (shapes/OOM/numerics), then profile for speed, then improve correctness by reducing loss—often via scale when feasible.

Highlights

Tests aren’t proofs; they behave like classifiers, so the key design problem is balancing detection against false alarms.

Expectation testing and memorization tests are “smoke tests” that catch foundational ML pipeline failures quickly before deeper debugging is needed.

Regression testing for ML should group recurring failures into named suites (often derived from high-loss examples and production observations), not just compare raw outputs.

Troubleshooting follows a ladder: make the model run, make it fast, then make it correct—because each step targets different classes of problems.

Scaling can resolve many correctness issues (overfitting, underfitting, distribution shift), but fine-tuning/foundation models are the fallback when budgets limit training from scratch.

Topics

Software Testing
ML Smoke Tests
Expectation Testing
Memorization Tests
Model Troubleshooting

Mentioned

OOM
NaNs
Infs
CI/CD
ML