Lecture 03: Troubleshooting & Testing (FSDL 2022)
Based on The Full Stack's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Treat test suites as classifiers that predict whether a commit introduced a bug, and design tests to maximize real bug detection while minimizing false alarms.
Briefing
Troubleshooting and testing in software is about risk reduction, but testing never becomes a guarantee of correctness—so the practical goal is to catch the most likely failures early, with minimal friction. Tests are written to fail clearly when bugs exist, yet they can’t prove correctness (especially in loosely typed, fast-moving environments like Python). A useful way to think about test suites is as classifiers: each test pass/fail acts like a prediction about whether a commit introduced a bug. That framing shifts attention from “maximum coverage” to smarter tradeoffs—especially the balance between detection and false alarms, where a false alarm looks like a failing test followed by a commit that fixes the test rather than the underlying code.
Before adding a test, engineers should identify which real-world bugs the test would catch—such as plausible ways the system could change or be modified—and then list what kinds of legitimate changes would cause the test to fail anyway. If false alarms outnumber real bug detections, the test is likely to waste time. In high-stakes domains—cardiac diagnostics, self-driving cars, and regulated or soon-to-be-regulated finance—teams may need higher confidence, but even then the approach remains pragmatic: use tools like pytest for Python testing, doctest for keeping docstring examples aligned with code, and coverage tooling (like codecov) to understand what’s being exercised rather than to chase arbitrary coverage targets. The guidance is to follow an 80/20 mindset: invest in a small set of high-impact tests rather than writing low-value tests just to meet a metric.
Quality assurance also includes linting and formatting to keep code review focused on substance. Uniform style reduces bikeshedding and makes version control diffs smaller and easier to scan. In Python, black handles automated formatting, while flake8 covers non-automatable style issues and can be extended for type hints, docstring conventions, and common bug patterns. For shell scripts, shellcheck helps catch sharp-edge bash behaviors quickly. But strict enforcement can become self-defeating; the recommendation is to filter lint rules down to what achieves the intended goals and to apply rules in an opt-in way so existing codebases can adopt improvements gradually.
Automation is the force multiplier that makes testing and linting sustainable. The workflow should run in the cloud via GitHub Actions—fast checks on every push/pull request, heavier suites on schedules—while pre-commit enables quick local runs with isolated environments. Automation reduces context switching, improves reproducibility, and turns the “how” of quality checks into living artifacts.
Machine learning testing adds extra difficulty because data, training, and models behave differently from compiled software: data is heavier and more inscrutable, training is complex, and debugging is harder. The recommended starting point is “smoke tests” that detect when something is on fire. For data pipelines, expectation testing checks basic properties (e.g., no nulls, valid date ordering) using Great Expectations, with expectations loosened to avoid false positives. For training, memorization tests verify that a model can overfit a tiny subset; if it can’t, something is broken in gradients, labels, or numerical stability. For model behavior, regression testing treats models like functions: build suites from hard examples (high loss) and group failures into named categories (e.g., pedestrian detection misses due to shadows/reflections/night scenes). Production testing then means monitoring and fast remediation, not simply shipping and hoping.
When models fail tests, troubleshooting follows a three-step ladder: make the model run (fix shape mismatches, out-of-memory, NaNs/Infs), make it fast (profile and remove bottlenecks), then make it correct (reduce loss on validation/test/production data). Because models are never perfectly correct, scale often resolves many issues—overfitting, underfitting, and distribution shift—though fine-tuning and using foundation models become necessary when compute budgets are limited. The overall message is to start with low-hanging fruit tests, automate them, and then iterate toward deeper, ML-specific quality checks and targeted troubleshooting workflows.
Cornell Notes
Testing and troubleshooting aim to reduce shipping risk, but tests are not certificates of correctness. A practical mindset treats test suites like classifiers: each pass/fail is a prediction about whether a commit introduced a bug, so teams should design tests to catch real failures while limiting false alarms. For software, pytest, doctest, linting (black, flake8, shellcheck), and coverage tooling (codecov) support fast feedback, while automation via GitHub Actions and pre-commit keeps quality checks consistent and reproducible. For ML systems, “smoke tests” start with expectation testing on data (Great Expectations), memorization tests for training, and regression suites built from hard examples and production failures. When models still fail, troubleshooting proceeds in three steps: make it run (shapes/OOM/numerics), make it fast (profiling), then make it correct (loss reduction, often via scale).
Why should teams avoid chasing 100% test coverage or strict coverage thresholds?
How does the “false alarm” concept change how engineers design tests?
What are “smoke tests” for ML, and why are they emphasized?
How do memorization tests work, and how can they be made fast enough for frequent runs?
What does ML regression testing look like when outputs are complex or when failures are recurring?
What is the three-step troubleshooting ladder for failing models?
Review Questions
- When adding a new test, what two lists should be compared to reduce false alarms, and what does it mean if the false-alarm list is longer?
- How do expectation tests, memorization tests, and regression suites each target a different failure mode in an ML pipeline?
- In troubleshooting, why is “make it run” separated from “make it fast” and “make it correct,” and what kinds of issues belong in each step?
Key Points
- 1
Treat test suites as classifiers that predict whether a commit introduced a bug, and design tests to maximize real bug detection while minimizing false alarms.
- 2
Avoid coverage-chasing: use coverage tooling to understand test health, but prioritize high-impact tests over meeting arbitrary coverage targets.
- 3
Use pytest for Python tests and doctest to keep docstring examples synchronized with code; for notebooks, run them and add assertions as a pragmatic “cheap and dirty” approach.
- 4
Standardize formatting and linting with black and flake8 (plus shellcheck for bash) while keeping an escape valve: filter rules to essentials and apply them gradually via opt-in strategies.
- 5
Automate quality checks with GitHub Actions (fast checks on push/PR, heavier suites on schedules) and pre-commit for quick local runs with isolated environments.
- 6
For ML, start with smoke tests: expectation testing for data properties (Great Expectations), memorization tests for training sanity, and regression suites built from hard examples and production failures.
- 7
Troubleshoot failing models in order: fix run-stoppers (shapes/OOM/numerics), then profile for speed, then improve correctness by reducing loss—often via scale when feasible.