Lecture 9: Testing and Deployment - Full Stack Deep Learning - March 2019
Based on The Full Stack's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Treat training, prediction, and serving as separate systems because failures originate in different places and require different test types.
Briefing
Machine learning systems need a different testing and deployment playbook than traditional software because the “running system” depends on both code and trained weights—and it will see new, shifting production data. The core framework presented splits work into three conceptual pipelines: a training system that turns raw data into model weights, a prediction system that uses code plus weights to generate outputs, and a serving system that exposes predictions to users at scale. That separation matters because each stage fails in different ways, so each stage needs different safeguards.
Functionality tests target the prediction system on a small set of critical examples to catch obvious regressions quickly—like a code typo that breaks inference entirely. Validation (evaluation) tests run on a held-out validation set to detect model regressions after changes to code or weights; they’re designed to run fast enough for continuous integration (CI), typically within about an hour, and to enforce minimum accuracy and runtime constraints. Training system tests focus on the full training pipeline—from downloading raw data through preprocessing and training—so upstream changes (such as database schema updates, missing images, or dependency/version shifts) get caught early on a schedule rather than months later when retraining finally fails.
Once the model is deployed, the serving system can’t be “tested” the same way because production inputs are effectively unknown in advance. Instead, it relies on monitoring: confirming the service is up, tracking error rates and latency, and—crucially for ML—watching for data distribution shifts between training/validation and real user traffic. Examples include changes in image resolution (e.g., 256×256 to 1024×1024) or shifts in pixel intensity histograms or color/grayscale composition. The discussion also frames monitoring as an alerting mechanism rather than an automated retraining trigger.
To ground these ideas, the lecture draws on Google’s “ML test score” rubric for production readiness and technical debt reduction. It argues that ML production quality is only as strong as the weakest link across four categories: model specs review, ML infrastructure reproducibility and integration testing, data and schema/feature expectations, and monitoring that notifies on dependency changes, input invariants, distribution skew, and stale models. The rubric is scored with partial automation and documentation, and it emphasizes that even mature organizations often land around low scores on average—making it an aspirational target.
The tooling section connects the framework to practical CI/CD and deployment. CI services such as CircleCI, Travis CI, Jenkins, and Buildkite run linting, unit/integration tests, and validation tests on every commit, while Docker containerization standardizes dependencies so tests run consistently on shared infrastructure. For deployment, the lecture compares cloud VM scaling behind load balancers, container orchestration (with Kubernetes as the “clear winner” for distributing multi-container apps), and serverless functions (e.g., AWS Lambda) that scale by compute time rather than always-on instances. For ML inference, CPU-based serving is often sufficient for single-request workloads, while GPU-optimized serving (e.g., TensorFlow Serving or Clipper) becomes relevant when throughput or batching makes GPUs worthwhile.
Finally, the labs operationalize the concepts: Lab 8 adds linting and validation tests plus CircleCI automation; Lab 9 wraps a text recognition model in a Flask API, containerizes it with Docker, and deploys it to AWS Lambda using Serverless Framework, then inspects runtime metrics and logs (including confidence and input intensity) to support ongoing monitoring and distribution-shift detection.
Cornell Notes
The lecture lays out an ML-specific testing and deployment blueprint built around three distinct stages: training, prediction, and serving. Functionality tests quickly validate the prediction system on a few high-stakes examples, while validation tests run in CI on a held-out dataset to catch model regressions in both accuracy and latency. Training system tests run on a schedule to detect upstream pipeline breakages like schema changes, missing data, or dependency/version issues. Serving can’t be “fully tested” ahead of time, so it relies on monitoring—especially for data distribution shifts—plus standard uptime, error-rate, and latency checks. The Google ML test score rubric provides a structured way to score production readiness across code, data, infrastructure, and monitoring.
Why split ML codebases into training, prediction, and serving systems instead of treating everything as one pipeline?
What’s the difference between functionality tests and validation tests for ML models?
Why do training system tests exist if prediction already has tests?
Why does serving rely on monitoring instead of the same kind of testing used for prediction?
How does the ML test score rubric translate into actionable engineering work?
What role do Docker and CI tools play in ML testing and deployment?
Review Questions
- How would you design a testing strategy that catches both code regressions and upstream data pipeline breakages, and where would each test type run (local, CI, nightly)?
- What monitoring signals would you set for an image model to detect distribution shift, and how would you distinguish “service health” problems from “ML quality” problems?
- In the ML test score rubric, which category would you prioritize if your biggest risk is stale models, and what would the corresponding alerting/automation look like?
Key Points
- 1
Treat training, prediction, and serving as separate systems because failures originate in different places and require different test types.
- 2
Use functionality tests for fast, high-stakes checks on a few critical examples to catch immediate inference breakages.
- 3
Run validation tests in CI on a held-out dataset to detect model regressions in accuracy and enforce inference latency/runtime thresholds.
- 4
Schedule training system tests to catch upstream pipeline regressions such as schema changes, missing data, and dependency/version issues.
- 5
Rely on monitoring for serving: track uptime, error rates, latency, and ML-specific distribution shift signals (e.g., resolution, intensity histograms, grayscale/color changes).
- 6
Score production readiness using the ML test score rubric across code, data, infrastructure, and monitoring, and focus on the weakest category.
- 7
Use CI plus Docker to make test environments reproducible, and choose deployment targets (VMs, containers, serverless, or specialized model servers) based on scaling needs and whether CPU or GPU inference is required.