Lecture 9: Testing and Deployment - Full Stack Deep Learning

TL;DR

Treat training, prediction, and serving as separate systems because failures originate in different places and require different test types.

Briefing Cornell Notes

Briefing

Machine learning systems need a different testing and deployment playbook than traditional software because the “running system” depends on both code and trained weights—and it will see new, shifting production data. The core framework presented splits work into three conceptual pipelines: a training system that turns raw data into model weights, a prediction system that uses code plus weights to generate outputs, and a serving system that exposes predictions to users at scale. That separation matters because each stage fails in different ways, so each stage needs different safeguards.

Functionality tests target the prediction system on a small set of critical examples to catch obvious regressions quickly—like a code typo that breaks inference entirely. Validation (evaluation) tests run on a held-out validation set to detect model regressions after changes to code or weights; they’re designed to run fast enough for continuous integration (CI), typically within about an hour, and to enforce minimum accuracy and runtime constraints. Training system tests focus on the full training pipeline—from downloading raw data through preprocessing and training—so upstream changes (such as database schema updates, missing images, or dependency/version shifts) get caught early on a schedule rather than months later when retraining finally fails.

Once the model is deployed, the serving system can’t be “tested” the same way because production inputs are effectively unknown in advance. Instead, it relies on monitoring: confirming the service is up, tracking error rates and latency, and—crucially for ML—watching for data distribution shifts between training/validation and real user traffic. Examples include changes in image resolution (e.g., 256×256 to 1024×1024) or shifts in pixel intensity histograms or color/grayscale composition. The discussion also frames monitoring as an alerting mechanism rather than an automated retraining trigger.

To ground these ideas, the lecture draws on Google’s “ML test score” rubric for production readiness and technical debt reduction. It argues that ML production quality is only as strong as the weakest link across four categories: model specs review, ML infrastructure reproducibility and integration testing, data and schema/feature expectations, and monitoring that notifies on dependency changes, input invariants, distribution skew, and stale models. The rubric is scored with partial automation and documentation, and it emphasizes that even mature organizations often land around low scores on average—making it an aspirational target.

The tooling section connects the framework to practical CI/CD and deployment. CI services such as CircleCI, Travis CI, Jenkins, and Buildkite run linting, unit/integration tests, and validation tests on every commit, while Docker containerization standardizes dependencies so tests run consistently on shared infrastructure. For deployment, the lecture compares cloud VM scaling behind load balancers, container orchestration (with Kubernetes as the “clear winner” for distributing multi-container apps), and serverless functions (e.g., AWS Lambda) that scale by compute time rather than always-on instances. For ML inference, CPU-based serving is often sufficient for single-request workloads, while GPU-optimized serving (e.g., TensorFlow Serving or Clipper) becomes relevant when throughput or batching makes GPUs worthwhile.

Finally, the labs operationalize the concepts: Lab 8 adds linting and validation tests plus CircleCI automation; Lab 9 wraps a text recognition model in a Flask API, containerizes it with Docker, and deploys it to AWS Lambda using Serverless Framework, then inspects runtime metrics and logs (including confidence and input intensity) to support ongoing monitoring and distribution-shift detection.

Cornell Notes

The lecture lays out an ML-specific testing and deployment blueprint built around three distinct stages: training, prediction, and serving. Functionality tests quickly validate the prediction system on a few high-stakes examples, while validation tests run in CI on a held-out dataset to catch model regressions in both accuracy and latency. Training system tests run on a schedule to detect upstream pipeline breakages like schema changes, missing data, or dependency/version issues. Serving can’t be “fully tested” ahead of time, so it relies on monitoring—especially for data distribution shifts—plus standard uptime, error-rate, and latency checks. The Google ML test score rubric provides a structured way to score production readiness across code, data, infrastructure, and monitoring.

Why split ML codebases into training, prediction, and serving systems instead of treating everything as one pipeline?

Training processes raw data, runs experiments, and produces weights; prediction combines code and weights to generate outputs; serving wraps prediction behind an interface (often an HTTP REST API) and must scale to variable demand. Each stage fails differently: training breaks when upstream data formats or dependencies change; prediction breaks when inference code regresses; serving breaks when production traffic, latency, or input distributions differ from what was seen in validation.

What’s the difference between functionality tests and validation tests for ML models?

Functionality tests run quickly on a small set of important examples to catch obvious regressions such as code regressions (e.g., a typo that prevents inference). Validation tests run on a held-out validation set after changes to code or weights, typically in CI, to detect model regressions like accuracy drops. The lecture also recommends enforcing runtime ceilings (e.g., “must run in under X seconds”) so changes that keep accuracy but slow inference still get flagged.

Why do training system tests exist if prediction already has tests?

Training system tests target upstream regressions in the training pipeline—issues that won’t necessarily affect already-trained prediction code. Examples include database schema changes that alter training data format, deleted images, or dependency updates that prevent training from running. Running these tests on a regular schedule helps catch breakages immediately after changes, rather than discovering them when retraining is attempted months later.

Why does serving rely on monitoring instead of the same kind of testing used for prediction?

Production data is effectively unknown ahead of time, so the system can’t be fully validated on a fixed dataset. Monitoring therefore checks service health (is it up, are errors rising) and ML-specific signals like distribution shift. The lecture gives concrete shift examples: image resolution changes (256×256 to 1024×1024), grayscale-to-color changes, or shifts in pixel intensity histograms that could degrade downstream performance.

How does the ML test score rubric translate into actionable engineering work?

It scores production readiness across categories such as model specs review and submission, ML infrastructure reproducibility and pipeline integration testing, data tests (feature expectations captured in schema, privacy considerations, and invariants), and monitoring tests (dependency changes trigger notifications, input invariants hold, training/serving aren’t skewed, and models aren’t too stale). The rubric uses partial credit for manual processes with documentation and full credit for automation, then takes the minimum score across sections—so the weakest area limits overall readiness.

What role do Docker and CI tools play in ML testing and deployment?

CI tools (CircleCI, Travis CI, Jenkins, Buildkite) automate running linting, prediction tests, and validation tests on each commit without deploying broken code. Docker standardizes dependencies and versions by packaging binaries/libraries and application code into reproducible images, enabling consistent test environments on shared infrastructure. The lecture emphasizes that training tests may require GPU access, so they’re often run on dedicated hardware or nightly jobs rather than on free-tier CI.

Review Questions

How would you design a testing strategy that catches both code regressions and upstream data pipeline breakages, and where would each test type run (local, CI, nightly)?
What monitoring signals would you set for an image model to detect distribution shift, and how would you distinguish “service health” problems from “ML quality” problems?
In the ML test score rubric, which category would you prioritize if your biggest risk is stale models, and what would the corresponding alerting/automation look like?

Key Points

1
Treat training, prediction, and serving as separate systems because failures originate in different places and require different test types.
2
Use functionality tests for fast, high-stakes checks on a few critical examples to catch immediate inference breakages.
3
Run validation tests in CI on a held-out dataset to detect model regressions in accuracy and enforce inference latency/runtime thresholds.
4
Schedule training system tests to catch upstream pipeline regressions such as schema changes, missing data, and dependency/version issues.
5
Rely on monitoring for serving: track uptime, error rates, latency, and ML-specific distribution shift signals (e.g., resolution, intensity histograms, grayscale/color changes).
6
Score production readiness using the ML test score rubric across code, data, infrastructure, and monitoring, and focus on the weakest category.
7
Use CI plus Docker to make test environments reproducible, and choose deployment targets (VMs, containers, serverless, or specialized model servers) based on scaling needs and whether CPU or GPU inference is required.

Highlights

ML serving can’t be “tested” on a fixed dataset because production inputs are unknown; monitoring must detect distribution shifts and service health in real time.

The ML test score rubric treats production quality as the minimum across multiple readiness categories—so one weak area can cap overall maturity.

Validation tests should guard both accuracy and runtime, since regressions can appear as slower inference even when accuracy stays stable.

Docker containerization makes CI testing reproducible by packaging dependencies and versions, reducing environment drift.

Serverless inference (e.g., AWS Lambda) scales by compute time and supports canary/rollback patterns, but startup latency can matter for latency-sensitive workloads.

Topics

ML Testing Strategy
Production Monitoring
CI With Docker
Model Serving
Serverless Deployment

Mentioned

CI
ML
SKU
A/B
REST
CTC
GPU
CPU
AWS
S3
JSON
API
SDK
FLOPS
AR
iOS
MXNet
ONNX
ECS
RT
WSGI
LFS
TLS

Lecture 9: Testing and Deployment - Full Stack Deep Learning - March 2019