Project Structure (1) - Testing & Deployment

TL;DR

Separate the prediction system (input processing + trained weights + decision logic) from the training system (experiments and pipeline execution) and the serving system (scalable deployment of predictions).

Briefing Cornell Notes

Briefing

A practical full-stack deep learning setup hinges on separating three systems—prediction, training, and serving—and then testing each with the right kind of data at the right speed. The prediction system is more than TensorFlow or PyTorch code: it includes input-processing logic, the trained network weights, and any decision layer such as confidence thresholds that turn model outputs into final predictions. The training system then takes raw data, runs experiments, and produces the prediction system, while the serving system deploys it to production where it can scale with demand.

That separation matters because the data used during development is not the same as the data the model will face in production. Training and validation data feed the training system to generate the prediction system, but production time brings a fresh dataset the team cannot see in advance. Serving therefore becomes the place to test the end-to-end arrangement against “production-like” inputs—at least indirectly—through a testing and monitoring strategy.

Testing is organized into three layers with different goals and time budgets. Functionality tests target the prediction system’s correctness on a small set of high-importance examples, ideally running in under five minutes so regressions get caught quickly as code changes. These tests should start from the same form of data the system will actually see in production—for an image model, that means real images rather than preprocessed placeholders. Validation tests broaden coverage by running the prediction on a much larger validation set—potentially millions of examples—using a form of data that may be preloaded (like a data frame with images already attached). The aim is to detect performance regressions every time code is pushed, with an expected runtime of minutes to hours.

Training-system tests focus on the entire pipeline, from the rawest possible data through preprocessing and training, because upstream changes can silently break results. These tests may take longer—up to a day—so they can run daily. They’re designed to catch issues such as data-source changes or dependency upgrades that alter data formatting. For example, a TensorFlow upgrade might not matter if only pandas changed; the model might still be trained on subtly different data next time, and only a full training pipeline test using production-like inputs would reliably surface that.

Finally, the workflow distinguishes testing from monitoring. Monitoring aims to detect failures in production—service downtime, runtime errors, and shifts in the distribution of incoming data—so the team can respond when the real world diverges from training assumptions. Together, fast functionality tests, broader validation tests, full-pipeline training tests, and ongoing monitoring form a cohesive defense against both code regressions and service/data regressions.

Cornell Notes

The system design for full-stack deep learning splits responsibilities across a prediction system, a training system, and a serving system. The prediction system includes input processing, trained weights, and decision logic like confidence thresholds. Training and validation data generate the prediction system, but serving must handle production data the team cannot pre-check. Testing is staged: functionality tests run quickly (under ~5 minutes) on a small set of production-form examples to catch code regressions; validation tests run in minutes to hours on much larger datasets to detect performance drops; training-system tests run daily (up to ~1 day) to catch upstream regressions from data-source or dependency changes. Monitoring then covers what tests can’t: production failures and data distribution shifts.

What exactly counts as the “prediction system,” beyond the neural network code?

The prediction system includes all logic needed to produce final outputs: input data processing code (so raw inputs become model-consumable tensors), the trained network weights, and any post-processing or decision rules such as applying a confidence threshold to decide whether an output is accepted or how it’s labeled.

Why can’t production data be treated the same way as training/validation data?

Training and validation data are used during development to generate the prediction system. Production time brings a new dataset the team cannot see beforehand, so serving will encounter inputs with potentially different characteristics. This is why serving and monitoring must account for unseen production distributions.

How do functionality tests differ from validation tests in both data and runtime goals?

Functionality tests use a small set of key examples and should run very fast—ideally under five minutes—to catch regressions during development. They should start from the same data form production will use (e.g., real images for an image model). Validation tests run on a much larger validation set—possibly millions of examples—and should take minutes to hours; they may use preloaded/preprocessed representations like a data frame that already contains images.

What is the purpose of training-system tests, and why do they take longer?

Training-system tests validate the full pipeline end-to-end, starting from the rawest data form. They’re meant to catch upstream regressions such as data-source changes or dependency upgrades that alter data formatting. Because they run the entire pipeline, they can take up to about a day and are intended to run daily rather than on every quick iteration.

Why does monitoring complement testing instead of replacing it?

Testing focuses on catching regressions before or during development runs, but monitoring targets real-time production conditions: service outages, runtime errors, and distribution shifts in the data the model is predicting on. Monitoring is what detects when the live environment diverges from what tests assumed.

Review Questions

How would you choose the data format for functionality tests so they best reflect what serving will actually receive?
What kinds of changes are most likely to slip past functionality/validation tests but get caught by full training-system tests?
What signals in monitoring would indicate a service regression versus a data regression?

Key Points

1
Separate the prediction system (input processing + trained weights + decision logic) from the training system (experiments and pipeline execution) and the serving system (scalable deployment of predictions).
2
Use training and validation data to build the prediction system, but assume production will bring unseen data that must be handled by serving and monitored in production.
3
Run functionality tests on a small set of high-importance examples using production-form inputs, aiming for runtimes under about five minutes to catch code regressions quickly.
4
Run validation tests on much larger datasets (potentially millions of examples) to detect performance regressions, targeting runtimes of minutes to hours.
5
Test the training pipeline end-to-end with raw data in daily training-system tests (up to about a day) to catch upstream regressions from data-source changes or dependency upgrades.
6
Treat monitoring as a distinct layer that detects production failures (downtime, errors) and data distribution shifts that tests can’t fully anticipate.
7
Design the overall workflow so each test layer has a clear purpose and a realistic time budget aligned with how often it can run.

Highlights

The prediction system isn’t just model code—it also includes input processing and decision logic like confidence thresholds.

Functionality tests should run in minutes and use the same input form as production (e.g., real images), making them ideal for catching fast code regressions.

Training-system tests are built to catch upstream breakage—like data-source or dependency changes—by running the full pipeline from raw data, typically on a daily schedule.

Monitoring exists to catch what pre-deployment tests can’t: service failures and shifts in the distribution of live incoming data.

Topics

Project Structure
Testing Strategy
Model Serving
Training Pipeline
Monitoring vs Testing

Project Structure (1) - Testing & Deployment - Full Stack Deep Learning