Lecture 11A: Deploying ML Models (Full Stack Deep Learning

TL;DR

Batch prediction is simplest when inputs are limited and predictions can tolerate staleness; it becomes risky when freshness matters.

Briefing Cornell Notes

Briefing

Machine learning models don’t become “production-ready” just because they work in a notebook; they need a deployment path that fits the latency, scaling, and maintenance realities of real traffic. The lecture lays out several deployment architectures—batch prediction, model-in-web-service, and a dedicated model service—then drills into how to run that model service reliably (REST APIs, dependency management, performance tuning, and horizontal scaling). It closes by covering edge prediction, where models run on the client device, and the mindset shifts required to make that work.

A first option is batch prediction: run the model offline on new data, store results in a database, and serve them later as fast lookups. This can be simple and low-latency for cases with limited input variety (for example, one prediction per user per day). The trade-off is staleness—predictions lag behind the latest data or fixes, and stale outputs can be hard to detect.

Another approach is to embed the model inside the existing web application. That reuses infrastructure, but it creates friction when model code and web code evolve at different rates. If the model retrains frequently while the web server changes rarely, the entire server may need redeployment. It also forces the web server to scale in lockstep with the model, even though their scaling patterns often differ, and it can be inefficient when the model needs specialized hardware like GPUs.

The most common pattern is a dedicated model service: package the model as its own server, expose it via an API, and let the web app or client call it over the network. This isolates failures (a model bug is less likely to crash the main web app), enables scaling tailored to the model, and supports reuse across multiple applications. The cost is added latency and infrastructure complexity, since the model service itself must be deployed, monitored, and maintained.

To make a model service practical, the lecture emphasizes REST APIs for prediction requests and JSON responses, while noting alternatives like gRPC and GraphQL. It highlights a major operational challenge: dependency management. Model weights are usually easy to ship, but runtime libraries (e.g., TensorFlow versions) can break behavior and complicate rollbacks. Two strategies are proposed: constrain dependencies using standardized model formats (with ONNX as an example) or use containers (Docker) to freeze environments. Containers are presented as lightweight compared to full virtual machines because the OS lives in the Docker engine rather than inside each container. For coordinating many containers, Kubernetes is positioned as the dominant orchestration layer, with Docker Compose as a simpler on-ramp.

Performance tuning then becomes a set of engineering levers: choose CPU vs GPU inference, increase throughput via concurrency (multiple model replicas), and consider techniques like model distillation (train a smaller model to imitate a larger one) and quantization (compress weights from 32-bit floats to 8-bit integers, sometimes with quantization-aware training). Caching and batching are also key—caching avoids recomputation for frequent inputs, while batching trades off throughput against per-request latency. The lecture recommends using serving frameworks such as TensorFlow Serving and TorchServe rather than building batching and GPU-sharing logic from scratch.

When traffic grows beyond a single host, horizontal scaling duplicates the model service and distributes requests via load balancing. Kubernetes-based serving (including Kubeflow’s KFServing) is contrasted with serverless options like AWS Lambda, which can be cost-effective for spiky demand but come with limits on package size, CPU-only execution, execution time, state management, and deployment tooling.

Finally, edge prediction pushes inference to the client: model weights ship to browsers, phones, or robots, enabling the lowest latency and improved privacy because user data need not leave the device. The drawbacks are constrained hardware, limited framework support (e.g., TensorFlow Lite, PyTorch Mobile, TensorRT for NVIDIA devices, Apache TVM for cross-runtime compilation, TensorFlow.js for browsers), harder updates, and limited observability. The lecture’s closing advice is to treat hardware constraints as a first-class design input, test on production-like devices after compilation, and build fallbacks for slow or failing models—especially in safety-critical edge scenarios.

Cornell Notes

The lecture maps out how machine learning models move from “works in a notebook” to “serves real predictions reliably.” It contrasts batch prediction (offline scoring with cached results) with embedding models inside a web app and with the more common dedicated model service pattern, where the model runs behind an API. Dedicated services isolate failures, scale independently, and enable reuse, but add latency and operational overhead. Making a model service production-grade hinges on REST API design, dependency management (often via Docker containers and Kubernetes orchestration), and performance techniques like concurrency, caching, batching, quantization, and distillation. When traffic or cost constraints change, horizontal scaling and serverless options (e.g., AWS Lambda) offer alternatives; for extreme latency and privacy needs, edge prediction runs inference on the client using tools like TensorFlow Lite or TensorFlow.js.

Why does batch prediction work for some ML use cases, and why does it fail as complexity grows?

Batch prediction runs the model periodically on new data, stores outputs in a database, and serves them later as fast lookups. It fits scenarios with a small, predictable input universe—like one prediction per user per day—because cached results remain acceptable. It becomes limiting when inputs are more complex or when users need up-to-date predictions; outputs become stale by the recaching interval, and stale results can be difficult to detect, especially when bugs prevent timely updates.

What trade-offs come from embedding a model inside an existing web server versus running it as a separate model service?

Embedding the model inside the web server reuses existing infrastructure and avoids extra network hops, but it forces the web server to redeploy whenever the model changes—problematic when model retraining happens hourly/daily while the web server changes less often. It also ties scaling of the web app to scaling of the model, even though their demand patterns differ, and it may waste resources when the model needs GPUs but the web server hardware isn’t GPU-optimized. A dedicated model service isolates failures (model crashes/memory leaks are less likely to take down the web app), allows model-specific scaling hardware, and supports reuse across multiple apps, at the cost of added latency and extra infrastructure to deploy and maintain.

How does dependency management become a production risk, and what are the two main mitigation strategies discussed?

Model serving depends on prediction code, model weights, and runtime dependencies (e.g., TensorFlow and related libraries). Weights are large but straightforward to download at startup; the harder part is dependency consistency. If a new model requires a different TensorFlow version, rollbacks can break due to mismatched libraries, and subtle dependency differences can change model behavior. The lecture proposes (1) constraining dependencies—choosing a stable subset and treating it as rarely changeable, and (2) using containers (Docker) to package a reproducible environment so each service runs with the exact dependencies it needs.

Which performance techniques target throughput versus latency, and how do batching and caching fit in?

Caching targets latency and cost by avoiding repeated inference for frequent or recently used inputs: the service checks a cache first and returns stored predictions when possible. Batching targets throughput by running multiple inputs together in parallel (especially important for GPU inference), but it introduces latency because requests may wait until a batch is formed. Practical batching systems tune batch size and often include a timeout/early-return mechanism so latency doesn’t grow unbounded when traffic is light.

When should teams consider serverless inference instead of managing model servers directly?

Serverless (like AWS Lambda) can be attractive for spiky traffic because it scales automatically and charges per compute time rather than keeping servers running. It reduces operational burden (no manual load balancing or scaling management). The lecture flags drawbacks: deployment package size limits, CPU-only execution with limited execution time, difficulty building stateful features like caching, and less mature deployment tooling for complex pipelines.

What makes edge prediction appealing, and what operational problems does it introduce?

Edge prediction ships model weights to the client (browser/phone/robot), enabling the lowest latency and reducing privacy risk because user data doesn’t need to leave the device for inference. The trade-offs include limited hardware resources on edge devices, less full-featured ML runtimes (e.g., TensorFlow Lite, PyTorch Mobile, TensorFlow.js), painful model updates (shipping new weights is nontrivial), and harder debugging/monitoring because the model runs on someone else’s device with limited visibility into failures.

Review Questions

Which deployment pattern best fits a scenario where inputs change frequently and users require near-real-time predictions, and why?
How do Docker containers and Kubernetes orchestration jointly address dependency management and scaling for model services?
What are the key latency/throughput trade-offs introduced by batching, and how does caching change the performance profile?

Key Points

1
Batch prediction is simplest when inputs are limited and predictions can tolerate staleness; it becomes risky when freshness matters.
2
Embedding a model inside a web server can force redeployments and couples scaling of the web app to the model, often inefficiently.
3
A dedicated model service isolates failures and enables model-specific scaling and reuse, but adds network latency and operational overhead.
4
Dependency management is a core production risk; containers (Docker) help freeze runtime environments, while Kubernetes helps coordinate many services.
5
Performance tuning spans CPU vs GPU inference, concurrency (multiple replicas), and efficiency techniques like caching, batching, quantization, and distillation.
6
Horizontal scaling duplicates the model service and distributes traffic; Kubernetes-based serving and serverless options (e.g., AWS Lambda) offer different cost/complexity trade-offs.
7
Edge prediction can minimize latency and improve privacy, but it demands hardware-aware model design, careful compilation/testing, and robust fallbacks for slow or failing inference.

Highlights

Batch prediction can work when predictions are naturally periodic (e.g., once per user per day), because cached results keep latency low even if the model runs offline.

Putting a model inside an existing web server often breaks down when model update frequency and web server update frequency diverge, forcing heavy redeployments.

Quantization typically compresses weights from 32-bit floats to 8-bit integers, often with quantization-aware training to reduce accuracy loss.

TensorFlow Serving and TorchServe are recommended for GPU inference because they handle tricky batching and throughput optimizations that are hard to implement correctly from scratch.

Edge deployment’s biggest operational challenge isn’t speed—it’s observability and update difficulty when inference runs on users’ devices.

Topics

Model Deployment
Batch Prediction
Model Service APIs
Dependency Management
Performance Optimization
Horizontal Scaling
Edge Prediction

Mentioned

HTTP
REST
JSON
ONNX
Docker
GPU
CPU
ONNX
KFServing
AWS
ML
BERT
GPT-3
NVIDIA
TVM
iOS
Android
CPU

Lecture 11A: Deploying ML Models (Full Stack Deep Learning - Spring 2021)