Lecture 11A: Deploying ML Models (Full Stack Deep Learning - Spring 2021)
Based on The Full Stack's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Batch prediction is simplest when inputs are limited and predictions can tolerate staleness; it becomes risky when freshness matters.
Briefing
Machine learning models don’t become “production-ready” just because they work in a notebook; they need a deployment path that fits the latency, scaling, and maintenance realities of real traffic. The lecture lays out several deployment architectures—batch prediction, model-in-web-service, and a dedicated model service—then drills into how to run that model service reliably (REST APIs, dependency management, performance tuning, and horizontal scaling). It closes by covering edge prediction, where models run on the client device, and the mindset shifts required to make that work.
A first option is batch prediction: run the model offline on new data, store results in a database, and serve them later as fast lookups. This can be simple and low-latency for cases with limited input variety (for example, one prediction per user per day). The trade-off is staleness—predictions lag behind the latest data or fixes, and stale outputs can be hard to detect.
Another approach is to embed the model inside the existing web application. That reuses infrastructure, but it creates friction when model code and web code evolve at different rates. If the model retrains frequently while the web server changes rarely, the entire server may need redeployment. It also forces the web server to scale in lockstep with the model, even though their scaling patterns often differ, and it can be inefficient when the model needs specialized hardware like GPUs.
The most common pattern is a dedicated model service: package the model as its own server, expose it via an API, and let the web app or client call it over the network. This isolates failures (a model bug is less likely to crash the main web app), enables scaling tailored to the model, and supports reuse across multiple applications. The cost is added latency and infrastructure complexity, since the model service itself must be deployed, monitored, and maintained.
To make a model service practical, the lecture emphasizes REST APIs for prediction requests and JSON responses, while noting alternatives like gRPC and GraphQL. It highlights a major operational challenge: dependency management. Model weights are usually easy to ship, but runtime libraries (e.g., TensorFlow versions) can break behavior and complicate rollbacks. Two strategies are proposed: constrain dependencies using standardized model formats (with ONNX as an example) or use containers (Docker) to freeze environments. Containers are presented as lightweight compared to full virtual machines because the OS lives in the Docker engine rather than inside each container. For coordinating many containers, Kubernetes is positioned as the dominant orchestration layer, with Docker Compose as a simpler on-ramp.
Performance tuning then becomes a set of engineering levers: choose CPU vs GPU inference, increase throughput via concurrency (multiple model replicas), and consider techniques like model distillation (train a smaller model to imitate a larger one) and quantization (compress weights from 32-bit floats to 8-bit integers, sometimes with quantization-aware training). Caching and batching are also key—caching avoids recomputation for frequent inputs, while batching trades off throughput against per-request latency. The lecture recommends using serving frameworks such as TensorFlow Serving and TorchServe rather than building batching and GPU-sharing logic from scratch.
When traffic grows beyond a single host, horizontal scaling duplicates the model service and distributes requests via load balancing. Kubernetes-based serving (including Kubeflow’s KFServing) is contrasted with serverless options like AWS Lambda, which can be cost-effective for spiky demand but come with limits on package size, CPU-only execution, execution time, state management, and deployment tooling.
Finally, edge prediction pushes inference to the client: model weights ship to browsers, phones, or robots, enabling the lowest latency and improved privacy because user data need not leave the device. The drawbacks are constrained hardware, limited framework support (e.g., TensorFlow Lite, PyTorch Mobile, TensorRT for NVIDIA devices, Apache TVM for cross-runtime compilation, TensorFlow.js for browsers), harder updates, and limited observability. The lecture’s closing advice is to treat hardware constraints as a first-class design input, test on production-like devices after compilation, and build fallbacks for slow or failing models—especially in safety-critical edge scenarios.
Cornell Notes
The lecture maps out how machine learning models move from “works in a notebook” to “serves real predictions reliably.” It contrasts batch prediction (offline scoring with cached results) with embedding models inside a web app and with the more common dedicated model service pattern, where the model runs behind an API. Dedicated services isolate failures, scale independently, and enable reuse, but add latency and operational overhead. Making a model service production-grade hinges on REST API design, dependency management (often via Docker containers and Kubernetes orchestration), and performance techniques like concurrency, caching, batching, quantization, and distillation. When traffic or cost constraints change, horizontal scaling and serverless options (e.g., AWS Lambda) offer alternatives; for extreme latency and privacy needs, edge prediction runs inference on the client using tools like TensorFlow Lite or TensorFlow.js.
Why does batch prediction work for some ML use cases, and why does it fail as complexity grows?
What trade-offs come from embedding a model inside an existing web server versus running it as a separate model service?
How does dependency management become a production risk, and what are the two main mitigation strategies discussed?
Which performance techniques target throughput versus latency, and how do batching and caching fit in?
When should teams consider serverless inference instead of managing model servers directly?
What makes edge prediction appealing, and what operational problems does it introduce?
Review Questions
- Which deployment pattern best fits a scenario where inputs change frequently and users require near-real-time predictions, and why?
- How do Docker containers and Kubernetes orchestration jointly address dependency management and scaling for model services?
- What are the key latency/throughput trade-offs introduced by batching, and how does caching change the performance profile?
Key Points
- 1
Batch prediction is simplest when inputs are limited and predictions can tolerate staleness; it becomes risky when freshness matters.
- 2
Embedding a model inside a web server can force redeployments and couples scaling of the web app to the model, often inefficiently.
- 3
A dedicated model service isolates failures and enables model-specific scaling and reuse, but adds network latency and operational overhead.
- 4
Dependency management is a core production risk; containers (Docker) help freeze runtime environments, while Kubernetes helps coordinate many services.
- 5
Performance tuning spans CPU vs GPU inference, concurrency (multiple replicas), and efficiency techniques like caching, batching, quantization, and distillation.
- 6
Horizontal scaling duplicates the model service and distributes traffic; Kubernetes-based serving and serverless options (e.g., AWS Lambda) offer different cost/complexity trade-offs.
- 7
Edge prediction can minimize latency and improve privacy, but it demands hardware-aware model design, careful compilation/testing, and robust fallbacks for slow or failing inference.