Lab 9: Web Deployment (Full Stack Deep Learning - Spring 2021)
Based on The Full Stack's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
TorchScript converts the Lab 8 PyTorch recognizer into a statically compiled model to reduce per-inference latency after an initial scripting step.
Briefing
Lab 9 turns a trained paragraph text recognizer into something that can be called over HTTP and packaged for deployment. The core move is speeding up inference with TorchScript, then exposing the model through two deployment paths: a local Flask web server running in Docker, and an AWS Lambda-style serverless function running locally in a container. The practical payoff is clear—once the model is scripted and wrapped behind an API, the same prediction logic can serve requests from curl (or a client app) and be shipped to production-like environments.
The lab starts by upgrading the Lab 8 PyTorch model for faster inference using TorchScript. TorchScript converts dynamically defined PyTorch code into a statically compiled form, leveraging optimizations that typically reduce inference latency. The changes are intentionally small: the model is set to eval mode, scripted via torch.jit.script, and then inference calls use the scripted model exactly like the original. The lab notes that scripting takes a few seconds up front, but repeated inference becomes faster—especially valuable when a service will handle many requests.
Next comes the web wrapper. A Flask app initializes the model at startup, sets up logging, and defines routes. A simple root route returns “hello world,” while the key endpoint is /v1/predict. That endpoint supports both GET and POST. For POST requests, the server expects a JSON body containing a base64-encoded image; it decodes the image, runs prediction, computes summary statistics, logs the request, and returns the prediction as a string. For GET requests, the server reads an image URL from a query parameter (after the ? in the request URL), fetches the image, and runs the same prediction flow. Supporting both patterns mirrors common real-world API usage.
To keep the service testable, the lab adds a test that sends a request and asserts the response matches expectations, runnable with pytest (via “python -m pytest” / “pytest” style). It emphasizes that web servers aren’t special code-wise—standard unit/integration testing still applies.
Deployment readiness then shifts to packaging. The lab builds a Docker image that includes only production dependencies. The Dockerfile uses a Python 3.6 base image, installs requirements from a production-only requirements.txt, swaps GPU-oriented PyTorch packages for CPU versions, copies the text recognizer and API server code, exposes port 8000, and launches the Flask app. Docker layer caching is highlighted: placing dependency installation steps before copying frequently changing code speeds up rebuilds.
Finally, the lab prepares a serverless version using AWS Lambda conventions. A separate app.pi defines a handler that loads the model, reads an image URL from the incoming event, performs prediction, and returns the result. A Lambda-compatible Docker image runs the handler locally (using the canonical localhost:9000 invocation path). The lab reports timing and billing behavior consistent with Lambda’s execution model, and notes that this approach can integrate with S3 triggers, add a lightweight API gateway in front, and support monitoring in a later lab.
In short: TorchScript reduces inference cost, Flask provides an HTTP interface with GET/POST image inputs, Docker makes the environment reproducible, and the Lambda handler pattern sets up serverless deployment with minimal additional code.
Cornell Notes
The lab upgrades a trained paragraph text recognizer for production-style serving by combining TorchScript, an HTTP API, and deployment packaging. TorchScript compiles the PyTorch model into a faster, statically defined form, with minimal code changes: set eval mode, script the model, and run inference through the scripted version. A Flask server then exposes predictions via /v1/predict, accepting either base64 images in a JSON POST body or image URLs via GET query parameters. The service is tested with standard pytest-style checks. Finally, the same prediction logic is adapted into an AWS Lambda handler and run locally in a Lambda-compatible Docker container, using an event payload that includes an image URL.
How does TorchScript speed up inference, and what code changes are required?
What does the Flask API endpoint /v1/predict accept, and how do GET and POST differ?
Why is it useful to version the API as /v1/predict, and what does the lab implement?
How does Docker make the deployment environment reproducible for the Flask service?
What changes when moving from a Flask server to an AWS Lambda-style function?
How is local testing performed for both the Flask service and the Lambda container?
Review Questions
- What specific steps are required to convert the recognizer model to TorchScript, and how does that affect repeated inference calls?
- Describe the request formats supported by /v1/predict and explain how the server obtains the image in each case.
- When adapting the service to AWS Lambda, what does the handler receive, what does it return, and how is it invoked locally?
Key Points
- 1
TorchScript converts the Lab 8 PyTorch recognizer into a statically compiled model to reduce per-inference latency after an initial scripting step.
- 2
The Flask API endpoint /v1/predict supports both POST (base64 image in JSON) and GET (image URL in query parameters).
- 3
Loading the model at app startup (Flask) or at module level (Lambda) avoids reloading weights on every request.
- 4
Standard pytest-style tests can validate API behavior by asserting expected responses from HTTP requests.
- 5
Docker packaging uses production-only requirements and CPU PyTorch packages to create a reproducible runtime and speed rebuilds via layer caching.
- 6
A Lambda handler adapts the same prediction logic to an event-driven interface that expects image_url and returns the prediction result.
- 7
Local Lambda testing uses a Lambda-compatible container and the canonical localhost:9000 invocation pattern, mirroring AWS execution behavior.