Lab 9: Web Deployment (Full Stack Deep Learning

TL;DR

TorchScript converts the Lab 8 PyTorch recognizer into a statically compiled model to reduce per-inference latency after an initial scripting step.

Briefing Cornell Notes

Briefing

Lab 9 turns a trained paragraph text recognizer into something that can be called over HTTP and packaged for deployment. The core move is speeding up inference with TorchScript, then exposing the model through two deployment paths: a local Flask web server running in Docker, and an AWS Lambda-style serverless function running locally in a container. The practical payoff is clear—once the model is scripted and wrapped behind an API, the same prediction logic can serve requests from curl (or a client app) and be shipped to production-like environments.

The lab starts by upgrading the Lab 8 PyTorch model for faster inference using TorchScript. TorchScript converts dynamically defined PyTorch code into a statically compiled form, leveraging optimizations that typically reduce inference latency. The changes are intentionally small: the model is set to eval mode, scripted via torch.jit.script, and then inference calls use the scripted model exactly like the original. The lab notes that scripting takes a few seconds up front, but repeated inference becomes faster—especially valuable when a service will handle many requests.

Next comes the web wrapper. A Flask app initializes the model at startup, sets up logging, and defines routes. A simple root route returns “hello world,” while the key endpoint is /v1/predict. That endpoint supports both GET and POST. For POST requests, the server expects a JSON body containing a base64-encoded image; it decodes the image, runs prediction, computes summary statistics, logs the request, and returns the prediction as a string. For GET requests, the server reads an image URL from a query parameter (after the ? in the request URL), fetches the image, and runs the same prediction flow. Supporting both patterns mirrors common real-world API usage.

To keep the service testable, the lab adds a test that sends a request and asserts the response matches expectations, runnable with pytest (via “python -m pytest” / “pytest” style). It emphasizes that web servers aren’t special code-wise—standard unit/integration testing still applies.

Deployment readiness then shifts to packaging. The lab builds a Docker image that includes only production dependencies. The Dockerfile uses a Python 3.6 base image, installs requirements from a production-only requirements.txt, swaps GPU-oriented PyTorch packages for CPU versions, copies the text recognizer and API server code, exposes port 8000, and launches the Flask app. Docker layer caching is highlighted: placing dependency installation steps before copying frequently changing code speeds up rebuilds.

Finally, the lab prepares a serverless version using AWS Lambda conventions. A separate app.pi defines a handler that loads the model, reads an image URL from the incoming event, performs prediction, and returns the result. A Lambda-compatible Docker image runs the handler locally (using the canonical localhost:9000 invocation path). The lab reports timing and billing behavior consistent with Lambda’s execution model, and notes that this approach can integrate with S3 triggers, add a lightweight API gateway in front, and support monitoring in a later lab.

In short: TorchScript reduces inference cost, Flask provides an HTTP interface with GET/POST image inputs, Docker makes the environment reproducible, and the Lambda handler pattern sets up serverless deployment with minimal additional code.

Cornell Notes

The lab upgrades a trained paragraph text recognizer for production-style serving by combining TorchScript, an HTTP API, and deployment packaging. TorchScript compiles the PyTorch model into a faster, statically defined form, with minimal code changes: set eval mode, script the model, and run inference through the scripted version. A Flask server then exposes predictions via /v1/predict, accepting either base64 images in a JSON POST body or image URLs via GET query parameters. The service is tested with standard pytest-style checks. Finally, the same prediction logic is adapted into an AWS Lambda handler and run locally in a Lambda-compatible Docker container, using an event payload that includes an image URL.

How does TorchScript speed up inference, and what code changes are required?

TorchScript takes the model’s dynamically defined PyTorch code and compiles it into a statically defined representation, enabling faster execution paths (the lab describes it as leveraging optimizations similar to lower-level compiled code). The required workflow is small: put the model in eval mode, convert it using torch.jit.script (on the Lightning-based model used in the recognizer), and then use the scripted model for inference calls. Scripting costs time once up front, but each subsequent inference call runs faster.

What does the Flask API endpoint /v1/predict accept, and how do GET and POST differ?

The endpoint /v1/predict supports both GET and POST. For POST, the server reads a base64-encoded image from a JSON request body, decodes it, runs the paragraph text recognizer, computes prediction statistics, logs the request, and returns the prediction as a string. For GET, the server reads an image URL from a query parameter (the part after ? in the URL), fetches the image, runs the same prediction flow, and returns the prediction.

Why is it useful to version the API as /v1/predict, and what does the lab implement?

Versioning helps keep API behavior stable as changes accumulate. The lab implements this by placing the prediction route under /v1/predict, rather than exposing an unversioned endpoint. The server responds to both GET and POST at that versioned path.

How does Docker make the deployment environment reproducible for the Flask service?

Docker packages the app with a controlled runtime and dependencies. The lab builds an image using a Python 3.6 base, installs production-only requirements from requirements.txt, and copies only the text recognizer code plus the API server code. It exposes port 8000 and starts the server with python app.pi. Layer caching speeds rebuilds when code changes by keeping dependency installation steps earlier in the Dockerfile.

What changes when moving from a Flask server to an AWS Lambda-style function?

Instead of HTTP routes, the Lambda version defines a handler function that receives an event and context. The handler loads the model at module level (so it’s ready when invoked), expects an image URL in the event payload, loads the image, runs prediction, and returns the result. A Lambda-compatible Docker image runs the handler using the AWS Lambda container entrypoint conventions, and local testing uses the canonical invocation URL pattern on localhost:9000.

How is local testing performed for both the Flask service and the Lambda container?

For Flask, the lab runs the server locally and uses curl to send requests to /v1/predict, either with base64 image data in a JSON POST or with an image URL in a GET query parameter; it also adds a test that asserts expected responses and runs it with pytest. For Lambda, the lab runs a Lambda-compatible container and tests it by sending a curl request to the local invocation endpoint (localhost:9000) with an event containing image_url; the container logs execution time and returns the prediction.

Review Questions

What specific steps are required to convert the recognizer model to TorchScript, and how does that affect repeated inference calls?
Describe the request formats supported by /v1/predict and explain how the server obtains the image in each case.
When adapting the service to AWS Lambda, what does the handler receive, what does it return, and how is it invoked locally?

Key Points

1
TorchScript converts the Lab 8 PyTorch recognizer into a statically compiled model to reduce per-inference latency after an initial scripting step.
2
The Flask API endpoint /v1/predict supports both POST (base64 image in JSON) and GET (image URL in query parameters).
3
Loading the model at app startup (Flask) or at module level (Lambda) avoids reloading weights on every request.
4
Standard pytest-style tests can validate API behavior by asserting expected responses from HTTP requests.
5
Docker packaging uses production-only requirements and CPU PyTorch packages to create a reproducible runtime and speed rebuilds via layer caching.
6
A Lambda handler adapts the same prediction logic to an event-driven interface that expects image_url and returns the prediction result.
7
Local Lambda testing uses a Lambda-compatible container and the canonical localhost:9000 invocation pattern, mirroring AWS execution behavior.

Highlights

TorchScript scripting is a one-time conversion that makes every subsequent inference call faster, with only a few lines of model-conversion code required.

The API supports two common client patterns: POST base64 payloads for direct uploads and GET image URLs for clients that can provide a reachable link.

Docker layer caching is leveraged by installing dependencies before copying frequently changing code, making rebuilds much faster.

The Lambda version keeps the prediction logic but swaps HTTP routing for an event handler that reads image_url from the invocation payload.

Topics

TorchScript
Flask API
Docker Deployment
AWS Lambda
Serverless Packaging

Mentioned

API
GET
POST
JSON
HTTP
CPU
CUDA
S3
AWS
GCP
TorchScript
PyTorch
Docker
Lambda

Lab 9: Web Deployment (Full Stack Deep Learning - Spring 2021)