Lab 07: Web Deployment (FSDL 2022)

TL;DR

Convert training checkpoints into TorchScript to make inference artifacts lighter and more portable across environments.

Briefing Cornell Notes

Briefing

A practical deployment pipeline turns a trained PyTorch text recognizer into a portable, shareable model service—first by compiling it to TorchScript, then by wrapping it in a Gradio interface, and finally by splitting the UI from model execution so it can run behind a stable public URL. The payoff is a workflow that moves beyond “just training” into an application people can actually use: upload an image, submit it, and receive handwritten text output through a web link.

The process starts with model weights saved during training. Checkpoints produced for restarting training are repurposed for inference by loading them into a Lightning module and converting the resulting model to TorchScript via the model’s TorchScript conversion method. That conversion yields a lighter artifact that can run without a Python runtime and without the heavy development dependencies typically required for training stacks. Once compiled, the TorchScript binary is stored in Weights & Biases (W&B) cloud storage, and a small “stage model” script automates the handoff from training to production.

A key operational advantage of using W&B for both training artifacts and production artifacts is traceability. The compiled TorchScript model is linked back to the specific training run that produced the checkpoint, along with logged metrics and experiment metadata. Early on, manual tracking is manageable, but as teams and model versions grow, programmatic access to that lineage becomes essential for debugging, auditing, and iterating.

With the model packaged, the lab introduces a dedicated Python module—ParagraphTextRecognizer—designed to load the TorchScript artifact with a single call (torch.jit.load). Inputs are formatted to match what the model expects, and outputs are converted back into strings, enabling an image-to-text workflow. To keep iteration safe, an end-to-end test script runs the full chain: data ingestion, gradient updates, checkpoint saving, TorchScript conversion, uploading to W&B, and pulling the compiled model back down. Any failure in that pipeline causes the test to fail.

For deployment performance, the lab emphasizes that GPUs are not always necessary for inference. In the provided notebook environment, inference runs without a GPU, and the discussion highlights why batching matters when GPUs are used: training can batch efficiently because it controls data flow, while production requests arrive independently. Profiling comparisons show that batch size 16 maintains high GPU utilization, while batch size 1 drops utilization and shifts the bottleneck to CPU-side work.

Next comes the user experience. A Gradio app wraps the recognizer’s predict function, automatically generating an interface with image input widgets and text output. Gradio also provides an API endpoint and a REST-style workflow for programmatic access, including base64 encoding for image payloads. A minimal web test verifies that the UI and API respond without errors.

Finally, the lab moves toward a model-as-a-service architecture. The Gradio frontend can call an AWS Lambda function that loads the TorchScript model and returns predictions via JSON over HTTP. This separation allows the UI to run locally while model inference runs on AWS infrastructure. To make the app shareable beyond Gradio’s temporary URLs, the lab uses ngrok to expose the locally running Gradio service over HTTPS, then outlines production deployment on an EC2 instance and an optional Docker-based approach to automate environment setup and execution.

Overall, the workflow—compile to TorchScript, package and trace with W&B, wrap with Gradio, serve via serverless or separated backends, and expose with stable public tunneling—creates a realistic path from research artifacts to an application that users can interact with immediately.

Cornell Notes

The lab builds a complete path from a trained PyTorch text recognizer to a shareable web application. It converts training checkpoints into a TorchScript binary so inference can run with fewer dependencies and greater portability, then stores the compiled artifact in W&B while linking it back to the originating training run. A ParagraphTextRecognizer module loads the TorchScript model and turns image inputs into text outputs. Gradio wraps the predict function to provide both a UI and an API, and the model can be separated from the frontend by serving inference through AWS Lambda. To share the app reliably, ngrok exposes the local Gradio service over HTTPS, and the lab also sketches EC2 and Docker for longer-term deployment and automation.

Why compile a trained PyTorch model into TorchScript before deployment?

TorchScript conversion produces a lighter, portable artifact that can run without requiring the full Python training stack. After loading checkpoints into the Lightning module, the model is converted to TorchScript and saved as a binary. That binary can be loaded in production with torch.jit.load, avoiding heavy dependencies like PyTorch Lightning, W&B, and other data-science tooling needed during development.

How does W&B help once the model is in production, not just during training?

W&B is used to store both the training checkpoints and the compiled TorchScript model. The compiled artifact is linked to the specific W&B training run that produced the checkpoint, including metrics and experiment metadata. This lineage becomes increasingly important as teams run more experiments and manage multiple model versions, because it enables programmatic traceability for debugging and iteration.

What does the ParagraphTextRecognizer module do in the deployment workflow?

ParagraphTextRecognizer is a production-friendly wrapper around the TorchScript model. It loads the compiled model with torch.jit.load, formats inputs into the structure the model expects (using the provided preprocessing/stem component), and converts model outputs into strings. It supports both notebook-style usage and a command-line interface that accepts an image path (local, URL, or cloud storage) and returns recognized handwritten text.

Why does batching matter for GPU efficiency in production?

During training, batching is controlled and efficient because the pipeline can assemble batches. In production, requests arrive independently, so running inference one input at a time (batch size 1) underutilizes the GPU. Profiling comparisons show batch size 16 keeps GPU utilization high (often above 90%), while batch size 1 drops GPU utilization (e.g., to around 38%) and shifts the bottleneck toward CPU-side work such as kernel selection and orchestration.

How do Gradio and its API change how users interact with the model?

Gradio wraps the model’s predict function and automatically generates a UI with image input widgets and text output display. Running launch starts a local web server and prints a URL that can be opened from other devices. Gradio also exposes an API endpoint; clients can send requests (often via curl) using JSON, with images typically encoded in base64 for HTTP transport.

What does separating the frontend from the model backend enable?

Instead of running the model and UI on the same machine, the UI can call a separate inference service. The lab demonstrates using AWS Lambda: the frontend sends JSON requests to a Lambda handler, which loads the TorchScript model and returns predictions. This separation lets the UI run anywhere while inference runs on AWS infrastructure, making scaling and independent development easier.

Review Questions

What specific steps convert a training checkpoint into a production-ready TorchScript artifact, and where is the compiled output stored?
How do batching and request patterns in production affect GPU utilization, and what profiling signals indicate a CPU bottleneck?
In what ways do Gradio’s UI and API endpoints differ, and why is base64 encoding needed for image requests?

Key Points

1
Convert training checkpoints into TorchScript to make inference artifacts lighter and more portable across environments.
2
Store compiled TorchScript binaries in W&B and link them back to the originating training runs for traceability.
3
Use a dedicated recognizer wrapper (ParagraphTextRecognizer) to load TorchScript, format inputs, and return text outputs.
4
Add an end-to-end test that validates the full pipeline from training artifacts through TorchScript conversion and W&B upload/download.
5
Treat batching as a lever for GPU efficiency in production; batch size 16 can maintain high GPU utilization while batch size 1 often shifts work to the CPU.
6
Wrap the model with Gradio to deliver both an interactive UI and a REST-style API, including base64 image encoding for requests.
7
Separate the frontend from inference by serving the model through AWS Lambda, then expose the app publicly using ngrok (HTTPS) or longer-term hosting on EC2/Docker.

Highlights

TorchScript conversion removes the need for a Python runtime and many training-time dependencies, making deployment artifacts far more portable.

W&B can function as a bridge between experiments and production by linking compiled inference binaries back to the exact training run and metrics that generated them.

Gradio provides both a drag-and-drop UI and an API endpoint; images sent over HTTP typically require base64 encoding.

Batch size changes GPU utilization dramatically in deployment: batch size 16 can keep GPUs busy, while batch size 1 often bottlenecks on CPU-side work.

AWS Lambda enables a model-as-a-service setup where the UI and inference run on different infrastructure, simplifying scaling and iteration.

Topics

TorchScript Conversion
W&B Artifacts
Gradio UI And API
Batching For Inference
AWS Lambda Model Service
ngrok Public URLs
Docker Deployment

Mentioned

W&B
FSDL
GPU
PyTorch
W&B
FSDL 2022
HTTP
REST
JSON
API
curl
AWS
EC2
ngrok
HTTPS
CPU
GPU
TLS
Docker
UI
MVP
ML
AWS Lambda