Hardware/Mobile (7) - Testing & Deployment

TL;DR

Mobile and embedded deployment often fails when training-time model constructs don’t map cleanly to on-device runtime capabilities, so architecture choices must anticipate export constraints.

Briefing Cornell Notes

Briefing

Deploying deep learning models on mobile and embedded hardware is less about model design in the abstract and more about surviving the constraints of smaller runtimes, tighter memory budgets, and slower processors. Training frameworks like PyTorch and TensorFlow often support a wide range of layers and dynamic Python workflows, but mobile/embedded targets typically offer fewer supported operations and require models to be packaged into formats that can run efficiently on-device. That mismatch forces careful architecture choices early—especially if the model must be portable across platforms.

A central theme is the need for interchange formats and platform-specific runtimes. For mobile, TensorFlow Lite is positioned as a smoother successor to the older, more error-prone TensorFlow Mobile workflow. The process starts by converting a TensorFlow model into the TensorFlow Lite format using a TensorFlow Lite converter, producing a .tflite file that can be loaded by the TensorFlow Lite runtime on the phone. TensorFlow Lite can also quantize weights, shrinking them from 32-bit floats down to 16-bit or even 8-bit integers to reduce compute and memory demands.

On the PyTorch side, the workflow avoids running Python on-device. PyTorch can quantize weights and then uses TorchScript (via torch.jit scripting) to compile a flexible Python model into a static execution graph, saved in a TorchScript format suitable for deployment. PyTorch also provides platform libraries such as LibTorch iOS and PyTorch Android, which load the compiled model and run inference. A key limitation remains: not every PyTorch model can be converted to TorchScript, so portability depends on staying within what the target format can represent.

For cross-framework portability, ONNX (Open Neural Network Exchange) is presented as the closest thing to a “global interchange format.” ONNX is open source and backed by multiple companies, with converters that translate models from frameworks like PyTorch and TensorFlow into an ONNX representation. Deployment then happens through ONNX-compatible runtimes on different devices. The tradeoff is strict: ONNX can only store operations and constructs defined in its specification. If a model uses “super weird” PyTorch features that can’t be mapped into ONNX, it may not deploy elsewhere.

Beyond interchange formats, ecosystems matter. Apple’s Core ML and Google’s ML Kit are described as more ecosystem-oriented deployment paths, with Google generally better aligned with TensorFlow Lite. Startups such as Fritz AI are mentioned as aiming to convert once and deploy across Core ML and ML Kit. The reason to lean on these primitives is that platform vendors optimize their operators for speed on their own hardware.

Embedded systems add another layer of constraint: low-power GPUs and limited memory. The typical approach is aggressive weight quantization (often to float16 or even int8) to shrink model size, paired with NVIDIA’s TensorRT to optimize execution on embedded GPUs. Upstream engineering can also reduce the model’s footprint without sacrificing accuracy: MobileNet is cited as an example architecture that separates spatial and channel-wise computation using depthwise separable convolutions, cutting operations and memory. Finally, when a smaller model still needs to learn from a larger one, knowledge distillation is highlighted—training a “student” network on the teacher network’s predictions rather than directly on the raw labels, with Hugging Face’s work on distilling large models like BERT offered as a recent case study.

Cornell Notes

Mobile and embedded deployment demands more than exporting a trained network: on-device runtimes support fewer operations, memory is limited, and processors are slower. That drives choices like quantization (e.g., 32-bit floats down to 16-bit or 8-bit integers) and sometimes retraining via knowledge distillation so a smaller “student” model matches a larger “teacher.” For TensorFlow, TensorFlow Lite converts models into a .tflite file and can quantize weights for efficient inference. For PyTorch, TorchScript compiles Python models into a static execution graph, avoiding Python execution on-device, with LibTorch iOS and PyTorch Android libraries to run the result. For portability across frameworks, ONNX acts as an interchange format, but only operations representable in ONNX can be deployed elsewhere.

Why do mobile and embedded deployments require different engineering than training on servers?

On-device environments usually have less memory and slower processors than servers, and the available inference frameworks are often less fully featured than training stacks. Some targets may not support every layer or dynamic behavior used during training, and Python execution generally isn’t feasible on-device. As a result, model architectures and export paths must be chosen to fit what the target runtime can represent and execute efficiently.

How does TensorFlow Lite change the deployment workflow for TensorFlow models?

The workflow converts a TensorFlow model into the TensorFlow Lite format using a TensorFlow Lite converter, producing a .tflite file. That file is then loaded by the TensorFlow Lite runtime on the mobile device for inference. TensorFlow Lite can also quantize weights—commonly from 32-bit floats down to 16-bit or even 8-bit integers—to reduce compute and memory usage.

What role does TorchScript play in PyTorch mobile deployment?

TorchScript (via torch.jit scripting) compiles a PyTorch model from a Python execution path into a static execution graph, similar in spirit to older static graph execution. This avoids running Python on the mobile/embedded device. PyTorch then saves the compiled model in a TorchScript format, which can be loaded by deployment libraries such as LibTorch iOS and PyTorch Android. Not every PyTorch model is convertible to TorchScript, so compatibility depends on staying within supported constructs.

What is ONNX, and what limitation comes with using it?

ONNX (Open Neural Network Exchange) is an open interchange format backed by multiple companies, with converters to translate models between frameworks (e.g., from PyTorch or TensorFlow) and runtimes that can execute ONNX models on different platforms. The limitation is representational: ONNX can only hold operations defined in its specification. If a model uses features that can’t be mapped into ONNX, it may not deploy to other targets.

How do quantization and knowledge distillation work together on constrained devices?

Quantization reduces numerical precision to shrink model size and speed up inference—often moving from 32-bit floats to 16-bit or 8-bit integers. But quantization alone can hurt accuracy, so the workflow may include retraining or knowledge distillation. In distillation, a large “teacher” model generates predictions, and a smaller “student” model is trained to match those predictions rather than learning only from the original labels, helping the compact model recover performance.

Review Questions

What specific constraints (memory, supported layers, runtime capabilities) most directly influence whether a trained model can be deployed to mobile or embedded hardware?
Compare TensorFlow Lite and TorchScript in terms of how they package a model for on-device inference and what kinds of compatibility issues can arise.
Why does ONNX portability break down for models that use operations outside the ONNX specification?

Key Points

1
Mobile and embedded deployment often fails when training-time model constructs don’t map cleanly to on-device runtime capabilities, so architecture choices must anticipate export constraints.
2
TensorFlow Lite deployment centers on converting a TensorFlow model into a .tflite file and optionally quantizing weights to 16-bit or 8-bit integers for faster, smaller inference.
3
PyTorch mobile deployment typically avoids running Python on-device by compiling models into TorchScript static execution graphs, then loading them via LibTorch iOS or PyTorch Android.
4
ONNX provides cross-framework portability, but only operations representable in the ONNX definition can be exported and deployed to other targets.
5
Embedded GPU deployment commonly relies on aggressive weight quantization and NVIDIA TensorRT to optimize execution on low-power hardware.
6
Model size and compute can be reduced upstream by using efficient architectures like MobileNet’s depthwise separable convolutions, which separate spatial and channel-wise computation.
7
Knowledge distillation helps smaller models recover accuracy by training a student network on teacher predictions rather than only on ground-truth labels.

Highlights

TensorFlow Lite turns a converted model into a .tflite file and can quantize weights down to 8-bit integers to fit mobile constraints.

TorchScript compiles PyTorch models into static execution graphs so inference can run without Python on the device.

ONNX is the closest thing to a universal interchange format, but it only supports what fits its specification—unsupported operations block portability.

MobileNet’s depthwise separable convolutions reduce memory and computation by splitting spatial and channel-wise processing.

Distillation trains a smaller student model on a teacher model’s predictions, often enabling compact networks to retain accuracy.

Topics

Mobile Deployment
Quantization
TorchScript
ONNX
Knowledge Distillation

Mentioned

TensorFlow Lite
PyTorch
LibTorch
LibTorch iOS
PyTorch Android
ONNX
Core ML
ML Kit
Fritz AI
NVIDIA
TensorRT
MobileNet
Hugging Face
ONNX
iOS
GPU
ML Kit
Core ML
JIT
TorchScript
CPU

Hardware/Mobile (7) - Testing & Deployment - Full Stack Deep Learning