Lecture 02: Development Infrastructure & Tooling (FSDL 2022)

TL;DR

Machine learning development is a loop: data preparation and labeling, model/weights selection, iterative debugging and experiments, deployment, monitoring, and then feeding new user data back into training.

Briefing Cornell Notes

Briefing

Machine learning development runs on a “data flywheel,” but getting from an idea to a reliable system at scale depends on disciplined software engineering—especially around tooling, reproducibility, and training infrastructure. The core workflow starts long before model code: teams must aggregate and clean data, label it, choose or train an architecture and pre-trained weights, iterate on model code through debugging and experiments, then deploy and monitor predictions. After deployment, user activity generates fresh data that must be fed back into the dataset, closing the loop.

That reality splits development into three practical layers: data and preparation (yellow), model development (the middle), and deployment (green). The lecture argues that most friction happens in the middle—where code, experiments, and environments must stay consistent while models evolve. Python is treated as the default language for ML because its ecosystem dominates scientific and data computing. From there, the tooling stack matters: editors like VS Code (recommended in the course) or PyCharm, notebook environments such as Jupyter/JupyterLab, and notebook workflows that enable fast feedback.

Notebooks are praised as a “first draft” environment because they provide immediate output and encourage rapid iteration. But they also create problems: limited refactoring support, weak versioning of cell outputs, out-of-order execution artifacts, and poor fit for unit testing. The lecture points to nbdev as a way to unify documentation, code, and tests inside notebooks, while still leveraging VS Code’s notebook support, remote editing via SSH, and debugging features like breakpoints.

For interactive sharing, Streamlit is highlighted as a way to turn Python scripts into lightweight web apps with widgets and efficient reruns. The lecture then shifts to environment management, emphasizing that deep learning setups are fragile: CUDA versions, Python versions, and library versions (e.g., PyTorch, NumPy) must align. A reproducible approach uses environment.yaml plus conda for Python/CUDA, then pip-tools to resolve compatible package versions and lock them so experiments can be recreated later. Makefiles can streamline common commands.

On the framework side, the lecture frames the choice as ecosystem and engineering fit. PyTorch is positioned as the full-stack default: it’s widely used in research and industry, supports CPU/GPU/TPU/mobile via an optimized execution graph, and has a strong distributed training ecosystem. PyTorch Lightning is recommended for structuring model code, optimizers, training loops, evaluation, and data loaders—making it easier to switch between single- and multi-device training with minimal code changes. Alternatives are acknowledged: TensorFlow (including Keras for layer composition), JAX for research and general vectorization/auto-differentiation, and meta-frameworks like Flax/Haiku.

Finally, scaling and compute management are treated as part of development infrastructure rather than an afterthought. Distributed training ranges from data parallelism (replicating the model across GPUs and averaging gradients) to sharded approaches for models that don’t fit on one GPU. Techniques like ZeRO-style optimizer/model sharding and Fully Sharded Data Parallel (FSDP) can cut memory usage dramatically, enabling much larger batch sizes and parameter counts. The lecture also surveys compute choices—NVIDIA GPUs, TPUs on GCP—and stresses that cost comparisons must consider time-to-train, not just hourly rates. For experiment and model management, it highlights tools such as TensorBoard, MLflow, Weights & Biases, and hyperparameter sweeps (e.g., Hyperband), plus all-in-one platforms that combine training, tracking, and deployment—while deferring deeper data and deployment topics to later weeks.

Cornell Notes

Machine learning development depends on more than model code: teams must build a reproducible pipeline that spans data preparation, experiment iteration, deployment, and monitoring—then feed new user-generated data back into training. Python is the default language, with notebooks enabling fast iteration but requiring extra discipline for versioning, testing, and execution order. Reproducible environments are achieved by pinning Python/CUDA versions and using pip-tools to resolve and lock compatible library versions. For training at scale, PyTorch Lightning structures training cleanly, while distributed strategies range from data parallelism to sharded methods like ZeRO/FSDP that reduce memory so larger models fit and train faster. Cost and performance comparisons should be based on total experiment time, not hourly GPU pricing.

Why does the “data flywheel” matter for development infrastructure, not just model accuracy?

The workflow described starts with a project spec and sample data, but quickly expands into data aggregation, cleaning, labeling, and continuous improvement. After deployment, monitoring turns predictions into new data—user activity generates fresh examples that must be added back into the dataset. That feedback loop is what keeps the prediction system improving over time, so tooling must support both iteration during training and ongoing data updates after release.

What are the main strengths and weaknesses of notebook-first development?

Notebooks provide a fast feedback cycle: code runs with immediate output, making them ideal for early prototyping. The tradeoffs include limited refactoring support, weak documentation navigation, lack of robust unit testing integration, and versioning challenges because cell outputs and artifacts can change. Out-of-order execution can also produce inconsistent results, so teams need practices or tools (e.g., nbdev, module-based code imported into notebooks, and VS Code notebook support) to keep experiments reliable.

How does the lecture recommend making ML environments reproducible?

It recommends specifying Python and CUDA versions in environment.yaml and using conda to install those pinned components. For the rest of the dependencies, pip-tools is used to compute mutually compatible versions from constraints (e.g., “torch version > 1.7” or unconstrained NumPy), then lock them so the same versions can be recreated later. This reduces the common failure mode where a training run works once but can’t be repeated.

When does distributed training move beyond simple data parallelism?

Data parallelism is used when the model fits on a single GPU but the batch/data doesn’t, or when speedups are needed by splitting batches across GPUs and averaging gradients. The lecture then describes the harder case: models with billions of parameters that don’t fit in one GPU. That triggers sharded approaches such as ZeRO-style optimizer/model sharding and PyTorch’s Fully Sharded Data Parallel (FSDP), which can dramatically reduce memory by ensuring each GPU holds only the parameters it needs for the current computation.

Why are cost comparisons based on total runtime rather than hourly GPU price?

The lecture gives counter-intuitive examples where a cheaper-per-hour setup can cost more overall if it takes much longer to finish training. It contrasts scenarios where an 8x A100 machine finishes in hours and costs far less than a slower configuration that runs for days. The takeaway is to compare “time-to-train × effective cost,” using benchmark data and experiment durations.

Review Questions

What specific notebook problems (execution order, versioning, testing) does the lecture identify, and what tooling approaches are suggested to mitigate them?
Describe the progression from trivial parallelism to data parallelism to sharded model training. What problem triggers each step?
How does pip-tools contribute to reproducibility compared with only pinning Python and CUDA versions?

Key Points

1
Machine learning development is a loop: data preparation and labeling, model/weights selection, iterative debugging and experiments, deployment, monitoring, and then feeding new user data back into training.
2
Python is treated as the practical default for ML because its library ecosystem dominates scientific and data computing.
3
Notebook workflows accelerate early iteration, but reproducibility requires addressing refactoring limits, cell output/versioning issues, and out-of-order execution artifacts.
4
Reproducible environments come from pinning Python/CUDA in environment.yaml (conda) and using pip-tools to resolve and lock compatible dependency versions.
5
PyTorch Lightning improves training maintainability by standardizing where model, optimizer, training, evaluation, and data loader code live while enabling multi-device runs with minimal changes.
6
Distributed training starts with data parallelism but requires sharded strategies (ZeRO/FSDP) when model parameters don’t fit on a single GPU.
7
Cloud cost should be evaluated by total experiment time and benchmarked throughput, not just hourly GPU pricing.

Highlights

The lecture frames ML development as a closed-loop “data flywheel,” where post-deployment monitoring generates new training data.

Notebooks deliver fast feedback but can undermine reliability through out-of-order execution and weak versioning of cell artifacts.

ZeRO-style sharding and PyTorch Fully Sharded Data Parallel can cut memory usage by distributing optimizer/model state so GPUs only hold what they need at each step.

Cost comparisons can flip expectations: a higher hourly rate can be cheaper overall if it drastically reduces training time.

PyTorch Lightning is positioned as a practical structure layer for training code, making CPU/GPU/multi-GPU scaling mostly a configuration change.

Topics

Development Workflow
Reproducible Environments
Deep Learning Frameworks
Distributed Training
Compute and Cloud Costing

Mentioned

Sergey
Mishka
Jeremy Howard
Josh
Tim Detmers
FSDL
GPU
CUDA
ML
TPU
DDP
ZeRO
FSDP
CPU
GPU
PCI
FSDP
TPUs
MLflow
W&B
NLP
CNN
GPT3
GCP
AWS
TPU
FSDP