Lab 04: Experiment Management (FSDL 2022)

TL;DR

Training metrics printed to the console are not enough; without persistent logs and metadata, experiments can’t be reliably reproduced or compared.

Briefing Cornell Notes

Briefing

Experiment management is the difference between “useful training output” and “lost knowledge.” During model training, metrics like loss and validation accuracy stream to the command line, but without structured logging those numbers get overwritten, vanish when the notebook restarts, and make it impossible to reconstruct which hyperparameters, git state, or timestamps produced a given result. The lab frames this as a core engineering need: preserve metrics over time and attach the surrounding context—code version, system details, and other metadata—so experiments can be compared, debugged, and improved.

TensorBoard is introduced as the traditional baseline for tracking training curves. It runs as a separate service and visualizes logged metrics as charts over time, which works well for inspecting a single run. But once multiple experiments enter the picture, the workflow becomes harder: comparing across runs and managing the lifecycle of the logging service adds friction, and the setup can distract many ML engineers from building models and products.

Weights & Biases (W&B) is presented as the recommended upgrade for experiment management, largely because it integrates smoothly with popular deep learning stacks (including PyTorch Lightning, Keras, and TensorBoard) and improves both the logging experience and the interface for reviewing results. The integration can be added to an experiment script with only a handful of lines, and the lab walks through what those additions unlock in the W&B web UI while training is still running. Metrics update live, and the run page is organized into tabs that separate concerns: an Overview tab for run metadata (start time, duration, OS/Python version, git repository state, command-line arguments, and final metrics), a System tab for CPU/GPU utilization and memory (useful for catching performance regressions), a Model tab enabled by a single watch call, a Logs tab capturing terminal output, and a Files tab that includes environment snapshots and diffs against version control.

W&B also adds higher-value artifacts and reproducibility features. The Artifacts tab stores generated binaries such as model checkpoints and input/output media from the text recognizer, versioned like directories with lineage tracking. The lab highlights that these artifacts can be tagged (for example, “best” checkpoints via PyTorch Lightning callbacks), and that W&B can track which runs created or consumed which artifacts—critical for debugging and for tracing model behavior back to data and code.

Beyond browsing, the lab emphasizes programmatic access via the W&B API, enabling workflows like pulling logged tables into pandas, analyzing augmented data samples, and traversing the run→artifact→run graph to diagnose issues in data processing pipelines. Because raw logs can become overwhelming, W&B Reports are introduced as a way to package results into shareable, structured dashboards and PR-friendly summaries, linking metrics back to git changes.

Finally, the lab scales from single experiments to large projects and hyperparameter optimization. It demonstrates filtering and custom charts across many runs (including derived metrics like generalization gap) and shows how W&B can orchestrate hyperparameter sweeps using a YAML-configured controller with lightweight agents. The exercises encourage contributing to a massive sweep, exploring the W&B SDK for manual logging, searching for better hyperparameters for a line CNN transformer, and extending metrics logging using torchmetrics—turning experiment tracking into a repeatable, collaborative development loop.

Cornell Notes

Experiment management preserves training knowledge that would otherwise disappear—metrics, hyperparameters, git state, system details, and generated artifacts—so results can be compared and reproduced. TensorBoard can visualize metrics for a single run, but it becomes cumbersome when many experiments must be sliced, filtered, and shared. Weights & Biases (W&B) improves the workflow with tight integrations (notably PyTorch Lightning), live-updating dashboards, and structured tabs for run metadata, system utilization, logs, files, and versioned artifacts like checkpoints and model I/O. W&B also supports programmatic access through an API and uses Reports to package results for PRs and stakeholders. The lab extends these ideas to large-scale projects and hyperparameter sweeps using W&B agents and YAML configuration.

Why does experiment management matter even when training already prints metrics to the console?

Console output is transient: metric values (like loss) get overwritten during training, and restarting the notebook or closing a terminal window removes the output. Without saved logs and metadata, it becomes difficult to reconstruct which arguments, git repository state, and timestamps produced a particular run—making debugging and comparison unreliable. Experiment management tools store metrics over time and attach context so experiments remain auditable and reproducible.

What specific limitations of TensorBoard show up once multiple experiments need comparison?

TensorBoard runs as a separate service and works best for one experiment’s charts. When comparing across many runs, the workflow gets harder: managing multiple runs and the independent service adds overhead, and the experience becomes less convenient for filtering, grouping, and collaboration. The lab also notes the need for extra management like cleaning up prior TensorBoard processes.

What does W&B add beyond metric charts, and how do the run tabs map to different debugging needs?

W&B organizes information into tabs: Overview for run metadata (start time, duration, OS/Python version, git state, command-line arguments, and final metrics); System for CPU/GPU utilization and memory (including GPU temperature and allocation for multi-GPU setups); Model via a watch call; Logs for terminal output like warnings/errors; Files for environment snapshots (requirements.txt, conda env YAML) and git diffs; and Artifacts for versioned binary outputs such as checkpoints and model inputs/outputs. This structure supports both performance debugging and reproducibility.

How do W&B artifacts and lineage help track model checkpoints and their provenance?

Artifacts behave like versioned directories: checkpoints and associated media are stored as artifact versions (starting at version 0). With PyTorch Lightning integration, checkpoints can be tagged (e.g., “best” based on tracked metrics). Each artifact includes metadata like who created it and which experiment it belongs to, and lineage views show which runs created artifacts and which runs used them—supporting traceability from model behavior back to training runs and data.

How does W&B’s API change what teams can do with logged experiment data?

The API enables programmatic retrieval of logged data, such as pulling tables into pandas for analysis or plotting. It also exposes the run→artifact→run graph, letting teams trace which runs created or consumed specific artifacts. Because training-time data transformations (like augmented samples) can be logged, teams can detect issues in data processing pipelines and resolve them faster.

What workflow does W&B support for hyperparameter optimization at scale?

Hyperparameter sweeps are configured via a YAML file that defines the command to run and the parameter search space. A lightweight controller runs on W&B servers and dispatches work to agents that execute training runs with different hyperparameter sets. Agents can be launched per GPU by setting environment variables like CUDA_VISIBLE_DEVICES, enabling parallel sweeps. Results appear in W&B dashboards with charts (e.g., validation loss vs. influential parameters) and tools like parallel coordinates for filtering runs.

Review Questions

How would you reconstruct the exact conditions of a past training run if the notebook output was lost—what metadata must have been logged?
Compare TensorBoard and W&B in terms of multi-experiment comparison, collaboration, and reproducibility features.
In a W&B project with many runs, how would you use derived metrics (like generalization gap) and filtering to identify overfitting patterns?

Key Points

1
Training metrics printed to the console are not enough; without persistent logs and metadata, experiments can’t be reliably reproduced or compared.
2
TensorBoard is effective for single-run visualization but becomes less practical when many experiments must be grouped, filtered, and managed.
3
Weights & Biases improves experiment management with live dashboards, structured run tabs (Overview/System/Logs/Files/Artifacts), and strong integrations with PyTorch Lightning and other frameworks.
4
Versioned artifacts (checkpoints and model I/O) plus lineage tracking make it possible to trace model outputs back to the exact training runs and code states.
5
The W&B API enables deeper analysis by pulling logged tables into tools like pandas and traversing the run/artifact dependency graph.
6
W&B Reports help turn raw experiment logs into shareable dashboards and PR-linked summaries that reduce confusion for reviewers and stakeholders.
7
W&B hyperparameter sweeps use YAML-configured controllers and agent workers, making large-scale search and multi-GPU parallelization straightforward.

Highlights

Without logging, key training context disappears—restarting a notebook or closing a terminal can erase metrics and make it impossible to know which hyperparameters produced a result.

W&B’s run page separates concerns: metadata (Overview), compute health (System), terminal evidence (Logs), reproducibility inputs (Files), and versioned outputs (Artifacts).

Artifacts aren’t just stored—they’re versioned and linked through lineage, showing which runs created and which runs consumed each checkpoint.

Derived metrics like generalization gap help spot overfitting by comparing training loss and validation loss over time.

Hyperparameter optimization can be parallelized per GPU by launching multiple W&B agents with different CUDA_VISIBLE_DEVICES settings.

Topics

Experiment Management
TensorBoard
Weights & Biases
Artifacts and Lineage
Hyperparameter Sweeps

Mentioned

Weights & Biases
TensorBoard
PyTorch Lightning
Keras
PyTorch
Hugging Face
MLflow
SageMaker
Ray
GitHub
Colab
Jupyter
ML
GPU
CPU
API
WANDB
W&B
FSDL
PR
OS
Python
yaml
CNN
LSTM
GPU
CPU