Tune hyper-parameters (6) - Troubleshooting

TL;DR

Hyper-parameter tuning is necessary once training and validation errors are close, but it’s difficult because many interacting knobs exist across architecture, optimizer, and training settings.

Briefing Cornell Notes

Briefing

Hyper-parameter tuning is the last major lever after training and validation curves look “reasonably close,” but it’s hard because there are many possible knobs—ranging from the network architecture (e.g., choosing ResNet) to layer depth, initialization, convolution kernel sizes, optimizer settings (like Adam’s beta1, beta2, epsilon), batch size, learning rate, learning rate schedules, and regularization. The practical challenge isn’t just volume; it’s that different models and datasets react to different hyper parameters, and the only reliable way to learn which ones matter is to test multiple settings and build intuition about sensitivity.

A key rule of thumb is that many models are especially sensitive to learning rate and the learning rate schedule. That sensitivity often shows up more strongly than changes to the “shape” of the model itself, so it’s frequently more productive to tune the loss function and layer sizing (for example, whether layers use 64 vs. 256 units) than to start by swapping architectures. Another nuance: sensitivity is measured relative to default values. If weights are initialized in a pathological way (like all zeros), changing initialization can cause a dramatic jump; with sensible defaults, initialization may matter less.

Once the set of hyper parameters is chosen, the transcript lays out several optimization strategies, each with tradeoffs in compute cost, ease of implementation, and how quickly the search narrows.

Manual tuning is the starting point for skilled practitioners: understand how changes affect training dynamics (e.g., higher learning rates can speed convergence but reduce stability), run a small number of experiments, and adjust based on learning curves. It can be compute-efficient, but it demands deep algorithm knowledge and is time-consuming.

Grid search is straightforward but inefficient: it evaluates every cross-combination in a predefined range, which becomes expensive as soon as more than two hyper parameters are involved. Random search samples points within ranges instead of exhaustively enumerating the grid; it’s easy to implement and often finds better results than grid search, but it can feel messy and still depends on choosing sensible ranges.

To improve random search, “coarse-to-fine” random search repeats the process: sample broadly, keep the best-performing region, then resample within that narrower range. This approach is described as widely used in practice because it quickly homes in on good settings while staying relatively simple.

For more hands-off tuning, Bayesian hyper-parameter optimization maintains a probabilistic model linking hyper-parameter choices to performance, then iteratively selects new trials expected to improve results. It can be powerful but is harder to implement from scratch and may be difficult to integrate, so the recommended workflow is to start with coarse-to-fine random search and consider Bayesian methods later as the codebase matures.

In the Q&A, practical guidance emphasizes using prior knowledge to set ranges (e.g., borrowing learning rate ranges from tutorials or papers on similar tasks), starting wide on a log scale when uncertain, and using warm-up learning rate schedules especially when scaling to large batch sizes or distributed training. Differential learning rates are mentioned as useful mainly in fine-tuning scenarios. Cross-validation is generally discouraged in deep learning when data is abundant and training is expensive, while population-based training is highlighted as a genetic-algorithm-like alternative that has shown strong results. Noise in validation can mislead hyper-parameter selection, and the pragmatic response offered is to largely ignore it in practice, relying on robust search strategies.

Cornell Notes

Hyper-parameter tuning matters most after training and validation errors are already close, but the search is difficult because many knobs interact: architecture choices, layer depth, initialization, convolution kernel sizes, optimizer settings (e.g., Adam’s beta1/beta2/epsilon), batch size, learning rate, schedules, and regularization. Many models are especially sensitive to learning rate and its schedule, so tuning those often yields larger gains than swapping architectures. A practical workflow starts with coarse-to-fine random search: sample broadly, keep the best region, then resample within it. As projects mature, Bayesian hyper-parameter optimization can reduce manual effort by using a probabilistic model to guide new trials, though it’s harder to implement and integrate. Range selection benefits from prior knowledge; when uncertain, start wide on a log scale and then zoom in.

Why is hyper-parameter tuning so challenging even when training/validation curves look good?

Because there are many hyper parameters that can be tuned simultaneously—network architecture (e.g., ResNet choice), within-network design (layers, initialization, kernel size), optimizer configuration (Adam’s beta1, beta2, epsilon), and training settings (batch size, learning rate, learning rate schedule, regularization). Which ones matter depends on the specific model and dataset, so the only reliable approach is to test multiple settings and build intuition about sensitivity.

Which hyper parameters tend to drive the biggest performance changes, and why does that matter?

Learning rate and learning rate schedule are frequently the largest drivers of change. The transcript suggests that instead of immediately tuning architecture, it’s often more effective to tune the loss function and layer sizing (e.g., 64 vs. 256 units) alongside learning rate behavior. It also notes that sensitivity is relative to defaults: changing from a bad initialization (like all zeros) can cause a huge jump, while sensible defaults can make initialization less influential.

How do manual tuning, grid search, and random search differ in practice?

Manual tuning relies on deep understanding of how hyper parameters affect training dynamics (e.g., higher learning rate can converge faster but become unstable), then uses learning curves to adjust. Grid search exhaustively evaluates all cross-combinations in predefined ranges, which becomes expensive as hyper-parameter count grows. Random search samples a subset of points from ranges; it’s easy to implement and often outperforms grid search, but it depends heavily on choosing good ranges and can feel less interpretable.

What is coarse-to-fine random search, and why is it popular?

It runs random search in stages. First, sample broadly across wide ranges; then select the best-performing points and “zoom in” by resampling within a narrower region. Repeating this quickly narrows the search to high-performing hyper parameters while keeping the method relatively simple—hence its frequent use among practitioners.

When does Bayesian hyper-parameter optimization become a good next step?

After starting with coarse-to-fine random search and as the codebase matures. Bayesian methods maintain a probabilistic model of the relationship between hyper parameters and performance, then choose new trials expected to improve outcomes. They’re described as more hands-off, but can be difficult to implement from scratch and harder to integrate with existing systems.

What practical tactics help set learning-rate ranges and stabilize training at scale?

Use prior knowledge from tutorials or papers on similar tasks to set initial ranges, then expand around them. If uncertain, start wide on a log scale (e.g., spanning orders of magnitude) and then zoom in after observing which runs work. Warm-up learning rate schedules are recommended especially with large batch sizes and distributed training: increasing batch size by a large factor doesn’t safely justify increasing learning rate by the same factor without warming up, or the loss can explode.

Review Questions

What makes hyper-parameter sensitivity difficult to predict, and how does the recommended approach address that uncertainty?
Compare grid search and random search in terms of compute cost and how they scale with the number of hyper parameters.
Why is warm-up particularly important when scaling to very large batch sizes in distributed training?

Key Points

1
Hyper-parameter tuning is necessary once training and validation errors are close, but it’s difficult because many interacting knobs exist across architecture, optimizer, and training settings.
2
Learning rate and learning rate schedules are often the most sensitive hyper parameters, so they frequently deserve early tuning attention.
3
Sensitivity depends on the baseline: changes from pathological defaults (like all-zero initialization) can dominate, while sensible defaults can reduce sensitivity to some choices.
4
Coarse-to-fine random search is a practical default: sample broadly, keep the best region, then resample within it repeatedly.
5
Grid search becomes inefficient as the number of tuned hyper parameters grows due to the cross-product of combinations.
6
Bayesian hyper-parameter optimization can reduce manual effort later, but it’s harder to implement and integrate, so it’s best introduced after the workflow stabilizes.
7
Range selection should use prior knowledge when available; otherwise, start wide on a log scale and narrow based on results, and use learning-rate warm-up for large-batch/distributed training.

Highlights

Learning rate and learning rate schedules repeatedly emerge as the biggest performance levers across many models, often outweighing early architecture changes.

Coarse-to-fine random search balances simplicity and effectiveness by narrowing the search region after an initial broad sweep.

Warm-up learning rates are crucial when scaling to very large batch sizes; matching learning rate increases to batch-size increases can cause loss to explode.

Bayesian hyper-parameter optimization is positioned as a later-stage upgrade after coarse-to-fine random search, not the first move.

Topics

Hyper-parameter Tuning
Learning Rate Schedules
Random Search
Bayesian Optimization
Warm-Up Training

Tune hyper-parameters (6) - Troubleshooting - Full Stack Deep Learning