Lecture 7: Troubleshooting Deep Neural Networks (Full Stack Deep Learning

TL;DR

Treat performance degradation as ambiguous: implementation bugs, hyperparameter sensitivity, and dataset mismatch/construction issues can all look similar.

Briefing Cornell Notes

Briefing

Troubleshooting deep neural networks is hard because the same drop in performance can come from many different causes—and many bugs don’t announce themselves loudly. A model that underperforms a reference learning curve might be suffering from an implementation mistake, but it could also be hyperparameter sensitivity (learning rate, weight initialization), a mismatch between the training data and the data used in the original results, or data construction problems like class imbalance and label noise. Even when a bug is found, isolating which factor caused the degradation is often difficult because neural nets can react sharply to small changes.

A concrete example illustrates the problem: a model that “wasn’t training at all” turned out to be caused by nondeterministic file ordering when using Python’s glob, so the training pipeline effectively fed data in an unintended order. The lecture then broadens the lens: learning rate choices can make training stall or diverge; weight initialization can be the difference between total failure and state-of-the-art results; and using a different dataset—or a dataset built differently from the reference—can degrade performance even if the model code is correct. Data issues are especially common in industry, where time spent on data collection, labeling, and dataset construction often outweighs time spent on model algorithms.

To make debugging less like random stirring and more like controlled diagnosis, the lecture pushes a single mindset: pessimism. Instead of assuming the first attempt is close to correct, it recommends a staged strategy that starts with the simplest possible setup and increases complexity gradually. The workflow is: pick the simplest model and dataset, implement it, get it to run, then overfit a single batch until the loss can be driven arbitrarily close to zero. If that fails, the problem is almost certainly in the implementation or data pipeline (common culprits include corrupted data/labels, silent broadcasting or shape issues, incorrect preprocessing, or over-regularization). Once a single batch can be memorized, compare results against a known reference—ideally an official implementation on a similar dataset, otherwise a benchmark like MNIST, or even simple baselines such as predicting the mean.

After the model is “bug-free enough,” the lecture shifts to deciding what to improve next using bias-variance decomposition. Training error, validation error, and test error are treated as signals: gaps indicate avoidable bias (underfitting), variance (overfitting to training), and validation/test overfitting. When train and test come from different distributions, it recommends using two validation sets—one sampled from the training distribution and one from the test distribution—to quantify distribution shift as an additional error term.

With those diagnostics, improvement priorities follow a sequence: reduce underfitting first (often by increasing model capacity, adjusting regularization, tuning hyperparameters, or adding features), then address overfitting (typically by adding data, using normalization/augmentation, and tuning), then tackle distribution shift via error analysis and targeted data collection or synthetic data, and finally rebalance validation strategy if hyperparameter search has overfit to the validation set. For hyperparameter tuning itself, the lecture recommends starting with sensible defaults (e.g., Adam, a “magic” learning rate, and leaving regularization/normalization out initially to avoid extra bugs), then tuning learning rate first and using course-defined random search to efficiently explore the space before considering Bayesian optimization as projects mature.

Overall, the central message is operational: build in layers, verify each layer with strict sanity checks, and let measured error patterns dictate the next engineering move rather than guessing.

Cornell Notes

Deep neural networks fail for many overlapping reasons: implementation bugs, hyperparameter sensitivity, and dataset mismatch or dataset construction errors can all produce the same learning-curve degradation. The lecture’s core strategy is to adopt pessimism and debug in stages—start with the simplest model and dataset, get the model running, then overfit a single batch until loss approaches zero. If that sanity check fails, the likely culprit is a bug in shapes, preprocessing, loss setup, corrupted labels, or excessive regularization. Once single-batch overfitting works, bias-variance decomposition (with two validation sets when distributions shift) guides whether to increase capacity, add data/regularization, or address distribution shift through targeted error analysis and data augmentation/synthesis.

Why can the same performance drop have multiple causes in deep learning?

A single degraded learning curve can reflect very different problems: an implementation bug (e.g., nondeterministic file ordering from Python glob), hyperparameter issues (learning rate too high/low, or weight initialization that prevents learning), or dataset problems (using a different dataset than the reference, or dataset construction errors like class imbalance and noisy labels). Neural nets also tend to be sensitive to small changes, so “looks like underfitting” may actually be “training never started” or “data distribution shifted.”

What does “overfit a single batch” prove, and what failures usually mean?

Overfitting a single batch means driving the loss arbitrarily close to zero on a tiny set (a single batch). If loss can’t be reduced that far, the model is almost certainly not wired correctly: common causes include corrupted data/labels, sign errors (optimizing the negative loss), silent broadcasting/shape mistakes, incorrect preprocessing/normalization, or over-regularization. If loss oscillates, corrupted labels or too-high learning rate are frequent suspects; if it plateaus early, check the loss definition and data pipeline, and consider adjusting learning rate or removing regularization for the first implementation.

How does bias-variance decomposition guide next steps after the model is bug-free?

Training error vs. a target/human-level baseline indicates avoidable bias (underfitting). Validation error vs. training error indicates variance (overfitting to training). Test error vs. validation error indicates overfitting to the validation set. When train and test distributions differ, the lecture recommends two validation sets—one from the training distribution and one from the test distribution—so the gap between those validation errors becomes a proxy for distribution shift impact.

What is the recommended order of improvement priorities?

First reduce underfitting (often by making the model bigger, reducing excessive regularization, tuning hyperparameters, or adding capacity/features). Then address overfitting (typically by adding training data, using normalization/augmentation, and tuning; early stopping is mentioned as sometimes useful but not consistently effective). Next handle distribution shift using error analysis on the test-validation set and then targeted data collection, synthetic data, or domain adaptation. Finally, if validation looks better than the held-out test set, resample/recollect validation to avoid overfitting to the validation set.

Which hyperparameters should be tuned first, and why?

Learning rate is usually the first and most sensitive knob. The lecture recommends starting with Adam and a commonly effective “magic learning rate,” then tuning learning rate (and possibly schedules). Other knobs like batch size are less attractive to tune early; weight initialization can matter if performance is stuck; loss function choice and model capacity (depth/width/kernel size) can also have big effects. The key is that sensitivity depends on model/data and the current default choices, so start from sensible defaults before exploring.

Review Questions

When a model underperforms a reference learning curve, what are at least three distinct categories of causes that could produce the same symptom?
Why is overfitting a single batch considered a strong sanity check before expanding the dataset or model complexity?
How do two validation sets help separate distribution shift from bias/variance effects?

Key Points

1
Treat performance degradation as ambiguous: implementation bugs, hyperparameter sensitivity, and dataset mismatch/construction issues can all look similar.
2
Adopt pessimism and debug incrementally: start simple, verify each stage, and only add complexity after passing strict checks.
3
Get the model to run, then overfit a single batch until loss approaches zero; failure usually indicates wiring/data/pipeline bugs rather than “hard learning.”
4
Use known results (official implementations, benchmarks like MNIST, or simple baselines) to confirm the implementation is behaving as expected.
5
Apply bias-variance decomposition to decide whether to increase capacity (underfitting), add data/regularization (overfitting), or address distribution shift (via error analysis and targeted data).
6
When train and test distributions differ, use two validation sets to quantify distribution shift separately from bias and variance.
7
Tune hyperparameters efficiently: start with sensible defaults, tune learning rate first, and use course-defined random search before considering Bayesian optimization as the project matures.

Highlights

A model that “wasn’t learning at all” traced back to nondeterministic file ordering from Python glob—an example of how invisible pipeline bugs can fully break training.

Overfitting a single batch is a high-signal diagnostic: if loss can’t be driven near zero, the issue is usually in shapes, preprocessing, labels, loss setup, or regularization—not in model expressiveness.

Bias-variance decomposition becomes more actionable when paired with two validation sets, letting teams separate distribution shift from underfitting/overfitting.

The recommended improvement loop prioritizes underfitting first, then overfitting, then distribution shift, and finally validation-set overfitting caused by repeated experimentation.

Course-defined random search is positioned as a practical middle ground: better than grid search for the same compute, and easier to implement than Bayesian optimization.

Topics

Neural Network Debugging
Bias-Variance Decomposition
Data Pipeline Bugs
Hyperparameter Tuning
Distribution Shift

Mentioned

Andre Carpathi
Andrew Ng
Nathan
ResNet
Adam
SGD
IPDB
Pytorch
Tensorflow
MNIST

Lecture 7: Troubleshooting Deep Neural Networks (Full Stack Deep Learning - Spring 2021)