Get AI summaries of any video or article — Sign up free
Lecture 8: Troubleshooting Deep Neural Networks - Full Stack Deep Learning - March 2019 thumbnail

Lecture 8: Troubleshooting Deep Neural Networks - Full Stack Deep Learning - March 2019

The Full Stack·
6 min read

Based on The Full Stack's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Treat performance drops as ambiguous symptoms: implementation bugs, hyperparameters, model-data mismatch, and data construction errors can all look the same from learning curves.

Briefing

Troubleshooting deep neural networks is hard not because training is mysterious, but because the same drop in performance can come from many different failures—often with bugs that fail silently. The core message is that top practitioners treat debugging like a decision tree: start with the simplest, most controllable setup, verify it works end-to-end, then gradually add complexity while using validation signals to decide whether to improve the model, the data, or the training setup.

The lecture frames why debugging dominates real-world deep learning work. Even advanced teams can spend most of their time on “invisible” issues: implementation mistakes, unstable numerics, mismatched tensor shapes, and subtle preprocessing errors. A concrete example comes from a training pipeline where file ordering was nondeterministic—using Python glob without sorting mixed up features and labels, causing training to fail while looking superficially plausible. Hyperparameters can also derail learning entirely: learning rate choices that are too small or too large can prevent convergence, and certain initialization schemes can be essential for networks like ResNet to train at all. Model choice matters too; copying an architecture that worked on ImageNet doesn’t guarantee it will work on very different domains such as self-driving car imagery. Finally, data construction is a frequent culprit: insufficient data, class imbalance, noisy labels, and distribution shift between training and testing can all produce the same symptom—worse-than-expected accuracy.

Because so many causes can look identical from the outside, the lecture argues for a pessimistic mindset and a disciplined workflow. The recommended strategy is a loop: begin with the simplest possible architecture and dataset, implement and debug it, evaluate performance, then make a targeted decision—improve the model, improve the data, or tune hyperparameters—before ramping up complexity. “Start simple” isn’t just philosophical; it’s operationalized through steps like overfitting a single batch. If a model can’t drive training error near zero on a tiny slice of data, the problem is likely a bug in the pipeline (shapes, casting, loss inputs, preprocessing) rather than a lack of model capacity.

The lecture then lays out practical defaults for early debugging: use Adam with a learning rate of 3e-4, start with ReLU for fully connected or cloned networks and LSTM with tanh (noting typical initialization choices), avoid regularization and batch normalization at first because they can introduce additional failure modes, and normalize inputs by subtracting mean and dividing by variance (while avoiding common mistakes like dividing by 255 twice). It also recommends simplifying the dataset—fewer classes, smaller images, or synthetic data—so the team can validate the training loop quickly.

Once training runs and a single batch can be overfit, the next step is comparing against known results: official implementations from paper authors are best, otherwise benchmark datasets like MNIST can reveal pipeline bugs, and random GitHub reimplementations should be treated cautiously because many contain serious errors. After that, performance decisions should be guided by bias-variance decomposition: training error indicates underfitting, the gap between training and validation indicates variance/overfitting to training, and validation-to-test gaps reveal distribution shift or overfitting to the validation set. The order of fixes follows that decomposition—address underfitting first, then overfitting, then distribution shift, and finally validation overfitting.

Hyperparameter tuning is treated as an empirical, often manual process (“graduate student descent”), with random search and coarse-to-fine strategies often outperforming naive grid search. Bayesian optimization is presented as a later-stage option once the codebase and experimentation pipeline are mature. The overall takeaway is a method for turning an overwhelming debugging space into a sequence of verifiable checkpoints, so complexity is added only after earlier assumptions are proven safe.

Cornell Notes

Deep neural network debugging is difficult because many distinct problems—silent implementation bugs, hyperparameter sensitivity, wrong model-data fit, and data construction errors—can all produce the same symptom: degraded performance. The lecture recommends a pessimistic, decision-tree workflow: start with the simplest architecture and dataset, use sensible training defaults, normalize inputs correctly, and verify the pipeline by making the model run and overfit a single batch. After that, compare against known results (ideally official implementations or strong benchmarks) to build confidence. Finally, use bias-variance decomposition and validation/test comparisons to decide whether to improve the model, the data, or training settings, fixing underfitting before overfitting and then addressing distribution shift.

Why can deep learning performance degrade for reasons that look identical from the outside?

Multiple failure modes can produce the same learning-curve symptom. Implementation bugs can be “invisible,” such as mixing up features and labels due to nondeterministic file ordering (e.g., using Python glob without sorting). Hyperparameters can prevent convergence entirely (learning rate too small/large; specific initialization schemes required for ResNet-like models to train). Model choice can fail when the data domain differs from what the architecture was validated on (ImageNet-trained models may not transfer cleanly to self-driving car imagery). Data construction can also be wrong: insufficient data, class imbalance, noisy labels, or distribution shift between training and testing.

What is the purpose of overfitting a single batch, and what does it reveal?

Overfitting a single batch means driving training error arbitrarily close to zero on a tiny fixed subset by repeatedly training on the same data. If training error cannot be reduced near zero, the issue is likely a bug in the pipeline rather than insufficient model capacity. The lecture notes common causes: tensor shape mistakes that fail silently via broadcasting, incorrect preprocessing/normalization or excessive augmentation, passing the wrong inputs to the loss function (e.g., softmaxing before a loss that expects logits), batch norm running in the wrong mode (train vs eval), and numerical instability producing NaNs.

What early-stage defaults help reduce debugging noise?

For early debugging, the lecture recommends starting with sensible defaults and minimal extra moving parts: use Adam with learning rate 3e-4; use ReLU for fully connected/MLP-style models and LSTM with tanh; start with no regularization and avoid batch normalization initially because it can introduce additional implementation pitfalls (train/eval mode handling, distribution shift sensitivity). It also recommends normalizing inputs by subtracting mean and dividing by variance, while avoiding double-scaling errors such as dividing by 255 twice.

How should a team decide whether to improve the model, the data, or hyperparameters after training runs?

Use bias-variance decomposition guided by training/validation/test gaps. Underfitting shows up when training error stays high relative to the best achievable (baseline/human/paper target). Variance/overfitting to training appears when validation error is much higher than training error. Distribution shift is suggested when validation sets drawn from different distributions (e.g., daytime vs nighttime) show a gap, or when validation-to-test error diverges. The lecture’s fix order is: address underfitting first (often by making the model bigger or reducing regularization), then address overfitting (more data, normalization, augmentation, and sometimes weight decay), then address distribution shift (error analysis, collecting/synthesizing targeted data, or domain adaptation methods).

What are the recommended steps for making sure a model implementation is correct before scaling up?

The workflow is: (1) ensure the model runs (debug shape mismatches, casting issues, and out-of-memory by stepping through creation/inference and reducing batch size or memory-heavy ops), (2) overfit a single batch to catch pipeline bugs, and (3) compare against known results—prefer official implementations from paper authors, otherwise use strong benchmarks like MNIST. If a model can’t reach expected benchmark performance, the pipeline likely has a bug.

How does hyperparameter tuning fit into the overall debugging strategy?

Hyperparameter tuning is treated as a later lever once the pipeline is verified. The lecture emphasizes that learning rate is highly sensitive and worth tuning early (including learning rate schedules). It also recommends practical search methods: manual tuning (“graduate student descent”) is common but requires intuition; grid search can be inefficient in high dimensions; random search often performs better; and coarse-to-fine random search can be effective. Bayesian optimization is presented as a more hands-off approach that becomes worthwhile when the project and experimentation pipeline are mature.

Review Questions

  1. When would overfitting a single batch fail, and what specific categories of bugs should be checked first?
  2. How do training/validation/test gaps map to underfitting, overfitting, and distribution shift in the lecture’s bias-variance framework?
  3. Which early-stage training defaults does the lecture recommend avoiding (and why) to reduce debugging complexity?

Key Points

  1. 1

    Treat performance drops as ambiguous symptoms: implementation bugs, hyperparameters, model-data mismatch, and data construction errors can all look the same from learning curves.

  2. 2

    Adopt a pessimistic workflow: start with the simplest architecture and dataset, then add complexity only after each checkpoint is proven.

  3. 3

    Verify correctness by ensuring the model runs, then overfit a single batch to near-zero training error; failure usually indicates a pipeline bug.

  4. 4

    Use sensible early defaults (Adam with learning rate 3e-4, ReLU/LSTM-appropriate activations, no regularization, avoid batch norm initially) and normalize inputs carefully without double-scaling.

  5. 5

    Normalize the debugging decision using bias-variance decomposition: training error diagnoses underfitting, training-to-validation gaps diagnose variance/overfitting, and validation-to-test gaps can indicate distribution shift or validation overfitting.

  6. 6

    Compare against known results early (official implementations or strong benchmarks like MNIST) because many random reimplementations can contain serious bugs.

  7. 7

    Tune hyperparameters after the pipeline is stable; prioritize learning rate and learning rate schedules, and use random/coarse-to-fine search before considering Bayesian optimization.

Highlights

A nondeterministic file order (e.g., using glob without sorting) can silently swap features and labels and waste days of debugging time.
If a model can’t overfit a single batch, the problem is usually a bug (shapes, preprocessing, loss inputs, batch norm mode, or numerical instability), not “insufficient model capacity.”
Bias-variance decomposition turns vague “it’s not working” results into actionable next steps: fix underfitting first, then overfitting, then distribution shift, then validation overfitting.
Batch normalization and regularization are recommended to be avoided at the earliest debugging stages because they add failure modes and can hide root causes.
Known-good comparisons matter: official implementations and benchmark datasets like MNIST can quickly expose pipeline errors that learning curves alone won’t reveal.

Topics

Mentioned

  • Andrej Karpathy
  • Sergei
  • Lucas
  • Sergei spoke
  • Andrej
  • ResNet
  • CTC
  • LSTM
  • LS TM
  • ReLU
  • Adam
  • TF
  • TF data
  • TF data API
  • IPDB
  • MNIST
  • GPU
  • AIA
  • CTC loss
  • l2
  • l1
  • NaN