Improve (5) - Troubleshooting - Full Stack Deep Learning

TL;DR

Fix underfitting first by verifying the model can reach target training-set performance before addressing generalization gaps.

Briefing Cornell Notes

Briefing

Model improvement starts with a simple priority order: fix underfitting first, then tackle overfitting, and only after both training and validation performance look acceptable should attention shift to distribution shift. The key practical rule is to make sure the model can reach the target training-set performance before worrying about generalization. If training error is far above the goal—an example given is 20% training error when the target is 1%—the fastest path is usually to increase model capacity by widening layers or adding depth. Other underfitting remedies include reducing regularization and swapping in a more modern architecture closer to state of the art, followed by tuning hyperparameters and adding features only when necessary (since deep nets are expected to learn useful representations themselves).

Once training error is in line with expectations, a large gap between training and validation error signals overfitting. The most effective remedy, when feasible, is adding more training data. When data collection isn’t possible, the toolbox expands: normalization layers such as batch normalization or layer normalization can act as regularizers; data augmentation (like flipping and rotating images) often improves robustness; and classic regularization methods such as dropout or L2 weight decay remain viable. The guidance also cautions against over-relying on early stopping as the main anti-overfitting strategy—early stopping can save compute when validation loss plateaus, but more principled approaches (data, augmentation, regularization, architecture changes) typically yield more “juice.”

A worked example ties these ideas together. With a tiny dataset of 10,000 examples, validation error can stay high even if training error improves. Increasing the dataset size (to something like 250,000) can reduce validation error, even if training error rises. From there, adding weight decay and data augmentation can bring both training and validation errors into a desirable range. At that point, the process shifts to a broad hyperparameter optimization sweep to fine-tune performance.

After the model is performing well on both training and validation, the next problem becomes distribution shift: the model may fail on systematic cases it rarely sees in training. The recommended approach is error analysis on the validation/test set, then categorizing mistakes to identify which gaps matter most. A concrete example uses pedestrian detection errors: some failures come from pedestrians being hard to see, some from windshield reflections, and some only from nighttime scenes. The prioritization logic is straightforward—estimate each error type’s contribution to total error and weigh the difficulty of intervention. Nighttime errors may be a small fraction of training mistakes but a large driver of test failures, so the priority becomes collecting more nighttime data; if that’s impossible, synthetic darkening or nighttime simulation, or domain adaptation, can help.

Domain adaptation is framed as a way to transfer from a source distribution with labels to a target distribution with limited or no labels, often using unlabeled target data. Supervised domain adaptation can involve fine-tuning a pretrained model or mixing in labeled target data; unsupervised domain adaptation uses methods such as correlation alignment, domain confusion, or cycle-based techniques. Finally, there’s a meta-step: periodically rebalance validation and test splits if validation performance looks suspiciously better than the held-out test set—an issue that can arise after extensive hyperparameter tuning on the same validation set. The session closes with practical Q&A: fixed random seeds help reproducibility (especially in reinforcement learning), uncertainty and “hard examples” can guide labeling, and the right objective is best verified by tracking the metric that actually matters in deployment.

Cornell Notes

Improvement follows a bias-variance-style priority: first eliminate underfitting by ensuring the model can reach the target training performance; then address overfitting when training is good but validation lags; finally handle distribution shift once both training and validation are in a good range. Underfitting is often fixed by increasing capacity (more layers or wider layers), reducing regularization, adopting a stronger architecture (e.g., ResNet), and tuning hyperparameters. Overfitting is best reduced with more training data; alternatives include normalization (batch/layer norm), data augmentation, and regularization like dropout or L2 weight decay. Distribution shift is tackled through systematic error analysis—categorize failures (e.g., hard-to-see pedestrians, reflections, nighttime) and prioritize interventions based on contribution and feasibility. Domain adaptation helps when target labels are scarce, using labeled source data plus unlabeled or limited labeled target data.

How should a team decide what to fix first when a model’s performance is poor?

Start with underfitting: check whether training error is far above the goal. If training error is high (example: 20% training error vs a 1% target), focus on making the model capable of reaching target training performance before changing anything aimed at generalization. Only after training performance is near the desired range should the team treat a large training–validation gap as overfitting and move to data/regularization/architecture changes.

What are the main levers for underfitting, and why is model capacity the first choice?

The simplest and often best underfitting fix is increasing capacity—make layers wider or add depth. If that’s not enough, reduce regularization, switch to a more modern architecture closer to state of the art (the example uses moving to a ResNet), tune hyperparameters (especially learning rate), and only then consider feature additions. The rationale is that the network may simply be unable to represent the needed function yet.

When training error is low but validation error is high, what’s the recommended overfitting playbook?

The top remedy is adding more training data. If collecting data isn’t possible, use normalization layers (batch normalization or layer norm) as effective regularizers, apply data augmentation (e.g., flipping/rotating images), and use regularization methods like dropout or L2 weight decay. The session also emphasizes keeping early stopping as a practical compute-saving tool rather than the primary overfitting solution.

How does the error-analysis workflow prioritize which data to collect or synthesize?

Run through validation/test mistakes and categorize them into error types. Then estimate each category’s contribution to total errors and judge intervention difficulty. Example categories for pedestrian detection: (1) hard-to-see pedestrians (present in both train and test), (2) windshield reflections (present in both), and (3) nighttime scenes (large in test validation but small in training). Low-contribution and hard-to-fix categories get lower priority; high-contribution categories with feasible interventions (like nighttime via data collection or synthetic darkening) get top priority.

What is domain adaptation, and when does it make sense?

Domain adaptation transfers knowledge from a labeled source distribution to a target distribution where labels are limited or absent, often using unlabeled target data. It’s most useful when target labels are scarce but there’s lots of data from a similar source distribution. Supervised domain adaptation can mean fine-tuning a pretrained model or adding labeled target data; unsupervised domain adaptation uses methods like correlation alignment, domain confusion, or cycle-based approaches.

Why might validation and test splits need rebalancing, and how is that detected?

If validation error looks significantly better than held-out test error after heavy hyperparameter tuning, the model may have overfit to the validation set itself. The recommendation is to periodically check this and, if it happens, resample or reshuffle validation/test splits—ideally by collecting new validation data, but reshuffling can still help.

Review Questions

If training error is far above the target, what specific diagnostic step prevents wasting time on overfitting fixes?
In the pedestrian-error example, what evidence justifies prioritizing nighttime data over reflections or hard-to-see pedestrians?
What signals suggest that validation performance has become unreliable due to repeated tuning, and what corrective action is recommended?

Key Points

1
Fix underfitting first by verifying the model can reach target training-set performance before addressing generalization gaps.
2
Increase capacity (wider layers or deeper networks) and reduce regularization when training error is high relative to the goal.
3
Treat a large training–validation gap as overfitting and prioritize more training data; use normalization, data augmentation, dropout, or L2 weight decay when data is limited.
4
Use systematic error analysis to categorize failures, estimate each category’s contribution, and prioritize interventions based on both impact and feasibility.
5
Handle distribution shift after training/validation are aligned by collecting targeted data, synthesizing data, or applying domain adaptation when target labels are scarce.
6
Periodically check whether validation error is unrealistically better than held-out test error after extensive tuning, and reshuffle/resample splits if needed.

Highlights

The improvement order is explicit: underfitting first (get training error to the target), then overfitting (close the training–validation gap), then distribution shift (address systematic test failures).

Normalization layers and data augmentation are positioned as practical regularizers, alongside dropout and L2 weight decay, especially when adding data isn’t possible.

Error categorization turns vague “the model fails” into actionable priorities—nighttime scenes can dominate test error even if they’re rare in training mistakes.

Domain adaptation is framed as a transfer strategy for labeled source data to unlabeled or sparsely labeled target data, using methods like correlation alignment, domain confusion, and cycle-based techniques.

Early stopping is treated as a compute-saving measure when validation loss plateaus, not as the primary strategy for reducing overfitting.

Topics

Bias-Variance Prioritization
Underfitting Remedies
Overfitting Regularization
Distribution Shift
Domain Adaptation