Adam Optimizer Explained in Detail with Animations | Optimizers in Deep Learning Part 5

TL;DR

Adam (Adaptive Moment Estimation) is popular because it combines momentum (first moment) with RMSProp-style normalization (second moment) to adapt updates per parameter.

Briefing Cornell Notes

Briefing

Adam (Adaptive Moment Estimation) has become a default optimizer in deep learning because it blends two older ideas—momentum and learning-rate decay—into a single update rule that adapts per parameter. It’s especially common when training neural networks such as CNNs and RNNs, where getting stable convergence without heavy manual tuning can make a practical difference.

The core motivation starts with the limitations of earlier optimizers. Mini-batch gradient descent can move toward minima, but it often does so slowly. Momentum speeds things up by using a “velocity” term that accumulates past gradients, letting the optimizer descend faster and reducing the number of steps needed to reach the valley of the loss landscape. Yet momentum can introduce oscillations—overshooting back and forth around the minimum—before gradually settling.

RMSProp addresses a different failure mode: when learning-rate decay becomes too aggressive, the optimizer may miss the minimum or stop making meaningful progress. In the transcript’s framing, RMSProp effectively damps problematic behavior by scaling updates using a running estimate of gradient magnitudes, helping the method reach the bottom more reliably than plain learning-rate schedules.

Adam merges these two streams. Momentum contributes the first moment (an exponentially decayed average of gradients), while RMSProp contributes the second moment (an exponentially decayed average of squared gradients). The combined update divides the first-moment estimate by the square root of the second-moment estimate plus a small constant, producing a step size that adapts to both direction (via momentum) and scale (via RMS-style normalization). The method uses bias correction because both running averages start at zero; without correction, early updates would be systematically off. Typical hyperparameters mentioned are β1 = 0.9 and β2 = 0.99, with the small constant ε used to prevent division by zero.

A key intuition is how Adam behaves on different loss landscapes. The transcript describes animations where momentum-like behavior can drift from the center and oscillate, while Adam’s combined normalization helps it move more directly toward the optimum—particularly in settings that are not strictly convex, where curvature changes across dimensions. For data that has structure across columns (the transcript contrasts moving from one side of a feature space to another), Adam’s per-parameter adaptation helps manage the directional shifts that can otherwise cause instability.

In practice, the guidance is straightforward: Adam is a strong starting point for most deep learning tasks. If results aren’t satisfactory, alternatives like RMSProp or momentum-based methods can be tried—especially when a specific optimizer matches the problem’s dynamics better. Hyperparameter tuning remains the final lever: even with Adam, adjusting settings and testing on the dataset can determine whether it outperforms other choices. The overall takeaway is that Adam’s popularity comes from its reliable convergence behavior across many neural network training scenarios, with manageable tuning requirements.

Cornell Notes

Adam (Adaptive Moment Estimation) is a widely used optimizer that combines momentum and RMSProp-style normalization into one adaptive update rule. It keeps two exponentially decayed running averages: the first moment (average of gradients) and the second moment (average of squared gradients). Updates scale the direction from the first moment by the square root of the second moment (plus ε), producing parameter-wise learning rates. Because both averages start at zero, bias correction is applied early in training to avoid systematically wrong step sizes. Adam is often a strong default choice for training deep networks like CNNs and RNNs, with RMSProp or momentum as fallback options if performance lags.

Why does momentum speed up convergence, and what new problem does it introduce?

Momentum accelerates descent by using a “velocity” term that accumulates past gradients, so the optimizer reaches the loss valley in fewer steps than plain mini-batch gradient descent. The tradeoff is oscillation: the accumulated velocity can overshoot the minimum, causing the updates to bounce back and forth before damping out and settling.

How does RMSProp-style learning-rate decay relate to missing the minimum?

When learning-rate decay reduces the step size too much, the optimizer can stop updating before it truly reaches the minimum. RMSProp addresses this by normalizing updates using a running estimate of gradient magnitudes (the second moment), which helps damp unstable behavior and allows the method to reach the bottom more reliably than a naive schedule.

What are the two “moments” inside Adam, and how do they shape the update?

Adam maintains (1) the first moment: an exponentially decayed average of gradients, capturing direction and momentum-like behavior; and (2) the second moment: an exponentially decayed average of squared gradients, capturing gradient scale for normalization. The update uses the first moment divided by the square root of the second moment plus ε, so each parameter gets an adaptive step size.

Why is bias correction needed in Adam’s early iterations?

Both running averages start at zero. Early on, that initialization biases the estimates toward zero, making the effective update too small or otherwise skewed. Bias correction adjusts the moment estimates so they better reflect the true averages during the first steps of training.

What typical hyperparameter values are mentioned for Adam?

The transcript gives common defaults: β1 = 0.9 for the first-moment decay rate and β2 = 0.99 for the second-moment decay rate. It also notes that these are configurable and can be tuned depending on the task and dataset.

When should someone consider switching away from Adam?

The guidance is to start with Adam because it often delivers strong results across many deep learning problems. If training outcomes are not satisfactory, the transcript suggests trying RMSProp or momentum-based approaches, and using hyperparameter tuning to match the optimizer behavior to the dataset’s characteristics.

Review Questions

Explain how Adam’s first and second moments differ, and describe how each affects the direction and magnitude of parameter updates.
What problem arises from initializing Adam’s moment estimates at zero, and how does bias correction address it?
Give a scenario where momentum might oscillate and explain why Adam’s normalization can help in that case.

Key Points

1
Adam (Adaptive Moment Estimation) is popular because it combines momentum (first moment) with RMSProp-style normalization (second moment) to adapt updates per parameter.
2
Momentum speeds up convergence by accumulating past gradients into a velocity term, but it can cause oscillations around minima.
3
RMSProp-style normalization helps prevent overly aggressive learning-rate decay from causing the optimizer to miss or stall before reaching the minimum.
4
Adam’s update divides the first-moment estimate by the square root of the second-moment estimate plus ε, yielding adaptive step sizes.
5
Bias correction is required because both running averages start at zero, which otherwise biases early updates.
6
Common default values mentioned are β1 = 0.9 and β2 = 0.99, though they can be tuned.
7
Adam is recommended as a strong starting point for deep learning; if results disappoint, try RMSProp or momentum and tune hyperparameters.

Highlights

Adam adapts learning rates per parameter by combining a momentum-like direction term with a gradient-magnitude normalization term.

Bias correction fixes the early-step distortion caused by initializing both running averages to zero.

The update rule effectively uses (first moment) / (sqrt(second moment) + ε), producing more stable convergence than using either idea alone.

Adam’s behavior tends to be robust on non-ideal loss landscapes, where curvature and directional shifts can destabilize simpler methods.

Topics

Adam Optimizer
Adaptive Learning Rates
Momentum
RMSProp
Bias Correction

Mentioned

Adam
CNN
RNN
RMSProp