Problems with RNN | 100 Days of Deep Learning

TL;DR

Vanilla RNNs struggle on long sequences because gradients from far-back time steps shrink (vanishing gradients), making learning rely mostly on recent inputs.

Briefing Cornell Notes

Briefing

RNNs struggle with two training failures that get worse as sequences get longer: long-term dependency learning breaks down, and gradients can become unstable during backpropagation through time. Those twin problems explain why more gated architectures like LSTMs became necessary—and why vanilla RNNs aren’t widely used for tasks that require remembering information across many time steps.

The first issue is the long-term dependency problem. In sequential data, later outputs depend on earlier inputs—think of language where the next word depends on context from far back. But when an RNN is trained with backpropagation through time, the influence of distant time steps on the loss shrinks. The mechanism is tied to vanishing gradients: during gradient computation, the contribution from far-back dependencies becomes extremely small, so learning mostly reflects recent inputs. A practical example given is next-word prediction in Marathi: the word “beautiful place” depends on earlier context, yet an RNN can effectively “forget” older information because it can’t propagate useful error signals across many steps.

The transcript also frames this mathematically. When gradients are traced back through many time steps, the long-range derivative terms multiply repeatedly by factors that end up near zero. As the number of time steps grows, the long-term derivative terms compress toward zero, leaving short-term terms to dominate the gradient. The result is that the model’s updates don’t meaningfully encode information from far in the sequence.

The second issue is unstable training caused by exploding gradients. In some setups, the repeated multiplication during backpropagation through time can instead grow rapidly. If the effective derivatives are large (for example, due to certain weight values or an overly aggressive learning rate), gradients can blow up, pushing weights toward extremely large values and preventing the model from converging.

To reduce these problems, the transcript lists three broad mitigation strategies for vanishing gradients: using different activation functions so derivatives don’t collapse toward zero; initializing weights in a way that avoids problematic dynamics (for instance, using identity-like behavior via identity matrices or better initialization schemes); and architectural tweaks such as adding skip connections (mentioned as a direction to look up). For exploding gradients, the key fix is gradient clipping—capping gradient values at a maximum threshold so updates don’t run away.

Finally, the discussion connects these failure modes to why LSTMs are widely adopted. LSTMs introduce gating mechanisms designed to preserve information over long horizons and stabilize gradient flow, making them far more reliable than plain RNNs for real-world sequence tasks like language and time-series prediction.

Cornell Notes

Vanilla RNNs face two major training obstacles on long sequences: vanishing gradients and exploding gradients. Vanishing gradients make the loss signal from far-back time steps shrink toward zero, so learning becomes dominated by recent inputs and long-term context is effectively lost. Exploding gradients can instead grow rapidly through repeated backpropagation through time, driving weights to very large values and stopping convergence. The transcript outlines mitigations such as changing activation functions, using better weight initialization (including identity-like approaches), and applying gradient clipping. These issues motivate gated architectures like LSTMs, which are built to keep information and gradients stable across many time steps.

What exactly is the long-term dependency problem in RNNs, and why does it happen during training?

Long-term dependency means later outputs depend on much earlier inputs (e.g., next-word prediction depends on distant context). During backpropagation through time, gradients are propagated backward across many time steps. The transcript links this to vanishing gradients: the derivative terms associated with far-back time steps shrink toward zero as the number of steps increases. As a result, the gradient signal that would teach the model to use distant context becomes negligible, and updates are driven mainly by short-term dependencies.

How does the transcript justify that distant time steps contribute less to the gradient?

It uses a chain-rule style argument: the gradient with respect to earlier weights involves products of derivative terms across many time steps. As the “long-term” path length grows (e.g., 100 steps), the multiplicative factors make the long-range derivative term extremely small—described as “almost close to zero.” The short-term terms remain comparatively larger, so the gradient computation effectively becomes dominated by recent inputs rather than distant ones.

What conditions lead to exploding gradients, and what does it do to training?

Exploding gradients occur when the repeated multiplication in backpropagation through time grows instead of shrinking. The transcript gives two causes: (1) effective derivatives become large (for example, if weights/values make derivatives positive and large), so repeated multiplication yields very large numbers; and (2) learning rate is too high, amplifying updates. The outcome is weights becoming extremely large, training failing to converge, and the model not learning correctly.

What are the practical fixes for vanishing gradients mentioned in the transcript?

Three directions are listed: (1) use different activation functions so derivatives don’t collapse toward zero (the transcript contrasts a problematic range near 0–1 with alternatives); (2) improve weight initialization—one idea mentioned is using identity-matrix-like initialization so multiplying by it doesn’t degrade signals; and (3) architectural changes such as skip connections (not detailed, but suggested as a known approach to look up).

How does gradient clipping address exploding gradients?

Gradient clipping caps gradients at a maximum value. Instead of letting gradients grow without bound, the training process limits the update magnitude. This prevents runaway weight updates in situations where gradients would otherwise explode.

Review Questions

In your own words, how does vanishing gradients change what an RNN learns from a long sequence?
Why can the same backpropagation-through-time mechanism lead to either vanishing or exploding gradients?
Which two interventions in the transcript target gradient stability, and how do they differ (vanishing vs exploding)?

Key Points

1
Vanilla RNNs struggle on long sequences because gradients from far-back time steps shrink (vanishing gradients), making learning rely mostly on recent inputs.
2
Long-term dependency learning fails when the gradient contribution from distant steps becomes nearly zero as the number of time steps grows.
3
Exploding gradients can occur when repeated gradient multiplications grow rapidly, often due to large effective derivatives or an overly high learning rate.
4
Gradient clipping stabilizes training for exploding gradients by capping gradient magnitudes at a preset maximum.
5
Mitigating vanishing gradients can involve changing activation functions, using better weight initialization (including identity-like ideas), and architectural tweaks such as skip connections.
6
LSTMs are adopted because gating mechanisms are designed to preserve information and stabilize gradient flow over many time steps.

Highlights

Long-term dependency learning breaks because backpropagation through time multiplies many derivative terms, causing distant contributions to shrink toward zero.

As sequence length increases, the gradient becomes dominated by short-term dependencies, so the model effectively “forgets” far context.

Exploding gradients can drive weights to extremely large values and stop convergence, especially with large derivatives or high learning rates.

Gradient clipping prevents runaway updates by capping gradients at a maximum value.

The two gradient pathologies—vanishing and exploding—are the core reasons gated architectures like LSTMs became standard for sequence modeling.

Topics

RNN Limitations
Long-Term Dependencies
Vanishing Gradients
Exploding Gradients
LSTM Motivation

Mentioned

RNN
LSTM
LSTMs