Problems with RNN | 100 Days of Deep Learning
Based on CampusX's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Vanilla RNNs struggle on long sequences because gradients from far-back time steps shrink (vanishing gradients), making learning rely mostly on recent inputs.
Briefing
RNNs struggle with two training failures that get worse as sequences get longer: long-term dependency learning breaks down, and gradients can become unstable during backpropagation through time. Those twin problems explain why more gated architectures like LSTMs became necessary—and why vanilla RNNs aren’t widely used for tasks that require remembering information across many time steps.
The first issue is the long-term dependency problem. In sequential data, later outputs depend on earlier inputs—think of language where the next word depends on context from far back. But when an RNN is trained with backpropagation through time, the influence of distant time steps on the loss shrinks. The mechanism is tied to vanishing gradients: during gradient computation, the contribution from far-back dependencies becomes extremely small, so learning mostly reflects recent inputs. A practical example given is next-word prediction in Marathi: the word “beautiful place” depends on earlier context, yet an RNN can effectively “forget” older information because it can’t propagate useful error signals across many steps.
The transcript also frames this mathematically. When gradients are traced back through many time steps, the long-range derivative terms multiply repeatedly by factors that end up near zero. As the number of time steps grows, the long-term derivative terms compress toward zero, leaving short-term terms to dominate the gradient. The result is that the model’s updates don’t meaningfully encode information from far in the sequence.
The second issue is unstable training caused by exploding gradients. In some setups, the repeated multiplication during backpropagation through time can instead grow rapidly. If the effective derivatives are large (for example, due to certain weight values or an overly aggressive learning rate), gradients can blow up, pushing weights toward extremely large values and preventing the model from converging.
To reduce these problems, the transcript lists three broad mitigation strategies for vanishing gradients: using different activation functions so derivatives don’t collapse toward zero; initializing weights in a way that avoids problematic dynamics (for instance, using identity-like behavior via identity matrices or better initialization schemes); and architectural tweaks such as adding skip connections (mentioned as a direction to look up). For exploding gradients, the key fix is gradient clipping—capping gradient values at a maximum threshold so updates don’t run away.
Finally, the discussion connects these failure modes to why LSTMs are widely adopted. LSTMs introduce gating mechanisms designed to preserve information over long horizons and stabilize gradient flow, making them far more reliable than plain RNNs for real-world sequence tasks like language and time-series prediction.
Cornell Notes
Vanilla RNNs face two major training obstacles on long sequences: vanishing gradients and exploding gradients. Vanishing gradients make the loss signal from far-back time steps shrink toward zero, so learning becomes dominated by recent inputs and long-term context is effectively lost. Exploding gradients can instead grow rapidly through repeated backpropagation through time, driving weights to very large values and stopping convergence. The transcript outlines mitigations such as changing activation functions, using better weight initialization (including identity-like approaches), and applying gradient clipping. These issues motivate gated architectures like LSTMs, which are built to keep information and gradients stable across many time steps.
What exactly is the long-term dependency problem in RNNs, and why does it happen during training?
How does the transcript justify that distant time steps contribute less to the gradient?
What conditions lead to exploding gradients, and what does it do to training?
What are the practical fixes for vanishing gradients mentioned in the transcript?
How does gradient clipping address exploding gradients?
Review Questions
- In your own words, how does vanishing gradients change what an RNN learns from a long sequence?
- Why can the same backpropagation-through-time mechanism lead to either vanishing or exploding gradients?
- Which two interventions in the transcript target gradient stability, and how do they differ (vanishing vs exploding)?
Key Points
- 1
Vanilla RNNs struggle on long sequences because gradients from far-back time steps shrink (vanishing gradients), making learning rely mostly on recent inputs.
- 2
Long-term dependency learning fails when the gradient contribution from distant steps becomes nearly zero as the number of time steps grows.
- 3
Exploding gradients can occur when repeated gradient multiplications grow rapidly, often due to large effective derivatives or an overly high learning rate.
- 4
Gradient clipping stabilizes training for exploding gradients by capping gradient magnitudes at a preset maximum.
- 5
Mitigating vanishing gradients can involve changing activation functions, using better weight initialization (including identity-like ideas), and architectural tweaks such as skip connections.
- 6
LSTMs are adopted because gating mechanisms are designed to preserve information and stabilize gradient flow over many time steps.