All RNNs Come From This One Idea

TL;DR

Frame-by-frame processing of video with an image classifier fails because predictions are independent across frames and no temporal context is retained.

Briefing Cornell Notes

Briefing

Most neural networks are “amnesic”: they process each input as if it were new, with no built-in sense of time. That limitation becomes obvious when an image classifier like AlexNet is repurposed for video—shuffling frames leaves its predictions unchanged because each frame is handled independently. Human perception works differently. Watching a movie means the current moment is shaped by what came right before, with context and an “arrow of time” built into how information is interpreted. Recurrent neural networks (RNNs) were created to bake sequence and memory into the computation itself.

The core technical move starts with the standard feed-forward network. In a feed-forward layer, a vector of neuron states is transformed by applying an activation function (the “fire” step) and then mixing signals through a weight matrix (the “project” step). Stacking layers yields a static, one-way mapping: the next state depends only on the previous layer’s current signals, not on anything earlier. In this setup, time is absent because nothing carries forward an echo of past states.

To introduce time, the layer update equation is modified so each time step includes an additional term: an “echo” of the neuron’s previous state. A general memory function M determines how that echo propagates. If M is chosen to mirror the feed-forward transformation—again using activation and a learned projection—then the resulting “vanilla” RNN can retain information only briefly. The reason is structural: every time step repeatedly applies a nonlinear squashing (sigma) and a linear mixing (weights). Over many steps, the original signal gets progressively transformed, like a telephone game where the message is paraphrased each round. The operation that feed-forward networks use for compression—collapsing many variations of an input into a stable representation—gets repurposed to preserve information across long sequences, which it is not designed to do.

A key insight from deeper network design helps fix this. Residual connections let information bypass transformations across layers, preserving important signals while still allowing selective updates. For RNNs, the analogous requirement is a memory pathway where information can flow forward largely intact, with only controlled modifications. The simplest echo that does this keeps a fraction α of the previous state and adds new input. With α=0, the model forgets completely and reverts to the frame-independent behavior. With α=1, it preserves everything, but the state becomes an unstructured running sum—older information never truly disappears, yet it also becomes hard to retrieve.

The practical solution is to make forgetting selective and context-dependent. Instead of a single scalar α shared by all neurons, each neuron gets its own retention gate f_t computed at every time step. A learned “forget gate” (typically produced by a small neural network with a sigmoid output) decides, per neuron, how much of the old state to keep and how much to erase to make room for new information. This gated retention matches the leaky integrated behavior of neurons in computational neuroscience, where charge builds and leaks away.

This mechanism underpins gated RNN families—especially GRUs and LSTMs. GRUs combine a forget gate with an update gate, while LSTMs maintain separate internal state vectors to control what is stored versus what is passed along. Underneath the engineering differences is the same principle: adaptive valves that choose what to remember and what to forget. That selective, learned control is what finally enables recurrent networks to learn long-range dependencies, turning “echoes” into usable memory rather than either rapid amnesia or indiscriminate hoarding.

Cornell Notes

Neural networks built as feed-forward systems have no built-in notion of time, so they treat each input independently. When time is added through a vanilla recurrent “echo” term that mirrors feed-forward transformations, information degrades quickly because nonlinear squashing and linear mixing repeat at every step. A better memory mechanism preserves information through a mostly unprocessed pathway, then selectively updates it. Replacing a fixed decay rate with learned, per-neuron forget gates yields leaky, context-dependent retention—older information fades while relevant recent details persist. This gating is the central idea behind GRUs and LSTMs and is what allows RNNs to learn long-range dependencies.

Why does an image classifier like AlexNet fail on video when frames are processed independently?

AlexNet outputs a probability distribution for a single image, with no mechanism to carry context from one frame to the next. If a movie is analyzed frame-by-frame, each prediction depends only on that frame’s pixels. Shuffling frames doesn’t change the model’s behavior because there’s no memory of what came earlier—each frame is treated like a fresh, unrelated input.

What goes wrong with vanilla RNNs when trying to learn long-range dependencies?

Vanilla RNNs add an echo of the previous state, but the echo is processed through the same kind of “fire and project” operations used in feed-forward layers: an activation function sigma followed by multiplication with recurrent weights. As information travels through many time steps, it gets repeatedly squished and re-mixed. After many steps, the original signal has been transformed so many times that it becomes hard to recover, limiting the memory horizon.

How does the residual-connection idea translate into recurrent memory?

Residual connections in deep networks allow signals to bypass transformations, preserving important information. For RNNs, the analogous need is a pathway where information can flow forward largely intact across time. A simple way is to keep a fraction α of the previous state and add new input, creating a forward path that doesn’t repeatedly reprocess everything at each step.

Why is a fixed scalar retention rate α not enough for real sequences like movies?

A single α applies to all neurons and all time points. Some information must persist for long stretches (e.g., a character’s name introduced early), while other details are only useful briefly (e.g., the exact framing of a particular shot). If α is high, stale details accumulate; if α is low, early critical information fades too. The retention behavior needs to vary with context and differ across neurons.

What is the role of the forget gate in gated RNNs like GRUs and LSTMs?

Gated RNNs replace the scalar α with a vector of retention values f_t, one per neuron, computed at each time step. A learned forget gate (often via a sigmoid) outputs values between 0 and 1 that determine how much of the previous state to keep. When the gate is near 1, the old state passes through almost unchanged; when near 0, it is erased to make room for new information. GRUs pair this with an update gate, while LSTMs use separate state vectors, but both rely on learned, context-dependent selective forgetting.

Review Questions

How does the feed-forward layer update equation differ from the recurrent update equation that includes an echo term?
Explain, using the “repeated squashing and projection” mechanism, why vanilla RNNs struggle with long-range dependencies.
What changes when α is replaced by a learned, per-neuron forget gate vector f_t, and why does that matter for sequences of varying relevance over time?

Key Points

1
Frame-by-frame processing of video with an image classifier fails because predictions are independent across frames and no temporal context is retained.
2
Feed-forward networks compute a static transformation: the next state depends only on the previous layer’s current signals, not on earlier history.
3
Vanilla RNNs can remember only briefly because repeated application of activation (sigma) and recurrent weight mixing progressively distorts older information.
4
A residual-connection-like memory pathway is needed: preserving a fraction of the previous state while adding new input prevents constant reprocessing.
5
A fixed scalar decay rate α is too rigid for real sequences; retention must vary by neuron and by time step.
6
Learned forget gates implement context-dependent selective forgetting, enabling RNNs to keep relevant details and discard irrelevant ones.
7
GRUs and LSTMs differ in their internal wiring, but both rely on the same core idea: adaptive gating that controls what to remember versus what to forget.

Highlights

AlexNet-style frame independence means shuffling video frames doesn’t change predictions—an amnesia problem for sequence tasks.

Vanilla RNN memory degrades because the same nonlinear-and-linear transformation is applied at every time step, repeatedly paraphrasing the signal.

The simplest “memory” echo keeps a fraction α of the previous state; α=0 causes total forgetting, while α=1 turns memory into an unstructured running sum.

Gated RNNs solve the retention problem by replacing α with per-neuron, time-varying forget gates f_t computed from both current input and stored state.

GRUs and LSTMs owe their long-range capability to learned, context-dependent selective forgetting rather than brute-force preservation.

Topics

Amnesia in Neural Nets
Feed-Forward vs Recurrent
Vanilla RNN Limitations
Residual-Style Memory
Gated RNNs
GRUs and LSTMs

Mentioned

AlexNet
RNNs
GRUs
LSTMs