Lecture 3: Recurrent Neural Networks (Full Stack Deep Learning

TL;DR

Sequence modeling tasks can be categorized by whether inputs and outputs are single values or sequences, including many-to-one, one-to-many, and many-to-many settings.

Briefing Cornell Notes

Briefing

Recurrent neural networks (RNNs) were built to handle sequence data efficiently by reusing the same weights across time and carrying information forward in a hidden state—an approach meant to exploit patterns that repeat over a sequence rather than treating every position as unrelated. The lecture frames sequence modeling as a set of problem types—time-series forecasting, translation, speech, image captioning, and question answering—then argues that feedforward networks struggle with variable-length sequences and become data-inefficient when they must learn position-specific mappings for repeated patterns.

In the core RNN setup, each time step takes the current input x_t and the previous hidden state h_{t-1}, computes a new hidden state h_t, and produces an output y_t. The hidden-state update is described as two matrix multiplications—one for h_{t-1} and one for x_t—summed and passed through an activation function (the basic form uses tanh). This weight sharing across time replaces the “single massive matrix” approach and makes the architecture naturally suited to sequences of varying length. For tasks where only one output is needed at the end (many-to-one), the last hidden state is fed into a classifier; for tasks where an output sequence must be generated from a single input (one-to-many), an encoder-decoder pattern is used: an encoder compresses the input into an initial state, and a decoder generates outputs step-by-step.

The lecture then zooms in on why vanilla RNNs break down on long sequences: backpropagation through time multiplies gradients across many steps, and with common activations like sigmoid and tanh, derivatives shrink toward zero, leading to vanishing gradients. The result is that long-term dependencies—like remembering a character name introduced early in a long sentence—get lost. LSTMs (Long Short-Term Memory networks) address this by adding a separate cell state plus gating mechanisms. A forget gate decides what to remove from the cell state, an input gate controls what new information to add, and the output computation uses a gated version of the cell state to produce the next hidden state. The lecture notes that GRUs and other variants exist, and cites empirical findings suggesting regular LSTMs are hard to beat across datasets, while GRUs can sometimes train more easily or perform slightly better.

Machine translation serves as a case study for how RNN encoder-decoder systems were made practical. The baseline idea—encode the source sentence with an RNN and initialize the decoder with the final encoder state—creates an information bottleneck for long sentences. Improvements include stacking LSTM layers (with residual connections to make deeper stacks trainable), adding attention so the decoder can access a weighted summary of all encoder hidden states instead of only the last one, and using bidirectionality so the encoder reads the sentence in both forward and backward directions. Attention is described via relevance weights over encoder states, producing attention maps that show which source words influence each target word.

Finally, the lecture introduces CTC loss (Connectionist Temporal Classification) for sequence tasks with misalignment, using handwriting recognition as the motivating example. CTC allows the model to emit characters or a special blank (epsilon) token at each time step, then merges repeated characters and removes blanks to produce the final transcription—handling cases where input and output lengths differ. The lecture closes by weighing RNN trade-offs: flexible architectures and strong historical performance versus slower, less parallelizable training, which helped open the door for transformers. It previews non-recurrent alternatives such as WaveNet-style convolutional sequence models, using causal and dilated convolutions to capture long-range context while enabling parallel training, at the cost of more expensive inference.

Cornell Notes

Sequence problems come in many-to-one, one-to-many, and many-to-many forms, and RNNs handle them by carrying a hidden state forward while reusing the same weights at every time step. Vanilla RNNs struggle with long-term dependencies because backpropagation through time multiplies gradients across many steps, causing vanishing gradients (or sometimes exploding gradients). LSTMs fix this with a cell state and gates—forget, input, and output—that regulate what information persists and what gets updated. For machine translation, encoder-decoder RNNs were improved with stacked LSTMs, residual connections, attention (so the decoder can use all encoder states via relevance weights), and bidirectional encoding. When alignment between input and output is unclear, CTC loss supports transcription by emitting characters or blanks and then merging repeats while removing blanks.

Why do feedforward networks become awkward for sequence tasks with variable length and repeating patterns?

A feedforward approach typically concatenates inputs and uses a large mapping matrix to produce outputs, which doesn’t naturally handle arbitrary sequence lengths. Padding to a maximum length forces computation to scale with the chosen max length, increasing the size of the matrix multiplications. More importantly, the mapping learns position-specific weights: if a word like “cat” is predictive, the model must learn that pattern independently at every position, which is data-inefficient when the same patterns recur over time.

How does a basic RNN compute outputs across time, and what makes it more efficient than a single flattened matrix?

At each time step t, the RNN takes the current input x_t and the previous hidden state h_{t-1}, computes a new hidden state h_t, and produces an output y_t. The hidden update is described as two matrix multiplications—one using h_{t-1} and one using x_t—summed and passed through an activation (tanh in the basic form). Because the same weights are reused at every time step, the model captures temporal structure without learning separate parameters for each position.

What exactly breaks down in vanilla RNNs on long sequences, and how do LSTMs address it?

Backpropagation through time unrolls the network across time and multiplies gradient terms across many steps. With activations like sigmoid and tanh, derivatives can be near zero when inputs saturate, so repeated multiplication drives gradients toward zero (vanishing gradients), making early information hard to influence later outcomes. LSTMs introduce a cell state plus gates: the forget gate uses a sigmoid to scale the old cell state, the input gate controls how new information enters, and the output computation gates the cell state to form the next hidden state—helping preserve useful information over long spans.

Why did early RNN encoder-decoder translation systems need attention?

The baseline passes only the encoder’s final hidden state to the decoder, forcing all source-sentence meaning into a fixed-size bottleneck. For long sentences, that compression loses details needed for accurate word-by-word translation. Attention replaces the bottleneck by letting the decoder compute relevance weights over all encoder hidden states and form a weighted sum, so each target word can draw on the specific source words that matter.

How does CTC loss handle misalignment between input time steps and output characters?

CTC allows the model to output either a character or a blank (epsilon) at each time step. Decoding merges consecutive repeated characters and removes blanks. If the true transcription needs two identical letters in a row, the model can emit the character, then a blank, then the character; the merging rules then produce the correct double-letter output.

What trade-off does WaveNet-style convolutional sequence modeling make compared with RNNs?

Convolutional sequence models using causal (and often dilated) convolutions can be trained in parallel, making training faster and less dependent on sequential computation. The trade-off is inference: generating one output time step can require information from many past steps, so online generation is more expensive than in simpler feedforward settings.

Review Questions

How does backpropagation through time lead to vanishing gradients in vanilla RNNs, and why do tanh/sigmoid derivatives matter?
In an encoder-decoder translation system, what bottleneck does attention remove, and how do relevance weights determine what the decoder uses?
During CTC decoding, what role do blank (epsilon) tokens play in producing repeated characters correctly?

Key Points

1
Sequence modeling tasks can be categorized by whether inputs and outputs are single values or sequences, including many-to-one, one-to-many, and many-to-many settings.
2
Vanilla RNNs reuse weights across time by updating a hidden state from (x_t, h_{t-1}), but they struggle with long-term dependencies due to vanishing gradients.
3
LSTMs add a cell state and gating (forget, input, output) to regulate information flow and preserve long-range context.
4
Early neural machine translation with RNN encoder-decoders improved performance using stacked LSTMs with residual connections, attention over all encoder states, and bidirectional encoding.
5
Attention replaces the fixed-size encoder bottleneck by computing relevance weights and forming weighted sums of encoder hidden states, often visualized as attention maps.
6
CTC loss supports sequence tasks with unknown alignment by emitting characters or blanks at each time step and then merging repeats while removing blanks.
7
RNNs are flexible but slower to train because they require sequential computation; convolutional alternatives like causal/dilated models improve training parallelism but can make inference more costly.

Highlights

Vanilla RNNs lose long-term information because gradients shrink as they’re multiplied across many time steps, especially with tanh/sigmoid saturation.

LSTMs’ forget/input/output gates turn the hidden-state update into controlled memory management via a separate cell state.

Attention fixes translation’s encoder bottleneck by letting the decoder form a weighted sum over all encoder hidden states rather than relying only on the final one.

CTC loss handles misalignment by allowing blanks and then merging repeated characters while removing blanks during decoding.

WaveNet-style causal and dilated convolutions capture long-range context while enabling parallel training, trading off against more expensive step-by-step inference.

Topics

Sequence Problems
Recurrent Neural Networks
LSTM Gates
Neural Machine Translation
CTC Loss
Attention Mechanisms
WaveNet

Mentioned

RNN
LSTM
GRU
CTC
BPTT
LSTM
RNN
CNN
CTC
LSTM

Lecture 3: Recurrent Neural Networks (Full Stack Deep Learning - Spring 2021)