Recurrent Neural Networks (RNN) - Deep Learning w/ Python, TensorFlow & Keras p.7

TL;DR

RNNs are designed for tasks where the order of inputs carries meaning, such as time series and natural language.

Briefing Cornell Notes

Briefing

Recurrent neural networks (RNNs) are built for problems where order matters—especially time series and natural language—because the meaning of a sequence depends on what came before. Instead of treating inputs as independent features, an RNN feeds each step of the sequence into a recurrent cell and carries information forward to the next step. That “memory” is what lets the model distinguish between sentences that share words but differ in order, such as “some people made a neural network” versus “a neural network made some people,” where word order changes the interpretation.

The core mechanism is implemented through recurrent cells, most commonly LSTM (long short-term memory) units. An LSTM cell receives (1) the current input at a time step and (2) information passed from the previous time step, then decides what to forget, what new information to add, and what to output. In practice, each time step produces outputs that can feed forward to the next layer and/or to the next recurrent step. The transcript also notes that architectures can be extended to bidirectional recurrent layers, where information flows in more than one direction through the sequence.

After laying out the intuition, the tutorial shifts to a practical goal: building a simple RNN from scratch using TensorFlow/Keras. The main challenge isn’t the model code—it’s shaping data into sequences with targets. To keep things easy, the example uses the MNIST dataset and treats each 28×28 image as a sequence of 28 rows, where each row is a time step. That means the model sees 28 steps per sample, each step containing 28 pixel values.

The model is assembled as a Keras Sequential network with stacked LSTM layers (128 units in the first LSTM, then another LSTM with 128 units). Because the second recurrent layer needs the full sequence, the first LSTM uses return_sequences=True. Dropout layers are added to reduce overfitting, followed by dense layers: a 32-unit dense layer and a final dense layer with 10 outputs using softmax for digit classification.

Training initially runs poorly and slowly, with accuracy not improving and loss not trending downward. The transcript identifies a key issue: the input images weren’t normalized to the 0–1 range. After scaling X_train and X_test by dividing by 255, learning accelerates dramatically and accuracy climbs quickly. The tutorial also highlights performance differences between LSTM implementations: switching to the GPU-optimized “CuDNNLSTM” (with tanh-based behavior) yields much faster epochs and reaches strong validation performance within the same number of epochs.

By the end, validation accuracy reaches about 98% while training accuracy is slightly lower, which is attributed to how Keras reports metrics (training accuracy averaged across an epoch versus validation measured at the end). The takeaway is that RNNs can be straightforward to run when data is already sequential, but preprocessing—especially normalization—can make the difference between a model that barely learns and one that converges quickly. The next step promised is a more realistic time-series example that will require heavier preprocessing.

Cornell Notes

Order-sensitive tasks are where recurrent neural networks shine: each time step’s output depends on earlier steps. LSTM cells implement this “memory” using gates that decide what to forget, what to add, and what to output, and they can pass information both forward to later layers and onward to the next time step. The tutorial demonstrates a simple MNIST setup by treating each 28×28 image as a sequence of 28 rows (each row is one time step). A stacked LSTM model with dropout and dense layers is trained with sparse categorical cross-entropy and softmax output. The biggest practical lesson is normalization: dividing pixel values by 255 turns a slow, non-learning run into fast convergence; using CuDNNLSTM further speeds training on GPU.

Why does word order or time order matter to an RNN, and how is that different from a standard feedforward network?

An RNN treats the input as a sequence where earlier elements influence later ones. In natural language, token order changes meaning even when the same words appear. A feedforward network typically treats features as independent, so it can’t naturally capture “what came before” unless the data is engineered. An RNN carries state forward through recurrent cells, so the representation at time step t depends on both the current input and the previous hidden information.

What does an LSTM cell do at each time step?

An LSTM cell receives the current input plus the previous step’s carried information. It uses gating to decide (1) what information to forget from the previous state, (2) what new information to add based on the current input, and (3) what to output. This gated structure helps the model retain useful long-range dependencies that plain RNNs struggle with.

How did the tutorial convert MNIST images into a sequence suitable for an RNN?

Each MNIST sample is a 28×28 image. The transcript treats the 28 rows as 28 time steps. Each time step contains 28 pixel values (one row). So the model input shape is effectively (sequence_length=28, features_per_step=28), with 60,000 training examples.

Why did training accuracy and loss look wrong at first, and what fixed it?

The initial run failed to learn properly and progressed very slowly. The identified cause was missing normalization: pixel values weren’t scaled to the 0–1 range. After dividing X_train and X_test by 255, learning accelerated sharply and accuracy rose quickly, showing how sensitive neural networks are to input scaling.

What is the practical difference between LSTM and CuDNNLSTM in this setup?

Both can run on GPU, but CuDNNLSTM is optimized for speed. After switching to CuDNNLSTM, epochs dropped from taking minutes to roughly seconds (the transcript cites about 13 seconds per epoch). The activation behavior also differs: CuDNNLSTM uses tanh internally, while the earlier configuration used rectified linear activation in the LSTM layer.

Why can validation accuracy be higher than training accuracy in Keras logs?

The transcript attributes it to how metrics are computed. Training accuracy shown during an epoch can be an average across the epoch, while validation accuracy is evaluated at the end of the epoch. If the model improves within the epoch, the end-of-epoch validation score can exceed the averaged training score.

Review Questions

When building an RNN for sequence data, what does return_sequences=True control, and why would a second recurrent layer require it?
How does dividing inputs by 255 change the training dynamics, and what symptoms in the logs would suggest normalization is missing?
In the MNIST-as-sequence approach, what exactly is the “time step,” and what is the feature vector at each step?

Key Points

1
RNNs are designed for tasks where the order of inputs carries meaning, such as time series and natural language.
2
LSTM cells maintain sequence information using gates that decide what to forget, what to add, and what to output at each time step.
3
A simple RNN example can be built by reshaping MNIST images into sequences of 28 rows, treating each row as one time step.
4
Normalization is critical: scaling pixel values to the 0–1 range (divide by 255) can turn a non-learning run into fast convergence.
5
Stacked LSTM layers often require return_sequences=True in earlier LSTM layers so later layers receive the full sequence.
6
Using CuDNNLSTM can dramatically speed up training on GPU compared with a standard LSTM configuration.
7
Validation accuracy can exceed training accuracy when training metrics are averaged over an epoch but validation is computed at the epoch’s end.

Highlights

LSTM gating is the mechanism that lets recurrent networks carry useful information forward across many time steps.

Treating MNIST as a sequence of rows makes RNNs practical without heavy time-series preprocessing.

Forgetting to normalize inputs can cause both slow training and poor accuracy; dividing by 255 fixes the issue quickly.

Switching to CuDNNLSTM can cut epoch time from minutes to seconds on GPU while boosting accuracy.

Topics

Recurrent Neural Networks
LSTM
CuDNNLSTM
MNIST Sequences
Normalization

Mentioned

RNN
LSTM
GRU
GPU
TF
Keras
MNIST
MSE
LSTM
CuDNNLSTM