Get AI summaries of any video or article — Sign up free
Lab 3: RNNs (Full Stack Deep Learning - Spring 2021) thumbnail

Lab 3: RNNs (Full Stack Deep Learning - Spring 2021)

The Full Stack·
6 min read

Based on The Full Stack's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

With window width/stride set to 28, sliding-window CNN predictions align with non-overlapping synthetic characters, enabling cross-entropy training to reach ~89–90% validation accuracy.

Briefing

Sequence models for handwritten text recognition take a practical turn in Lab 3: a sliding-window CNN baseline quickly works when characters don’t overlap, but it breaks down once realistic overlap and variable spacing enter the synthetic data. The fix is to switch from plain cross-entropy training to CTC loss, which can learn alignments between the CNN’s per-window character predictions and the final variable-length label sequence. Adding a bidirectional LSTM on top of the CNN further improves accuracy by injecting context across the predicted character sequence.

The lab starts from earlier work where MNIST-derived characters and line images are generated, then focuses on recognizing entire lines. It introduces special tokens in the label mapping—most notably a padding token used to extend targets to a fixed maximum length. The first model, “line cnn simple,” applies the same character-level CNN from the previous lab across a line image by sliding a window horizontally. The model computes how many windows fit using window width and window stride, runs each window through the CNN, and produces logits shaped as batch × classes × sequence length. With window width and stride set to 28 (effectively sampling one character at a time), training uses the default cross-entropy loss and reaches validation accuracy around the high 80s to ~90%.

Trouble appears when stride is reduced to create overlapping windows. The ground-truth label length stays fixed to the dataset’s maximum character count, but the model’s output sequence length grows because more windows are produced. After adjusting output length to match the dataset, accuracy still collapses (roughly to the low 50s) because the CNN predictions no longer align cleanly with the underlying non-overlapping character generation process. Making the synthetic data overlap more like the model’s sampling—by increasing overlap to about a quarter—recovers performance into the 70s and can reach the 80s, but a fully realistic setting with variable overlap across characters and writers remains too hard for the simple cross-entropy approach. In that more chaotic regime, accuracy tops out around ~60.

The lab then introduces “line cnn,” a more efficient fully convolutional variant that avoids recomputing convolutions on heavily overlapping windows. It replaces explicit window-by-window CNN calls with a convolutional downsampling stack that yields a sequence of character logits in one pass.

The key upgrade is CTC loss via a dedicated “ctc lit model.” Instead of forcing a one-to-one mapping between windows and target characters, CTC learns the alignment and handles repeated characters and blanks through its collapsing behavior. Training with CTC drives down character error rate substantially (from very high values to around ~70 error rate in validation), while accuracy improves to roughly ~62% in the reported experiment. Finally, a bidirectional LSTM is stacked on top of the CNN logits (summed across directions before a final fully connected layer), pushing character error rate further down to about ~18 and improving overall performance to ~16 character error rate. The lab closes by assigning experimentation: tune window width/stride, LSTM dimensions/layers, and inspect how the greedy decode and character error rate metrics work under CTC.

Cornell Notes

Lab 3 builds a line text recognizer by applying a character CNN across a line image using sliding windows, then progressively fixes the mismatch between window-level predictions and the final character sequence. With non-overlapping sampling (window width/stride = 28), cross-entropy training works well, reaching ~89–90% validation accuracy. Reducing stride to create overlapping windows breaks alignment and drops accuracy to ~52, and variable overlap makes it harder still (best around ~60). Switching the loss to CTC resolves the alignment problem by learning when to emit characters, handling blanks and repeated predictions via CTC’s collapsing behavior. Adding a bidirectional LSTM on top of the CNN sequence further reduces character error rate to about ~16–18.

Why does “line cnn simple” perform well when window stride equals window width, and what changes when stride is smaller?

When window width and stride are both 28, each window effectively captures one character region at a time, matching how the synthetic line data is generated with no overlap. The model outputs logits per window (batch × classes × sequence length) and cross-entropy can align each window prediction to a single target character position. When stride is reduced (e.g., stride 20), windows overlap and the output sequence length increases (e.g., 44 windows vs. a ground-truth target length of 32). Even after limiting output length, the CNN’s window-level predictions no longer correspond cleanly to the underlying character positions, so cross-entropy training struggles and accuracy drops sharply.

How does the lab’s “line cnn” improve efficiency compared with “line cnn simple”?

“line cnn simple” runs the CNN separately on each sliding window, which repeats convolution computation on overlapping regions. “line cnn” turns the approach into a fully convolutional network: it uses convolutional downsampling (instead of max pooling) and a convolution layer that behaves like a large fully connected layer over the spatial extent, producing the same kind of sequence logits but in one forward pass. The result is similar outputs with faster execution, especially when windows overlap heavily.

What problem does CTC loss solve in this setup, and how is it implemented here?

CTC loss removes the need for a strict one-to-one alignment between the model’s per-window sequence and the target character sequence. It learns alignments internally while accounting for blanks and repeated characters through its collapsing mechanism. In the lab, the CTC model permutes logits into the order CTC expects (sequence length × batch × classes), computes input lengths from the model’s sequence length, computes target lengths excluding padding, and uses CTC loss instead of cross-entropy. Greedy decoding is used in validation/test to collapse outputs (remove blanks and repeated runs) before computing character error rate.

Why does variable overlap in the synthetic handwriting data make cross-entropy training fail even when overlap is introduced?

Cross-entropy assumes a consistent mapping between output positions (windows) and target characters. Variable overlap means character widths and spacing vary across the line and across samples, so no fixed window width/stride reliably matches every character boundary. As a result, the model’s window sequence cannot stay aligned with the target sequence, and cross-entropy training plateaus around ~60 accuracy in the lab’s reported attempt.

How does adding a bidirectional LSTM change the model’s behavior on top of CTC-trained CNN logits?

The CNN produces per-position character logits, but it lacks context about neighboring characters. A bidirectional LSTM processes the sequence in both directions, allowing each position’s prediction to use information from surrounding windows. The lab sums the two LSTM directions, applies a fully connected layer, and returns logits in the same format expected by the CTC pipeline. This contextual modeling reduces character error rate further (to roughly ~16–18 in the reported run).

What does character error rate (CER) and greedy decode relate to in CTC systems?

CER measures how many character edits (insertions, deletions, substitutions) are needed to transform the predicted collapsed sequence into the ground-truth sequence, normalized by the target length. Greedy decode performs the CTC-specific collapsing: it removes the blank token and collapses consecutive repeated character predictions into a single character. This decoding step is necessary because CTC outputs per-timestep distributions, not directly the final character string.

Review Questions

  1. In what way does reducing window stride increase the model’s output sequence length, and why does that disrupt cross-entropy alignment?
  2. Describe the role of CTC in handling blanks and repeated characters. How does greedy decode relate to CER computation?
  3. Why might a bidirectional LSTM improve results even when the CNN already produces per-window logits?

Key Points

  1. 1

    With window width/stride set to 28, sliding-window CNN predictions align with non-overlapping synthetic characters, enabling cross-entropy training to reach ~89–90% validation accuracy.

  2. 2

    Overlapping windows (smaller stride) increase the number of output timesteps and break the one-to-one alignment assumption behind cross-entropy, dropping accuracy to around ~52 in the lab’s example.

  3. 3

    Increasing overlap in the synthetic data can partially recover performance, but variable overlap across characters and writers remains too misaligned for cross-entropy to converge well (best around ~60).

  4. 4

    “line cnn” replaces repeated window-by-window CNN calls with a fully convolutional architecture to avoid recomputing convolutions on overlapping regions.

  5. 5

    CTC loss fixes the alignment problem by learning when to emit characters and handling blanks/repeats via collapsing; validation uses greedy decode to produce collapsed predictions for CER.

  6. 6

    Stacking a bidirectional LSTM on top of CNN logits adds sequence context and reduces character error rate further (to roughly ~16–18 in the reported run).

Highlights

Cross-entropy works when windows effectively sample one character at a time, but it collapses once overlapping windows destroy positional alignment.
CTC loss is the turning point: it learns alignments between per-window logits and variable-length targets without requiring manual matching.
A bidirectional LSTM on top of CNN+CTC logits improves character-level accuracy by injecting left-to-right and right-to-left context.
“line cnn” speeds up overlapping-window inference by converting the sliding-window approach into a single fully convolutional forward pass.

Topics

  • Sliding Windows
  • CTC Loss
  • Fully Convolutional CNN
  • Bidirectional LSTM
  • Character Error Rate

Mentioned

  • RNN
  • CNN
  • LSTM
  • CTC
  • CER
  • MLP
  • GPU
  • CTC
  • CTC loss
  • PyTorch