Labs 4-5: Tracking Experiments - Full Stack Deep Learning

TL;DR

The line recognizer scans a fixed-width (952-pixel) line image by extracting overlapping 28×28 windows and processing each with the same convolutional network using shared weights via “time distributed.”

Briefing Cornell Notes

Briefing

Handwriting line recognition is built from two linked pieces: a convolutional network that scans an input line image window-by-window, and a sequence model trained with CTC to map those visual features into character strings. The lab’s core workflow starts by treating each line image as a fixed-height (28-pixel) strip with a fixed width (952 pixels). From that wide image, the system extracts many overlapping 28×28 windows and runs the same convolutional model on every window using Keras-style “time distributed” weight sharing—so each window gets processed independently, but all windows share identical parameters. Training then uses backpropagation through time mechanics for the sequence of window outputs, while still sharing weights across windows, producing a feature sequence that can be decoded into characters.

The next major step is data: the lab introduces a synthetic dataset called “EMS lines,” generated from the Brown corpus. For each training example, a sentence is sampled from Brown, then each character in that sentence is replaced by a randomly sampled EMS character, and characters are placed side-by-side with random overlap. This yields 10,000 training examples with images shaped as 28×952 and labels shaped as a fixed maximum length of 34 characters, padded with blanks when shorter. Labels are drawn from 80 EMS classes (numbers, uppercase/lowercase letters, and symbols). The lab emphasizes why synthetic data is used early: collecting large, labeled real handwriting datasets is expensive, while synthetic generation enables rapid prototyping and lets teams test whether the model pipeline works before scaling up.

A key practical constraint is the fixed input geometry: the baseline assumes line images are 952 pixels wide and labels are at most 34 characters long. Shorter labels are fine via padding, but longer sentences would require retraining or a redesigned setup. The lab also addresses realism tradeoffs. Synthetic spacing isn’t uniform and includes overlap, but it may miss handwriting-specific correlations seen in real writing (like slanted baselines that shift character positions together). To bridge toward reality, the lab later compares against a real handwriting dataset also named “EMS lines,” but with different characteristics: 7,000 training examples, 2,000 test examples, the same image size, and a larger maximum label length (97). That real dataset includes cursive and tighter character separation, making it harder—one reason the synthetic version serves as a simpler starting point.

After setting up the data and model, the lab demonstrates training on the EMS lines dataset using an LSTM with CTC loss (the “line LSTM CTC” network) and shows how to run the training command by swapping dataset and model arguments. It then shifts to experiment management with Weights & Biases: logging configs, metrics, and run metadata automatically via a callback integrated into the Keras training loop. Live dashboards make it easier to compare runs (e.g., different batch sizes), track loss/accuracy curves over time, and restart failed runs from checkpoints. The session ends by encouraging hands-on experimentation—tuning sliding window width/stride, changing LSTM depth or direction (including bidirectionality), trying alternative architectures, adding normalization like batch norm, and sweeping learning rate and batch size—so improvements can be validated through logged experiments rather than guesswork.

Cornell Notes

The lab builds a handwriting line recognizer by scanning a fixed-size line image with a convolutional network over many overlapping windows, using shared weights via “time distributed.” Each window produces features that form a sequence, which is then decoded into characters using an LSTM trained with CTC loss. Training labels come from a synthetic “EMS lines” dataset generated from the Brown corpus, producing 10,000 examples with 28×952 images and labels padded to a maximum length of 34 characters across 80 classes. The synthetic setup accelerates prototyping, but it assumes fixed input width and limited label length. A second, real EMS lines dataset raises difficulty with cursive and a larger max label length (97), motivating the synthetic-to-real progression.

How does the model turn a wide line image into a character sequence input for later decoding?

It extracts many overlapping 28×28 windows from a fixed-height (28-pixel) line image that is assumed to be 952 pixels wide. Each window is processed by the same convolutional network using “time distributed,” which applies identical weights to every window in the sequence. The forward pass computes outputs for every window, and training propagates gradients through the sequence of window outputs while keeping weights shared across all windows.

What exactly is “EMS lines” in this lab, and how are labels shaped?

“EMS lines” is a synthetic dataset created by sampling sentences from the Brown corpus and then sampling EMS characters for each character position. Characters are placed side-by-side with random overlap, producing line images that look like squeezed character sequences. The dataset provides 10,000 training examples with images shaped 28×952. Labels are 10,000 sequences padded to a maximum length of 34 characters, with the final dimension representing 80 EMS classes (numbers, uppercase/lowercase letters, and symbols).

Why use synthetic data first instead of collecting real handwriting labels immediately?

Synthetic data avoids the expensive process of labeling large real handwriting datasets before the pipeline is proven. It also allows generating as much data as needed for prototyping and helps determine whether the model architecture and training approach can learn useful mappings. The lab treats the synthetic dataset as a stepping stone toward a more comprehensive real dataset.

What constraints does the baseline impose on input width and label length?

The baseline assumes line images are always 952 pixels wide. For labels, sequences can be shorter than the maximum (34 for the synthetic dataset), but they are padded to that fixed maximum length. If longer sentences are introduced later, the system would need retraining or a redesigned approach to handle longer sequences.

How does the lab compare synthetic EMS lines to a real handwriting EMS lines dataset?

The real dataset has 7,000 training examples and 2,000 test examples, with the same image size but a larger max label length of 97. It also includes cursive and characters that aren’t as clearly separated, making it harder. The lab uses this contrast to justify starting with the simpler synthetic version.

What role does Weights & Biases play during training, and how is it integrated?

Weights & Biases is used for experiment management: logging hyperparameters, metrics (like loss and accuracy), and run configuration so results can be compared later. Integration happens by initializing a W&B run for the experiment and adding a W&B callback to the Keras training loop. With the callback enabled (via a flag like use W&B), training metrics stream to the W&B dashboard in real time, enabling side-by-side comparisons across runs (e.g., different batch sizes) and easier restart/repro workflows.

Review Questions

What assumptions about image width and maximum label length does the baseline system make, and what would break if those assumptions change?
Describe how shared convolutional weights are applied across multiple windows in the line image and why that matters for training.
Why does CTC loss pair naturally with variable-length character sequences in handwriting recognition?

Key Points

1
The line recognizer scans a fixed-width (952-pixel) line image by extracting overlapping 28×28 windows and processing each with the same convolutional network using shared weights via “time distributed.”
2
Window outputs form a sequence that is decoded into characters using an LSTM trained with CTC loss, enabling alignment-free mapping from pixels to text.
3
The synthetic “EMS lines” dataset is generated by sampling sentences from the Brown corpus and then sampling EMS characters for each character position, with random overlap between adjacent characters.
4
Training labels are padded to a fixed maximum length (34 for synthetic EMS lines) across 80 character classes; shorter labels are allowed, longer ones require changes or retraining.
5
Synthetic data accelerates prototyping and lets teams validate the training pipeline before investing in expensive real handwriting labeling.
6
A real EMS lines dataset increases difficulty with cursive and a larger max label length (97), providing a more realistic target for the same overall modeling approach.
7
Weights & Biases is integrated through a callback in the training loop to log configs and metrics, making hyperparameter comparisons and run tracking practical.

Highlights

Shared convolutional weights across all sliding windows are implemented through “time distributed,” turning a single line image into a sequence of learned features.

The synthetic EMS lines dataset uses the Brown corpus to generate text, then replaces each character with an EMS character and places them with random overlap—fast enough to generate large training sets.

The baseline assumes fixed input width (952 pixels) and a fixed maximum label length (34 for synthetic), so scaling to longer sentences isn’t automatic.

Weights & Biases logging is wired into the Keras callback system, enabling real-time dashboards and direct comparison of experiments like different batch sizes.

The real EMS lines dataset is harder due to cursive and less-separated characters, with a max label length of 97.

Topics

Handwriting Line Recognition
Sliding Window CNN
CTC and LSTM
Synthetic Data Generation
Experiment Tracking with Weights & Biases

Mentioned

CTC
LSTM
WB

Labs 4-5: Tracking Experiments - Full Stack Deep Learning - March 2019