Get AI summaries of any video or article — Sign up free
Labs 1-3: Introduction to the Text Recognizer Project - Full Stack Deep Learning - March 2019 thumbnail

Labs 1-3: Introduction to the Text Recognizer Project - Full Stack Deep Learning - March 2019

The Full Stack·
5 min read

Based on The Full Stack's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

The system is organized as a web backend that decodes an uploaded image, a deployed prediction model that runs inference, and a response that returns transcribed text.

Briefing

Handwritten-text recognition is built as a full pipeline: a web backend accepts an encoded image, a deployed “compiled prediction model” runs inference, and the system returns transcribed text. The core architecture splits recognition into two learned stages—first detecting each text line, then recognizing the characters within that line—so the model can focus on smaller, more learnable visual units before assembling the final output.

At the system level, the web backend receives a POST request containing an encoded image, decodes it into a format the model can process, and sends it to a prediction model packaged for deployment. That prediction model is produced by separate training code that learns weights and wraps them in application-ready logic. In this project’s design, the deployed model is treated like a compiled artifact: weights live alongside inference code so production can serve predictions without dragging in the full training toolchain.

The labs then zoom in on what a “mature” machine learning codebase looks like from training through deployment. The repository is organized into clear layers: an API folder for serving predictions (including a Flask server, tests, Docker configuration, and an AWS Lambda deployment descriptor), a data folder that stores download specifications rather than large datasets, and an evaluation area for quick model checks and unit-test-like scripts. Model and training logic are separated so that deployment can ship only what’s needed for inference. Within training, networks are defined as relatively “dumb” architectures, while models wrap the network with dataset formatting, loss functions, optimization, and training loops. Weights for production models are stored in a dedicated location, while other experiment weights can live elsewhere (e.g., object storage).

To make experimentation repeatable, the labs introduce experiment management with Weights & Biases (W&B) and emphasize dependency control using pipenv so training environments stay consistent even when libraries like TensorFlow change. A web-based Jupyter Lab environment provides shared access to the same requirements and GPUs, enabling everyone to run the same training and tests.

Lab 2 starts with a simpler task: predicting a single handwritten character using the EMNIST dataset (handwritten digits and letters). It demonstrates baseline network options such as an MLP that flattens 28×28 images into vectors and a LeNet-style convolutional network. A key software pattern appears again: networks define layers, while model classes provide a uniform interface (including a fit method) that compiles the network with an optimizer and loss, streams data via generators, and hides framework-specific details from training scripts.

Once character prediction works, the labs scale up to line recognition. The proposed architecture uses a sliding window over the line image, applies a convolutional network to each window patch, feeds the resulting sequence of features into an LSTM, and trains with CTC (Connectionist Temporal Classification) loss. CTC handles the mismatch between the fixed-length sequence produced by the model and the variable-length target text by allowing a special blank (epsilon) label and using dynamic programming to compute the probability of the correct transcription across many alignment paths. The result is a system that can learn from unsegmented handwriting—no per-character bounding boxes required—while still producing coherent text strings.

Cornell Notes

The project builds handwritten-text recognition as a full pipeline: a web backend sends an image to a deployed prediction model, which returns transcribed text. The codebase is structured for maturity—separating inference serving (API, Docker, AWS Lambda), data download specs (not large datasets), evaluation/testing scripts, and training code (networks vs models, weights storage). Training begins with a single-character task on EMNIST using baseline networks like an MLP and a LeNet-style CNN, wrapped by model classes that provide a consistent fit/predict interface. Scaling to full line recognition uses a sliding-window CNN feature extractor, an LSTM sequence model, and CTC loss to align variable-length text without explicit character segmentation. This matters because CTC enables end-to-end learning from images to text where character boundaries are unknown.

Why split recognition into line detection and line text recognition instead of doing everything at once?

The architecture is designed around smaller, learnable visual units. A line detector isolates each text line, producing an image containing only a single line. That output then feeds a line text recognizer that converts the line image into characters and assembles the transcription. This reduces the complexity of the visual input the recognizer must handle and makes training and inference more modular.

What does the codebase separation between “networks” and “models” buy you?

Networks are treated as architectures that transform inputs into raw outputs (e.g., a CNN or MLP). Models wrap that network with the rest of the training/inference contract: how raw data becomes the right input format, how outputs become interpretable predictions, and which loss/optimization logic is used. This separation helps keep deployment self-contained (ship inference-relevant code) while training-specific components remain isolated.

How does the single-character training setup on EMNIST work at a high level?

EMNIST provides 28×28 handwritten character images with labels across 80 character classes. An MLP flattens the image and uses dense layers, while a LeNet-style CNN uses convolution/pooling before classification. A character model class overrides prediction behavior for raw images (casting unsigned integers to floats, running the network, and taking the most likely class). Training uses a consistent interface (a fit method) that compiles the network with a loss and optimizer and streams data via generators.

Why does line recognition use a sliding window plus an LSTM rather than a single left-to-right classifier?

Handwriting is both sequential and spatial. A sliding window converts the line image into a sequence of overlapping patches; the CNN extracts features per patch. An LSTM then models context across those time steps, helping resolve ambiguity where a patch might contain parts of multiple characters. This context is crucial when the same visual pattern could correspond to different characters depending on what came before.

How does CTC loss solve the alignment problem for handwriting without character segmentation?

CTC introduces a blank (epsilon) label and allows the model to output a fixed-length sequence of per-time-step character probabilities. During decoding, adjacent duplicate predictions are collapsed, epsilon blanks are removed, and the remaining characters form the transcription. Training uses CTC’s dynamic programming to sum probabilities over all alignments that could produce the target string, so the model learns to place characters in the right order even when character boundaries are unknown.

What inputs does the CTC-based line model need that differ from the simpler character model?

The CTC network function needs not only the image-derived input sequence but also the input sequence length and the label sequence length. These lengths are required to compute CTC loss correctly: input length describes how many time steps the model produced from the image, while label length describes the target text length after compression rules.

Review Questions

  1. What specific responsibilities belong in “networks” versus “models,” and how does that affect deployment packaging?
  2. Describe how sliding windows, a CNN feature extractor, and an LSTM combine to turn a line image into a sequence suitable for CTC.
  3. Why does CTC require dynamic programming, and what role does the blank (epsilon) label play in training and decoding?

Key Points

  1. 1

    The system is organized as a web backend that decodes an uploaded image, a deployed prediction model that runs inference, and a response that returns transcribed text.

  2. 2

    Line-level recognition is modular: a line detector isolates each line, and a line text recognizer converts that line image into characters.

  3. 3

    A production-ready ML codebase separates concerns: API serving, data download specs, evaluation/testing, inference packaging, and training logic.

  4. 4

    Networks define architectures, while models wrap networks with dataset formatting, loss functions, optimization, and a consistent fit/predict interface.

  5. 5

    Training starts with single-character prediction on EMNIST using baseline CNN/MLP architectures before scaling to line recognition.

  6. 6

    Line recognition uses a sliding-window CNN feature extractor feeding an LSTM, trained end-to-end with CTC loss to handle unknown character boundaries.

  7. 7

    CTC’s blank label and decoding rules (collapse duplicates, remove blanks) enable variable-length text transcription from fixed-length model outputs.

Highlights

The pipeline returns text by chaining a web backend (image decode + API) to a compiled prediction model (weights + inference logic) and back to the user.
CTC loss lets the model learn alignments without character segmentation by summing over many possible paths using dynamic programming.
Line recognition is implemented as a sequence model: sliding-window CNN features become LSTM time steps, then a dense layer produces per-step character probabilities.
The repository structure is intentionally deployment-aware: inference code is isolated so production can ship only what’s needed.
Dependency and experiment reproducibility are treated as first-class concerns via pipenv and Weights & Biases.

Topics

  • Text Recognizer Architecture
  • EMNIST Character Training
  • CTC Line Recognition
  • ML Codebase Organization
  • Deployment with AWS Lambda

Mentioned

  • W&B
  • ML
  • CTC
  • LSTM
  • CNN
  • MLP
  • AWS
  • GPU
  • Jupyter
  • API
  • Docker
  • CTC
  • EMNIST