Lab 1 - Introduction - Full Stack Deep Learning

TL;DR

The target app pipeline is modular: a web backend decodes images, a compiled prediction model performs inference, and training code produces the weights used for serving.

Briefing Cornell Notes

Briefing

The lab setup centers on building a production-minded deep learning pipeline for a text-recognition app—turning an uploaded page image into a clean transcription—while giving participants a structured codebase to train and evaluate models. The workflow is split into three interacting modules: a web backend receives an image, a compiled prediction model performs inference, and training code produces the model weights used for serving. Inference itself is staged: a line detector first extracts individual text lines from the full image, then a line text recognizer converts each line into text, and the results are returned to the backend for API-ready output.

Before any modeling begins, the course walks through getting access to a managed compute environment using Weights & Biases (W&B). Participants sign up at an app.wandb.ai site, then redeem a token to unlock a Jupyter hub environment. Each user receives access to two GPUs, and the lab notebooks run in a web-based JupyterLab interface already configured for the class. If setup stalls, help is routed through a Slack channel, and the intent is to minimize local laptop friction so learners can focus on understanding and improving the codebase over the following month.

Lab 1 then lays out the architecture and the learning path. The first lab trains a “simple character predictor” rather than tackling full page transcription immediately. The system starts with predicting a single handwritten character from an image, using the EMNIST dataset (an extension of MNIST with 80 classes including digits, uppercase letters, lowercase letters, and a few symbols). Participants inspect dataset statistics and image shapes (700,000 training images of size 28×28) and visualize samples to build intuition about what the model is learning.

A major portion of the session is devoted to how the repository is organized. At the top level, notebooks support data exploration; tasks provide reusable scripts for common commands; training contains the scalable training entry points; and the text recognizer package holds the core logic. Inside that package, datasets manage downloading, preprocessing, augmentation, and loading into TensorFlow (while keeping raw data external to version control). Networks define “dumb” input-output architectures (e.g., an MLP that flattens 28×28 images into vectors and stacks dense layers with softmax, or a CNN-style LeNet-like option using convolutional layers). Models wrap those networks with training functionality—loss functions, optimizers, metrics, weight saving/loading, evaluation, and a fit loop.

The training pipeline is driven by a run-experiment configuration (a JSON/dictionary specifying dataset, model, network, and training arguments). A run experiment script imports the requested dataset and model, downloads data if needed, trains via a model.fit call, and then evaluates on a test set. The session also emphasizes modularity as a bug-prevention strategy: isolating data generation, swapping datasets easily, and avoiding monolithic scripts that duplicate logic across architectures. Participants then begin an actual training run in the Jupyter environment, installing dependencies as needed and watching loss decrease, with encouragement to iterate on hyperparameters and improvements on their own time.

Cornell Notes

The labs build a production-style deep learning system for text recognition, where a web backend sends an image to a compiled prediction model. Inference is staged: a line detector extracts text lines, and a line text recognizer converts each line into text for API output. Lab 1 starts small by training a single-character predictor using EMNIST, an MNIST-like dataset with 80 classes (digits, letters, and symbols). The repository is organized so notebooks explore data, datasets handle download/preprocessing/augmentation, networks define architectures (MLP/CNN), and models add training/evaluation logic (loss, optimizer, fit/evaluate, weight I/O). Training runs are configured via a JSON/dictionary and executed through a run-experiment script that trains and then evaluates on a test set.

How does the system turn an uploaded page image into transcription-ready text?

A web backend receives a POST request containing an image (as bytes), decodes it into a readable image format, and passes it to a compiled prediction model. The prediction model first runs a line detector to extract individual lines from the full page image. Each extracted line is then processed by a line text recognizer that converts the line’s content into text. The line-level outputs are sent back to the web backend, which packages them into a final prediction served as an API response.

Why does Lab 1 train a character predictor before attempting full line or page recognition?

The course intentionally starts with the simplest end of the pipeline: predicting a single character from an image. This reduces complexity so learners can understand the codebase structure and training mechanics using a straightforward task. Once character-level training is working, later labs scale up to reading entire lines, then address experiment management, additional pipeline components, testing/CI, and deployment.

What does EMNIST add compared with MNIST, and what are the dataset basics used here?

EMNIST extends MNIST by including not only digits but also letters and symbols. In this lab, the EMNIST dataset has 80 classes: numbers, uppercase letters, lowercase letters, and a few basic symbols. The training set contains about 700,000 images, each sized 28×28, matching the MNIST image shape.

What is the division of labor between the repository’s Networks and Models folders?

Networks define the architecture as a “dumb” input-output mapping—how tensors flow through layers. For example, the MLP network flattens a 28×28 image into a 28*28 vector, adds dense layers according to specified layer sizes, and ends with softmax. Models wrap those networks with training and usability features: selecting loss/optimizer/metrics (fixed in the base class), compiling the Keras model, converting datasets into Keras-friendly sequences, running fit/evaluate, and handling saving/loading weights. A character model further adds a predict-on-image function that preprocesses inputs (e.g., scaling pixel values to floats in [0,1] by dividing by 255).

How does the run-experiment configuration drive training?

Training is controlled by an experiment config (JSON/dictionary) that specifies which dataset to use, dataset arguments, which model and network to use, and training hyperparameters. The run experiment script uses the config to import and instantiate the dataset (downloading/generating data if missing), instantiate the model with the provided arguments, train via a model.fit call (optionally with callbacks), and then score performance by running model.evaluate on the test set.

What modularity principle is emphasized to reduce bugs in ML codebases?

The lab stresses isolating potential bug sources—especially data generation and preprocessing. If augmentation or preprocessing is wrong (e.g., over-regularizing images), training can become impossible. By keeping dataset logic in dedicated dataset files with easy toggles, learners can swap in a known-good dataset definition and quickly test whether the failure comes from data handling versus the model architecture or training loop.

Review Questions

Map the full inference path from a submitted image to the final API output. Where do line detection and line recognition fit?
Describe how the codebase separates architecture definition from training mechanics. What belongs in Networks vs Models?
Given an experiment config, what steps does the run-experiment script perform from dataset instantiation through evaluation?

Key Points

1
The target app pipeline is modular: a web backend decodes images, a compiled prediction model performs inference, and training code produces the weights used for serving.
2
Inference is staged for scalability: a line detector extracts text lines, and a line text recognizer converts each line into text before returning results to the backend.
3
Lab 1 starts with a character-level task using EMNIST (80 classes) to build understanding before scaling to line and page recognition.
4
Repository structure separates concerns: notebooks for exploration, tasks for reusable scripts, training for scalable entry points, and text recognizer for core ML logic.
5
Networks define architectures (e.g., MLP flattens 28×28 inputs and uses dense layers with softmax), while Models wrap networks with training/evaluation, weight I/O, and convenience prediction methods.
6
Training runs are driven by a JSON/dictionary experiment config that selects dataset, model/network, and hyperparameters, then trains and evaluates automatically.
7
Modularity is treated as a reliability strategy: isolating dataset preprocessing/augmentation helps pinpoint failures and prevents duplicated logic across scripts.

Highlights

Inference is explicitly decomposed into line detection followed by line text recognition, with the backend assembling the final transcription.

The codebase draws a hard line between Networks (architecture-only) and Models (loss/optimizer/fit/evaluate/weights and prediction wrappers).

EMNIST is used as the first training target, expanding MNIST into 80 classes including digits, letters, and symbols.

Topics

Text Recognition Architecture
Lab Setup
EMNIST Dataset
Model Code Structure
Experiment Management

Mentioned

Weights & Biases
W&B
API
MLP
CNN
Keras
CI
GPU
JSON
ML