Lab 1 - Introduction - Full Stack Deep Learning
Based on The Full Stack's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
The target app pipeline is modular: a web backend decodes images, a compiled prediction model performs inference, and training code produces the weights used for serving.
Briefing
The lab setup centers on building a production-minded deep learning pipeline for a text-recognition app—turning an uploaded page image into a clean transcription—while giving participants a structured codebase to train and evaluate models. The workflow is split into three interacting modules: a web backend receives an image, a compiled prediction model performs inference, and training code produces the model weights used for serving. Inference itself is staged: a line detector first extracts individual text lines from the full image, then a line text recognizer converts each line into text, and the results are returned to the backend for API-ready output.
Before any modeling begins, the course walks through getting access to a managed compute environment using Weights & Biases (W&B). Participants sign up at an app.wandb.ai site, then redeem a token to unlock a Jupyter hub environment. Each user receives access to two GPUs, and the lab notebooks run in a web-based JupyterLab interface already configured for the class. If setup stalls, help is routed through a Slack channel, and the intent is to minimize local laptop friction so learners can focus on understanding and improving the codebase over the following month.
Lab 1 then lays out the architecture and the learning path. The first lab trains a “simple character predictor” rather than tackling full page transcription immediately. The system starts with predicting a single handwritten character from an image, using the EMNIST dataset (an extension of MNIST with 80 classes including digits, uppercase letters, lowercase letters, and a few symbols). Participants inspect dataset statistics and image shapes (700,000 training images of size 28×28) and visualize samples to build intuition about what the model is learning.
A major portion of the session is devoted to how the repository is organized. At the top level, notebooks support data exploration; tasks provide reusable scripts for common commands; training contains the scalable training entry points; and the text recognizer package holds the core logic. Inside that package, datasets manage downloading, preprocessing, augmentation, and loading into TensorFlow (while keeping raw data external to version control). Networks define “dumb” input-output architectures (e.g., an MLP that flattens 28×28 images into vectors and stacks dense layers with softmax, or a CNN-style LeNet-like option using convolutional layers). Models wrap those networks with training functionality—loss functions, optimizers, metrics, weight saving/loading, evaluation, and a fit loop.
The training pipeline is driven by a run-experiment configuration (a JSON/dictionary specifying dataset, model, network, and training arguments). A run experiment script imports the requested dataset and model, downloads data if needed, trains via a model.fit call, and then evaluates on a test set. The session also emphasizes modularity as a bug-prevention strategy: isolating data generation, swapping datasets easily, and avoiding monolithic scripts that duplicate logic across architectures. Participants then begin an actual training run in the Jupyter environment, installing dependencies as needed and watching loss decrease, with encouragement to iterate on hyperparameters and improvements on their own time.
Cornell Notes
The labs build a production-style deep learning system for text recognition, where a web backend sends an image to a compiled prediction model. Inference is staged: a line detector extracts text lines, and a line text recognizer converts each line into text for API output. Lab 1 starts small by training a single-character predictor using EMNIST, an MNIST-like dataset with 80 classes (digits, letters, and symbols). The repository is organized so notebooks explore data, datasets handle download/preprocessing/augmentation, networks define architectures (MLP/CNN), and models add training/evaluation logic (loss, optimizer, fit/evaluate, weight I/O). Training runs are configured via a JSON/dictionary and executed through a run-experiment script that trains and then evaluates on a test set.
How does the system turn an uploaded page image into transcription-ready text?
Why does Lab 1 train a character predictor before attempting full line or page recognition?
What does EMNIST add compared with MNIST, and what are the dataset basics used here?
What is the division of labor between the repository’s Networks and Models folders?
How does the run-experiment configuration drive training?
What modularity principle is emphasized to reduce bugs in ML codebases?
Review Questions
- Map the full inference path from a submitted image to the final API output. Where do line detection and line recognition fit?
- Describe how the codebase separates architecture definition from training mechanics. What belongs in Networks vs Models?
- Given an experiment config, what steps does the run-experiment script perform from dataset instantiation through evaluation?
Key Points
- 1
The target app pipeline is modular: a web backend decodes images, a compiled prediction model performs inference, and training code produces the weights used for serving.
- 2
Inference is staged for scalability: a line detector extracts text lines, and a line text recognizer converts each line into text before returning results to the backend.
- 3
Lab 1 starts with a character-level task using EMNIST (80 classes) to build understanding before scaling to line and page recognition.
- 4
Repository structure separates concerns: notebooks for exploration, tasks for reusable scripts, training for scalable entry points, and text recognizer for core ML logic.
- 5
Networks define architectures (e.g., MLP flattens 28×28 inputs and uses dense layers with softmax), while Models wrap networks with training/evaluation, weight I/O, and convenience prediction methods.
- 6
Training runs are driven by a JSON/dictionary experiment config that selects dataset, model/network, and hyperparameters, then trains and evaluates automatically.
- 7
Modularity is treated as a reliability strategy: isolating dataset preprocessing/augmentation helps pinpoint failures and prevents duplicated logic across scripts.