Lab 06: Data Annotation (FSDL 2022)

TL;DR

Rich annotations at line/word/character level support synthetic paragraph generation, even when the final target is paragraph-level.

Briefing Cornell Notes

Briefing

Data annotation is treated as a make-or-break step in the full machine-learning pipeline: rich, carefully structured labels—often at finer granularity than the final task—are what turn raw handwritten inputs into training data that neural networks can actually learn from. The lab’s first half focuses on how handwritten-text datasets are represented and persisted for PyTorch/PyTorch Lightning, emphasizing that the “flavor” of the raw data matters less than the principle of capturing detailed annotations. Even when the end goal is a whole paragraph, the workflow benefits from labeling at the line, word, and character levels. That detail enables synthetic data generation, such as recombining lines from different paragraphs into new synthetic paragraphs to stretch limited datasets. The lab frames data synthesis as an underrated early-stage strategy for bootstrapping ML systems, especially when data is scarce, and notes that modern image/text synthesis advances (including approaches associated with Stable Diffusion–style image generation and GPT-3–style text generation) make this kind of augmentation increasingly important.

The second half shifts from dataset structure to the practical mechanics of producing annotations from the real world. Raw handwritten pages are easy to collect by scanning and digitizing, but the labels must be created manually. For that, the lab uses Label Studio, a secure web-based annotation tool. Because Label Studio runs as a local web service during the exercise, the setup includes creating a username/password, installing Label Studio, and using ngrok to expose the local service to the public internet without wrestling with firewalls or port forwarding. The lab also uses a publicly accessible FSDL handwriting dataset stored on S3, but in an unannotated form; instead of uploading images directly, Label Studio ingests a manifest (a CSV of URLs). In local development, the manifest can be uploaded directly; in Colab-style workflows, the manifest must be downloaded to the machine running the browser.

Once Label Studio is running, each row in the CSV becomes a “task” to annotate. The lab then walks through building an annotation interface using Label Studio’s domain-specific language, starting from an OCR template and customizing it to match the desired output: annotators mark each line of handwritten text. The interface supports zooming, rotating, and precise region selection via draggable controls. A key part of the workflow is UI debugging and ambiguity resolution—spending time annotating a few forms end-to-end to understand edge cases such as whether annotators should tightly bound individual letters (even if that overlaps neighboring lines), how much rotation to apply to follow the text baseline, and whether to correct misspellings. The lab’s guidance is explicit: annotators should make a best effort to capture the letters present in the handwriting, not “correct” spelling or replace uncertain handwriting with the printed prompt.

Finally, the lab highlights a practical constraint: if downstream data handling expects rectangular regions, polygon selectors should be removed from the UI even if polygon precision seems useful. It also recommends writing clear instructions inside Label Studio and testing them with real annotation passes. The exercise ends with a teardown step that shuts down the Label Studio service and returns the environment to the model-development setup for the next lab, where trained models are deployed into production.

Cornell Notes

The lab treats data annotation as an end-to-end pipeline requirement, not a clerical step. It argues that labeling should be rich—often at line/word/character level—even if the final model target is coarser (like full paragraphs), because that detail enables synthetic paragraph generation by recombining labeled lines. For producing those labels from raw scans, it uses Label Studio running as a local web service, exposed via ngrok, and fed by a CSV manifest of image URLs. Annotators are guided to mark handwritten text lines with best-effort letter capture, avoiding spelling correction or substituting the printed prompt. The setup also stresses UI debugging and alignment with downstream assumptions, such as removing polygon region tools when the training pipeline expects rectangles.

Why does the lab emphasize labeling at line/word/character granularity when the final task may be a paragraph-level output?

Because finer labels unlock more effective augmentation. The workflow uses line-level (and even word/character-level) annotations to synthesize new training paragraphs by recombining lines from different original paragraphs. That creates additional training examples when the dataset is limited, improving learning without collecting more raw scans. The lab frames this as a practical bootstrapping technique early in ML development, where data scarcity is common.

What is the practical difference between collecting handwritten scans and producing the training-ready dataset?

Scans are straightforward: pages can be digitized by scanning and storing images. The hard part is annotations, which must be created manually. The lab’s second half focuses on turning those raw images into structured labeled data by running Label Studio and configuring an annotation interface that matches the model’s needs (e.g., line-level regions and transcribed handwritten content).

How does Label Studio get the data it needs to annotate in this lab setup?

Label Studio ingests a manifest—a CSV treated as a list of tasks. Each row points to data via URLs (the FSDL handwriting dataset images are referenced through this manifest). In local development, the manifest can be uploaded directly if it’s on the same machine as the browser; in Colab-like setups, the manifest must be downloaded from cloud storage to the machine where the browser accesses Label Studio.

What kinds of annotation ambiguities does the lab ask annotators (and developers) to resolve?

It highlights uncertainty about bounding precision (e.g., whether to include parts of neighboring lines when letters overlap), how much to rotate the region to follow the text baseline, and whether to correct misspellings using the printed prompt. The lab’s guidance is to avoid correcting spelling or using the printed prompt to replace handwriting; annotators should capture the letters present in the handwriting as best as possible.

Why remove polygon region selection even if it seems more precise?

Because the downstream data pipeline expects rectangular regions. If polygon tools are enabled, the saved annotations may not match the assumptions in later preprocessing/training code. The lab recommends deleting the polygon option from the Label Studio configuration so annotation outputs stay consistent with what the training pipeline can ingest.

Why use ngrok in this workflow?

Label Studio runs as a local web service, but annotators need to access it through a public URL. ngrok creates a tunnel from the public internet to the local port, avoiding common networking issues like firewall rules and port forwarding. The lab notes that ngrok is added to the lab requirements and that a free tier is sufficient for the demo.

Review Questions

What labeling granularity choices enable synthetic data generation in this lab, and how does that affect model training when data is scarce?
How do the lab’s annotation instructions prevent annotators from “correcting” handwriting using the printed prompt?
What mismatch can occur if the annotation UI allows polygon regions but the preprocessing/training code expects rectangles?

Key Points

1
Rich annotations at line/word/character level support synthetic paragraph generation, even when the final target is paragraph-level.
2
Synthetic data is positioned as a practical early-stage strategy for bootstrapping ML systems under data scarcity.
3
Raw scanned images are easy to collect; high-quality annotations require a dedicated workflow and careful UI configuration.
4
Label Studio is used to create structured annotation tasks from a CSV manifest of image URLs, with each row treated as a task.
5
ngrok enables access to a locally running Label Studio service from a public URL without manual port-forwarding.
6
Annotation quality depends on resolving ambiguities (bounding precision, rotation, and transcription rules) through clear instructions and test passes.
7
Annotation tools must match downstream preprocessing assumptions, such as using rectangular regions when training code expects rectangles.

Highlights

Labeling at the line/word/character level is framed as essential because it enables recombining lines into synthetic paragraphs for augmentation.

Annotators are instructed to transcribe the handwriting as-is (best effort), explicitly avoiding spelling correction or substituting the printed prompt.

Polygon region selection is removed when downstream code assumes rectangular regions, showing how UI design must align with pipeline constraints.

Topics

Data Annotation
Label Studio
Synthetic Data
OCR Labeling
ngrok Setup