Lecture 04: Data Management (FSDL 2022)

TL;DR

Treat dataset work—exploration, cleaning, augmentation, and labeling—as the primary path to performance gains, not just model architecture changes.

Briefing Cornell Notes

Briefing

Data management is the hidden driver of machine-learning performance: spending far more time on data than on models—especially on dataset quality, labeling, and repeatable preprocessing—often yields the biggest gains. The core message is practical: explore your data aggressively (roughly an order of magnitude more time than you’d spend on model tinkering), then improve outcomes by fixing, adding, or augmenting the training data rather than chasing architecture changes.

The lecture breaks data work into a pipeline of decisions, starting with where data comes from and how it lands near the GPU. Sources range from images and text files to logs and database records, but training typically requires copying data onto a local file system or fast storage close to the compute. The fundamentals are file systems, object storage, and databases. A file system treats data as unversioned files that can be overwritten or deleted, with performance varying dramatically—from slow spinning disks to fast NVMe SSDs. Object storage (e.g., S3) shifts the abstraction from “files” to “objects,” typically adding versioning and redundancy at the service level, trading some speed for durability and scalability. Databases handle structured metadata and relationships; binary payloads (like images or audio) should live in object storage, while databases store URLs or references to those binaries.

The talk emphasizes that storage choices should match access patterns. For analytics, data warehouses use OLAP-style processing: column-oriented layouts that compress well and speed up queries like “average comment length over the last 30 days.” For transactional workloads, OLTP systems are row-oriented. Data lakes sit alongside these systems for unstructured aggregation, often using ELT-style flows—extract and load first, transform later—while the industry trend moves toward unifying structured and unstructured data.

Once data is stored, the “language of data” is mostly SQL, with Python data frames (especially pandas) as the workhorse for code-based manipulation. When performance matters, the lecture points to accelerated alternatives: parallelized data-frame libraries and GPU-focused tools like RAPIDS.

Operationally, data processing becomes a scheduling problem. A motivating example describes a nightly training job for a photo popularity predictor, where outputs depend on database queries, log-derived feature computation, and running classifiers. Workflow managers like Airflow can express these dependencies as DAGs, restart failed tasks, and distribute work across machines; Prefect and Dagster are presented as modern alternatives. The lecture also warns against over-engineering: sometimes a well-written parallel Unix pipeline beats a heavyweight distributed framework.

For training consistency and efficiency, feature stores help ensure that the same feature engineering logic used offline during training matches what’s served online during inference, while avoiding recomputation. The lecture then moves to concrete dataset sources (notably Hugging Face datasets and common formats like Parquet, plus image-text datasets and speech corpora) and to labeling strategies.

Labeling is treated as a spectrum: self-supervised learning can reduce or eliminate manual labels; data augmentation can sometimes substitute for labels; synthetic data can provide “free” ground truth; and user feedback can create a data flywheel. When labels are necessary, quality depends on clear rulebooks, careful quality assurance, and choosing labeling tools or services (Scale, Labelbox, Label Studio, and others). Finally, data versioning is framed as essential for reproducibility: unversioned data makes deployed models effectively unversioned, while snapshotting, git-style versioning with Git LFS, and DVC provide increasing levels of rigor.

The lecture closes by noting that privacy-preserving training—federated learning, differential privacy, and learning on encrypted data—remains an active research area without widely reliable off-the-shelf solutions. The overall takeaway is straightforward: treat data as a first-class engineering asset, and performance improvements will follow.

Cornell Notes

Machine-learning gains often come less from model tweaks and more from data work: deeper exploration, better dataset construction, and consistent preprocessing. Data should be stored and accessed according to workload—binary payloads in object storage, metadata in databases, and analytics in column-oriented warehouses or lakes for unstructured aggregation. After storage comes data manipulation (SQL and pandas, with accelerated options when needed) and reliable orchestration (DAG-based workflow tools) so dependent preprocessing steps run correctly at scale. For training efficiency and consistency, feature stores align offline training features with online inference. Labeling and data versioning complete the loop: use self-supervision and augmentation when possible, apply clear labeling rules with quality control when not, and version datasets so model performance can be reproduced and rolled back.

Why does the lecture treat data exploration as a performance lever, not a preliminary chore?

It argues that dataset quality drives outcomes: spending about 10× more time exploring data than on model changes helps uncover issues that later improvements can’t fix cheaply. The practical prescription is to let data insights “flow through” the workflow—then improve performance by fixing the dataset, adding more data, or augmenting data during training rather than relying solely on architecture or hyperparameter changes.

How should binary data, metadata, and relationships be stored for training?

Binary payloads like images and audio should be stored in object storage (e.g., S3), while the database should store object store URLs or references. The database is the right place for metadata and structured relationships (labels, user activity, experiment records). This avoids putting large binaries inside the database and keeps retrieval patterns efficient.

What distinguishes OLAP/OLTP and why does that matter for choosing storage systems?

OLAP systems (data warehouses) are typically column-oriented, enabling compact storage and fast analytical queries such as computing aggregates over time windows. OLTP systems are typically row-oriented for transactional reads and writes. The lecture’s point is that query patterns should determine the platform: analytics workloads benefit from columnar layouts, while transactional workloads benefit from row-based access.

What role do workflow managers play in data preprocessing for training?

Training pipelines often have dependencies: database queries provide metadata, logs produce computed features, and classifiers generate additional outputs that feed the final training dataset. Workflow managers like Airflow model these dependencies as DAGs, distribute tasks across workers, and restart failed jobs—making the preprocessing repeatable and robust when multiple jobs run concurrently.

When is a feature store worth using?

Feature stores help when production inference must use the same feature transformations as offline training, and when recomputing features is expensive. The lecture frames feature stores as a synchronization mechanism between offline training and online prediction, citing tools such as Tecton and Feast as common options.

How does the lecture recommend approaching labeling and data versioning together?

Labeling should start with self-supervised learning and data augmentation to reduce manual annotation needs; if labels are required, annotators need detailed rulebooks and quality assurance. After labels exist, data versioning prevents silent performance drift: unversioned data makes deployed models effectively unversioned. The lecture suggests snapshots (level 1), git-style metadata versioning with Git LFS (level 2), and DVC for more complete data lineage and reproducible pipelines (level 3).

Review Questions

What storage choice would you make for large binary files versus labels and how would you connect them during training?
Describe how a DAG-based workflow manager helps ensure correctness in a multi-step dataset build for nightly training.
What problems arise when training data and labels are not versioned, and what tools or approaches can mitigate them?

Key Points

1
Treat dataset work—exploration, cleaning, augmentation, and labeling—as the primary path to performance gains, not just model architecture changes.
2
Plan storage around access patterns: keep binary payloads in object storage and store references/metadata in a database.
3
Use SQL and data-frame workflows for structured data, and switch to accelerated options (parallel or GPU-based) when pandas becomes a bottleneck.
4
Make preprocessing repeatable with workflow orchestration (DAGs) so dependent steps run in the right order and failures are recoverable.
5
Use feature stores when offline training features must match online inference and when recomputation costs are high.
6
Adopt a labeling strategy that starts with self-supervision and augmentation, then adds manual labeling with strict rulebooks and quality assurance when necessary.
7
Version datasets (not just code) so model performance can be reproduced and rolled back as data changes.

Highlights

Object storage (like S3) treats data as versioned objects, while local file systems treat data as unversioned files that can be overwritten or deleted.

Databases should store metadata and URLs to binaries; binary blobs themselves belong in object storage, not inside the database.

Airflow-style DAGs help coordinate multi-step dataset builds where database queries, log processing, and classifier outputs must land before training.

Feature stores align training-time feature engineering with inference-time serving, reducing both mismatch risk and recomputation cost.

Data versioning is framed as a spectrum: unversioned data makes deployed models effectively unversioned, while DVC can capture lineage and reproduce pipelines. 

Topics

Data Exploration
Storage Architecture
SQL And Data Frames
Workflow Orchestration
Feature Stores
Labeling Strategies
Data Versioning

Mentioned

Hugging Face
Weights and Biases
Hugging Face Hub
Label Studio
Scale
Labelbox
Label Studio
Databricks
Snowflake
Airflow
Prefect
Dagster
Tecton
Feast
DVC
Git LFS
RAPIDS
Video Rapids
Airflow
Dagster
DagsHub
Uber
Michelangelo
Sergey
Mishka
Jeff Dean
Peter Norvig
S3
ETL
OLAP
OLTP
CPU
GPU
NLP
mp3
json
Parquet
DAG
ETL
SATA
NVMe
L1
L2
L3
API
SQL
3D