Lecture 8: Data Management (Full Stack Deep Learning

TL;DR

Deep learning engineering time often concentrates on dataset construction and data flow, not GPU training, so pipeline reliability matters as much as model design.

Briefing Cornell Notes

Briefing

Data management is where most deep learning projects quietly win or fail: getting messy, distributed inputs into a GPU-ready training pipeline—and keeping that pipeline reliable over time—often dominates engineering effort more than model design or GPU time. The lecture frames data work as both a performance lever and a complexity trap. Cleaning, moving, labeling, and versioning data can consume the majority of a machine learning team’s time, and the “data flow” between systems is frequently the hardest part.

A central message is that adding data (or augmenting existing data) is often the most cost-effective way to improve model performance. Instead of chasing new architectures or running exhaustive hyperparameter searches, teams should look for ways to expand the dataset, then use augmentation as baseline “table stakes.” Augmentation is treated as a practical engineering loop: CPU-side workers generate transformed samples (cropping, masking, pixelation, rotations for images; temporal cropping and speed changes for video; noise injection and masking for sequences) while the GPU trains in parallel on the augmented stream.

The lecture then breaks down where data comes from and why proprietary datasets usually matter. Deep learning often relies on proprietary sources—images, text, logs, and database records—because public datasets rarely provide lasting competitive advantage. Even when public data helps as a starting point, real differentiation typically comes from unique labels, domain-specific data collection, and product-driven feedback. The “data flywheel” concept highlights how deploying an early model can improve future training by turning user interactions into cleaner or more informative data.

To reduce labeling costs, the lecture emphasizes semi-supervised learning: reformulate tasks so the data supervises itself (predict future words, reconstruct sentence segments, or learn whether two sentences belong together). It cites Facebook AI Research’s SEER model as an example of training on a billion random unlabeled images to reach state-of-the-art performance on ImageNet top-1, using released loss functions and training libraries.

Next comes the storage and systems layer. The lecture distinguishes file systems (fastest, foundational), object storage like Amazon S3 (API-based, durable, versionable, good for cloud workflows), and databases for structured, repeatedly accessed metadata. It contrasts OLTP databases (fast because data is effectively held in memory with disk persistence) with OLAP-style data warehouses for analysis, and introduces ETL as the classic extract-transform-load pipeline. Data lakes and “lakehouse” approaches (e.g., Delta Lake within Databricks’ vision) aim to store raw and processed data together—structured, semi-structured, and unstructured—then transform it later for training or analytics.

Finally, the lecture treats orchestration, labeling, and versioning as the glue that makes data pipelines dependable. Workflow managers like Airflow define DAGs of dependent tasks across SQL, Python, and other steps; newer alternatives include Prefect, dbt, and Dask/Beam-style processing. Labeling is framed as a quality problem: clear guidelines, training annotators, and choosing between hiring, crowdsourcing, or specialized labeling companies. Tools such as Label Studio, plus approaches like weak supervision (Snorkel), help scale annotation while controlling quality.

The closing sections stress versioning and privacy. Without data versioning, deployed models can’t be reliably reproduced or rolled back because models are partly “code” and partly “data.” The lecture outlines a progression from snapshotting to Git LFS-based dataset manifests, then to specialized tools like DVC and Dolt. It ends with privacy concerns—especially in healthcare—pointing to federated learning, differential privacy, and encrypted-data training as active research areas. The overall takeaway: deep learning success depends on engineering data pipelines that are fast, simple enough to maintain, and rigorous enough to reproduce.

Cornell Notes

Deep learning performance and reliability often hinge less on model architecture than on data management: sourcing data, transforming it into GPU-ready formats, storing it efficiently, labeling it consistently, and versioning it so results can be reproduced. The lecture argues that teams should prioritize data flow and spend far more time “becoming one with the data” than over-optimizing training code. It recommends using augmentation and semi-supervised/self-supervised formulations to reduce labeling costs, and it highlights data flywheels where deployed models improve future training data. On the systems side, it maps file systems, Amazon S3 object storage, databases, data warehouses, and data lakes/lakehouses to different access patterns. Finally, it emphasizes orchestration (Airflow/Prefect), annotation tooling (Label Studio, Snorkel), dataset versioning (Git LFS, DVC), and privacy approaches like federated learning and differential privacy.

Why does data flow often dominate engineering effort in deep learning projects?

The lecture cites common industry observations: junior ML failures frequently come from not investing in dataset construction, and teams often spend most time cleaning and moving data rather than running GPU training. A key reason is that deep learning requires data to be in a trainable format near GPUs, even when sources are scattered across systems—S3 URLs for images, file-system text, distributed processing outputs (e.g., Spark data frames), or logs/records stored in warehouses like Snowflake. Each project/company ends up with a unique path to move and transform data into the final pipeline.

What are the most practical ways to improve model performance without changing the model architecture?

The lecture’s performance lever is data: add more data when possible, and use data augmentation as baseline practice. Augmentation is implemented so CPU workers generate transformed samples while GPUs train in parallel. For images, examples include cropping, masking/blanking patches, pixelation, inversion, and rotation; for tabular data, missingness can be simulated by blanking cells; for text, synonyms and word-order changes; for speech/video, temporal cropping and speed changes plus noise/masking. It also highlights semi-supervised learning where the data supervises itself (e.g., predicting future words from past words, or whether two sentences belong together).

How do semi-supervised/self-supervised approaches reduce the need for manual labeling?

Instead of labeling, the task is reformulated so parts of the same dataset provide supervision. Text examples include predicting future words from past words (sentence completion), predicting the beginning from the end, or predicting a middle word from surrounding context; another example is learning whether two sentences occur in the same paragraph. The lecture notes that this idea extends to vision, citing Facebook AI Research’s SEER model trained on a billion random unlabeled images (contrasted with ImageNet’s million labeled images) to reach state-of-the-art top-1 accuracy on ImageNet, using released loss functions such as contrastive divergence loss.

Which storage layer should hold which kind of data, and why?

The lecture gives a rule-of-thumb mapping. File systems are the fastest and foundational for local training data. Object storage like Amazon S3 is an API layer over files, offering durability, elasticity, and versioning; it’s suitable for binary assets and cloud workflows. Databases store structured metadata that must be accessed repeatedly (e.g., labels, user attributes), while logs are treated more like structured data that’s stored “just in case.” For analytics, data warehouses support OLAP-style querying, and ETL pipelines move data into a common schema. Data lakes and lakehouses (e.g., Delta Lake in Databricks’ vision) store raw and processed data together so transformations can happen later for both analytics and ML.

How do modern pipelines coordinate many dependent data-processing steps?

The lecture emphasizes orchestration via DAGs. Airflow (from Airbnb) is presented as a workflow manager that defines a direct acyclic graph of tasks with dependencies, where tasks can be SQL operations, Python programs, or other steps. A workflow manager assigns tasks to workers, handles failures/retries, and triggers downstream work when upstream tasks complete. Alternatives mentioned include Prefect (Python-first orchestration with hosted execution), dbt (SQL-centric analytics engineering), and other processing frameworks like Apache Beam and TensorFlow Datasets pipelines on Google Cloud Dataflow.

What does “data versioning” mean for ML, and why is it more than code versioning?

Because models depend on both code and the data they were trained on, versioning only the code makes deployments irreproducible and prevents reliable rollback. The lecture outlines levels: (1) snapshotting everything (works but feels hacky), (2) versioning data as assets plus code—store large files in S3 with unique IDs and keep dataset manifests/metadata in the code repo (optionally using Git LFS), and (3) specialized tools like DVC for provenance tracking and Dolt as “git for SQL.” The goal is to ensure a deployed model can be traced back to the exact dataset version used.

Review Questions

If a team can’t afford large-scale manual labeling, what combination of techniques from the lecture could reduce labeling needs while preserving accuracy?
Match each storage option (file system, Amazon S3, database, data warehouse, data lake/lakehouse) to a typical data type and access pattern described in the lecture.
Why can two models with identical code still behave differently after deployment, and how does dataset versioning address that?

Key Points

1
Deep learning engineering time often concentrates on dataset construction and data flow, not GPU training, so pipeline reliability matters as much as model design.
2
Adding data and using augmentation are usually higher-leverage than changing architectures or running large hyperparameter searches.
3
Semi-supervised/self-supervised learning can replace manual labels by reformulating tasks so the dataset supervises itself (e.g., predicting masked/next tokens or sentence relationships).
4
Storage choices should follow access patterns: file systems for fast local training, Amazon S3 for durable cloud assets, databases for repeatedly accessed metadata, and data warehouses/lakehouses for analytics and later transformations.
5
Orchestrate multi-step pipelines with DAG-based workflow managers (e.g., Airflow/Prefect) so dependent tasks run automatically and failures are handled systematically.
6
Annotation quality depends on guidelines, annotator training, and quality assurance; tools like Label Studio and methods like weak supervision (Snorkel) can scale labeling while controlling consistency.
7
Model reproducibility requires dataset versioning; code-only versioning is insufficient because deployed models implicitly depend on the exact training data snapshot.

Highlights

A recurring theme is that “data flow” complexity—moving and transforming inputs into GPU-ready formats—often outweighs the complexity of the training step itself.

Semi-supervised learning is treated as a practical cost reducer, with SEER cited as training on a billion unlabeled random images to achieve strong ImageNet top-1 results.

Amazon S3 is positioned as a durable, versionable object layer that fits cloud workflows, while databases are reserved for structured metadata accessed repeatedly.

Airflow-style DAG orchestration turns dependent data-processing steps into an automated system that triggers downstream work when upstream outputs finish.

Dataset versioning is framed as essential for rollback and reproducibility because models are partly “data,” not just code.

Topics

Data Flywheel
Semi-Supervised Learning
Data Storage
Workflow Orchestration
Data Labeling
Dataset Versioning
Privacy & Federated Learning

Mentioned

Josh Tobin
S3
GPU
GANs
GPT-3
SEER
TF
CPU
OLTP
ETL
OLAP
SQL
DAG
NFS
HDFS
OCR
tf.data
DVC
LFS
JSON
SQL
CQ
T5
Beam
Dask
CUDA

Lecture 8: Data Management (Full Stack Deep Learning - Spring 2021)