Lecture 6: Data Management - Full Stack Deep Learning - March 2019
Based on The Full Stack's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Supervised learning remains the practical workhorse, and proprietary labeled data is often the main competitive advantage.
Briefing
Data management in deep learning is less about model math and more about building a reliable pipeline for labels, storage, versioning, and retraining—because most real-world complexity comes from inputs, not training. Supervised learning remains the workhorse for most companies, and competitive advantage often hinges on proprietary labeled data. Public datasets can get a product working, but they rarely create lasting differentiation; the usual path is to ship quickly, then gather and label your own data at scale—often through a “data flywheel” where user interactions generate new training examples.
Labeling is portrayed as both unavoidable and strategically important. Many applications require large volumes of labeled data, and the bottleneck often becomes tedious pre-processing and input preparation—sometimes even exhausting CPU resources more than GPUs. The lecture argues that companies should treat labeling as a system: start with a small labeled set, release a product, and use user feedback to improve labels over time. Google Photos is used as an example of how user-provided signals (like face matching suggestions and optional confirmations) can expand labeled datasets by orders of magnitude without relying on large internal annotation teams. When user labeling isn’t feasible early on, synthetic training data is presented as an underrated accelerator—Dropbox’s OCR progress is cited as an example where generating realistic images of fake text (with transformations like rotation and font variation) helped bootstrap an accurate OCR system before the user-driven flywheel took over.
Once labeling is treated as a continuous process, the next challenge becomes how to store and manage the resulting data. Heavy binary assets (images, audio, compressed text) belong in object storage such as Amazon S3, while structured, frequently queried information fits databases. The lecture distinguishes storage layers: file systems (including distributed variants like HDFS-style setups), object storage (API-based, parallel-friendly, versionable), databases (fast structured lookups, with rows and columns; Postgres is recommended as a default for most cases), and data lakes (schema-on-read aggregation of logs and outputs from multiple sources). For deep learning training, data should be as local as possible—ideally cached in memory or on local/distributed file systems—because reading from remote sources can bottleneck GPU training.
Versioning is framed as a prerequisite for trustworthy deployment. Models are part code and part data, so retraining and rollback require knowing exactly which dataset version produced a given model. The lecture outlines a progression from no versioning (dangerous when deployments need rollback) to snapshotting datasets, to a more robust approach where code commits reference immutable data identifiers stored in object storage. Git LFS is mentioned as a way to handle large dataset metadata files without bloating version control. Specialized data versioning tools (including DVC, plus others like VESY, Pachyderm, and Quil) are presented as options once simpler approaches aren’t enough.
Finally, data workflows tie everything together. Training pipelines often depend on upstream tasks—extracting features from logs, running microservices to compute classifier outputs, and assembling metadata—forming a directed acyclic graph of dependencies. Tools like Luigi and Airflow are introduced as workflow managers that orchestrate tasks and distribute work via queues (e.g., RabbitMQ), ensuring the right computations happen in the right order before nightly retraining.
Cornell Notes
Supervised deep learning still dominates most real applications, and lasting advantage usually comes from proprietary labeled data. A practical “data flywheel” approach—ship a product, then use user interactions to generate new labels—can scale labeling far beyond what internal annotators alone can achieve. To make retraining reliable, data must be stored in the right layer (object storage for binaries like Amazon S3, databases for structured metadata, data lakes for aggregated logs with schema-on-read) and kept local enough to avoid bottlenecks during training. Versioning is essential because models depend on both code and data; dataset snapshots or identifier-based references (often with Git LFS for large files) enable rollback and reproducibility. Workflow managers like Luigi and Airflow orchestrate dependent tasks as DAGs so nightly training can assemble features and classifier outputs in the correct order.
Why does supervised learning—and labeled data—remain the practical default in deep learning companies?
What is the “data flywheel,” and how does it reduce dependence on manual annotation?
How can synthetic data help when labeling is too slow or expensive at the beginning?
How should different kinds of data be stored for deep learning training and product operations?
Why is dataset versioning necessary for deploying and rolling back ML models?
What does a “data workflow” mean, and why does it often require DAG orchestration?
Review Questions
- What makes proprietary labeled data more strategically valuable than relying solely on public datasets?
- How do object storage, databases, and data lakes differ in what they store and how they’re queried?
- What are the risks of training and deploying without dataset versioning, and how does identifier-based versioning mitigate them?
Key Points
- 1
Supervised learning remains the practical workhorse, and proprietary labeled data is often the main competitive advantage.
- 2
A data flywheel—shipping a product and using user interactions to generate labels—can scale labeling without proportional growth in internal annotators.
- 3
Synthetic training data can bootstrap early model quality when manual labeling is too slow or expensive.
- 4
Store binary assets in object storage (e.g., Amazon S3), structured metadata in databases (e.g., Postgres), and aggregated raw inputs in data lakes with schema-on-read.
- 5
Keep training inputs as local as possible (memory/local/distributed file systems) to avoid remote I/O bottlenecks.
- 6
Version datasets so each deployed model can be traced to the exact data version used for training, enabling reliable retraining and rollback.
- 7
Use workflow managers (e.g., Luigi, Airflow) to orchestrate dependent data-processing tasks as DAGs, often with queues like RabbitMQ for distribution.