Versioning (5) - Data Management - Full Stack Deep Learning

TL;DR

Treat training data as a first-class dependency: deployed ML artifacts depend on both code and the exact dataset used for training.

Briefing Cornell Notes

Briefing

Versioning in machine learning isn’t just about saving model code—it’s about making the trained artifact reproducible by tracking the exact data used to train it. Without data versioning, teams eventually lose the ability to recreate a previously working model: data changes after deployment, but the system only remembers the code hash, not the dataset state that produced the model. The result is a slow drift into irreproducible performance, where retraining may no longer recover the earlier “known-good” model because the original training inputs can’t be reconstructed.

A practical way to think about versioning is in four layers. At the lowest level is “unversioned” data—keeping datasets only in a file system, S3, or a database and training on whatever happens to be there later. The next level adds a snapshot: store a hash or snapshot of the training data alongside the code version so the deployed artifact can be tied to a specific dataset state. But doing this as a crude “snapshot everything” approach can become hacky, especially if it doesn’t capture the full code bundle or if it’s too heavy to manage.

Level two is more systematic: version data as a combination of assets and code. For example, speech datasets might store WAV files in S3 under unique IDs, while labels live in a JSON mapping those IDs to annotations. Those label files can be tracked in version control, but they may balloon into multi-gigabyte blobs when there are millions of labeled rows. Tools like Git Large File Storage (Git LFS) address this by keeping large files out of the Git repository: commits store hashes and pointers while the actual content is uploaded to S3. An additional optimization, lazy data loading, can prevent downloading entire large repositories until the data is actually needed.

In this setup, the dataset version becomes the hash of the underlying raw data and labels—so any change to labels, added rows, or modified waveform files produces a new dataset hash and therefore a new dataset “version.” Adding a timestamp can make it easier to correlate versions with training runs.

Level three points to specialized machine-learning data versioning systems. The transcript flags a key caution: these tools should be adopted only after understanding the specific problem they solve, because some quickly expand into managing infrastructure and pipelines. Examples mentioned include DVC, which version-controls both data and transformation steps via pipelines (useful in tabular/CSV-style workflows), and Pachyderm, which emphasizes language-agnostic data versioning plus automated triggering, parallelism, and resource management. Another option highlighted is Dolt from Liquid Data, which focuses on database versioning: it provides Git-like branching and merging for databases, supports conflict detection and resolution, and enables comparisons of performance across different dataset versions by running models against each version.

The core takeaway is straightforward: reliable deployment requires treating training data as a first-class, versioned dependency—ideally with tooling that can reproduce both the dataset state and the transformations that produced the training inputs.

Cornell Notes

Machine learning deployments need reproducible artifacts, which requires versioning not only code but also the exact training data. Without tracking dataset state, later retraining may never recreate a previously working model because the data has changed and the original inputs weren’t recorded. A four-level framework starts with unversioned data, moves to snapshot-based tracking, then to asset-plus-code dataset versioning (often using Git LFS and lazy loading for large label files). Specialized tools like DVC, Pachyderm, and Dolt offer deeper data/version workflows, but adoption should match the team’s specific needs—especially since some tools also take on infrastructure management.

Why does missing data versioning eventually break the ability to reproduce a working model?

When a model is deployed, the deployed artifact depends on both the code and the dataset used for training. If only the code is versioned (e.g., via a code hash) while the dataset changes later, the original training inputs can’t be reconstructed. Over time, retraining produces models tied to new data states, and teams lose the ability to roll back to the earlier “known-good” model because the dataset version that created it was never captured.

What’s the difference between snapshotting data and versioning data as assets plus code?

Snapshotting stores a specific dataset snapshot (often via a hash) alongside the code version so the artifact can be tied to a dataset state. It can become heavy or “hacky” if it’s just dumping large blobs without a clean workflow. Asset-plus-code versioning treats datasets as a mix of large assets (e.g., WAV files in S3) and smaller metadata/labels (e.g., a JSON mapping IDs to labels) tracked in version control. The dataset version can then be derived from hashes of the raw assets and label mappings.

How does Git LFS help when label files become multi-gigabyte?

Git LFS prevents large files from being stored directly inside the Git repository. Instead, when a large file (like a big JSON label mapping) is committed, Git LFS uploads the content to S3 behind the scenes and stores a pointer containing the file’s hash and location. This keeps the repository manageable while still making the dataset content versionable through content hashes.

What does “lazy data” add to dataset versioning workflows?

Lazy data loading lets a team check out a large repository containing many large files without downloading everything locally immediately. Data is fetched only when it’s needed for a specific step (e.g., training or evaluation). Combined with hashing, this supports efficient workflows while still ensuring dataset versions correspond to specific raw data and label states.

How do specialized tools differ in their approach to data versioning?

DVC is positioned as open-source ML project version control that can version data and also version transformation steps via pipelines—useful when changing parts of a data-processing workflow should trigger recomputation. Pachyderm is described as language-agnostic and focused on automated triggering, parallelism, and resource management, which can blur into infrastructure management. Dolt (from Liquid Data) is highlighted for database versioning: it supports branching/merging changes to a database, conflict resolution, and comparing performance across versions by running models against each dataset state.

What caution should guide choosing a specialized data versioning tool?

Specialized systems should be adopted only after clearly identifying the problem they solve and being able to explain how the tool addresses it. A sticking point raised is that some tools quickly move into managing infrastructure for the user, which may not align with a team’s preferences or existing platform.

Review Questions

What two components must be versioned to make a deployed ML artifact reproducible, and what failure mode occurs if one is missing?
How does hashing raw data and labels create a dataset version, and what kinds of changes would alter that hash?
Compare DVC, Pachyderm, and Dolt in terms of what they version and how much infrastructure management they appear to assume.

Key Points

1
Treat training data as a first-class dependency: deployed ML artifacts depend on both code and the exact dataset used for training.
2
Unversioned datasets lead to irreproducibility because later data changes prevent recreating earlier “known-good” models.
3
Snapshot-based tracking improves reproducibility by tying a code hash to a dataset snapshot, but can become unwieldy if handled as a blunt blob dump.
4
Asset-plus-code dataset versioning works well when large assets (e.g., WAV files in S3) and label metadata (e.g., JSON mappings) can be tracked together via content hashes.
5
Git LFS keeps large label files out of the Git repository by uploading content to S3 and storing hashes/pointers in Git.
6
Lazy loading reduces the cost of working with large versioned datasets by downloading only what’s needed.
7
Specialized tools (DVC, Pachyderm, Dolt) can automate deeper workflows, but should be chosen only when their benefits match the team’s specific problem and infrastructure preferences.

Highlights

Missing data versioning means a previously working model can become unrecoverable once the dataset changes, even if the code is still versioned.

Git LFS turns huge label files into versionable content hashes stored with pointers, while the actual blobs live in S3.

Dataset versions can be defined as hashes of raw assets plus label mappings, so any label change or added row automatically creates a new dataset version.

Dolt’s database versioning enables branching, merging, conflict resolution, and performance comparisons across dataset versions.

DVC emphasizes versioning both data and transformation pipelines, while Pachyderm emphasizes automated parallelism and resource management.

Topics

Data Versioning
ML Reproducibility
Git LFS
Dataset Hashing
Database Versioning

Mentioned

Git Large File Storage
Dolt
Liquid Data
DVC
Pachyderm
S3
ML
JSON
CSV