Versioning Data for Machine Learning

TL;DR

Model rollback requires a reliable mapping from deployed code to the exact training dataset used, not just a Git commit.

Briefing Cornell Notes

Briefing

Data versioning is treated as a ladder of increasing sophistication—because deployed machine learning models depend on both code and the exact data they were trained on, and teams eventually need to roll back when things break. At the simplest level, data sits on a file system or in S3 without versioning; training on Tuesday uses whatever data exists then, and training on Friday uses whatever exists then. The trouble shows up during deployment: if the deployed model fails and the team needs to revert, the lack of data history makes it hard to reproduce the prior training set, even though the model itself is versioned.

A more reliable approach is to snapshot the dataset and store it alongside the code version. In practice, that means packaging the full training data at a given time, storing it somewhere durable, and recording a pointer to that snapshot when training and deploying. With this method, a deployment can reference both the Git commit for the code and a specific data snapshot date (or identifier), enabling a rollback to a known-good combination. The workflow still has friction—manual snapshotting can feel “super hacking”—which motivates a better pattern: version data as a mix of assets and code.

Level 2 centers on keeping large raw assets in object storage while versioning the *references* and metadata in Git. For image processing or speech recognition, the actual audio files or images can live in S3 under unique IDs. The training dataset then becomes a structured list of those IDs plus labels and other metadata stored in JSON or similar formats that Git can track. As long as the underlying S3 objects aren’t deleted, checking out an earlier Git revision effectively reconstructs the dataset “as of” that point in time. Because JSON manifests can grow huge, the transcript flags Git LFS (Git Large File Storage) as a way to handle large files similarly to code.

This scheme raises a future-proofing question: what happens when systems evolve—database schemas change, or data gets deleted? The answer depends on application constraints. In education-focused use cases with strict privacy rules, deletion requests must be honored, so retaining old data for future training isn’t allowed; in those environments, the “right place” to enforce correctness is upstream via contracts, tests, and alerts, or by using S3 versioning so files aren’t removed—only new versions are added. That shifts the burden from downstream rollback to earlier safeguards.

Level 3 moves beyond Git-based reference tracking into specialized data versioning systems. The transcript names DBC, Pachyderm, and Quill, with DBC highlighted as especially promising. DBC automates the linkage between data additions and stored artifacts in S3, and it records transformations (for example, converting an XML dataset into a training-ready CSV) so the pipeline and provenance can be retrieved later. The takeaway is pragmatic: try Level 2 first, then evaluate Level 3 if the operational overhead or provenance needs justify it.

Cornell Notes

Machine learning deployments require the exact training data used alongside the code version, or rollback becomes unreliable. Level 0 skips data versioning entirely, which breaks reproducibility when a bad model must be reverted. Level 1 uses dataset snapshots stored externally and records pointers to those snapshots with the code version, enabling rollback to a known-good code+data pair. Level 2 improves this by storing large raw assets in S3 (using stable IDs) while versioning the dataset manifest—IDs plus labels/metadata—in Git, often with Git LFS for large files. Level 3 introduces specialized tools like DBC to automatically track data provenance and transformation pipelines, but it’s recommended to adopt Level 2 before moving to Level 3.

Why does skipping data versioning (Level 0) make model rollback difficult?

If training uses whatever data exists at the time (e.g., “Tuesday’s files” vs “Friday’s files”) while the deployed model is versioned, the team still lacks a reliable mapping from a deployed model back to the exact dataset version. When a deployment fails and the team needs to revert, the code rollback alone isn’t enough—without versioned data, there’s no dependable way to reconstruct the earlier training set.

How does Level 1 make deployments reproducible?

Level 1 snapshots the dataset at training time: the team downloads all training data, packages it, stores it somewhere durable, and records a reference to that snapshot when deploying. The deployment can then pair a Git commit (code) with a specific data snapshot identifier or date, so reverting means restoring both the code and the corresponding data state.

What does Level 2 change about the way datasets are versioned?

Level 2 keeps raw assets in S3 and version-controls the *manifest* that defines the dataset. For example, audio or image files live in S3 under unique IDs; the training dataset is represented as a structured list of those IDs plus labels/metadata stored in Git (often JSON). Checking out an earlier Git revision reconstructs the dataset as long as the referenced S3 objects remain available.

Why is Git LFS relevant in Level 2?

Dataset manifests and related artifacts can become large—millions of examples can make JSON manifests sizable. Git LFS (Git Large File Storage) helps store large files without treating them like ordinary Git blobs, making it more practical to version large dataset-related files alongside code.

How do privacy and retention rules affect the “future-proofing” of Level 2?

If the application must delete data on request (strict privacy), retaining old data for future training isn’t allowed, so the system can’t rely on keeping historical S3 objects forever. In such cases, correctness needs to be enforced upstream through contracts, schema-change tests, and alerts. In other settings, enabling S3 versioning can prevent deletion by keeping prior versions and only adding new ones.

What extra capability does Level 3 add, and what tool example is given?

Level 3 automates data provenance and transformation tracking. DBC is highlighted: adding a data file uses a DBC command (e.g., “DBC add data XML”), which uploads the file to S3 and creates a metadata file (e.g., “data XML be seen”) that links to the stored asset. When transforming data (e.g., XML to a training-ready CSV), DBC records the transformation pipeline (e.g., “DBC run Python transform …”), so the processed dataset lineage can be retrieved later.

Review Questions

What specific failure mode occurs during rollback when code is versioned but the training data is not?
In Level 2, what exactly is versioned in Git, and what is stored in S3?
How do privacy deletion requirements change the recommended strategy for data versioning?

Key Points

1
Model rollback requires a reliable mapping from deployed code to the exact training dataset used, not just a Git commit.
2
Level 0 (no data versioning) makes reproducibility fragile because training depends on whatever data happens to exist at the time.
3
Level 1 improves rollback by snapshotting the dataset and recording a snapshot reference alongside the code version.
4
Level 2 version-controls dataset manifests (S3 object IDs plus labels/metadata) in Git, while storing raw assets in S3.
5
Git LFS helps manage large dataset-related files that would otherwise be unwieldy in standard Git.
6
Future-proofing depends on retention and schema-change constraints; upstream contracts/tests or S3 versioning can prevent silent drift.
7
Level 3 tools like DBC automate provenance and transformation pipeline tracking, but Level 2 is the recommended starting point.

Highlights

Without data versioning, reverting a bad deployment can’t reliably reconstruct the earlier training set—even if the model code is versioned.

Level 2’s core trick is separating raw assets (S3) from the dataset definition (Git-tracked IDs and metadata).

S3 versioning and upstream schema-change tests address different risks: deletion vs drift from evolving systems.

DBC is presented as an automation layer that links data additions to S3 and records transformation pipelines for provenance.

Topics

Data Versioning
Machine Learning Deployment
Git LFS
S3 Versioning
Data Provenance

Mentioned

S3
Git LFS
ML