Reproducible Machine Learning & Experiment Tracking Pipeline with Python and DVC

TL;DR

DVC enables reproducible ML by versioning datasets, features, models, and metrics through Git-compatible metadata while storing large artifacts outside Git.

Briefing Cornell Notes

Briefing

Data and model reproducibility hinges on tracking not just code, but the exact datasets, derived features, trained artifacts, and evaluation outputs that produced a result. DVC (Data Version Control) delivers that by storing lightweight metadata in Git while keeping large files (datasets, models, intermediate outputs) in external storage, enabling repeatable machine-learning pipelines and side-by-side experiment comparisons.

The workflow starts with DVC’s Git compatibility: commands run alongside Git, but DVC tracks file hashes and dependencies rather than committing bulky artifacts. Instead of saving the data/model files directly in the repository, DVC writes small “.dvc” files that describe where artifacts live and what they depend on. This design matters because ML artifacts can be huge, yet reproducibility requires knowing exactly which version of each artifact fed each training run.

To demonstrate the approach, a small end-to-end project is built around a Kaggle dataset of Udemy courses (3,682 courses across four categories). The goal is to predict the number of subscribers using features such as course price, number of lectures, course level, content duration, subject, and the course’s publication timing. The project is organized into pipeline stages: create a dataset (download and split into train/test), extract features (including a “days since published” feature engineered from the published timestamp), train a model, and evaluate it.

A configuration file defines paths under an assets directory for original data, features, models, and metrics. The dataset step downloads the CSV from Google Drive, then uses a fixed random seed to create train/test splits (with a 20% test set). The feature step reads the split files, computes engineered inputs, and writes separate CSVs for training/test features and labels (labels are the subscriber counts). The training step fits a baseline Linear Regression model and serializes it with pickle. The evaluation step computes metrics—R² and root mean squared error—then writes them to a JSON metrics file.

After initializing Git and DVC (with analytics disabled for cleaner reproducibility), each pipeline stage is registered using `dvc run`. Those stage definitions capture dependencies (scripts and upstream artifacts) and outputs (data/features/models/metrics). Once the pipeline is wired, DVC can rerun only the necessary steps when something changes. The first experiment tags the baseline linear regression run and records its metrics.

A second experiment swaps in a RandomForestRegressor (tuning parameters like number of estimators, max depth, and random state). DVC detects the changed training code and automatically rebuilds the downstream artifacts: it reruns training and evaluation, then stores the new metrics. Using DVC’s metrics comparison, the random forest run produces better results than the linear regression baseline, and both experiments remain traceable via tags.

The takeaway is practical: DVC turns ML pipelines into dependency graphs where datasets, features, models, and evaluation outputs are versioned and reproducible—while still keeping repositories manageable and enabling quick, reliable experiment comparisons.

Cornell Notes

DVC (Data Version Control) makes machine-learning experiments reproducible by tracking datasets, derived features, trained models, and evaluation metrics through Git-compatible metadata while storing large artifacts elsewhere. In the demo project, the pipeline is split into stages: download/split data, extract engineered features (including “days since published”), train a model, and evaluate it with R² and RMSE. Each stage is registered with `dvc run`, which records dependencies and outputs so only the necessary steps rerun when changes occur. After tagging a baseline Linear Regression experiment, the project switches to a RandomForestRegressor; DVC automatically retrains and reevaluates, then stores comparable metrics for both runs. This enables reliable experiment tracking and side-by-side metric comparison.

How does DVC achieve reproducibility without committing huge datasets and model files to Git?

DVC stores small metadata files (the “.dvc” files) in the Git repository instead of the large artifacts themselves. Those metadata files record what each artifact depends on and where the actual content is stored (in the demo, local filesystem storage is used). Because the metadata captures dependencies and hashes, rerunning the pipeline reconstructs the same dataset/features/models/metrics from the same inputs, even when the underlying files are too large for Git.

What are the key stages in the example ML pipeline, and what artifacts does each stage produce?

The pipeline is organized into: (1) dataset creation—downloads the Kaggle/Udemy courses CSV and splits it into train/test CSVs; (2) feature extraction—reads the splits and writes train/test feature CSVs plus train/test label CSVs, including engineered “days since published”; (3) model training—fits a Linear Regression baseline (then later RandomForestRegressor) and serializes the model to a pickle file; (4) evaluation—loads the model and test data, computes R² and RMSE, and writes metrics to a JSON file (metrics.json).

Why does the demo emphasize fixed randomness during dataset splitting?

Reproducibility depends on using the same train/test partition every time. The dataset step uses a random seed from a config file when calling `train_test_split`, ensuring the 20% test set remains consistent across runs. Without that, even identical code and model settings could yield different metrics because the evaluation data would change.

How does DVC decide which parts of the pipeline to rerun after a change?

DVC builds a dependency graph from the `dvc run` definitions. When the training code or upstream inputs change, DVC checks which declared dependencies and outputs are affected. In the demo, changing the model from Linear Regression to RandomForestRegressor triggers rerunning the training stage and the downstream evaluation stage, while earlier stages (like dataset creation and feature extraction) can remain unchanged if their dependencies didn’t change.

What does experiment tagging buy you in DVC’s metrics workflow?

Tags label specific pipeline runs so metrics can be compared across experiments. After tagging the baseline linear regression run, the random forest run is tagged separately. Running `dvc metrics show` then lists metrics per tagged experiment, making it straightforward to see that the random forest achieved better R²/RMSE than the baseline in this setup.

Review Questions

In what ways do DVC’s “.dvc” metadata files differ from committing the actual dataset/model files to Git?
If you change only feature-engineering code, which pipeline stages should rerun and why, based on declared DVC dependencies?
How would you extend the demo to track additional metrics (e.g., MAE) while keeping experiment comparisons reproducible?

Key Points

1
DVC enables reproducible ML by versioning datasets, features, models, and metrics through Git-compatible metadata while storing large artifacts outside Git.
2
DVC’s `.dvc` files capture dependencies and outputs, allowing the pipeline to be reconstructed reliably from the same inputs.
3
Breaking an ML project into explicit stages (data split, feature extraction, training, evaluation) makes dependency tracking precise and reruns efficient.
4
Using a fixed random seed for train/test splitting is essential; otherwise metrics can drift even when code and models match.
5
`dvc run` registers each pipeline step with declared dependencies and outputs, so DVC can rerun only the affected downstream stages after changes.
6
Experiment tags provide a clean way to compare metrics across different model choices (e.g., Linear Regression vs RandomForestRegressor).
7
DVC metrics stored as JSON support straightforward metric inspection and comparison across tagged runs.

Highlights

DVC keeps repositories lightweight by committing metadata instead of large datasets and model binaries, while still enabling exact reconstruction of artifacts.

A dependency-graph approach means changing the model code reruns training and evaluation automatically, without forcing a full pipeline rebuild.

Tagging experiments turns metric comparison into a first-class workflow, making it easy to see improvements from model changes (random forest outperforming linear regression in the demo).

Feature engineering can be treated as a versioned, reproducible pipeline stage—engineered inputs like “days since published” become tracked artifacts too.

Topics

Mentioned

Venelin Valkov
DVC
ML
R²
RMSE
S3
JSON
Git