Reproducible Machine Learning & Experiment Tracking Pipeline with Python and DVC
Based on Venelin Valkov's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
DVC enables reproducible ML by versioning datasets, features, models, and metrics through Git-compatible metadata while storing large artifacts outside Git.
Briefing
Data and model reproducibility hinges on tracking not just code, but the exact datasets, derived features, trained artifacts, and evaluation outputs that produced a result. DVC (Data Version Control) delivers that by storing lightweight metadata in Git while keeping large files (datasets, models, intermediate outputs) in external storage, enabling repeatable machine-learning pipelines and side-by-side experiment comparisons.
The workflow starts with DVC’s Git compatibility: commands run alongside Git, but DVC tracks file hashes and dependencies rather than committing bulky artifacts. Instead of saving the data/model files directly in the repository, DVC writes small “.dvc” files that describe where artifacts live and what they depend on. This design matters because ML artifacts can be huge, yet reproducibility requires knowing exactly which version of each artifact fed each training run.
To demonstrate the approach, a small end-to-end project is built around a Kaggle dataset of Udemy courses (3,682 courses across four categories). The goal is to predict the number of subscribers using features such as course price, number of lectures, course level, content duration, subject, and the course’s publication timing. The project is organized into pipeline stages: create a dataset (download and split into train/test), extract features (including a “days since published” feature engineered from the published timestamp), train a model, and evaluate it.
A configuration file defines paths under an assets directory for original data, features, models, and metrics. The dataset step downloads the CSV from Google Drive, then uses a fixed random seed to create train/test splits (with a 20% test set). The feature step reads the split files, computes engineered inputs, and writes separate CSVs for training/test features and labels (labels are the subscriber counts). The training step fits a baseline Linear Regression model and serializes it with pickle. The evaluation step computes metrics—R² and root mean squared error—then writes them to a JSON metrics file.
After initializing Git and DVC (with analytics disabled for cleaner reproducibility), each pipeline stage is registered using `dvc run`. Those stage definitions capture dependencies (scripts and upstream artifacts) and outputs (data/features/models/metrics). Once the pipeline is wired, DVC can rerun only the necessary steps when something changes. The first experiment tags the baseline linear regression run and records its metrics.
A second experiment swaps in a RandomForestRegressor (tuning parameters like number of estimators, max depth, and random state). DVC detects the changed training code and automatically rebuilds the downstream artifacts: it reruns training and evaluation, then stores the new metrics. Using DVC’s metrics comparison, the random forest run produces better results than the linear regression baseline, and both experiments remain traceable via tags.
The takeaway is practical: DVC turns ML pipelines into dependency graphs where datasets, features, models, and evaluation outputs are versioned and reproducible—while still keeping repositories manageable and enabling quick, reliable experiment comparisons.
Cornell Notes
DVC (Data Version Control) makes machine-learning experiments reproducible by tracking datasets, derived features, trained models, and evaluation metrics through Git-compatible metadata while storing large artifacts elsewhere. In the demo project, the pipeline is split into stages: download/split data, extract engineered features (including “days since published”), train a model, and evaluate it with R² and RMSE. Each stage is registered with `dvc run`, which records dependencies and outputs so only the necessary steps rerun when changes occur. After tagging a baseline Linear Regression experiment, the project switches to a RandomForestRegressor; DVC automatically retrains and reevaluates, then stores comparable metrics for both runs. This enables reliable experiment tracking and side-by-side metric comparison.
How does DVC achieve reproducibility without committing huge datasets and model files to Git?
What are the key stages in the example ML pipeline, and what artifacts does each stage produce?
Why does the demo emphasize fixed randomness during dataset splitting?
How does DVC decide which parts of the pipeline to rerun after a change?
What does experiment tagging buy you in DVC’s metrics workflow?
Review Questions
- In what ways do DVC’s “.dvc” metadata files differ from committing the actual dataset/model files to Git?
- If you change only feature-engineering code, which pipeline stages should rerun and why, based on declared DVC dependencies?
- How would you extend the demo to track additional metrics (e.g., MAE) while keeping experiment comparisons reproducible?
Key Points
- 1
DVC enables reproducible ML by versioning datasets, features, models, and metrics through Git-compatible metadata while storing large artifacts outside Git.
- 2
DVC’s `.dvc` files capture dependencies and outputs, allowing the pipeline to be reconstructed reliably from the same inputs.
- 3
Breaking an ML project into explicit stages (data split, feature extraction, training, evaluation) makes dependency tracking precise and reruns efficient.
- 4
Using a fixed random seed for train/test splitting is essential; otherwise metrics can drift even when code and models match.
- 5
`dvc run` registers each pipeline step with declared dependencies and outputs, so DVC can rerun only the affected downstream stages after changes.
- 6
Experiment tags provide a clean way to compare metrics across different model choices (e.g., Linear Regression vs RandomForestRegressor).
- 7
DVC metrics stored as JSON support straightforward metric inspection and comparison across tagged runs.