ML Project Template for 2025 - Build ML Pipelines with Python, uv, DVC, FastAPI, Docker

TL;DR

Clone the GitHub repository and use uv to install Python (if needed), create a virtual environment, and install dependencies quickly.

Briefing Cornell Notes

Briefing

A ready-to-use machine learning project template is positioned as a 2025 blueprint for taking a model from dataset creation to a production-style REST API and a Dockerized deployment. The core payoff is a structured pipeline that combines reproducible data and training steps (via DVC), a FastAPI service layer for inference, and a Docker workflow that packages the whole system so it can run on a VPS or other hosting environment.

Setup starts with cloning a public GitHub repository and using uv to manage Python and dependencies. The workflow creates a virtual environment pinned to a specified Python version, installs project dependencies quickly, and installs the project itself in editable mode so local code changes are importable during development. A pre-commit hook is then installed to enforce formatting and checks before changes land in Git—using tools like Ruff for formatting/linting and pytest for testing.

The template’s structure is organized around standard ML engineering directories: artifacts for outputs, bin for scripts, a profit package for the application code, tests for verification, plus Dockerfile and app.py for the serving layer. Dependencies listed include DVC for data versioning, FastAPI and Uvicorn for the API server, pandas for data handling, and FastAPI-related tooling (such as fastparquet for Parquet I/O). The template also includes configuration and tooling for development hygiene (pre-commit, Ruff, pytest).

At the heart of the pipeline is a REST API endpoint that serves predictions from a text classifier. The API reads a config file, loads or references the model used for inference, and exposes a single predict endpoint. A client sends text via POST, and the service returns a JSON response containing the predicted label. In the example shown, the classifier returns “just do it” for the provided input text.

Reproducibility and workflow automation are handled through a DVC pipeline defined in a dvc.yaml file. The pipeline uses three common stages: build the dataset, train the model, and evaluate the model. The dataset stage runs a bash script that seeds and configures the environment, then calls a dataset builder that constructs labeled examples (e.g., mapping text to labels like “just do it” versus “stop doing it”). The builder splits data into training and testing subsets using pandas operations and writes them out as Parquet files (train.parquet and test.parquet) according to paths specified in the config. DVC then tracks these artifacts and enables the steps to be reproduced consistently.

Training and evaluation are run as subsequent DVC stages. The example uses a dummy model but still produces an evaluation metric—reported as accuracy of 1—demonstrating the end-to-end flow from data artifacts to measurable performance.

Finally, deployment readiness comes from Docker. A Dockerfile is included so the image can be built and run locally; when started, the container exposes the FastAPI documentation and the same predict endpoint on the expected local host URL. Overall, the template is presented as a fast way to bootstrap an ML pipeline that is reproducible, testable, serves inference via HTTP, and packages cleanly for deployment.

Cornell Notes

The template provides an end-to-end ML engineering scaffold: build a labeled text dataset, train and evaluate a model with reproducible DVC stages, serve predictions through a FastAPI REST endpoint, and package everything into a Docker image. It uses uv to create a virtual environment, install dependencies, and install the project in editable mode for development. DVC’s dvc.yaml defines three pipeline stages—dataset building, training, and evaluation—producing Parquet artifacts for train/test splits. Inference is exposed via app.py with a predict endpoint that accepts text and returns JSON labels. Dockerfile support lets the API run in a container suitable for VPS-style deployment.

How does the template ensure reproducible ML workflow across dataset, training, and evaluation?

Reproducibility is driven by DVC. A dvc.yaml pipeline defines three stages: build dataset, train model, and evaluate model. The dataset stage writes tracked artifacts (train.parquet and test.parquet) produced from a dataset builder that labels text examples and splits them into train/test subsets using pandas operations. DVC then runs “repro” commands for the train and evaluation stages, so the same inputs and steps can be rerun consistently.

What does the dataset-building stage actually produce, and in what format?

The dataset builder creates labeled text data (the example labels include “just do it” and “stop doing it”) and splits it into training and testing subsets. It exports the results to Parquet files named train.parquet and test.parquet under the artifacts/data directory (paths controlled by the config). These Parquet artifacts become the inputs for the training stage.

How is model inference exposed to users, and what does the client send and receive?

Inference is served via FastAPI in app.py. The service starts on port 8000 and exposes a single predict endpoint. Clients send text via a POST request; the API runs the text classifier using a config file and returns a JSON response containing the predicted label. In the example, posting text yields the label “just do it.”

Why install the project in editable mode, and how does that affect development?

The template installs the project as an editable package (editable mode) so imports resolve to the local source code. That means code changes in the repository are immediately reflected without reinstalling the package, which speeds iteration while building the pipeline and API.

What role do pre-commit, Ruff, and pytest play in the workflow?

Pre-commit installs a Git hook that runs formatting and checks before commits. Ruff is used for formatting/linting, helping keep code style consistent. pytest provides the testing framework, with a tests directory included in the template, so pipeline and API components can be validated during development.

How does Docker fit into the deployment story?

A Dockerfile is included to build a container image for the API and pipeline components. The workflow uses Docker build to create the image and Docker run to start a container. Once running, the FastAPI docs and the predict endpoint are accessible via the local host URL, mirroring how the service would run on a VPS or other online instance.

Review Questions

What three DVC stages are defined in the template’s pipeline, and what artifacts does the dataset stage generate?
Describe the request/response pattern for the FastAPI predict endpoint, including the format of the response.
How do uv, editable installation, and pre-commit hooks work together to support fast iteration in this project template?

Key Points

1
Clone the GitHub repository and use uv to install Python (if needed), create a virtual environment, and install dependencies quickly.
2
Install the project in editable mode so local code changes are immediately importable during development.
3
Use pre-commit with Ruff and pytest to enforce formatting and run tests before committing changes.
4
Define reproducible ML steps in DVC with a dvc.yaml pipeline consisting of dataset build, training, and evaluation stages.
5
Build the dataset into tracked Parquet artifacts (train.parquet and test.parquet) using pandas-based splitting and labeled examples.
6
Serve inference through FastAPI (app.py) with a predict endpoint that accepts text via POST and returns JSON predictions.
7
Package the service with Docker using the provided Dockerfile so the API can run in a container on a VPS or similar host.

Highlights

DVC’s pipeline is organized into three standard stages—build dataset, train, evaluate—so the same artifacts and steps can be reproduced reliably.

The dataset builder outputs Parquet files (train.parquet and test.parquet) into artifacts/data, which become the training inputs.

FastAPI exposes a single predict endpoint on port 8000 that returns JSON labels based on posted text.

Dockerfile support turns the API into a containerized service ready for deployment on external infrastructure.

Topics

ML Project Template
DVC Pipelines
FastAPI Inference
Docker Deployment
Text Classification

Mentioned

DVC
FastAPI
uv
VPS
JSON
API
DVC
Parquet
API