DVC

What DVC solves
Getting started
Versioning datasets
DVC pipelines
Remote storage
DVC with LLM workflows
Gotchas

SECTION 01

What DVC solves

Git is great for code but not for large files — pushing a 10GB training dataset to GitHub causes rate limits, slow clones, and repository bloat. DVC (Data Version Control) solves this by storing large files in a content-addressable cache (locally or in cloud storage like S3/GCS) and tracking them via small metadata files (.dvc) that go into Git. Teammates can reproduce your exact dataset and model versions by running dvc pull, which fetches the correct files from the remote storage based on the committed metadata. This gives you reproducibility without bloating your git repo.

SECTION 02

Getting started

pip install dvc dvc-s3  # or dvc-gs, dvc-azure

# Initialise in an existing git repo
git init my-llm-project && cd my-llm-project
dvc init
git add .dvc .dvcignore
git commit -m "Initialise DVC"

# Add remote storage (S3 example)
dvc remote add -d myremote s3://my-bucket/dvc-store
git add .dvc/config
git commit -m "Configure DVC remote"

SECTION 03

Versioning datasets

# Track a large training dataset
dvc add data/train.jsonl
# Creates data/train.jsonl.dvc (tiny metadata file, goes in Git)
# data/train.jsonl itself is gitignored automatically

git add data/train.jsonl.dvc data/.gitignore
git commit -m "Add training dataset v1"
dvc push  # Upload to remote storage

# Later, update the dataset
cp new_train.jsonl data/train.jsonl
dvc add data/train.jsonl   # updates the .dvc file with new hash
git add data/train.jsonl.dvc
git commit -m "Update training dataset v2"
dvc push

# On another machine / CI:
git pull
dvc pull  # downloads the exact dataset version for this commit

SECTION 04

DVC pipelines

# dvc.yaml — define reproducible pipeline stages
stages:
  preprocess:
    cmd: python preprocess.py --input data/raw.jsonl --output data/processed.jsonl
    deps:
      - preprocess.py
      - data/raw.jsonl
    outs:
      - data/processed.jsonl
    params:
      - params.yaml:
          - preprocessing.chunk_size

  train:
    cmd: python train.py --data data/processed.jsonl --output models/checkpoint
    deps:
      - train.py
      - data/processed.jsonl
    outs:
      - models/checkpoint
    metrics:
      - metrics/train_metrics.json:
          cache: false   # track in git, not DVC

  evaluate:
    cmd: python evaluate.py --model models/checkpoint --output metrics/eval.json
    deps:
      - evaluate.py
      - models/checkpoint
    metrics:
      - metrics/eval.json:
          cache: false

dvc repro          # run only changed stages
dvc dag            # visualise pipeline DAG
dvc metrics show   # compare metrics across commits

SECTION 05

Remote storage

import subprocess

# Configure S3 remote programmatically
subprocess.run([
    "dvc", "remote", "add", "-d", "s3remote",
    "s3://my-models-bucket/dvc"
])
# Set credentials via environment variables:
# AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY

# Or use GCS:
# dvc remote add -d gcsremote gs://my-bucket/dvc
# GOOGLE_APPLICATION_CREDENTIALS=/path/to/service_account.json

# Or use SSH:
# dvc remote add -d sshremote ssh://user@server/path/to/dvc

SECTION 06

DVC with LLM workflows

DVC is particularly useful in LLM workflows for: (1) Dataset versioning — track which version of your fine-tuning dataset produced which model checkpoint; (2) Model artifact versioning — version LoRA adapters and merged checkpoints alongside the code that produced them; (3) Experiment comparison — dvc metrics diff compares eval metrics across git commits, showing you exactly how a dataset change affected model quality; (4) Reproducibility — any collaborator or CI job can reproduce the exact training run with git checkout <commit> && dvc checkout.

SECTION 07

Gotchas

DVC vs Git-LFS: Git-LFS stores large files in GitHub's servers (bandwidth/storage limits apply). DVC stores in your own cloud storage (no limits, usually cheaper). For serious ML projects, DVC is more practical.
Don't dvc add the whole model dir: Track individual checkpoint files or use DVC pipelines to track the output directory as a whole. Tracking thousands of small files in a checkpoint directory individually is slow — DVC works best on a few large files.
dvc pull on CI: Set up CI credentials carefully. DVC remote credentials must be available as environment variables or service account files in CI. Use dvc pull --run-cache to also restore stage run caches.

Advanced Pipelines & Workflows

DVC supports complex multi-stage pipelines with conditional execution and artifact management across multiple data sources. Beyond simple linear pipelines, DVC handles directed acyclic graphs (DAGs) that capture dependencies between training, preprocessing, and evaluation steps. This enables sophisticated workflows where multiple branches can execute in parallel, significantly improving development speed for complex ML projects.

Pipeline orchestration with DVC integrates smoothly with existing tools. You can compose DVC pipelines with other workflow systems, trigger them based on data changes, and maintain reproducibility across team members and environments. The dvc.yaml file captures your entire ML workflow in version-controllable form, enabling collaboration and knowledge sharing.

stages:
  preprocess:
    cmd: python scripts/preprocess.py
    deps:
      - data/raw/
    outs:
      - data/processed/
  
  train:
    cmd: python scripts/train.py
    deps:
      - data/processed/
      - src/model.py
    outs:
      - models/model.pkl
    plots:
      - metrics.json
  
  evaluate:
    cmd: python scripts/evaluate.py
    deps:
      - models/model.pkl
      - data/test/
    metrics:
      - eval_metrics.json

DVC has become essential infrastructure for modern ML teams, particularly those dealing with large datasets and complex training procedures. Its deep integration with Git and the broader ML ecosystem makes it a natural choice for version control of machine learning projects. Understanding DVC well accelerates development and prevents subtle bugs related to data versioning.

Data versioning with DVC becomes increasingly valuable as projects grow and data changes frequently. Unlike traditional version control systems designed for code, DVC optimizes for large binary files common in ML projects. The combination of Git for code and DVC for data creates a unified version control system for complete ML projects. This separation of concerns enables efficient storage while maintaining the ability to reproduce any historical version of the project.

Remote storage configuration enables team collaboration at scale. By storing large files on shared S3 buckets, NFS mounts, or other remote backends, teams avoid storing gigabytes of data in each repository. Only the metadata remains in Git, keeping repositories lightweight while enabling full reproducibility. This pattern has become standard practice in professional ML teams working with substantial datasets.

DVC pipelines scale from simple single-machine workflows to complex distributed computations. Pipeline parallelization automatically executes independent stages in parallel, significantly reducing total execution time. Integration with cloud platforms enables scalable execution, allowing teams to run resource-intensive training jobs without local hardware constraints.

The visualization capabilities in DVC help teams understand data lineage and pipeline structure. Seeing how data flows through preprocessing, training, and evaluation stages provides clarity about project organization. This documentation aspect of DVC often proves as valuable as the core versioning capabilities, especially for onboarding new team members to complex ML projects.

Integrating DVC with CI/CD pipelines enables automated training and deployment workflows. When data changes, pipelines automatically retrain models, run evaluations, and update deployments. This automation reduces manual work and ensures models stay current with the latest data. Modern MLOps increasingly treats model training pipelines as first-class CI/CD workflows, bringing software engineering best practices to machine learning development.

The cost of storage and computation for large pipelines can become substantial. DVC helps manage these costs by avoiding redundant computation—unchanged data and code skip reprocessing. Selective execution of only stages affected by changes saves both time and money. For teams working with large datasets and expensive models, these efficiency gains compound significantly.

DVC's comparison capabilities enable systematic evaluation of different approaches. Running multiple pipeline variants with different hyperparameters and comparing results informs model selection. This comparative approach supported by DVC's infrastructure helps teams make evidence-based decisions rather than relying on intuition.

The learning curve for DVC is moderate for users familiar with Git but understanding all capabilities takes time. Investing in DVC mastery pays dividends through improved productivity and reliability. Teams that fully leverage DVC's capabilities often develop more robust, reproducible, and scalable ML systems compared to those managing versioning manually.

Mastering DVC unlocks powerful workflows that scale from individual projects to enterprise ML platforms.

Enterprise adoption of DVC has demonstrated its value at scale. Teams managing terabytes of data with complex pipelines benefit tremendously from DVC infrastructure. The tool has become foundational for organizations serious about reproducibility and scalability in machine learning development. Investment in learning DVC pays dividends through improved team productivity and system reliability.

The integration of DVC with popular cloud platforms including AWS, Google Cloud, and Azure enables seamless scaling of ML workflows. Data can be stored in cloud object storage while computation happens in cloud environments. This cloud-native approach accommodates growth from laptop experiments to large-scale production systems without fundamental architectural changes. The flexibility to work locally during development and scale to cloud resources during production deployment represents a major advantage of DVC-based workflows.

Development tools like DVC Studio provide visual interfaces for monitoring and managing DVC projects, further improving user experience and accessibility for teams preferring graphical workflows.

Feature	Benefit	Use Case
Dataset Versioning	Track data changes	ML projects with evolving data
Pipeline Automation	Reproducible workflows	Complex multi-stage training
Remote Storage	Collaborative development	Team-based ML projects
Experiment Tracking	Hyperparameter comparison	Model selection and tuning

Table of Contents