DVC (Data Version Control)

Git-style version control for data and ML models, enabling reproducible experiments.

by Iterative (DVC) · Marketing Analytics

Executive Summary

DVC (Data Version Control) is an open-source command-line tool and VS Code extension designed to bring engineering best practices like version control to machine learning and data science projects. It allows users to version large datasets and machine learning models, track experiments, and manage ML pipelines in a Git-like fashion. DVC stores metadata in Git repositories while managing the actual data blobs in various remote storage solutions. The tool facilitates reproducible ML workflows by codifying data and machine learning models, enabling users to capture pipelines and access data anywhere. It integrates seamlessly with Git, extending its capabilities to handle large files and directories that are typically problematic for Git alone. DVC also offers a Python API for direct data and model access within source code, and integrates with Iterative Studio for web-based data/model management and experiment tracking.

Use Cases

  • Version control for large datasets and ML models
  • Experiment tracking and management for machine learning projects
  • Building reproducible ML pipelines and workflows
  • Managing data registries for ML artifacts
  • Facilitating scalable and distributed ML workflows

Features

Visibility

  • Experiment Visualization: Visualize experiment parameters, metrics, and results, often through integration with Iterative Studio.
  • Data and Model Lineage: Track the lineage of data and models through ML pipelines, ensuring reproducibility and auditability.

Technical Specifications

Architecture
DVC extends Git's capabilities to version large files and directories by storing pointers in Git and the actual data in remote storage (e.g., S3, GCS, Azure Blob Storage, local filesystem). It uses a `.dvc` file to track data versions and pipelines, ensuring reproducibility.
Deployment
Self-Hosted
API Available
Yes

Infrastructure

  • AWS
  • GCP
  • Azure

Integrations

  • Git
  • VS Code
  • Iterative Studio
  • AWS S3
  • Google Cloud Storage
  • Azure Blob Storage
  • Ray
  • CML

Security & Compliance

Encryption: Relies on encryption capabilities of the underlying cloud storage providers (e.g., S3 encryption, GCS encryption).

Pricing

Model
Open-source (free)
Starting Price
Free
Target Customer
SMB,Mid-Market,Enterprise
Free Trial
No (no credit card required)