Data Version Control · DVC
Data Version Control (DVC) offers tools and services for managing machine learning projects, including experiment tracking, version control, and dataset management.
Services
Data Version Control (DVC) offers a range of services designed to streamline machine learning (ML) operations. Key services include DVC Studio, a tool for tracking experiments and sharing insights from ML projects, and DVC Platinum Services, which provide expertise in data engineering, data science, and project management. The company assists in project planning, execution, and alignment of design and architecture. DVC also offers solutions such as end-to-end data version control, project architecture, data pipelines, and data curation, thus establishing common standards and frameworks for scalable operations.
Products
DVC's product offerings are centered around enhancing the ML workflow. The open-source version control system is pivotal for managing ML projects. DVC Studio is a key product for tracking and sharing ML experiment insights. The VS Code Extension facilitates local ML model development and experiment tracking. Another notable tool is Dataset Factory, which creates generative computer vision datasets. DVC also provides a model registry for managing model lifecycles and offers tools for integrating remote storage services to manage and version images, audio, video, and text files.
Features
DVC supports various features essential for ML projects. Key capabilities include managing and versioning different types of files, organizing reproducible ML modeling workflows, and connecting to versioned data sources and code with pipelines. DVC enables users to filter a billion samples quickly and create datasets from queries for ML model training. The platform also supports integrations with multiple ML frameworks, including Amazon SageMaker, Databricks, and Hugging Face. Through GitOps principles, DVC allows users to track experiments, compare results, and restore experiment states across teams.
Integration with Remote Storage and ML Frameworks
DVC supports a variety of remote storage types, including Amazon S3, NFS, SSH, Google Drive, Azure Blob Storage, and HDFS, enabling seamless management and versioning of large files. The platform is compatible with numerous ML frameworks, such as Catalyst, Fast.ai, Hugging Face Accelerate, Keras, LightGBM, PyTorch, Scikit-learn, TensorFlow, and XGBoost. These integrations facilitate efficient model development, experiment tracking, and data management, ensuring scalability and reproducibility across diverse ML projects.
Version Control System for Machine Learning
DVC offers an open-source version control system specifically tailored for machine learning projects. This system organizes ML workflows into reproducible steps and integrates storage with code repositories, keeping large data and model files alongside code. Users can track experiments in Git, compare results, and restore entire experiment states across teams, ensuring robust version management. The system's Python API provides functionalities for handling metrics, parameters, and plots, while self-hosting options are available via AWS AMI and Kubernetes (Helm).