Data Versioning

Updated May 13, 2023 ·

Overview

Tracking data and model versions is essential in machine learning to ensure reproducibility and traceability.

Versioning is categorized into two types:

Major Versioning:
- Represents significant changes
- Example: Adding new features or methods
Minor Versioning
- Indicates smaller updates
- Bug fixes or slight adjustments.

We can version data by labeling it with unique identifiers. For example:

A feature store manages the versions of features across experiments, ensures consistent data usage in future tests, and avoids duplication.

For more information, please see Feature Store.

Model versioning allows tracking of model changes over time. For example:

Here’s how you can log a model version with MLflow:

import mlflow

# Set version to 1.0
mlflow.set_tag("model_version", "1.0")

# Log model
with mlflow.start_run():
    mlflow.log_model(model, "model")

This logs the model version, and you can track changes across experiments.

A model store is like a feature store but for models. It ensures models can be reused and reproduced across experiments.