Profiling and Versioning

Updated May 14, 2023 ·

Overview

Data profiling automates the analysis of data to create summaries that help us validate and monitor data in production.

Data profiling – Creates summaries of data to identify issues.
Data profiles – Validate user inputs and track model performance.

Data Profiles

Data profiles help detect invalid inputs and monitor when retraining is needed. They are used as feedbacks given to the user if they are providing wrong inputs.

When we don't use data profiles, we risk getting errors and performance issues.

Erroneous inputs
- Users may provide incorrect data
- Leads to wrong predictions
Data drift
- Without monitoring, we may miss when data changes
- Model becomes invalid, which affects performance

ML Pipeline Checklist

This checklist helps ensure that all key aspects of the ML pipeline are covered.

☑ Ensure the code is versioned ☑ Ensure the data is versioned ☑ Train the model ☑ Save the model ☑ Create the data profile ☑ Record exact version of training data

As part of the checklist, we also need to ensure we have the exact version of the training data. This is called data versioning.

Data Versioning

Data versioning tracks the version of data used for training to ensure models are reproducible.

The full training dataset is not stored in the package.
The data is stored in a centralized data store.
A pointer is created to reference the stored data.
A dataset fingerprint verifies data consistency.

The training pipeline also records metadata to recreate the exact train-test split used for performance evaluation.

Overview​

Data Profiles​

ML Pipeline Checklist​

Data Versioning​

Overview

Data Profiles

ML Pipeline Checklist

Data Versioning