Skip to main content

Testing Data

Updated May 13, 2023 ·

Overview

Testing data ensures it is accurate and consistent before use in an ML pipeline. There are two common methods to test data:

  • Data validation tests look for missing values, inconsistencies, or abnormal data
  • Schema tests verify that data formats meet expectations.

Example: If "timeToValue" is measured in minutes instead of seconds, it could cause issues.

Beyond Simple Testing

In addition to basic validation, we need more advanced tests to check for complex data issues.

  • Check if values fall within expected ranges
  • Verify trends or patterns in the data.
  • Example: website visit time should be around 4 minutes.

Expectation tests can be used to ensure data values align with business expectations, like user behavior trends.

Expectation Tests

These tests validate if data conforms to predefined expectations, and ensures it fits the expected patterns.

  • Checks if data meets criteria defined by business logic.
  • Test for consistency over time.

Example: Check if a patient's visit history is always before today. This prevents future dates in historical data.

Feature Importance Tests

Feature Importance Tests evaluate which data features most affect model predictions.

  • Test model sensitivity to individual features.
  • Identify which features contribute the most to performance.

Permutation importance randomly shuffles feature values to check if performance drops. This tests how sensitive the model is to each feature and helps decide if retraining with new data is needed.

Permutation Importance Example

In the example below, Python and scikit-learn is used to implement permutation importance. We train a random forest classifier and then apply the permutation_importance function to evaluate feature importance. This method works with various models, including regressions.

import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

data = load_iris()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

results = permutation_importance(model, X_test, y_test, n_repeats=10, random_state=42)

# Display importance of each feature
for i in range(X_train.shape[1]):
print(f"Feature {i}: Mean importance = {results.importances_mean[i]}, Std = {results.importances_std[i]}")

The output will display the mean and standard deviation of feature importance scores for each feature in the dataset. The higher the mean importance, the more that feature impacts the model's performance.

Looking for Data Drift

Monitoring for data drift helps identify changes in input data or labels that may affect model performance.

  • Detect changes in feature distributions (data drift).
  • Identify shifts in label distributions (label drift).

Example: If user behavior changes, the model might underperform on new data. Monitoring helps address these issues early.