Testing Models

Updated May 13, 2023 ·

Overview

Testing ML models is as important as testing the data. It ensures that the model is reliable and performs well in different situations.

Individual vs Group Fairness

Testing for fairness in ML models ensures they don’t unfairly favor certain individuals or groups.

Individual fairness
- Similar individuals should receive similar predictions.
- Example: People with similar qualifications should have equal job opportunities.
Group fairness
- Different groups (e.g., based on race or gender) should be treated equally.
- Example: A loan approval model should not discriminate based on race.

Holdout Testing

Holdout testing involves evaluating the model on a separate dataset that wasn’t used during training.

It helps identify issues like overfitting or underfitting.
This ensures the model is reliable and can generalize to new data.

The goal is to assess a model's performance based on data the model did not train on.

Model Drifts

Models can change over time, affecting their performance. It's important to monitor this drift.

Concept drift:
- Changes in the relationship between features and output.
- Features may lose relevance over time.
Prediction drift:
- Shift in the model’s prediction distribution.
- Results may become inconsistent over time.
Label drift
- The distribution of the actual labels changes.
- Affects the model's ability to predict accurately.

Both prediction drift and label drift happen when the underlying data changes and are measured at the model level.

Example: Adding Noises

This example shows how to simulate concept drift by adding noise to the test data.

The additional noise simulates data drift, testing how well the model adapts to new or unexpected patterns.

import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

# Adding Gaussian noise simulate concept drift
noise = np.random.normal(0, 0.1, X_test.shape)  
X_test_noisy = X_test + noise

# Evaluate model performance on noisy data
score = model.score(X_test_noisy, y_test)
print(f"Model accuracy with concept drift: {score:.4f}")

Complex Models vs Baseline Models

Complex models often have higher costs in terms of training time and latency compared to simpler baseline models.

Latency
- Time taken to make a prediction.
- Crucial for real-time applications.
Throughput
- The number of predictions made in a given time.
- Metric used by applications with high-volume processing.
- Important for large-scale or batch processing applications.

By testing a model's latency and throughput, we can determine the maximum complexity the model can handle while maintaining good performance. This helps in optimizing the model for both accuracy and efficiency.

Overview​

Individual vs Group Fairness​

Holdout Testing​

Model Drifts​

Example: Adding Noises​

Complex Models vs Baseline Models​