Effective Documentation

Updated May 13, 2023 ·

Overview

Good documentation makes machine learning projects easier to use, understand, and improve. It helps track data, models, and decisions for future reference.

Key Documentation Areas

ML documentation should cover six main areas:

Data sources – Where the data comes from and how to access it
Data schemas – Structure and organization of data
Labeling methods – How data is labeled for training
Model pseudocode – Steps involved in building and running the model
Model experiments – Testing, selection, and hyperparameters
Training environments – Software and settings used for training

Data Sources

Tracking data sources ensures quality control and long-term access.

Helps compare different datasets
Identifies inconsistencies or errors
Makes it easier to update or replace data

Data Schemas

Data schemas describe the structure of datasets to maintain consistency.

Defines fields, types, and relationships
Helps organize unstructured data
Ensures models learn from properly formatted inputs

Example schema in JSON format:

Field Name	Data Type	Data Order	Description
`customer_id`	Integer	Nominal	Unique identifier for each customer
`purchase_amount`	Float	Ordinal	Amount spent in the transaction
`purchase_date`	Datetime	Ordinal	Date and time of the purchase

Where:

Nominal: Categories without inherent order (e.g., customer IDs)
Ordinal: Data with a meaningful sequence (e.g., dates, amounts)

Labeling Methods

For classification tasks, clear labeling methods improve accuracy and reproducibility.

Explains how data is categorized
Tracks label changes over time
Helps refine labels for better model performance

Example process:

Raw images collected from website logs  
Labeled using pre-trained model + manual verification  
Labels reviewed and corrected by domain expert  

Model Pseudocode

Simplified steps outline the model’s structure and logic.

Helps understand data flow and transformations
Provides a reference for debugging
Acts as a blueprint for future improvements

Example pseudocode:

Load dataset
Preprocess data
Split into training and test sets
Train model using logistic regression
Evaluate performance
Save trained model

Model Experiments

After collecting and labeling data, we document how we tested and chose our ML models. This helps track progress and allows others to improve the process.

Model choices: List tested architectures and selection criteria.
Performance metrics: Explain how the best model was chosen.
Hyperparameters: Record different settings tested during training.

Training Environments

Along with model selection, we should document the training environment, as reproducibility depends on capturing the exact setup used.

Lists dependencies (e.g., TensorFlow, Scikit-learn)
Specifies hardware details (CPU, GPU, RAM)
Logs random seeds to ensure consistent results

For example, changes in data processing or random seeds can impact model performance if not properly recorded.

Overview​

Key Documentation Areas​

Data Sources​

Data Schemas​

Labeling Methods​

Model Pseudocode​

Model Experiments​

Training Environments​