Skip to main content

Effective Documentation

Updated May 13, 2023 ·

Overview

Good documentation makes machine learning projects easier to use, understand, and improve. It helps track data, models, and decisions for future reference.

Key Documentation Areas

ML documentation should cover six main areas:

  • Data sources – Where the data comes from and how to access it
  • Data schemas – Structure and organization of data
  • Labeling methods – How data is labeled for training
  • Model pseudocode – Steps involved in building and running the model
  • Model experiments – Testing, selection, and hyperparameters
  • Training environments – Software and settings used for training

Data Sources

Tracking data sources ensures quality control and long-term access.

  • Helps compare different datasets
  • Identifies inconsistencies or errors
  • Makes it easier to update or replace data

Data Schemas

Data schemas describe the structure of datasets to maintain consistency.

  • Defines fields, types, and relationships
  • Helps organize unstructured data
  • Ensures models learn from properly formatted inputs

Example schema in JSON format:

Field NameData TypeData OrderDescription
customer_idIntegerNominalUnique identifier for each customer
purchase_amountFloatOrdinalAmount spent in the transaction
purchase_dateDatetimeOrdinalDate and time of the purchase

Where:

  • Nominal: Categories without inherent order (e.g., customer IDs)
  • Ordinal: Data with a meaningful sequence (e.g., dates, amounts)

Labeling Methods

For classification tasks, clear labeling methods improve accuracy and reproducibility.

  • Explains how data is categorized
  • Tracks label changes over time
  • Helps refine labels for better model performance

Example process:

1. Raw images collected from website logs  
2. Labeled using pre-trained model + manual verification
3. Labels reviewed and corrected by domain expert

Model Pseudocode

Simplified steps outline the model’s structure and logic.

  • Helps understand data flow and transformations
  • Provides a reference for debugging
  • Acts as a blueprint for future improvements

Example pseudocode:

1. Load dataset
2. Preprocess data
3. Split into training and test sets
4. Train model using logistic regression
5. Evaluate performance
6. Save trained model

Model Experiments

After collecting and labeling data, we document how we tested and chose our ML models. This helps track progress and allows others to improve the process.

  • Model choices: List tested architectures and selection criteria.
  • Performance metrics: Explain how the best model was chosen.
  • Hyperparameters: Record different settings tested during training.

Training Environments

Along with model selection, we should document the training environment, as reproducibility depends on capturing the exact setup used.

  • Lists dependencies (e.g., TensorFlow, Scikit-learn)
  • Specifies hardware details (CPU, GPU, RAM)
  • Logs random seeds to ensure consistent results

For example, changes in data processing or random seeds can impact model performance if not properly recorded.