Effective Documentation
Overview
Good documentation makes machine learning projects easier to use, understand, and improve. It helps track data, models, and decisions for future reference.
Key Documentation Areas
ML documentation should cover six main areas:
- Data sources – Where the data comes from and how to access it
- Data schemas – Structure and organization of data
- Labeling methods – How data is labeled for training
- Model pseudocode – Steps involved in building and running the model
- Model experiments – Testing, selection, and hyperparameters
- Training environments – Software and settings used for training
Data Sources
Tracking data sources ensures quality control and long-term access.
- Helps compare different datasets
- Identifies inconsistencies or errors
- Makes it easier to update or replace data
Data Schemas
Data schemas describe the structure of datasets to maintain consistency.
- Defines fields, types, and relationships
- Helps organize unstructured data
- Ensures models learn from properly formatted inputs
Example schema in JSON format:
Field Name | Data Type | Data Order | Description |
---|---|---|---|
customer_id | Integer | Nominal | Unique identifier for each customer |
purchase_amount | Float | Ordinal | Amount spent in the transaction |
purchase_date | Datetime | Ordinal | Date and time of the purchase |
Where:
- Nominal: Categories without inherent order (e.g., customer IDs)
- Ordinal: Data with a meaningful sequence (e.g., dates, amounts)
Labeling Methods
For classification tasks, clear labeling methods improve accuracy and reproducibility.
- Explains how data is categorized
- Tracks label changes over time
- Helps refine labels for better model performance
Example process:
1. Raw images collected from website logs
2. Labeled using pre-trained model + manual verification
3. Labels reviewed and corrected by domain expert
Model Pseudocode
Simplified steps outline the model’s structure and logic.
- Helps understand data flow and transformations
- Provides a reference for debugging
- Acts as a blueprint for future improvements
Example pseudocode:
1. Load dataset
2. Preprocess data
3. Split into training and test sets
4. Train model using logistic regression
5. Evaluate performance
6. Save trained model
Model Experiments
After collecting and labeling data, we document how we tested and chose our ML models. This helps track progress and allows others to improve the process.
- Model choices: List tested architectures and selection criteria.
- Performance metrics: Explain how the best model was chosen.
- Hyperparameters: Record different settings tested during training.
Training Environments
Along with model selection, we should document the training environment, as reproducibility depends on capturing the exact setup used.
- Lists dependencies (e.g., TensorFlow, Scikit-learn)
- Specifies hardware details (CPU, GPU, RAM)
- Logs random seeds to ensure consistent results
For example, changes in data processing or random seeds can impact model performance if not properly recorded.