Reproducibility in Machine Learning
Learn how to make your ML experiments reproducible for yourself and others.
Reproducibility in Machine Learning
"It worked yesterday!" If you can't reproduce your results, you can't trust them. Reproducibility isn't optional - it's essential.
Why Reproducibility Matters
- Debug issues (need to recreate the problem) - Compare experiments fairly - Share with colleagues - Deploy to production with confidence - Scientific integrity
Level 1: Random Seeds
Set seeds everywhere:
```python import numpy as np import random import os
def set_all_seeds(seed=42): np.random.seed(seed) random.seed(seed) os.environ['PYTHONHASHSEED'] = str(seed) # For TensorFlow try: import tensorflow as tf tf.random.set_seed(seed) except ImportError: pass # For PyTorch try: import torch torch.manual_seed(seed) torch.cuda.manual_seed_all(seed) torch.backends.cudnn.deterministic = True except ImportError: pass
Call at start of every experiment set_all_seeds(42) ```
Also in model and data splitting:
```python from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier
Always specify random_state X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 )
model = RandomForestClassifier(n_estimators=100, random_state=42) ```
Level 2: Environment Management
### requirements.txt with Versions
``` # Save exact versions pip freeze > requirements.txt
requirements.txt numpy==1.24.0 pandas==2.0.0 scikit-learn==1.3.0 xgboost==1.7.0 ```
### Better: Use Poetry or Conda
```yaml # environment.yml name: ml-project dependencies: - python=3.10 - numpy=1.24.0 - pandas=2.0.0 - scikit-learn=1.3.0 - pip: - xgboost==1.7.0 ```
Level 3: Track Experiments
### Simple: Log Everything
```python import json from datetime import datetime
def log_experiment(params, metrics, notes=""): experiment = { 'timestamp': datetime.now().isoformat(), 'params': params, 'metrics': metrics, 'notes': notes } with open('experiments.jsonl', 'a') as f: f.write(json.dumps(experiment) + '\n')
Usage log_experiment( params={'n_estimators': 100, 'max_depth': 10}, metrics={'accuracy': 0.85, 'f1': 0.82}, notes='First baseline' ) ```
### Better: Use MLflow
```python import mlflow
mlflow.set_experiment("my_experiment")
with mlflow.start_run(): # Log parameters mlflow.log_param("n_estimators", 100) mlflow.log_param("max_depth", 10) # Train model model.fit(X_train, y_train) # Log metrics mlflow.log_metric("accuracy", accuracy) mlflow.log_metric("f1", f1) # Log model mlflow.sklearn.log_model(model, "model") ```
Level 4: Version Data
Data changes can break reproducibility:
```python import hashlib import pandas as pd
def hash_dataframe(df): return hashlib.md5( pd.util.hash_pandas_object(df).values ).hexdigest()
Log data hash data_hash = hash_dataframe(df) print(f"Data hash: {data_hash}")
Or use DVC for data versioning # dvc add data/training_data.csv ```
Level 5: Code Version Control
Always commit before running experiments:
```python import subprocess
def get_git_commit(): try: commit = subprocess.check_output( ['git', 'rev-parse', 'HEAD'] ).decode().strip() return commit except: return "unknown"
Log with experiment log_experiment( params={...}, metrics={...}, git_commit=get_git_commit() ) ```
Reproducibility Checklist
``` □ Random seeds set (numpy, python, framework) □ Specific package versions documented □ Data versioned or hashed □ Code committed to git □ Experiment parameters logged □ Results logged with timestamp □ Hardware/environment noted (GPU, OS) ```
Quick Template
```python import numpy as np import random from datetime import datetime
1. Set seeds SEED = 42 np.random.seed(SEED) random.seed(SEED)
2. Log experiment config config = { 'seed': SEED, 'data_path': 'data/train.csv', 'model': 'RandomForest', 'params': {'n_estimators': 100, 'max_depth': 10}, 'timestamp': datetime.now().isoformat() } print(f"Config: {config}")
3. Load data (with hash for verification) df = pd.read_csv(config['data_path']) print(f"Data shape: {df.shape}")
4. Train with fixed seed model = RandomForestClassifier( **config['params'], random_state=SEED )
5. Log results results = { 'config': config, 'metrics': {'accuracy': accuracy, 'f1': f1} } ```
Key Takeaway
Reproducibility is a habit, not an afterthought. Set random seeds everywhere, version your environment and data, log all experiments, and commit code before runs. Future you (and your colleagues) will thank you. Start simple with seeds and requirements.txt, add experiment tracking as needed.