Reproducibility in Machine Learning

"It worked yesterday!" If you can't reproduce your results, you can't trust them. Reproducibility isn't optional - it's essential.

Why Reproducibility Matters

- Debug issues (need to recreate the problem) - Compare experiments fairly - Share with colleagues - Deploy to production with confidence - Scientific integrity

Level 1: Random Seeds

Set seeds everywhere:

```python import numpy as np import random import os

def set_all_seeds(seed=42): np.random.seed(seed) random.seed(seed) os.environ['PYTHONHASHSEED'] = str(seed) # For TensorFlow try: import tensorflow as tf tf.random.set_seed(seed) except ImportError: pass # For PyTorch try: import torch torch.manual_seed(seed) torch.cuda.manual_seed_all(seed) torch.backends.cudnn.deterministic = True except ImportError: pass

Call at start of every experiment set_all_seeds(42) ```

Also in model and data splitting:

```python from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier

Always specify random_state X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 )

model = RandomForestClassifier(n_estimators=100, random_state=42) ```

Level 2: Environment Management

### requirements.txt with Versions

``` # Save exact versions pip freeze > requirements.txt

requirements.txt numpy==1.24.0 pandas==2.0.0 scikit-learn==1.3.0 xgboost==1.7.0 ```

### Better: Use Poetry or Conda

```yaml # environment.yml name: ml-project dependencies: - python=3.10 - numpy=1.24.0 - pandas=2.0.0 - scikit-learn=1.3.0 - pip: - xgboost==1.7.0 ```

Level 3: Track Experiments

### Simple: Log Everything

```python import json from datetime import datetime

def log_experiment(params, metrics, notes=""): experiment = { 'timestamp': datetime.now().isoformat(), 'params': params, 'metrics': metrics, 'notes': notes } with open('experiments.jsonl', 'a') as f: f.write(json.dumps(experiment) + '\n')

Usage log_experiment( params={'n_estimators': 100, 'max_depth': 10}, metrics={'accuracy': 0.85, 'f1': 0.82}, notes='First baseline' ) ```

### Better: Use MLflow

```python import mlflow

mlflow.set_experiment("my_experiment")

with mlflow.start_run(): # Log parameters mlflow.log_param("n_estimators", 100) mlflow.log_param("max_depth", 10) # Train model model.fit(X_train, y_train) # Log metrics mlflow.log_metric("accuracy", accuracy) mlflow.log_metric("f1", f1) # Log model mlflow.sklearn.log_model(model, "model") ```

Level 4: Version Data

Data changes can break reproducibility:

```python import hashlib import pandas as pd

def hash_dataframe(df): return hashlib.md5( pd.util.hash_pandas_object(df).values ).hexdigest()

Log data hash data_hash = hash_dataframe(df) print(f"Data hash: {data_hash}")

Or use DVC for data versioning # dvc add data/training_data.csv ```

Level 5: Code Version Control

Always commit before running experiments:

```python import subprocess

def get_git_commit(): try: commit = subprocess.check_output( ['git', 'rev-parse', 'HEAD'] ).decode().strip() return commit except: return "unknown"

Log with experiment log_experiment( params={...}, metrics={...}, git_commit=get_git_commit() ) ```

Reproducibility Checklist

``` □ Random seeds set (numpy, python, framework) □ Specific package versions documented □ Data versioned or hashed □ Code committed to git □ Experiment parameters logged □ Results logged with timestamp □ Hardware/environment noted (GPU, OS) ```

Quick Template

```python import numpy as np import random from datetime import datetime

1. Set seeds SEED = 42 np.random.seed(SEED) random.seed(SEED)

2. Log experiment config config = { 'seed': SEED, 'data_path': 'data/train.csv', 'model': 'RandomForest', 'params': {'n_estimators': 100, 'max_depth': 10}, 'timestamp': datetime.now().isoformat() } print(f"Config: {config}")

3. Load data (with hash for verification) df = pd.read_csv(config['data_path']) print(f"Data shape: {df.shape}")

4. Train with fixed seed model = RandomForestClassifier( **config['params'], random_state=SEED )

5. Log results results = { 'config': config, 'metrics': {'accuracy': accuracy, 'f1': f1} } ```

Key Takeaway

Reproducibility is a habit, not an afterthought. Set random seeds everywhere, version your environment and data, log all experiments, and commit code before runs. Future you (and your colleagues) will thank you. Start simple with seeds and requirements.txt, add experiment tracking as needed.

Reproducibility in Machine Learning

Reproducibility in Machine Learning

Why Reproducibility Matters

Level 1: Random Seeds

Call at start of every experiment set_all_seeds(42) ```

Always specify random_state X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 )

Level 2: Environment Management

requirements.txt numpy==1.24.0 pandas==2.0.0 scikit-learn==1.3.0 xgboost==1.7.0 ```

Level 3: Track Experiments

Usage log_experiment( params={'n_estimators': 100, 'max_depth': 10}, metrics={'accuracy': 0.85, 'f1': 0.82}, notes='First baseline' ) ```

Level 4: Version Data

Log data hash data_hash = hash_dataframe(df) print(f"Data hash: {data_hash}")

Or use DVC for data versioning # dvc add data/training_data.csv ```

Level 5: Code Version Control

Log with experiment log_experiment( params={...}, metrics={...}, git_commit=get_git_commit() ) ```

Reproducibility Checklist

Quick Template

1. Set seeds SEED = 42 np.random.seed(SEED) random.seed(SEED)

2. Log experiment config config = { 'seed': SEED, 'data_path': 'data/train.csv', 'model': 'RandomForest', 'params': {'n_estimators': 100, 'max_depth': 10}, 'timestamp': datetime.now().isoformat() } print(f"Config: {config}")

3. Load data (with hash for verification) df = pd.read_csv(config['data_path']) print(f"Data shape: {df.shape}")

4. Train with fixed seed model = RandomForestClassifier( **config['params'], random_state=SEED )

5. Log results results = { 'config': config, 'metrics': {'accuracy': accuracy, 'f1': f1} } ```

Key Takeaway

More on ML

What is Machine Learning? A Simple Introduction

Supervised vs Unsupervised Learning Explained

Understanding Training, Validation, and Test Sets