ML9 min read

Debugging Machine Learning Models

Learn systematic approaches to debugging when your ML model isn't performing as expected.

Sarah Chen
December 19, 2025
0.0k0

Debugging Machine Learning Models

Your model isn't working. Before trying random things, follow a systematic debugging approach.

The Debugging Checklist

1. Check your data
2. Start simple
3. Verify the pipeline
4. Analyze errors
5. Check for bugs in code

Step 1: Check Your Data

Most ML problems are data problems.

Look at Your Data

# Basic stats
print(df.describe())
print(df.info())

# Target distribution
print(df['target'].value_counts(normalize=True))

# Missing values
print(df.isnull().sum())

# Actually look at examples
print(df.head(20))

Check for Data Issues

# Duplicates
print(f"Duplicates: {df.duplicated().sum()}")

# Outliers
for col in numeric_cols:
    q99 = df[col].quantile(0.99)
    q01 = df[col].quantile(0.01)
    print(f"{col}: 1%={q01:.2f}, 99%={q99:.2f}")

# Class imbalance
print(df['target'].value_counts())

# Leakage check - correlations with target
correlations = df.corr()['target'].sort_values(ascending=False)
print(correlations.head(10))

Step 2: Start Simple

If a complex model doesn't work, try something simpler first.

from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression

# Baseline: How good is random/majority?
dummy = DummyClassifier(strategy='most_frequent')
dummy.fit(X_train, y_train)
print(f"Dummy baseline: {dummy.score(X_test, y_test):.3f}")

# Simple model
lr = LogisticRegression()
lr.fit(X_train, y_train)
print(f"Logistic Regression: {lr.score(X_test, y_test):.3f}")

# If LR doesn't beat dummy, you have a data problem

Step 3: Verify the Pipeline

Can the Model Overfit?

A model should be able to memorize training data:

# Train on small subset
X_small = X_train[:100]
y_small = y_train[:100]

model.fit(X_small, y_small)
train_score = model.score(X_small, y_small)

print(f"Train score on small data: {train_score:.3f}")
# Should be close to 1.0! If not, model can't learn at all.

Check Preprocessing

# Verify shapes at each step
print(f"Original: {X.shape}")
print(f"After preprocessing: {X_processed.shape}")
print(f"After train/test split: {X_train.shape}, {X_test.shape}")

# Check for NaN after preprocessing
print(f"NaN in processed: {np.isnan(X_processed).sum()}")

# Check for inf
print(f"Inf in processed: {np.isinf(X_processed).sum()}")

Verify Labels

# Are labels correct?
print(f"Unique labels: {np.unique(y)}")
print(f"Label types: {y.dtype}")

# Check alignment
assert len(X) == len(y), "X and y length mismatch!"

Step 4: Analyze Errors

Confusion Matrix

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

cm = confusion_matrix(y_test, y_pred)
ConfusionMatrixDisplay(cm).plot()

# What's it getting wrong?

Look at Misclassified Examples

# Find errors
errors = X_test[y_test != y_pred]
error_true = y_test[y_test != y_pred]
error_pred = y_pred[y_test != y_pred]

# Look at them
for i in range(min(10, len(errors))):
    print(f"True: {error_true.iloc[i]}, Pred: {error_pred[i]}")
    print(errors.iloc[i])
    print("---")

Error Analysis by Subgroup

# Performance by category
for category in df['category'].unique():
    mask = X_test['category'] == category
    score = accuracy_score(y_test[mask], y_pred[mask])
    print(f"{category}: {score:.3f}")

Step 5: Common Bugs

Bug: Using Test Data for Training

# WRONG
scaler.fit(X)  # Includes test data!
model.fit(X_train, y_train)

# RIGHT
scaler.fit(X_train)
model.fit(X_train, y_train)

Bug: Shuffled Labels

# Check if labels match data
# After any reshaping, verify alignment
print(f"First 5 X rows: {X.head()}")
print(f"First 5 y values: {y.head()}")

Bug: Wrong Metric

# For imbalanced data, accuracy is misleading
print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")
print(f"F1: {f1_score(y_test, y_pred):.3f}")  # Much more informative

Debugging Workflow

Model not learning?
├── Check: Can it overfit small data?
│   └── No → Bug in code or data
├── Check: Does simple model work?
│   └── No → Data problem
└── Check: Is training loss decreasing?
    └── No → Learning rate / architecture issue

Model overfitting?
├── Try regularization
├── Try simpler model
├── Get more data
└── Check for data leakage

Model underfitting?
├── Try more complex model
├── Add features
├── Reduce regularization
└── Check preprocessing

Key Takeaway

Debug systematically, not randomly. Start with data checks, verify you can overfit, use simple baselines, and analyze actual errors. Most problems are data problems or bugs in preprocessing. When stuck, simplify until something works, then add complexity back gradually.

#Machine Learning#Debugging#Troubleshooting#Intermediate