ML9 min read
Debugging Machine Learning Models
Learn systematic approaches to debugging when your ML model isn't performing as expected.
Sarah Chen
December 19, 2025
0.0k0
Debugging Machine Learning Models
Your model isn't working. Before trying random things, follow a systematic debugging approach.
The Debugging Checklist
1. Check your data
2. Start simple
3. Verify the pipeline
4. Analyze errors
5. Check for bugs in code
Step 1: Check Your Data
Most ML problems are data problems.
Look at Your Data
# Basic stats
print(df.describe())
print(df.info())
# Target distribution
print(df['target'].value_counts(normalize=True))
# Missing values
print(df.isnull().sum())
# Actually look at examples
print(df.head(20))
Check for Data Issues
# Duplicates
print(f"Duplicates: {df.duplicated().sum()}")
# Outliers
for col in numeric_cols:
q99 = df[col].quantile(0.99)
q01 = df[col].quantile(0.01)
print(f"{col}: 1%={q01:.2f}, 99%={q99:.2f}")
# Class imbalance
print(df['target'].value_counts())
# Leakage check - correlations with target
correlations = df.corr()['target'].sort_values(ascending=False)
print(correlations.head(10))
Step 2: Start Simple
If a complex model doesn't work, try something simpler first.
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
# Baseline: How good is random/majority?
dummy = DummyClassifier(strategy='most_frequent')
dummy.fit(X_train, y_train)
print(f"Dummy baseline: {dummy.score(X_test, y_test):.3f}")
# Simple model
lr = LogisticRegression()
lr.fit(X_train, y_train)
print(f"Logistic Regression: {lr.score(X_test, y_test):.3f}")
# If LR doesn't beat dummy, you have a data problem
Step 3: Verify the Pipeline
Can the Model Overfit?
A model should be able to memorize training data:
# Train on small subset
X_small = X_train[:100]
y_small = y_train[:100]
model.fit(X_small, y_small)
train_score = model.score(X_small, y_small)
print(f"Train score on small data: {train_score:.3f}")
# Should be close to 1.0! If not, model can't learn at all.
Check Preprocessing
# Verify shapes at each step
print(f"Original: {X.shape}")
print(f"After preprocessing: {X_processed.shape}")
print(f"After train/test split: {X_train.shape}, {X_test.shape}")
# Check for NaN after preprocessing
print(f"NaN in processed: {np.isnan(X_processed).sum()}")
# Check for inf
print(f"Inf in processed: {np.isinf(X_processed).sum()}")
Verify Labels
# Are labels correct?
print(f"Unique labels: {np.unique(y)}")
print(f"Label types: {y.dtype}")
# Check alignment
assert len(X) == len(y), "X and y length mismatch!"
Step 4: Analyze Errors
Confusion Matrix
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
cm = confusion_matrix(y_test, y_pred)
ConfusionMatrixDisplay(cm).plot()
# What's it getting wrong?
Look at Misclassified Examples
# Find errors
errors = X_test[y_test != y_pred]
error_true = y_test[y_test != y_pred]
error_pred = y_pred[y_test != y_pred]
# Look at them
for i in range(min(10, len(errors))):
print(f"True: {error_true.iloc[i]}, Pred: {error_pred[i]}")
print(errors.iloc[i])
print("---")
Error Analysis by Subgroup
# Performance by category
for category in df['category'].unique():
mask = X_test['category'] == category
score = accuracy_score(y_test[mask], y_pred[mask])
print(f"{category}: {score:.3f}")
Step 5: Common Bugs
Bug: Using Test Data for Training
# WRONG
scaler.fit(X) # Includes test data!
model.fit(X_train, y_train)
# RIGHT
scaler.fit(X_train)
model.fit(X_train, y_train)
Bug: Shuffled Labels
# Check if labels match data
# After any reshaping, verify alignment
print(f"First 5 X rows: {X.head()}")
print(f"First 5 y values: {y.head()}")
Bug: Wrong Metric
# For imbalanced data, accuracy is misleading
print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")
print(f"F1: {f1_score(y_test, y_pred):.3f}") # Much more informative
Debugging Workflow
Model not learning?
├── Check: Can it overfit small data?
│ └── No → Bug in code or data
├── Check: Does simple model work?
│ └── No → Data problem
└── Check: Is training loss decreasing?
└── No → Learning rate / architecture issue
Model overfitting?
├── Try regularization
├── Try simpler model
├── Get more data
└── Check for data leakage
Model underfitting?
├── Try more complex model
├── Add features
├── Reduce regularization
└── Check preprocessing
Key Takeaway
Debug systematically, not randomly. Start with data checks, verify you can overfit, use simple baselines, and analyze actual errors. Most problems are data problems or bugs in preprocessing. When stuck, simplify until something works, then add complexity back gradually.
#Machine Learning#Debugging#Troubleshooting#Intermediate