Model Evaluation Metrics You Must Know
Learn the essential metrics for evaluating ML models - accuracy, precision, recall, F1, ROC, and more.
Model Evaluation Metrics You Must Know
"My model is 95% accurate!" Sounds great, right? But what if 95% of your data is one class? Let's learn the RIGHT metrics.
Classification Metrics
The Confusion Matrix
The foundation of all classification metrics:
Predicted
Neg Pos
Actual Neg [TN FP]
Pos [FN TP]
- True Negative (TN): Correctly predicted negative
- True Positive (TP): Correctly predicted positive
- False Positive (FP): Said positive, was negative (Type I error)
- False Negative (FN): Said negative, was positive (Type II error)
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_true, y_pred)
print(cm)
# [[TN, FP],
# [FN, TP]]
Accuracy
accuracy = (TP + TN) / (TP + TN + FP + FN)
Percentage of correct predictions.
When to use: Balanced classes
Problem: Misleading with imbalanced data
Dataset: 950 non-fraud, 50 fraud
Model: Always predicts "non-fraud"
Accuracy: 95% 🎉 (but useless!)
Precision
precision = TP / (TP + FP)
Of all positive predictions, how many were correct?
Use when: False positives are costly
- Spam filter (don't mark good email as spam)
- Recommender systems (don't waste user's time)
Recall (Sensitivity)
recall = TP / (TP + FN)
Of all actual positives, how many did we find?
Use when: False negatives are costly
- Disease detection (don't miss sick patients!)
- Fraud detection (don't miss fraudsters!)
Precision vs Recall Tradeoff
↑ Precision → ↓ Recall (more conservative predictions)
↑ Recall → ↓ Precision (more liberal predictions)
You can't have both perfect. Choose based on what's worse:
- Missing a positive (FN) → Focus on Recall
- False alarm (FP) → Focus on Precision
F1 Score
f1 = 2 * (precision * recall) / (precision + recall)
Harmonic mean of precision and recall. Balanced metric.
Use when: You care about both precision and recall equally.
from sklearn.metrics import precision_score, recall_score, f1_score
print(f"Precision: {precision_score(y_true, y_pred):.3f}")
print(f"Recall: {recall_score(y_true, y_pred):.3f}")
print(f"F1: {f1_score(y_true, y_pred):.3f}")
Classification Report
Get all metrics at once:
from sklearn.metrics import classification_report
print(classification_report(y_true, y_pred))
precision recall f1-score support
class 0 0.95 0.98 0.96 950
class 1 0.72 0.56 0.63 50
accuracy 0.94 1000
ROC Curve and AUC
ROC Curve: Plot of True Positive Rate vs False Positive Rate at different thresholds.
AUC (Area Under Curve): Single number summarizing ROC.
- AUC = 1.0: Perfect model
- AUC = 0.5: Random guessing
- AUC < 0.5: Worse than random
from sklearn.metrics import roc_auc_score, roc_curve
import matplotlib.pyplot as plt
# Get probabilities, not predictions
y_proba = model.predict_proba(X_test)[:, 1]
# AUC score
auc = roc_auc_score(y_true, y_proba)
print(f"AUC: {auc:.3f}")
# Plot ROC curve
fpr, tpr, _ = roc_curve(y_true, y_proba)
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title(f'ROC Curve (AUC = {auc:.3f})')
plt.show()
Regression Metrics
Mean Absolute Error (MAE)
mae = mean(|predicted - actual|)
Average absolute difference. Easy to interpret.
Mean Squared Error (MSE)
mse = mean((predicted - actual)²)
Penalizes large errors more heavily.
Root Mean Squared Error (RMSE)
rmse = sqrt(mse)
Same units as target variable. Most commonly used.
R² Score
r2 = 1 - (sum of squared errors) / (total variance)
- R² = 1: Perfect predictions
- R² = 0: As good as predicting the mean
- R² < 0: Worse than predicting the mean!
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np
mae = mean_absolute_error(y_true, y_pred)
mse = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_true, y_pred)
print(f"MAE: {mae:.2f}")
print(f"RMSE: {rmse:.2f}")
print(f"R²: {r2:.3f}")
Which Metric to Use?
| Problem | Recommended Metric |
|---|---|
| Balanced classification | Accuracy, F1 |
| Imbalanced classification | F1, AUC, Precision/Recall |
| Ranking | AUC |
| Regression (normal errors) | RMSE, R² |
| Regression (with outliers) | MAE |
Quick Reference
from sklearn.metrics import (
accuracy_score,
precision_score,
recall_score,
f1_score,
roc_auc_score,
confusion_matrix,
classification_report,
mean_absolute_error,
mean_squared_error,
r2_score
)
Key Takeaway
Don't just report accuracy. Understand your problem:
- What's the cost of different errors?
- Is your data balanced?
- Do you need probabilities or hard predictions?
Choose metrics that match your real-world goals!