Deep Dive into Classification Metrics
Master classification metrics beyond accuracy - precision, recall, F1, ROC-AUC, and when to use each.
Deep Dive into Classification Metrics
Accuracy can lie. A spam filter that blocks everything gets 0% spam through - but also blocks all your real emails. Let's understand metrics properly.
The Confusion Matrix
Everything starts here:
Predicted
Neg Pos
Actual Neg TN FP ← False Positive: "False alarm"
Pos FN TP ← False Negative: "Missed it"
↑
Type II error
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
cm = confusion_matrix(y_test, y_pred)
ConfusionMatrixDisplay(cm, display_labels=['Negative', 'Positive']).plot()
Precision vs Recall
Precision: Of predicted positives, how many are correct?
Precision = TP / (TP + FP)
"When I say yes, am I right?"
Recall (Sensitivity): Of actual positives, how many did we find?
Recall = TP / (TP + FN)
"Of all the positives, how many did I catch?"
The Tradeoff
High threshold: "Only predict positive when very confident"
→ High precision (fewer false alarms)
→ Low recall (miss more positives)
Low threshold: "Predict positive more liberally"
→ Low precision (more false alarms)
→ High recall (catch more positives)
When to Prioritize What
| Scenario | Prioritize | Why |
|---|---|---|
| Spam filter | Precision | Don't lose important emails |
| Cancer screening | Recall | Don't miss any cancer cases |
| Fraud detection | Usually recall | Don't let fraud through |
| Recommendation | Precision | Don't annoy users |
F1 Score: The Balance
Harmonic mean of precision and recall:
F1 = 2 × (Precision × Recall) / (Precision + Recall)
Use when you need both precision and recall to be good.
from sklearn.metrics import precision_score, recall_score, f1_score
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
print(f"Precision: {precision:.3f}")
print(f"Recall: {recall:.3f}")
print(f"F1: {f1:.3f}")
ROC Curve and AUC
ROC shows tradeoff across all thresholds:
from sklearn.metrics import roc_curve, roc_auc_score
# Get probabilities
y_proba = model.predict_proba(X_test)[:, 1]
# ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
# Plot
plt.plot(fpr, tpr, label=f'AUC = {roc_auc_score(y_test, y_proba):.3f}')
plt.plot([0, 1], [0, 1], 'k--', label='Random')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate (Recall)')
plt.title('ROC Curve')
plt.legend()
AUC Interpretation:
- 1.0: Perfect
- 0.5: Random guessing
- <0.5: Worse than random (flip predictions!)
Precision-Recall Curve
Better than ROC for imbalanced data:
from sklearn.metrics import precision_recall_curve, average_precision_score
precision, recall, thresholds = precision_recall_curve(y_test, y_proba)
plt.plot(recall, precision)
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title(f'AP = {average_precision_score(y_test, y_proba):.3f}')
Multi-class Metrics
from sklearn.metrics import classification_report
# Get detailed report per class
print(classification_report(y_test, y_pred))
# Averaging strategies
f1_score(y_test, y_pred, average='macro') # Equal weight per class
f1_score(y_test, y_pred, average='weighted') # Weight by class size
f1_score(y_test, y_pred, average='micro') # Global calculation
Choosing Your Metric
Balanced classes?
├── Yes → Accuracy, F1, AUC all reasonable
└── No (Imbalanced)
├── Care more about finding positives → Recall, PR-AUC
├── Care more about being right when positive → Precision
└── Need ranking ability → ROC-AUC
Key Takeaway
No single metric tells the whole story. Understand the confusion matrix, know the precision-recall tradeoff, and choose metrics based on business impact. For imbalanced data, avoid accuracy and use F1, AUC, or precision/recall. Always ask: "What's worse - false positives or false negatives?" That guides your metric choice.