Deep Dive into Classification Metrics
Master classification metrics beyond accuracy - precision, recall, F1, ROC-AUC, and when to use each.
Deep Dive into Classification Metrics
Accuracy can lie. A spam filter that blocks everything gets 0% spam through - but also blocks all your real emails. Let's understand metrics properly.
The Confusion Matrix
Everything starts here:
``` Predicted Neg Pos Actual Neg TN FP ← False Positive: "False alarm" Pos FN TP ← False Negative: "Missed it" ↑ Type II error ```
```python from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
cm = confusion_matrix(y_test, y_pred) ConfusionMatrixDisplay(cm, display_labels=['Negative', 'Positive']).plot() ```
Precision vs Recall
**Precision:** Of predicted positives, how many are correct? ``` Precision = TP / (TP + FP) ``` "When I say yes, am I right?"
**Recall (Sensitivity):** Of actual positives, how many did we find? ``` Recall = TP / (TP + FN) ``` "Of all the positives, how many did I catch?"
The Tradeoff
``` High threshold: "Only predict positive when very confident" → High precision (fewer false alarms) → Low recall (miss more positives)
Low threshold: "Predict positive more liberally" → Low precision (more false alarms) → High recall (catch more positives) ```
When to Prioritize What
| Scenario | Prioritize | Why | |----------|-----------|-----| | Spam filter | Precision | Don't lose important emails | | Cancer screening | Recall | Don't miss any cancer cases | | Fraud detection | Usually recall | Don't let fraud through | | Recommendation | Precision | Don't annoy users |
F1 Score: The Balance
Harmonic mean of precision and recall:
``` F1 = 2 × (Precision × Recall) / (Precision + Recall) ```
Use when you need both precision and recall to be good.
```python from sklearn.metrics import precision_score, recall_score, f1_score
precision = precision_score(y_test, y_pred) recall = recall_score(y_test, y_pred) f1 = f1_score(y_test, y_pred)
print(f"Precision: {precision:.3f}") print(f"Recall: {recall:.3f}") print(f"F1: {f1:.3f}") ```
ROC Curve and AUC
ROC shows tradeoff across all thresholds:
```python from sklearn.metrics import roc_curve, roc_auc_score
Get probabilities y_proba = model.predict_proba(X_test)[:, 1]
ROC curve fpr, tpr, thresholds = roc_curve(y_test, y_proba)
Plot plt.plot(fpr, tpr, label=f'AUC = {roc_auc_score(y_test, y_proba):.3f}') plt.plot([0, 1], [0, 1], 'k--', label='Random') plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate (Recall)') plt.title('ROC Curve') plt.legend() ```
**AUC Interpretation:** - 1.0: Perfect - 0.5: Random guessing - <0.5: Worse than random (flip predictions!)
Precision-Recall Curve
Better than ROC for imbalanced data:
```python from sklearn.metrics import precision_recall_curve, average_precision_score
precision, recall, thresholds = precision_recall_curve(y_test, y_proba)
plt.plot(recall, precision) plt.xlabel('Recall') plt.ylabel('Precision') plt.title(f'AP = {average_precision_score(y_test, y_proba):.3f}') ```
Multi-class Metrics
```python from sklearn.metrics import classification_report
Get detailed report per class print(classification_report(y_test, y_pred))
Averaging strategies f1_score(y_test, y_pred, average='macro') # Equal weight per class f1_score(y_test, y_pred, average='weighted') # Weight by class size f1_score(y_test, y_pred, average='micro') # Global calculation ```
Choosing Your Metric
``` Balanced classes? ├── Yes → Accuracy, F1, AUC all reasonable └── No (Imbalanced) ├── Care more about finding positives → Recall, PR-AUC ├── Care more about being right when positive → Precision └── Need ranking ability → ROC-AUC ```
Key Takeaway
No single metric tells the whole story. Understand the confusion matrix, know the precision-recall tradeoff, and choose metrics based on business impact. For imbalanced data, avoid accuracy and use F1, AUC, or precision/recall. Always ask: "What's worse - false positives or false negatives?" That guides your metric choice.