ML9 min read

Deep Dive into Classification Metrics

Master classification metrics beyond accuracy - precision, recall, F1, ROC-AUC, and when to use each.

Sarah Chen
December 19, 2025
0.0k0

Deep Dive into Classification Metrics

Accuracy can lie. A spam filter that blocks everything gets 0% spam through - but also blocks all your real emails. Let's understand metrics properly.

The Confusion Matrix

Everything starts here:

                    Predicted
                  Neg    Pos
Actual   Neg      TN     FP    ← False Positive: "False alarm"
         Pos      FN     TP    ← False Negative: "Missed it"
                   ↑
            Type II error
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

cm = confusion_matrix(y_test, y_pred)
ConfusionMatrixDisplay(cm, display_labels=['Negative', 'Positive']).plot()

Precision vs Recall

Precision: Of predicted positives, how many are correct?

Precision = TP / (TP + FP)

"When I say yes, am I right?"

Recall (Sensitivity): Of actual positives, how many did we find?

Recall = TP / (TP + FN)

"Of all the positives, how many did I catch?"

The Tradeoff

High threshold: "Only predict positive when very confident"
  → High precision (fewer false alarms)
  → Low recall (miss more positives)

Low threshold: "Predict positive more liberally"
  → Low precision (more false alarms)
  → High recall (catch more positives)

When to Prioritize What

Scenario Prioritize Why
Spam filter Precision Don't lose important emails
Cancer screening Recall Don't miss any cancer cases
Fraud detection Usually recall Don't let fraud through
Recommendation Precision Don't annoy users

F1 Score: The Balance

Harmonic mean of precision and recall:

F1 = 2 × (Precision × Recall) / (Precision + Recall)

Use when you need both precision and recall to be good.

from sklearn.metrics import precision_score, recall_score, f1_score

precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Precision: {precision:.3f}")
print(f"Recall: {recall:.3f}")
print(f"F1: {f1:.3f}")

ROC Curve and AUC

ROC shows tradeoff across all thresholds:

from sklearn.metrics import roc_curve, roc_auc_score

# Get probabilities
y_proba = model.predict_proba(X_test)[:, 1]

# ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_proba)

# Plot
plt.plot(fpr, tpr, label=f'AUC = {roc_auc_score(y_test, y_proba):.3f}')
plt.plot([0, 1], [0, 1], 'k--', label='Random')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate (Recall)')
plt.title('ROC Curve')
plt.legend()

AUC Interpretation:

  • 1.0: Perfect
  • 0.5: Random guessing
  • <0.5: Worse than random (flip predictions!)

Precision-Recall Curve

Better than ROC for imbalanced data:

from sklearn.metrics import precision_recall_curve, average_precision_score

precision, recall, thresholds = precision_recall_curve(y_test, y_proba)

plt.plot(recall, precision)
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title(f'AP = {average_precision_score(y_test, y_proba):.3f}')

Multi-class Metrics

from sklearn.metrics import classification_report

# Get detailed report per class
print(classification_report(y_test, y_pred))

# Averaging strategies
f1_score(y_test, y_pred, average='macro')   # Equal weight per class
f1_score(y_test, y_pred, average='weighted') # Weight by class size
f1_score(y_test, y_pred, average='micro')   # Global calculation

Choosing Your Metric

Balanced classes?
├── Yes → Accuracy, F1, AUC all reasonable
└── No (Imbalanced)
    ├── Care more about finding positives → Recall, PR-AUC
    ├── Care more about being right when positive → Precision
    └── Need ranking ability → ROC-AUC

Key Takeaway

No single metric tells the whole story. Understand the confusion matrix, know the precision-recall tradeoff, and choose metrics based on business impact. For imbalanced data, avoid accuracy and use F1, AUC, or precision/recall. Always ask: "What's worse - false positives or false negatives?" That guides your metric choice.

#Machine Learning#Metrics#Classification#Evaluation#Intermediate