ML9 min read

Deep Dive into Classification Metrics

Master classification metrics beyond accuracy - precision, recall, F1, ROC-AUC, and when to use each.

Sarah Chen
December 19, 2025
0.0k0

Deep Dive into Classification Metrics

Accuracy can lie. A spam filter that blocks everything gets 0% spam through - but also blocks all your real emails. Let's understand metrics properly.

The Confusion Matrix

Everything starts here:

``` Predicted Neg Pos Actual Neg TN FP ← False Positive: "False alarm" Pos FN TP ← False Negative: "Missed it" ↑ Type II error ```

```python from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

cm = confusion_matrix(y_test, y_pred) ConfusionMatrixDisplay(cm, display_labels=['Negative', 'Positive']).plot() ```

Precision vs Recall

**Precision:** Of predicted positives, how many are correct? ``` Precision = TP / (TP + FP) ``` "When I say yes, am I right?"

**Recall (Sensitivity):** Of actual positives, how many did we find? ``` Recall = TP / (TP + FN) ``` "Of all the positives, how many did I catch?"

The Tradeoff

``` High threshold: "Only predict positive when very confident" → High precision (fewer false alarms) → Low recall (miss more positives)

Low threshold: "Predict positive more liberally" → Low precision (more false alarms) → High recall (catch more positives) ```

When to Prioritize What

| Scenario | Prioritize | Why | |----------|-----------|-----| | Spam filter | Precision | Don't lose important emails | | Cancer screening | Recall | Don't miss any cancer cases | | Fraud detection | Usually recall | Don't let fraud through | | Recommendation | Precision | Don't annoy users |

F1 Score: The Balance

Harmonic mean of precision and recall:

``` F1 = 2 × (Precision × Recall) / (Precision + Recall) ```

Use when you need both precision and recall to be good.

```python from sklearn.metrics import precision_score, recall_score, f1_score

precision = precision_score(y_test, y_pred) recall = recall_score(y_test, y_pred) f1 = f1_score(y_test, y_pred)

print(f"Precision: {precision:.3f}") print(f"Recall: {recall:.3f}") print(f"F1: {f1:.3f}") ```

ROC Curve and AUC

ROC shows tradeoff across all thresholds:

```python from sklearn.metrics import roc_curve, roc_auc_score

Get probabilities y_proba = model.predict_proba(X_test)[:, 1]

ROC curve fpr, tpr, thresholds = roc_curve(y_test, y_proba)

Plot plt.plot(fpr, tpr, label=f'AUC = {roc_auc_score(y_test, y_proba):.3f}') plt.plot([0, 1], [0, 1], 'k--', label='Random') plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate (Recall)') plt.title('ROC Curve') plt.legend() ```

**AUC Interpretation:** - 1.0: Perfect - 0.5: Random guessing - <0.5: Worse than random (flip predictions!)

Precision-Recall Curve

Better than ROC for imbalanced data:

```python from sklearn.metrics import precision_recall_curve, average_precision_score

precision, recall, thresholds = precision_recall_curve(y_test, y_proba)

plt.plot(recall, precision) plt.xlabel('Recall') plt.ylabel('Precision') plt.title(f'AP = {average_precision_score(y_test, y_proba):.3f}') ```

Multi-class Metrics

```python from sklearn.metrics import classification_report

Get detailed report per class print(classification_report(y_test, y_pred))

Averaging strategies f1_score(y_test, y_pred, average='macro') # Equal weight per class f1_score(y_test, y_pred, average='weighted') # Weight by class size f1_score(y_test, y_pred, average='micro') # Global calculation ```

Choosing Your Metric

``` Balanced classes? ├── Yes → Accuracy, F1, AUC all reasonable └── No (Imbalanced) ├── Care more about finding positives → Recall, PR-AUC ├── Care more about being right when positive → Precision └── Need ranking ability → ROC-AUC ```

Key Takeaway

No single metric tells the whole story. Understand the confusion matrix, know the precision-recall tradeoff, and choose metrics based on business impact. For imbalanced data, avoid accuracy and use F1, AUC, or precision/recall. Always ask: "What's worse - false positives or false negatives?" That guides your metric choice.

#Machine Learning#Metrics#Classification#Evaluation#Intermediate