Master classification metrics beyond accuracy - precision, recall, F1, ROC-AUC, and when to use each.

Deep Dive into Classification Metrics

Accuracy can lie. A spam filter that blocks everything gets 0% spam through - but also blocks all your real emails. Let's understand metrics properly.

The Confusion Matrix

Everything starts here:

``` Predicted Neg Pos Actual Neg TN FP ← False Positive: "False alarm" Pos FN TP ← False Negative: "Missed it" ↑ Type II error ```

```python from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

cm = confusion_matrix(y_test, y_pred) ConfusionMatrixDisplay(cm, display_labels=['Negative', 'Positive']).plot() ```

Precision vs Recall

**Precision:** Of predicted positives, how many are correct? ``` Precision = TP / (TP + FP) ``` "When I say yes, am I right?"

**Recall (Sensitivity):** Of actual positives, how many did we find? ``` Recall = TP / (TP + FN) ``` "Of all the positives, how many did I catch?"

The Tradeoff

``` High threshold: "Only predict positive when very confident" → High precision (fewer false alarms) → Low recall (miss more positives)

Low threshold: "Predict positive more liberally" → Low precision (more false alarms) → High recall (catch more positives) ```

When to Prioritize What

| Scenario | Prioritize | Why | |----------|-----------|-----| | Spam filter | Precision | Don't lose important emails | | Cancer screening | Recall | Don't miss any cancer cases | | Fraud detection | Usually recall | Don't let fraud through | | Recommendation | Precision | Don't annoy users |

F1 Score: The Balance

Harmonic mean of precision and recall:

``` F1 = 2 × (Precision × Recall) / (Precision + Recall) ```

Use when you need both precision and recall to be good.

```python from sklearn.metrics import precision_score, recall_score, f1_score

precision = precision_score(y_test, y_pred) recall = recall_score(y_test, y_pred) f1 = f1_score(y_test, y_pred)

print(f"Precision: {precision:.3f}") print(f"Recall: {recall:.3f}") print(f"F1: {f1:.3f}") ```

ROC Curve and AUC

ROC shows tradeoff across all thresholds:

```python from sklearn.metrics import roc_curve, roc_auc_score

Get probabilities y_proba = model.predict_proba(X_test)[:, 1]

ROC curve fpr, tpr, thresholds = roc_curve(y_test, y_proba)

Plot plt.plot(fpr, tpr, label=f'AUC = {roc_auc_score(y_test, y_proba):.3f}') plt.plot([0, 1], [0, 1], 'k--', label='Random') plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate (Recall)') plt.title('ROC Curve') plt.legend() ```

**AUC Interpretation:** - 1.0: Perfect - 0.5: Random guessing - <0.5: Worse than random (flip predictions!)

Precision-Recall Curve

Better than ROC for imbalanced data:

```python from sklearn.metrics import precision_recall_curve, average_precision_score

precision, recall, thresholds = precision_recall_curve(y_test, y_proba)

plt.plot(recall, precision) plt.xlabel('Recall') plt.ylabel('Precision') plt.title(f'AP = {average_precision_score(y_test, y_proba):.3f}') ```

Multi-class Metrics

```python from sklearn.metrics import classification_report

Get detailed report per class print(classification_report(y_test, y_pred))

Averaging strategies f1_score(y_test, y_pred, average='macro') # Equal weight per class f1_score(y_test, y_pred, average='weighted') # Weight by class size f1_score(y_test, y_pred, average='micro') # Global calculation ```

Choosing Your Metric

``` Balanced classes? ├── Yes → Accuracy, F1, AUC all reasonable └── No (Imbalanced) ├── Care more about finding positives → Recall, PR-AUC ├── Care more about being right when positive → Precision └── Need ranking ability → ROC-AUC ```

Key Takeaway

No single metric tells the whole story. Understand the confusion matrix, know the precision-recall tradeoff, and choose metrics based on business impact. For imbalanced data, avoid accuracy and use F1, AUC, or precision/recall. Always ask: "What's worse - false positives or false negatives?" That guides your metric choice.

Deep Dive into Classification Metrics

Deep Dive into Classification Metrics

The Confusion Matrix

Precision vs Recall

The Tradeoff

When to Prioritize What

F1 Score: The Balance

ROC Curve and AUC

Get probabilities y_proba = model.predict_proba(X_test)[:, 1]

ROC curve fpr, tpr, thresholds = roc_curve(y_test, y_proba)

Plot plt.plot(fpr, tpr, label=f'AUC = {roc_auc_score(y_test, y_proba):.3f}') plt.plot([0, 1], [0, 1], 'k--', label='Random') plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate (Recall)') plt.title('ROC Curve') plt.legend() ```

Precision-Recall Curve

Multi-class Metrics

Get detailed report per class print(classification_report(y_test, y_pred))

Averaging strategies f1_score(y_test, y_pred, average='macro') # Equal weight per class f1_score(y_test, y_pred, average='weighted') # Weight by class size f1_score(y_test, y_pred, average='micro') # Global calculation ```

Choosing Your Metric

Key Takeaway

More on ML

What is Machine Learning? A Simple Introduction

Supervised vs Unsupervised Learning Explained

Understanding Training, Validation, and Test Sets