Model Evaluation Metrics You Must Know
Learn the essential metrics for evaluating ML models - accuracy, precision, recall, F1, ROC, and more.
Model Evaluation Metrics You Must Know
"My model is 95% accurate!" Sounds great, right? But what if 95% of your data is one class? Let's learn the RIGHT metrics.
Classification Metrics
### The Confusion Matrix
The foundation of all classification metrics:
``` Predicted Neg Pos Actual Neg [TN FP] Pos [FN TP] ```
- **True Negative (TN):** Correctly predicted negative - **True Positive (TP):** Correctly predicted positive - **False Positive (FP):** Said positive, was negative (Type I error) - **False Negative (FN):** Said negative, was positive (Type II error)
```python from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_true, y_pred) print(cm) # [[TN, FP], # [FN, TP]] ```
### Accuracy
```python accuracy = (TP + TN) / (TP + TN + FP + FN) ```
Percentage of correct predictions.
**When to use:** Balanced classes **Problem:** Misleading with imbalanced data
``` Dataset: 950 non-fraud, 50 fraud Model: Always predicts "non-fraud" Accuracy: 95% 🎉 (but useless!) ```
### Precision
```python precision = TP / (TP + FP) ```
Of all positive predictions, how many were correct?
**Use when:** False positives are costly - Spam filter (don't mark good email as spam) - Recommender systems (don't waste user's time)
### Recall (Sensitivity)
```python recall = TP / (TP + FN) ```
Of all actual positives, how many did we find?
**Use when:** False negatives are costly - Disease detection (don't miss sick patients!) - Fraud detection (don't miss fraudsters!)
### Precision vs Recall Tradeoff
``` ↑ Precision → ↓ Recall (more conservative predictions) ↑ Recall → ↓ Precision (more liberal predictions) ```
You can't have both perfect. Choose based on what's worse: - Missing a positive (FN) → Focus on Recall - False alarm (FP) → Focus on Precision
### F1 Score
```python f1 = 2 * (precision * recall) / (precision + recall) ```
Harmonic mean of precision and recall. Balanced metric.
**Use when:** You care about both precision and recall equally.
```python from sklearn.metrics import precision_score, recall_score, f1_score
print(f"Precision: {precision_score(y_true, y_pred):.3f}") print(f"Recall: {recall_score(y_true, y_pred):.3f}") print(f"F1: {f1_score(y_true, y_pred):.3f}") ```
### Classification Report
Get all metrics at once:
```python from sklearn.metrics import classification_report
print(classification_report(y_true, y_pred)) ```
``` precision recall f1-score support class 0 0.95 0.98 0.96 950 class 1 0.72 0.56 0.63 50 accuracy 0.94 1000 ```
### ROC Curve and AUC
**ROC Curve:** Plot of True Positive Rate vs False Positive Rate at different thresholds.
**AUC (Area Under Curve):** Single number summarizing ROC. - AUC = 1.0: Perfect model - AUC = 0.5: Random guessing - AUC < 0.5: Worse than random
```python from sklearn.metrics import roc_auc_score, roc_curve import matplotlib.pyplot as plt
Get probabilities, not predictions y_proba = model.predict_proba(X_test)[:, 1]
AUC score auc = roc_auc_score(y_true, y_proba) print(f"AUC: {auc:.3f}")
Plot ROC curve fpr, tpr, _ = roc_curve(y_true, y_proba) plt.plot(fpr, tpr) plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title(f'ROC Curve (AUC = {auc:.3f})') plt.show() ```
Regression Metrics
### Mean Absolute Error (MAE)
```python mae = mean(|predicted - actual|) ```
Average absolute difference. Easy to interpret.
### Mean Squared Error (MSE)
```python mse = mean((predicted - actual)²) ```
Penalizes large errors more heavily.
### Root Mean Squared Error (RMSE)
```python rmse = sqrt(mse) ```
Same units as target variable. Most commonly used.
### R² Score
```python r2 = 1 - (sum of squared errors) / (total variance) ```
- R² = 1: Perfect predictions - R² = 0: As good as predicting the mean - R² < 0: Worse than predicting the mean!
```python from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score import numpy as np
mae = mean_absolute_error(y_true, y_pred) mse = mean_squared_error(y_true, y_pred) rmse = np.sqrt(mse) r2 = r2_score(y_true, y_pred)
print(f"MAE: {mae:.2f}") print(f"RMSE: {rmse:.2f}") print(f"R²: {r2:.3f}") ```
Which Metric to Use?
| Problem | Recommended Metric | |---------|-------------------| | Balanced classification | Accuracy, F1 | | Imbalanced classification | F1, AUC, Precision/Recall | | Ranking | AUC | | Regression (normal errors) | RMSE, R² | | Regression (with outliers) | MAE |
Quick Reference
```python from sklearn.metrics import ( accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, classification_report, mean_absolute_error, mean_squared_error, r2_score ) ```
Key Takeaway
Don't just report accuracy. Understand your problem: - What's the cost of different errors? - Is your data balanced? - Do you need probabilities or hard predictions?
Choose metrics that match your real-world goals!