ML9 min read

Model Evaluation Metrics You Must Know

Learn the essential metrics for evaluating ML models - accuracy, precision, recall, F1, ROC, and more.

Sarah Chen
December 19, 2025
0.0k0

Model Evaluation Metrics You Must Know

"My model is 95% accurate!" Sounds great, right? But what if 95% of your data is one class? Let's learn the RIGHT metrics.

Classification Metrics

### The Confusion Matrix

The foundation of all classification metrics:

``` Predicted Neg Pos Actual Neg [TN FP] Pos [FN TP] ```

- **True Negative (TN):** Correctly predicted negative - **True Positive (TP):** Correctly predicted positive - **False Positive (FP):** Said positive, was negative (Type I error) - **False Negative (FN):** Said negative, was positive (Type II error)

```python from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_true, y_pred) print(cm) # [[TN, FP], # [FN, TP]] ```

### Accuracy

```python accuracy = (TP + TN) / (TP + TN + FP + FN) ```

Percentage of correct predictions.

**When to use:** Balanced classes **Problem:** Misleading with imbalanced data

``` Dataset: 950 non-fraud, 50 fraud Model: Always predicts "non-fraud" Accuracy: 95% 🎉 (but useless!) ```

### Precision

```python precision = TP / (TP + FP) ```

Of all positive predictions, how many were correct?

**Use when:** False positives are costly - Spam filter (don't mark good email as spam) - Recommender systems (don't waste user's time)

### Recall (Sensitivity)

```python recall = TP / (TP + FN) ```

Of all actual positives, how many did we find?

**Use when:** False negatives are costly - Disease detection (don't miss sick patients!) - Fraud detection (don't miss fraudsters!)

### Precision vs Recall Tradeoff

``` ↑ Precision → ↓ Recall (more conservative predictions) ↑ Recall → ↓ Precision (more liberal predictions) ```

You can't have both perfect. Choose based on what's worse: - Missing a positive (FN) → Focus on Recall - False alarm (FP) → Focus on Precision

### F1 Score

```python f1 = 2 * (precision * recall) / (precision + recall) ```

Harmonic mean of precision and recall. Balanced metric.

**Use when:** You care about both precision and recall equally.

```python from sklearn.metrics import precision_score, recall_score, f1_score

print(f"Precision: {precision_score(y_true, y_pred):.3f}") print(f"Recall: {recall_score(y_true, y_pred):.3f}") print(f"F1: {f1_score(y_true, y_pred):.3f}") ```

### Classification Report

Get all metrics at once:

```python from sklearn.metrics import classification_report

print(classification_report(y_true, y_pred)) ```

``` precision recall f1-score support class 0 0.95 0.98 0.96 950 class 1 0.72 0.56 0.63 50 accuracy 0.94 1000 ```

### ROC Curve and AUC

**ROC Curve:** Plot of True Positive Rate vs False Positive Rate at different thresholds.

**AUC (Area Under Curve):** Single number summarizing ROC. - AUC = 1.0: Perfect model - AUC = 0.5: Random guessing - AUC < 0.5: Worse than random

```python from sklearn.metrics import roc_auc_score, roc_curve import matplotlib.pyplot as plt

Get probabilities, not predictions y_proba = model.predict_proba(X_test)[:, 1]

AUC score auc = roc_auc_score(y_true, y_proba) print(f"AUC: {auc:.3f}")

Plot ROC curve fpr, tpr, _ = roc_curve(y_true, y_proba) plt.plot(fpr, tpr) plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title(f'ROC Curve (AUC = {auc:.3f})') plt.show() ```

Regression Metrics

### Mean Absolute Error (MAE)

```python mae = mean(|predicted - actual|) ```

Average absolute difference. Easy to interpret.

### Mean Squared Error (MSE)

```python mse = mean((predicted - actual)²) ```

Penalizes large errors more heavily.

### Root Mean Squared Error (RMSE)

```python rmse = sqrt(mse) ```

Same units as target variable. Most commonly used.

### R² Score

```python r2 = 1 - (sum of squared errors) / (total variance) ```

- R² = 1: Perfect predictions - R² = 0: As good as predicting the mean - R² < 0: Worse than predicting the mean!

```python from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score import numpy as np

mae = mean_absolute_error(y_true, y_pred) mse = mean_squared_error(y_true, y_pred) rmse = np.sqrt(mse) r2 = r2_score(y_true, y_pred)

print(f"MAE: {mae:.2f}") print(f"RMSE: {rmse:.2f}") print(f"R²: {r2:.3f}") ```

Which Metric to Use?

| Problem | Recommended Metric | |---------|-------------------| | Balanced classification | Accuracy, F1 | | Imbalanced classification | F1, AUC, Precision/Recall | | Ranking | AUC | | Regression (normal errors) | RMSE, R² | | Regression (with outliers) | MAE |

Quick Reference

```python from sklearn.metrics import ( accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, classification_report, mean_absolute_error, mean_squared_error, r2_score ) ```

Key Takeaway

Don't just report accuracy. Understand your problem: - What's the cost of different errors? - Is your data balanced? - Do you need probabilities or hard predictions?

Choose metrics that match your real-world goals!

#Machine Learning#Metrics#Model Evaluation#Beginner