ML9 min read

Model Evaluation Metrics You Must Know

Learn the essential metrics for evaluating ML models - accuracy, precision, recall, F1, ROC, and more.

Sarah Chen
December 19, 2025
0.0k0

Model Evaluation Metrics You Must Know

"My model is 95% accurate!" Sounds great, right? But what if 95% of your data is one class? Let's learn the RIGHT metrics.

Classification Metrics

The Confusion Matrix

The foundation of all classification metrics:

                 Predicted
                 Neg    Pos
Actual    Neg   [TN     FP]
          Pos   [FN     TP]
  • True Negative (TN): Correctly predicted negative
  • True Positive (TP): Correctly predicted positive
  • False Positive (FP): Said positive, was negative (Type I error)
  • False Negative (FN): Said negative, was positive (Type II error)
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_true, y_pred)
print(cm)
# [[TN, FP],
#  [FN, TP]]

Accuracy

accuracy = (TP + TN) / (TP + TN + FP + FN)

Percentage of correct predictions.

When to use: Balanced classes
Problem: Misleading with imbalanced data

Dataset: 950 non-fraud, 50 fraud
Model: Always predicts "non-fraud"
Accuracy: 95% 🎉 (but useless!)

Precision

precision = TP / (TP + FP)

Of all positive predictions, how many were correct?

Use when: False positives are costly

  • Spam filter (don't mark good email as spam)
  • Recommender systems (don't waste user's time)

Recall (Sensitivity)

recall = TP / (TP + FN)

Of all actual positives, how many did we find?

Use when: False negatives are costly

  • Disease detection (don't miss sick patients!)
  • Fraud detection (don't miss fraudsters!)

Precision vs Recall Tradeoff

↑ Precision → ↓ Recall (more conservative predictions)
↑ Recall → ↓ Precision (more liberal predictions)

You can't have both perfect. Choose based on what's worse:

  • Missing a positive (FN) → Focus on Recall
  • False alarm (FP) → Focus on Precision

F1 Score

f1 = 2 * (precision * recall) / (precision + recall)

Harmonic mean of precision and recall. Balanced metric.

Use when: You care about both precision and recall equally.

from sklearn.metrics import precision_score, recall_score, f1_score

print(f"Precision: {precision_score(y_true, y_pred):.3f}")
print(f"Recall: {recall_score(y_true, y_pred):.3f}")
print(f"F1: {f1_score(y_true, y_pred):.3f}")

Classification Report

Get all metrics at once:

from sklearn.metrics import classification_report

print(classification_report(y_true, y_pred))
              precision    recall  f1-score   support
     class 0       0.95      0.98      0.96       950
     class 1       0.72      0.56      0.63        50
    accuracy                           0.94      1000

ROC Curve and AUC

ROC Curve: Plot of True Positive Rate vs False Positive Rate at different thresholds.

AUC (Area Under Curve): Single number summarizing ROC.

  • AUC = 1.0: Perfect model
  • AUC = 0.5: Random guessing
  • AUC < 0.5: Worse than random
from sklearn.metrics import roc_auc_score, roc_curve
import matplotlib.pyplot as plt

# Get probabilities, not predictions
y_proba = model.predict_proba(X_test)[:, 1]

# AUC score
auc = roc_auc_score(y_true, y_proba)
print(f"AUC: {auc:.3f}")

# Plot ROC curve
fpr, tpr, _ = roc_curve(y_true, y_proba)
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title(f'ROC Curve (AUC = {auc:.3f})')
plt.show()

Regression Metrics

Mean Absolute Error (MAE)

mae = mean(|predicted - actual|)

Average absolute difference. Easy to interpret.

Mean Squared Error (MSE)

mse = mean((predicted - actual)²)

Penalizes large errors more heavily.

Root Mean Squared Error (RMSE)

rmse = sqrt(mse)

Same units as target variable. Most commonly used.

R² Score

r2 = 1 - (sum of squared errors) / (total variance)
  • R² = 1: Perfect predictions
  • R² = 0: As good as predicting the mean
  • R² < 0: Worse than predicting the mean!
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

mae = mean_absolute_error(y_true, y_pred)
mse = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_true, y_pred)

print(f"MAE: {mae:.2f}")
print(f"RMSE: {rmse:.2f}")
print(f"R²: {r2:.3f}")

Which Metric to Use?

Problem Recommended Metric
Balanced classification Accuracy, F1
Imbalanced classification F1, AUC, Precision/Recall
Ranking AUC
Regression (normal errors) RMSE, R²
Regression (with outliers) MAE

Quick Reference

from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    roc_auc_score,
    confusion_matrix,
    classification_report,
    mean_absolute_error,
    mean_squared_error,
    r2_score
)

Key Takeaway

Don't just report accuracy. Understand your problem:

  • What's the cost of different errors?
  • Is your data balanced?
  • Do you need probabilities or hard predictions?

Choose metrics that match your real-world goals!

#Machine Learning#Metrics#Model Evaluation#Beginner