Model Calibration: Reliable Probability Estimates
Learn how to calibrate your model so predicted probabilities reflect actual likelihoods.
Model Calibration: Reliable Probability Estimates
When your model says "80% probability of rain," it should actually rain 80% of the time. That's calibration - making probability estimates reliable.
Why Calibration Matters
Uncalibrated model says 90% confidence:
- Sometimes right 90% of the time ✓
- Sometimes right only 60% of the time ✗
Calibrated model says 90% confidence:
- Right 90% of the time ✓
Important for:
- Medical diagnosis
- Risk assessment
- Decision making with costs/benefits
Checking Calibration: Reliability Diagram
from sklearn.calibration import calibration_curve
import matplotlib.pyplot as plt
# Get predicted probabilities
y_prob = model.predict_proba(X_test)[:, 1]
# Create calibration curve
prob_true, prob_pred = calibration_curve(y_test, y_prob, n_bins=10)
# Plot
plt.figure(figsize=(8, 6))
plt.plot(prob_pred, prob_true, 's-', label='Model')
plt.plot([0, 1], [0, 1], '--', label='Perfectly calibrated')
plt.xlabel('Mean predicted probability')
plt.ylabel('Fraction of positives')
plt.title('Calibration Curve')
plt.legend()
plt.show()
Interpreting the Plot
Perfectly Calibrated:
Points on diagonal line
Under-confident: Over-confident:
Above diagonal Below diagonal
Says 60%, is 80% Says 80%, is 60%
Brier Score
Measures calibration quality (lower is better):
from sklearn.metrics import brier_score_loss
brier = brier_score_loss(y_test, y_prob)
print(f"Brier Score: {brier:.4f}") # 0 = perfect, 0.25 = random
Which Models Need Calibration?
| Model | Typically Calibrated? |
|---|---|
| Logistic Regression | Usually good |
| Naive Bayes | Often extreme (needs calibration) |
| Random Forest | Usually under-confident |
| SVM | Often needs calibration |
| Neural Networks | Varies |
| XGBoost | Usually good |
Calibration Methods
1. Platt Scaling (Sigmoid)
Fits a logistic regression to the outputs:
from sklearn.calibration import CalibratedClassifierCV
# Wrap your model with calibration
calibrated = CalibratedClassifierCV(model, method='sigmoid', cv=5)
calibrated.fit(X_train, y_train)
# Now probabilities are calibrated
y_prob_calibrated = calibrated.predict_proba(X_test)[:, 1]
2. Isotonic Regression
Non-parametric, more flexible:
calibrated = CalibratedClassifierCV(model, method='isotonic', cv=5)
calibrated.fit(X_train, y_train)
Comparing Methods
# Original model
model.fit(X_train, y_train)
prob_original = model.predict_proba(X_test)[:, 1]
# Sigmoid calibration
cal_sigmoid = CalibratedClassifierCV(model, method='sigmoid', cv=5)
cal_sigmoid.fit(X_train, y_train)
prob_sigmoid = cal_sigmoid.predict_proba(X_test)[:, 1]
# Isotonic calibration
cal_isotonic = CalibratedClassifierCV(model, method='isotonic', cv=5)
cal_isotonic.fit(X_train, y_train)
prob_isotonic = cal_isotonic.predict_proba(X_test)[:, 1]
# Compare
print(f"Original Brier: {brier_score_loss(y_test, prob_original):.4f}")
print(f"Sigmoid Brier: {brier_score_loss(y_test, prob_sigmoid):.4f}")
print(f"Isotonic Brier: {brier_score_loss(y_test, prob_isotonic):.4f}")
Choosing Calibration Method
| Method | Best When |
|---|---|
| Sigmoid | Enough data, S-shaped miscalibration |
| Isotonic | Lots of data, arbitrary miscalibration |
Sigmoid is safer with less data (isotonic can overfit).
Full Pipeline Example
from sklearn.ensemble import RandomForestClassifier
from sklearn.calibration import CalibratedClassifierCV, calibration_curve
from sklearn.model_selection import train_test_split
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Train base model
rf = RandomForestClassifier(n_estimators=100)
# Create calibrated version
rf_calibrated = CalibratedClassifierCV(rf, method='sigmoid', cv=5)
rf_calibrated.fit(X_train, y_train)
# Compare calibration curves
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
for ax, (name, model) in zip(axes, [('Uncalibrated', rf), ('Calibrated', rf_calibrated)]):
if name == 'Uncalibrated':
model.fit(X_train, y_train)
prob = model.predict_proba(X_test)[:, 1]
prob_true, prob_pred = calibration_curve(y_test, prob, n_bins=10)
ax.plot(prob_pred, prob_true, 's-')
ax.plot([0, 1], [0, 1], '--')
ax.set_title(f'{name}\nBrier: {brier_score_loss(y_test, prob):.4f}')
plt.tight_layout()
Key Takeaway
Calibration makes probability predictions reliable. Always check calibration with reliability diagrams, especially when probabilities drive decisions. Random forests and Naive Bayes often need calibration, while logistic regression is usually well-calibrated. Use sigmoid calibration by default, isotonic if you have lots of data.