Learning Curves: Diagnosing Model Performance
Use learning curves to diagnose overfitting, underfitting, and determine if more data will help.
Learning Curves: Diagnosing Model Performance
Should you get more data? Use a more complex model? Learning curves answer these questions visually.
What Learning Curves Show
Plot model performance as training set size increases:
``` Performance │ │ ╭─────── Training score │ ╱ │ ╱ ╭───── Validation score │ ╱ ╱ │╱ ╱ └───────────────── Training set size ```
Creating Learning Curves
```python from sklearn.model_selection import learning_curve import numpy as np import matplotlib.pyplot as plt
def plot_learning_curve(model, X, y, cv=5): train_sizes, train_scores, val_scores = learning_curve( model, X, y, train_sizes=np.linspace(0.1, 1.0, 10), cv=cv, scoring='accuracy' ) # Calculate mean and std train_mean = train_scores.mean(axis=1) train_std = train_scores.std(axis=1) val_mean = val_scores.mean(axis=1) val_std = val_scores.std(axis=1) # Plot plt.figure(figsize=(10, 6)) plt.plot(train_sizes, train_mean, label='Training score') plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha=0.1) plt.plot(train_sizes, val_mean, label='Validation score') plt.fill_between(train_sizes, val_mean - val_std, val_mean + val_std, alpha=0.1) plt.xlabel('Training set size') plt.ylabel('Score') plt.title('Learning Curve') plt.legend() plt.grid(True) plt.show()
Usage plot_learning_curve(RandomForestClassifier(), X, y) ```
Pattern 1: High Bias (Underfitting)
``` Score │ ___________ Training │ ╱ │___╱____________ Validation │ └───────────────── Training size
Both scores low and converged. More data won't help! ```
**Diagnosis:** Model too simple **Solutions:** - More complex model - Add features - Reduce regularization
Pattern 2: High Variance (Overfitting)
``` Score │──────────────── Training (high) │ │ │ ╱─────────── Validation (gap!) │___╱ └───────────────── Training size
Big gap between training and validation. ```
**Diagnosis:** Model memorizing training data **Solutions:** - More training data - Simpler model - More regularization - Feature selection
Pattern 3: Good Fit
``` Score │ ╭─────────── Training │ ╱ ╭───────── Validation │ ╱ ╱ │ ╱ ╱ Small gap, both high │╱ ╱ └───────────────── Training size ```
Both scores high and close together.
Will More Data Help?
``` Gap still present at max training size? ├── Yes → More data likely helps │ (curves haven't converged) └── No → More data won't help (need better model/features) ```
Comparing Models
```python from sklearn.linear_model import LogisticRegression from sklearn.ensemble import RandomForestClassifier from sklearn.svm import SVC
models = { 'Logistic Regression': LogisticRegression(), 'Random Forest': RandomForestClassifier(n_estimators=100), 'SVM': SVC() }
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for ax, (name, model) in zip(axes, models.items()): train_sizes, train_scores, val_scores = learning_curve( model, X, y, train_sizes=np.linspace(0.1, 1.0, 10), cv=5 ) ax.plot(train_sizes, train_scores.mean(axis=1), label='Train') ax.plot(train_sizes, val_scores.mean(axis=1), label='Val') ax.set_title(name) ax.legend()
plt.tight_layout() plt.show() ```
Validation Curves (Hyperparameter Effect)
Different from learning curves - shows effect of a hyperparameter:
```python from sklearn.model_selection import validation_curve
param_range = [1, 10, 50, 100, 200, 500] train_scores, val_scores = validation_curve( RandomForestClassifier(), X, y, param_name='n_estimators', param_range=param_range, cv=5 )
plt.plot(param_range, train_scores.mean(axis=1), label='Train') plt.plot(param_range, val_scores.mean(axis=1), label='Validation') plt.xlabel('n_estimators') plt.ylabel('Score') plt.legend() ```
Key Takeaway
Learning curves are diagnostic tools that save time and money. Before collecting more data, check if curves have converged - if yes, more data won't help. Before trying complex models, check if you're underfitting (need more complexity) or overfitting (need less). Let the data guide your decisions!