ML7 min read

Learning Curves: Diagnosing Model Performance

Use learning curves to diagnose overfitting, underfitting, and determine if more data will help.

Sarah Chen
December 19, 2025
0.0k0

Learning Curves: Diagnosing Model Performance

Should you get more data? Use a more complex model? Learning curves answer these questions visually.

What Learning Curves Show

Plot model performance as training set size increases:

Performance
    │
    │    ╭─────── Training score
    │   ╱
    │  ╱    ╭───── Validation score
    │ ╱    ╱
    │╱    ╱
    └─────────────────
       Training set size

Creating Learning Curves

from sklearn.model_selection import learning_curve
import numpy as np
import matplotlib.pyplot as plt

def plot_learning_curve(model, X, y, cv=5):
    train_sizes, train_scores, val_scores = learning_curve(
        model, X, y,
        train_sizes=np.linspace(0.1, 1.0, 10),
        cv=cv,
        scoring='accuracy'
    )
    
    # Calculate mean and std
    train_mean = train_scores.mean(axis=1)
    train_std = train_scores.std(axis=1)
    val_mean = val_scores.mean(axis=1)
    val_std = val_scores.std(axis=1)
    
    # Plot
    plt.figure(figsize=(10, 6))
    plt.plot(train_sizes, train_mean, label='Training score')
    plt.fill_between(train_sizes, train_mean - train_std, 
                     train_mean + train_std, alpha=0.1)
    plt.plot(train_sizes, val_mean, label='Validation score')
    plt.fill_between(train_sizes, val_mean - val_std,
                     val_mean + val_std, alpha=0.1)
    
    plt.xlabel('Training set size')
    plt.ylabel('Score')
    plt.title('Learning Curve')
    plt.legend()
    plt.grid(True)
    plt.show()

# Usage
plot_learning_curve(RandomForestClassifier(), X, y)

Pattern 1: High Bias (Underfitting)

Score
    │     ___________  Training
    │    ╱
    │___╱____________  Validation
    │
    └─────────────────
       Training size

Both scores low and converged.
More data won't help!

Diagnosis: Model too simple
Solutions:

  • More complex model
  • Add features
  • Reduce regularization

Pattern 2: High Variance (Overfitting)

Score
    │────────────────  Training (high)
    │
    │
    │    ╱───────────  Validation (gap!)
    │___╱
    └─────────────────
       Training size

Big gap between training and validation.

Diagnosis: Model memorizing training data
Solutions:

  • More training data
  • Simpler model
  • More regularization
  • Feature selection

Pattern 3: Good Fit

Score
    │    ╭───────────  Training
    │   ╱  ╭─────────  Validation
    │  ╱  ╱
    │ ╱  ╱  Small gap, both high
    │╱  ╱
    └─────────────────
       Training size

Both scores high and close together.

Will More Data Help?

Gap still present at max training size?
├── Yes → More data likely helps
│         (curves haven't converged)
└── No → More data won't help
         (need better model/features)

Comparing Models

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

models = {
    'Logistic Regression': LogisticRegression(),
    'Random Forest': RandomForestClassifier(n_estimators=100),
    'SVM': SVC()
}

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for ax, (name, model) in zip(axes, models.items()):
    train_sizes, train_scores, val_scores = learning_curve(
        model, X, y, train_sizes=np.linspace(0.1, 1.0, 10), cv=5
    )
    
    ax.plot(train_sizes, train_scores.mean(axis=1), label='Train')
    ax.plot(train_sizes, val_scores.mean(axis=1), label='Val')
    ax.set_title(name)
    ax.legend()

plt.tight_layout()
plt.show()

Validation Curves (Hyperparameter Effect)

Different from learning curves - shows effect of a hyperparameter:

from sklearn.model_selection import validation_curve

param_range = [1, 10, 50, 100, 200, 500]
train_scores, val_scores = validation_curve(
    RandomForestClassifier(),
    X, y,
    param_name='n_estimators',
    param_range=param_range,
    cv=5
)

plt.plot(param_range, train_scores.mean(axis=1), label='Train')
plt.plot(param_range, val_scores.mean(axis=1), label='Validation')
plt.xlabel('n_estimators')
plt.ylabel('Score')
plt.legend()

Key Takeaway

Learning curves are diagnostic tools that save time and money. Before collecting more data, check if curves have converged - if yes, more data won't help. Before trying complex models, check if you're underfitting (need more complexity) or overfitting (need less). Let the data guide your decisions!

#Machine Learning#Learning Curves#Model Diagnosis#Intermediate