Use learning curves to diagnose overfitting, underfitting, and determine if more data will help.

Learning Curves: Diagnosing Model Performance

Should you get more data? Use a more complex model? Learning curves answer these questions visually.

What Learning Curves Show

Plot model performance as training set size increases:

``` Performance │ │ ╭─────── Training score │ ╱ │ ╱ ╭───── Validation score │ ╱ ╱ │╱ ╱ └───────────────── Training set size ```

Creating Learning Curves

```python from sklearn.model_selection import learning_curve import numpy as np import matplotlib.pyplot as plt

def plot_learning_curve(model, X, y, cv=5): train_sizes, train_scores, val_scores = learning_curve( model, X, y, train_sizes=np.linspace(0.1, 1.0, 10), cv=cv, scoring='accuracy' ) # Calculate mean and std train_mean = train_scores.mean(axis=1) train_std = train_scores.std(axis=1) val_mean = val_scores.mean(axis=1) val_std = val_scores.std(axis=1) # Plot plt.figure(figsize=(10, 6)) plt.plot(train_sizes, train_mean, label='Training score') plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha=0.1) plt.plot(train_sizes, val_mean, label='Validation score') plt.fill_between(train_sizes, val_mean - val_std, val_mean + val_std, alpha=0.1) plt.xlabel('Training set size') plt.ylabel('Score') plt.title('Learning Curve') plt.legend() plt.grid(True) plt.show()

Usage plot_learning_curve(RandomForestClassifier(), X, y) ```

Pattern 1: High Bias (Underfitting)

``` Score │ ___________ Training │ ╱ │___╱____________ Validation │ └───────────────── Training size

Both scores low and converged. More data won't help! ```

**Diagnosis:** Model too simple **Solutions:** - More complex model - Add features - Reduce regularization

Pattern 2: High Variance (Overfitting)

``` Score │──────────────── Training (high) │ │ │ ╱─────────── Validation (gap!) │___╱ └───────────────── Training size

Big gap between training and validation. ```

**Diagnosis:** Model memorizing training data **Solutions:** - More training data - Simpler model - More regularization - Feature selection

Pattern 3: Good Fit

``` Score │ ╭─────────── Training │ ╱ ╭───────── Validation │ ╱ ╱ │ ╱ ╱ Small gap, both high │╱ ╱ └───────────────── Training size ```

Both scores high and close together.

Will More Data Help?

``` Gap still present at max training size? ├── Yes → More data likely helps │ (curves haven't converged) └── No → More data won't help (need better model/features) ```

Comparing Models

```python from sklearn.linear_model import LogisticRegression from sklearn.ensemble import RandomForestClassifier from sklearn.svm import SVC

models = { 'Logistic Regression': LogisticRegression(), 'Random Forest': RandomForestClassifier(n_estimators=100), 'SVM': SVC() }

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for ax, (name, model) in zip(axes, models.items()): train_sizes, train_scores, val_scores = learning_curve( model, X, y, train_sizes=np.linspace(0.1, 1.0, 10), cv=5 ) ax.plot(train_sizes, train_scores.mean(axis=1), label='Train') ax.plot(train_sizes, val_scores.mean(axis=1), label='Val') ax.set_title(name) ax.legend()

plt.tight_layout() plt.show() ```

Validation Curves (Hyperparameter Effect)

Different from learning curves - shows effect of a hyperparameter:

```python from sklearn.model_selection import validation_curve

param_range = [1, 10, 50, 100, 200, 500] train_scores, val_scores = validation_curve( RandomForestClassifier(), X, y, param_name='n_estimators', param_range=param_range, cv=5 )

plt.plot(param_range, train_scores.mean(axis=1), label='Train') plt.plot(param_range, val_scores.mean(axis=1), label='Validation') plt.xlabel('n_estimators') plt.ylabel('Score') plt.legend() ```

Key Takeaway

Learning curves are diagnostic tools that save time and money. Before collecting more data, check if curves have converged - if yes, more data won't help. Before trying complex models, check if you're underfitting (need more complexity) or overfitting (need less). Let the data guide your decisions!

Learning Curves: Diagnosing Model Performance

Learning Curves: Diagnosing Model Performance

What Learning Curves Show

Creating Learning Curves

Usage plot_learning_curve(RandomForestClassifier(), X, y) ```

Pattern 1: High Bias (Underfitting)

Pattern 2: High Variance (Overfitting)

Pattern 3: Good Fit

Will More Data Help?

Comparing Models

Validation Curves (Hyperparameter Effect)

Key Takeaway

More on ML

What is Machine Learning? A Simple Introduction

Supervised vs Unsupervised Learning Explained

Understanding Training, Validation, and Test Sets