ML7 min read

Overfitting and Underfitting Explained

Understand the critical concepts of overfitting and underfitting - the key to building models that actually work.

Sarah Chen
December 19, 2025
0.0k0

Overfitting and Underfitting

These two concepts will make or break your ML models. Understanding them is crucial.

The Core Problem

We want models that work on new, unseen data—not just the training data.

Overfitting: The Memorizer

Definition: Model learns the training data TOO well, including noise and random patterns.

Think of a student who memorizes every answer word-for-word but can't handle slightly different questions.

Training accuracy: 99%
Test accuracy: 65%
← Huge gap = Overfitting!

Visual

Data points: ∙

Overfit model tries to hit EVERY point:
  │    ∙
  │  ╱╲  ╱∙
  │ ∙  ╲╱  ╲
  │╱         ∙
  └───────────
  (Too wiggly!)

Signs of Overfitting

  • Training accuracy >> Test accuracy
  • Model is very complex
  • Performance varies wildly on different test sets

Underfitting: The Oversimplifier

Definition: Model is too simple to capture the real patterns.

Think of a student who only learned "the answer is always C."

Training accuracy: 60%
Test accuracy: 58%
← Both low = Underfitting!

Visual

Data follows a curve, but model is a straight line:
  │      ∙  ∙
  │    ∙ ────────
  │  ∙     (Too simple!)
  │∙
  └───────────

Signs of Underfitting

  • Both training and test accuracy are low
  • Model is too simple for the data
  • High bias

The Sweet Spot

       │
Error  │╲
       │ ╲    ╱  Test Error
       │  ╲__╱   
       │   ╲
       │    ╲____  Training Error
       │
       └────────────────
          Model Complexity →
       
       Underfit │ Just Right │ Overfit

Goal: Find the complexity where test error is minimized.

Bias vs Variance

This is the technical way to describe the tradeoff:

High Bias High Variance
Meaning Oversimplified Overfit
Training error High Low
Test error High High
Model Too simple Too complex
Fix More complexity Less complexity

Ideal: Low bias AND low variance (hard to achieve!)

How to Fix Overfitting

1. Get More Data

More examples = harder to memorize = must learn real patterns

2. Simplify the Model

# Reduce complexity
tree = DecisionTreeClassifier(max_depth=3)  # Limit depth
model = LinearRegression()  # Use simpler model

3. Regularization

Add penalty for complexity:

from sklearn.linear_model import Ridge, Lasso

# Ridge (L2) - shrinks coefficients
model = Ridge(alpha=1.0)

# Lasso (L1) - can zero out features
model = Lasso(alpha=1.0)

4. Dropout (Neural Networks)

Randomly ignore neurons during training.

5. Early Stopping

Stop training before the model starts overfitting.

# Monitor validation loss, stop when it starts increasing

6. Cross-Validation

Test on multiple splits to get reliable performance estimate.

How to Fix Underfitting

1. Add More Features

Give the model more information to work with.

2. Use a More Complex Model

# From linear to polynomial
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=3)
X_poly = poly.fit_transform(X)

# From decision tree to random forest
from sklearn.ensemble import RandomForestClassifier

3. Reduce Regularization

model = Ridge(alpha=0.01)  # Less penalty

4. Train Longer

Give the model more time to learn.

Practical Checklist

□ Compare training vs validation accuracy
  - Big gap → Overfitting
  - Both low → Underfitting
  - Small gap, both good → Just right!

□ Use cross-validation for reliable estimates

□ Start simple, add complexity gradually

□ Always have a held-out test set for final evaluation

Code Example: Detecting Overfitting

from sklearn.model_selection import learning_curve
import matplotlib.pyplot as plt

train_sizes, train_scores, val_scores = learning_curve(
    model, X, y, cv=5,
    train_sizes=np.linspace(0.1, 1.0, 10)
)

plt.plot(train_sizes, train_scores.mean(axis=1), label='Training')
plt.plot(train_sizes, val_scores.mean(axis=1), label='Validation')
plt.xlabel('Training Size')
plt.ylabel('Score')
plt.legend()
plt.title('Learning Curve')
plt.show()

If lines converge = Good fit
If big gap = Overfitting
If both low = Underfitting

Key Takeaway

Overfitting: "I memorized the textbook but can't answer new questions"
Underfitting: "I barely studied and can't answer any questions"
Good fit: "I understood the concepts and can apply them"

Always validate on unseen data. Training accuracy lies!

#Machine Learning#Overfitting#Underfitting#Beginner