Overfitting and Underfitting Explained
Understand the critical concepts of overfitting and underfitting - the key to building models that actually work.
Overfitting and Underfitting
These two concepts will make or break your ML models. Understanding them is crucial.
The Core Problem
We want models that work on new, unseen data—not just the training data.
Overfitting: The Memorizer
Definition: Model learns the training data TOO well, including noise and random patterns.
Think of a student who memorizes every answer word-for-word but can't handle slightly different questions.
Training accuracy: 99%
Test accuracy: 65%
← Huge gap = Overfitting!
Visual
Data points: ∙
Overfit model tries to hit EVERY point:
│ ∙
│ ╱╲ ╱∙
│ ∙ ╲╱ ╲
│╱ ∙
└───────────
(Too wiggly!)
Signs of Overfitting
- Training accuracy >> Test accuracy
- Model is very complex
- Performance varies wildly on different test sets
Underfitting: The Oversimplifier
Definition: Model is too simple to capture the real patterns.
Think of a student who only learned "the answer is always C."
Training accuracy: 60%
Test accuracy: 58%
← Both low = Underfitting!
Visual
Data follows a curve, but model is a straight line:
│ ∙ ∙
│ ∙ ────────
│ ∙ (Too simple!)
│∙
└───────────
Signs of Underfitting
- Both training and test accuracy are low
- Model is too simple for the data
- High bias
The Sweet Spot
│
Error │╲
│ ╲ ╱ Test Error
│ ╲__╱
│ ╲
│ ╲____ Training Error
│
└────────────────
Model Complexity →
Underfit │ Just Right │ Overfit
Goal: Find the complexity where test error is minimized.
Bias vs Variance
This is the technical way to describe the tradeoff:
| High Bias | High Variance | |
|---|---|---|
| Meaning | Oversimplified | Overfit |
| Training error | High | Low |
| Test error | High | High |
| Model | Too simple | Too complex |
| Fix | More complexity | Less complexity |
Ideal: Low bias AND low variance (hard to achieve!)
How to Fix Overfitting
1. Get More Data
More examples = harder to memorize = must learn real patterns
2. Simplify the Model
# Reduce complexity
tree = DecisionTreeClassifier(max_depth=3) # Limit depth
model = LinearRegression() # Use simpler model
3. Regularization
Add penalty for complexity:
from sklearn.linear_model import Ridge, Lasso
# Ridge (L2) - shrinks coefficients
model = Ridge(alpha=1.0)
# Lasso (L1) - can zero out features
model = Lasso(alpha=1.0)
4. Dropout (Neural Networks)
Randomly ignore neurons during training.
5. Early Stopping
Stop training before the model starts overfitting.
# Monitor validation loss, stop when it starts increasing
6. Cross-Validation
Test on multiple splits to get reliable performance estimate.
How to Fix Underfitting
1. Add More Features
Give the model more information to work with.
2. Use a More Complex Model
# From linear to polynomial
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=3)
X_poly = poly.fit_transform(X)
# From decision tree to random forest
from sklearn.ensemble import RandomForestClassifier
3. Reduce Regularization
model = Ridge(alpha=0.01) # Less penalty
4. Train Longer
Give the model more time to learn.
Practical Checklist
□ Compare training vs validation accuracy
- Big gap → Overfitting
- Both low → Underfitting
- Small gap, both good → Just right!
□ Use cross-validation for reliable estimates
□ Start simple, add complexity gradually
□ Always have a held-out test set for final evaluation
Code Example: Detecting Overfitting
from sklearn.model_selection import learning_curve
import matplotlib.pyplot as plt
train_sizes, train_scores, val_scores = learning_curve(
model, X, y, cv=5,
train_sizes=np.linspace(0.1, 1.0, 10)
)
plt.plot(train_sizes, train_scores.mean(axis=1), label='Training')
plt.plot(train_sizes, val_scores.mean(axis=1), label='Validation')
plt.xlabel('Training Size')
plt.ylabel('Score')
plt.legend()
plt.title('Learning Curve')
plt.show()
If lines converge = Good fit
If big gap = Overfitting
If both low = Underfitting
Key Takeaway
Overfitting: "I memorized the textbook but can't answer new questions"
Underfitting: "I barely studied and can't answer any questions"
Good fit: "I understood the concepts and can apply them"
Always validate on unseen data. Training accuracy lies!