Introduction to Bias and Variance
Understand the bias-variance tradeoff - a fundamental concept in machine learning.
Introduction to Bias and Variance
The bias-variance tradeoff is at the heart of machine learning. Understanding it helps you diagnose and fix model problems.
Simple Explanation
Imagine throwing darts at a target:
High Bias, Low Variance: Low Bias, High Variance:
┌─────────┐ ┌─────────┐
│ │ │ x x │
│ xxx │ │ ◎ │
│ ◎ │ │ x x │
│ │ │ │
└─────────┘ └─────────┘
Consistent but wrong Scattered around target
Low Bias, Low Variance: High Bias, High Variance:
┌─────────┐ ┌─────────┐
│ │ │ x │
│ x◎x │ │ x │
│ x │ │ ◎ │
│ │ │ x x │
└─────────┘ └─────────┘
What we want! Worst case
What is Bias?
Bias = Error from overly simple assumptions.
A model with high bias:
- Misses relevant patterns
- Underfits the data
- Is "too rigid"
Example: Fitting a straight line to curved data.
Data: curved pattern
Model: straight line
∙
∙ ∙
──────── ← Line misses the curve
∙ ∙
What is Variance?
Variance = Error from being too sensitive to training data.
A model with high variance:
- Learns noise as if it were signal
- Overfits the data
- Changes dramatically with different training data
Example: A very complex polynomial that hits every point but generalizes poorly.
The Tradeoff
Error
│╲
│ ╲ ╱╱ Total Error
│ ╲__╱╱
│ ╲╱ Variance
│ ╲____
│ Bias
└────────────────
Simple → Complex
- Simple models: High bias, low variance
- Complex models: Low bias, high variance
- Goal: Find the sweet spot where total error is minimized
Mathematical View
Total Error = Bias² + Variance + Irreducible Noise
- Bias²: How far off predictions are on average
- Variance: How much predictions vary with different training data
- Irreducible Noise: Random error you can't eliminate
Examples by Model
| Model | Bias | Variance |
|---|---|---|
| Linear Regression | High | Low |
| Polynomial Regression (high degree) | Low | High |
| Decision Tree (no pruning) | Low | High |
| Decision Tree (limited depth) | Medium | Medium |
| k-NN (k=1) | Low | High |
| k-NN (k=n) | High | Low |
Diagnosing Your Model
High Bias (Underfitting)
- Training error is high
- Training and test error are similar (both bad)
- Model is too simple
Fix: Increase complexity
- Add features
- Use a more complex model
- Reduce regularization
High Variance (Overfitting)
- Training error is low
- Test error is much higher than training
- Model is too complex
Fix: Decrease complexity
- Remove features
- Use simpler model
- Add regularization
- Get more data
Learning Curves
Plot training and validation error vs. training size:
High Bias Pattern
Error
│ ────────── Validation (high)
│ ────────── Training (high)
└──────────────────
Training Size
Both errors are high and converge
High Variance Pattern
Error
│ ────── Validation (high)
│
│ ______ Training (low)
└──────────────────
Training Size
Big gap between training and validation
Code: Plotting Learning Curves
from sklearn.model_selection import learning_curve
import matplotlib.pyplot as plt
import numpy as np
train_sizes, train_scores, val_scores = learning_curve(
model, X, y,
train_sizes=np.linspace(0.1, 1.0, 10),
cv=5
)
train_mean = train_scores.mean(axis=1)
val_mean = val_scores.mean(axis=1)
plt.plot(train_sizes, train_mean, label='Training')
plt.plot(train_sizes, val_mean, label='Validation')
plt.xlabel('Training Size')
plt.ylabel('Score')
plt.legend()
plt.title('Learning Curve')
plt.show()
Controlling Bias and Variance
Model Complexity
More complex → Less bias, more variance
Regularization
More regularization → More bias, less variance
Training Data
More data → Helps reduce variance (not bias!)
Feature Selection
Fewer features → More bias, less variance
Practical Tips
- Start simple: Begin with a simple model, add complexity as needed
- Use cross-validation: Don't rely on single train-test split
- Plot learning curves: Visualize the bias-variance situation
- Regularize: When in doubt, add regularization
- Get more data: Often the best solution for high variance
Key Takeaway
You can't minimize both bias and variance simultaneously. The art of machine learning is finding the right balance for your specific problem.
- High training error? → Reduce bias
- High gap between train/test? → Reduce variance
Use learning curves to diagnose, then adjust accordingly!