Introduction to Bias and Variance
Understand the bias-variance tradeoff - a fundamental concept in machine learning.
Introduction to Bias and Variance
The bias-variance tradeoff is at the heart of machine learning. Understanding it helps you diagnose and fix model problems.
Simple Explanation
Imagine throwing darts at a target:
``` High Bias, Low Variance: Low Bias, High Variance: ┌─────────┐ ┌─────────┐ │ │ │ x x │ │ xxx │ │ ◎ │ │ ◎ │ │ x x │ │ │ │ │ └─────────┘ └─────────┘ Consistent but wrong Scattered around target
Low Bias, Low Variance: High Bias, High Variance: ┌─────────┐ ┌─────────┐ │ │ │ x │ │ x◎x │ │ x │ │ x │ │ ◎ │ │ │ │ x x │ └─────────┘ └─────────┘ What we want! Worst case ```
What is Bias?
**Bias** = Error from overly simple assumptions.
A model with high bias: - Misses relevant patterns - Underfits the data - Is "too rigid"
Example: Fitting a straight line to curved data.
``` Data: curved pattern Model: straight line
∙ ∙ ∙ ──────── ← Line misses the curve ∙ ∙ ```
What is Variance?
**Variance** = Error from being too sensitive to training data.
A model with high variance: - Learns noise as if it were signal - Overfits the data - Changes dramatically with different training data
Example: A very complex polynomial that hits every point but generalizes poorly.
The Tradeoff
``` Error │╲ │ ╲ ╱╱ Total Error │ ╲__╱╱ │ ╲╱ Variance │ ╲____ │ Bias └──────────────── Simple → Complex ```
- **Simple models:** High bias, low variance - **Complex models:** Low bias, high variance - **Goal:** Find the sweet spot where total error is minimized
Mathematical View
``` Total Error = Bias² + Variance + Irreducible Noise ```
- **Bias²**: How far off predictions are on average - **Variance**: How much predictions vary with different training data - **Irreducible Noise**: Random error you can't eliminate
Examples by Model
| Model | Bias | Variance | |-------|------|----------| | Linear Regression | High | Low | | Polynomial Regression (high degree) | Low | High | | Decision Tree (no pruning) | Low | High | | Decision Tree (limited depth) | Medium | Medium | | k-NN (k=1) | Low | High | | k-NN (k=n) | High | Low |
Diagnosing Your Model
### High Bias (Underfitting) - Training error is high - Training and test error are similar (both bad) - Model is too simple
**Fix:** Increase complexity - Add features - Use a more complex model - Reduce regularization
### High Variance (Overfitting) - Training error is low - Test error is much higher than training - Model is too complex
**Fix:** Decrease complexity - Remove features - Use simpler model - Add regularization - Get more data
Learning Curves
Plot training and validation error vs. training size:
### High Bias Pattern ``` Error │ ────────── Validation (high) │ ────────── Training (high) └────────────────── Training Size Both errors are high and converge ```
### High Variance Pattern ``` Error │ ────── Validation (high) │ │ ______ Training (low) └────────────────── Training Size Big gap between training and validation ```
Code: Plotting Learning Curves
```python from sklearn.model_selection import learning_curve import matplotlib.pyplot as plt import numpy as np
train_sizes, train_scores, val_scores = learning_curve( model, X, y, train_sizes=np.linspace(0.1, 1.0, 10), cv=5 )
train_mean = train_scores.mean(axis=1) val_mean = val_scores.mean(axis=1)
plt.plot(train_sizes, train_mean, label='Training') plt.plot(train_sizes, val_mean, label='Validation') plt.xlabel('Training Size') plt.ylabel('Score') plt.legend() plt.title('Learning Curve') plt.show() ```
Controlling Bias and Variance
### Model Complexity More complex → Less bias, more variance
### Regularization More regularization → More bias, less variance
### Training Data More data → Helps reduce variance (not bias!)
### Feature Selection Fewer features → More bias, less variance
Practical Tips
1. **Start simple**: Begin with a simple model, add complexity as needed 2. **Use cross-validation**: Don't rely on single train-test split 3. **Plot learning curves**: Visualize the bias-variance situation 4. **Regularize**: When in doubt, add regularization 5. **Get more data**: Often the best solution for high variance
Key Takeaway
You can't minimize both bias and variance simultaneously. The art of machine learning is finding the right balance for your specific problem.
- High training error? → Reduce bias - High gap between train/test? → Reduce variance
Use learning curves to diagnose, then adjust accordingly!