Regularization: Preventing Overfitting
Learn how regularization techniques prevent overfitting and improve model generalization.
Regularization: Preventing Overfitting
Regularization is like telling your model: "Keep it simple!" It's one of the most important techniques to prevent overfitting.
The Problem
Without regularization, models can become too complex:
```python # Overfitted linear regression might have huge coefficients weights = [23456.7, -12345.6, 98765.4, ...]
Small input changes → Huge output changes = Unstable! ```
The model is "working too hard" to fit training data perfectly.
The Solution: Add a Penalty
``` Original objective: Minimize error
New objective: Minimize error + λ × complexity penalty ```
λ (lambda) controls how much we penalize complexity.
L2 Regularization (Ridge)
Penalizes the **sum of squared weights**:
``` Loss = Error + λ × (w₁² + w₂² + w₃² + ...) ```
**Effect:** Shrinks all weights toward zero, but doesn't make them exactly zero.
```python from sklearn.linear_model import Ridge
α is the regularization strength (λ) model = Ridge(alpha=1.0) model.fit(X_train, y_train)
print(f"Coefficients: {model.coef_}") # Smaller than unregularized ```
L1 Regularization (Lasso)
Penalizes the **sum of absolute weights**:
``` Loss = Error + λ × (|w₁| + |w₂| + |w₃| + ...) ```
**Effect:** Shrinks weights AND makes some exactly zero (feature selection!).
```python from sklearn.linear_model import Lasso
model = Lasso(alpha=1.0) model.fit(X_train, y_train)
print(f"Coefficients: {model.coef_}") # Some are exactly 0 ```
Elastic Net (L1 + L2)
Combines both:
```python from sklearn.linear_model import ElasticNet
model = ElasticNet(alpha=1.0, l1_ratio=0.5) # 50% L1, 50% L2 ```
**Use when:** Many correlated features
Comparing L1 and L2
| Aspect | L1 (Lasso) | L2 (Ridge) | |--------|------------|------------| | Penalty | |w| | w² | | Effect on weights | Some become 0 | All shrink | | Feature selection | Yes | No | | Correlated features | Picks one randomly | Keeps all | | Computationally | Harder | Easier |
Visual comparison:
``` L1 penalty region: L2 penalty region: ◇ ○ ╱ ╲ ╱ ╲ ╱ ╲ ╱ ╲ (Diamond shape - (Circle - corners at axes) touches at corners less often) ```
L1's corners make weights hit exactly zero more often.
Choosing λ (Alpha)
``` λ = 0: No regularization (might overfit) λ = ∞: All weights → 0 (will underfit) ```
Use cross-validation to find the best value:
```python from sklearn.linear_model import RidgeCV, LassoCV
Automatically finds best alpha using CV ridge = RidgeCV(alphas=[0.1, 1.0, 10.0], cv=5) ridge.fit(X_train, y_train) print(f"Best alpha: {ridge.alpha_}")
lasso = LassoCV(alphas=[0.01, 0.1, 1.0], cv=5) lasso.fit(X_train, y_train) print(f"Best alpha: {lasso.alpha_}") ```
Regularization in Logistic Regression
Same concept applies:
```python from sklearn.linear_model import LogisticRegression
C = 1/λ (smaller C = more regularization) model = LogisticRegression(C=0.1, penalty='l2') # Strong L2 model = LogisticRegression(C=10, penalty='l1', solver='saga') # Weak L1 ```
Regularization in Other Models
### Decision Trees ```python from sklearn.tree import DecisionTreeClassifier
Regularization via constraints tree = DecisionTreeClassifier( max_depth=5, # Limit depth min_samples_split=10, # Min samples to split min_samples_leaf=5 # Min samples per leaf ) ```
### Neural Networks ```python # L2 (weight decay) from keras.regularizers import l2 model.add(Dense(64, kernel_regularizer=l2(0.01)))
Dropout (randomly zeros neurons) model.add(Dropout(0.5)) ```
When to Use Regularization
**Always consider it for:** - Linear/Logistic Regression - Neural Networks - Any model prone to overfitting
**Signs you need MORE regularization:** - Training accuracy >> Test accuracy - Large coefficients/weights - Unstable predictions
**Signs you need LESS regularization:** - Training accuracy is low - Model is too simple - Important features have zero weight
Practical Example
```python from sklearn.linear_model import Ridge, Lasso from sklearn.model_selection import cross_val_score import numpy as np
Compare different regularization strengths alphas = [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]
for alpha in alphas: ridge = Ridge(alpha=alpha) scores = cross_val_score(ridge, X, y, cv=5) print(f"α={alpha:6.3f}: {scores.mean():.3f} ± {scores.std():.3f}") ```
Output: ``` α= 0.001: 0.812 ± 0.045 (underfitting slight) α= 0.010: 0.845 ± 0.032 α= 0.100: 0.867 ± 0.028 ← Best! α= 1.000: 0.854 ± 0.031 α=10.000: 0.798 ± 0.042 (overfitting) α=100.00: 0.721 ± 0.058 (severe overfitting) ```
Key Takeaways
1. **Regularization adds a penalty for complexity** 2. **L2 (Ridge):** Shrinks all weights 3. **L1 (Lasso):** Can zero out weights (feature selection) 4. **Always tune the regularization strength (λ/α)** 5. **Use cross-validation to find the best value**
Think of regularization as Occam's Razor for ML: simpler models often generalize better!