Regularization: Preventing Overfitting
Learn how regularization techniques prevent overfitting and improve model generalization.
Regularization: Preventing Overfitting
Regularization is like telling your model: "Keep it simple!" It's one of the most important techniques to prevent overfitting.
The Problem
Without regularization, models can become too complex:
# Overfitted linear regression might have huge coefficients
weights = [23456.7, -12345.6, 98765.4, ...]
# Small input changes → Huge output changes = Unstable!
The model is "working too hard" to fit training data perfectly.
The Solution: Add a Penalty
Original objective: Minimize error
New objective: Minimize error + λ × complexity penalty
λ (lambda) controls how much we penalize complexity.
L2 Regularization (Ridge)
Penalizes the sum of squared weights:
Loss = Error + λ × (w₁² + w₂² + w₃² + ...)
Effect: Shrinks all weights toward zero, but doesn't make them exactly zero.
from sklearn.linear_model import Ridge
# α is the regularization strength (λ)
model = Ridge(alpha=1.0)
model.fit(X_train, y_train)
print(f"Coefficients: {model.coef_}") # Smaller than unregularized
L1 Regularization (Lasso)
Penalizes the sum of absolute weights:
Loss = Error + λ × (|w₁| + |w₂| + |w₃| + ...)
Effect: Shrinks weights AND makes some exactly zero (feature selection!).
from sklearn.linear_model import Lasso
model = Lasso(alpha=1.0)
model.fit(X_train, y_train)
print(f"Coefficients: {model.coef_}") # Some are exactly 0
Elastic Net (L1 + L2)
Combines both:
from sklearn.linear_model import ElasticNet
model = ElasticNet(alpha=1.0, l1_ratio=0.5) # 50% L1, 50% L2
Use when: Many correlated features
Comparing L1 and L2
| Aspect | L1 (Lasso) | L2 (Ridge) |
|---|---|---|
| Penalty | w | |
| Effect on weights | Some become 0 | All shrink |
| Feature selection | Yes | No |
| Correlated features | Picks one randomly | Keeps all |
| Computationally | Harder | Easier |
Visual comparison:
L1 penalty region: L2 penalty region:
◇ ○
╱ ╲ ╱ ╲
╱ ╲ ╱ ╲
(Diamond shape - (Circle -
corners at axes) touches at corners less often)
L1's corners make weights hit exactly zero more often.
Choosing λ (Alpha)
λ = 0: No regularization (might overfit)
λ = ∞: All weights → 0 (will underfit)
Use cross-validation to find the best value:
from sklearn.linear_model import RidgeCV, LassoCV
# Automatically finds best alpha using CV
ridge = RidgeCV(alphas=[0.1, 1.0, 10.0], cv=5)
ridge.fit(X_train, y_train)
print(f"Best alpha: {ridge.alpha_}")
lasso = LassoCV(alphas=[0.01, 0.1, 1.0], cv=5)
lasso.fit(X_train, y_train)
print(f"Best alpha: {lasso.alpha_}")
Regularization in Logistic Regression
Same concept applies:
from sklearn.linear_model import LogisticRegression
# C = 1/λ (smaller C = more regularization)
model = LogisticRegression(C=0.1, penalty='l2') # Strong L2
model = LogisticRegression(C=10, penalty='l1', solver='saga') # Weak L1
Regularization in Other Models
Decision Trees
from sklearn.tree import DecisionTreeClassifier
# Regularization via constraints
tree = DecisionTreeClassifier(
max_depth=5, # Limit depth
min_samples_split=10, # Min samples to split
min_samples_leaf=5 # Min samples per leaf
)
Neural Networks
# L2 (weight decay)
from keras.regularizers import l2
model.add(Dense(64, kernel_regularizer=l2(0.01)))
# Dropout (randomly zeros neurons)
model.add(Dropout(0.5))
When to Use Regularization
Always consider it for:
- Linear/Logistic Regression
- Neural Networks
- Any model prone to overfitting
Signs you need MORE regularization:
- Training accuracy >> Test accuracy
- Large coefficients/weights
- Unstable predictions
Signs you need LESS regularization:
- Training accuracy is low
- Model is too simple
- Important features have zero weight
Practical Example
from sklearn.linear_model import Ridge, Lasso
from sklearn.model_selection import cross_val_score
import numpy as np
# Compare different regularization strengths
alphas = [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]
for alpha in alphas:
ridge = Ridge(alpha=alpha)
scores = cross_val_score(ridge, X, y, cv=5)
print(f"α={alpha:6.3f}: {scores.mean():.3f} ± {scores.std():.3f}")
Output:
α= 0.001: 0.812 ± 0.045 (underfitting slight)
α= 0.010: 0.845 ± 0.032
α= 0.100: 0.867 ± 0.028 ← Best!
α= 1.000: 0.854 ± 0.031
α=10.000: 0.798 ± 0.042 (overfitting)
α=100.00: 0.721 ± 0.058 (severe overfitting)
Key Takeaways
- Regularization adds a penalty for complexity
- L2 (Ridge): Shrinks all weights
- L1 (Lasso): Can zero out weights (feature selection)
- Always tune the regularization strength (λ/α)
- Use cross-validation to find the best value
Think of regularization as Occam's Razor for ML: simpler models often generalize better!