ML8 min read

Regularization: Preventing Overfitting

Learn how regularization techniques prevent overfitting and improve model generalization.

Sarah Chen
December 19, 2025
0.0k0

Regularization: Preventing Overfitting

Regularization is like telling your model: "Keep it simple!" It's one of the most important techniques to prevent overfitting.

The Problem

Without regularization, models can become too complex:

# Overfitted linear regression might have huge coefficients
weights = [23456.7, -12345.6, 98765.4, ...]

# Small input changes → Huge output changes = Unstable!

The model is "working too hard" to fit training data perfectly.

The Solution: Add a Penalty

Original objective: Minimize error

New objective: Minimize error + λ × complexity penalty

λ (lambda) controls how much we penalize complexity.

L2 Regularization (Ridge)

Penalizes the sum of squared weights:

Loss = Error + λ × (w₁² + w₂² + w₃² + ...)

Effect: Shrinks all weights toward zero, but doesn't make them exactly zero.

from sklearn.linear_model import Ridge

# α is the regularization strength (λ)
model = Ridge(alpha=1.0)
model.fit(X_train, y_train)

print(f"Coefficients: {model.coef_}")  # Smaller than unregularized

L1 Regularization (Lasso)

Penalizes the sum of absolute weights:

Loss = Error + λ × (|w₁| + |w₂| + |w₃| + ...)

Effect: Shrinks weights AND makes some exactly zero (feature selection!).

from sklearn.linear_model import Lasso

model = Lasso(alpha=1.0)
model.fit(X_train, y_train)

print(f"Coefficients: {model.coef_}")  # Some are exactly 0

Elastic Net (L1 + L2)

Combines both:

from sklearn.linear_model import ElasticNet

model = ElasticNet(alpha=1.0, l1_ratio=0.5)  # 50% L1, 50% L2

Use when: Many correlated features

Comparing L1 and L2

Aspect L1 (Lasso) L2 (Ridge)
Penalty w
Effect on weights Some become 0 All shrink
Feature selection Yes No
Correlated features Picks one randomly Keeps all
Computationally Harder Easier

Visual comparison:

L1 penalty region:      L2 penalty region:
    ◇                      ○
   ╱ ╲                   ╱   ╲
  ╱   ╲                 ╱     ╲
(Diamond shape -       (Circle - 
 corners at axes)       touches at corners less often)

L1's corners make weights hit exactly zero more often.

Choosing λ (Alpha)

λ = 0: No regularization (might overfit)
λ = ∞: All weights → 0 (will underfit)

Use cross-validation to find the best value:

from sklearn.linear_model import RidgeCV, LassoCV

# Automatically finds best alpha using CV
ridge = RidgeCV(alphas=[0.1, 1.0, 10.0], cv=5)
ridge.fit(X_train, y_train)
print(f"Best alpha: {ridge.alpha_}")

lasso = LassoCV(alphas=[0.01, 0.1, 1.0], cv=5)
lasso.fit(X_train, y_train)
print(f"Best alpha: {lasso.alpha_}")

Regularization in Logistic Regression

Same concept applies:

from sklearn.linear_model import LogisticRegression

# C = 1/λ (smaller C = more regularization)
model = LogisticRegression(C=0.1, penalty='l2')  # Strong L2
model = LogisticRegression(C=10, penalty='l1', solver='saga')  # Weak L1

Regularization in Other Models

Decision Trees

from sklearn.tree import DecisionTreeClassifier

# Regularization via constraints
tree = DecisionTreeClassifier(
    max_depth=5,            # Limit depth
    min_samples_split=10,   # Min samples to split
    min_samples_leaf=5      # Min samples per leaf
)

Neural Networks

# L2 (weight decay)
from keras.regularizers import l2
model.add(Dense(64, kernel_regularizer=l2(0.01)))

# Dropout (randomly zeros neurons)
model.add(Dropout(0.5))

When to Use Regularization

Always consider it for:

  • Linear/Logistic Regression
  • Neural Networks
  • Any model prone to overfitting

Signs you need MORE regularization:

  • Training accuracy >> Test accuracy
  • Large coefficients/weights
  • Unstable predictions

Signs you need LESS regularization:

  • Training accuracy is low
  • Model is too simple
  • Important features have zero weight

Practical Example

from sklearn.linear_model import Ridge, Lasso
from sklearn.model_selection import cross_val_score
import numpy as np

# Compare different regularization strengths
alphas = [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]

for alpha in alphas:
    ridge = Ridge(alpha=alpha)
    scores = cross_val_score(ridge, X, y, cv=5)
    print(f"α={alpha:6.3f}: {scores.mean():.3f} ± {scores.std():.3f}")

Output:

α= 0.001: 0.812 ± 0.045  (underfitting slight)
α= 0.010: 0.845 ± 0.032
α= 0.100: 0.867 ± 0.028  ← Best!
α= 1.000: 0.854 ± 0.031
α=10.000: 0.798 ± 0.042  (overfitting)
α=100.00: 0.721 ± 0.058  (severe overfitting)

Key Takeaways

  1. Regularization adds a penalty for complexity
  2. L2 (Ridge): Shrinks all weights
  3. L1 (Lasso): Can zero out weights (feature selection)
  4. Always tune the regularization strength (λ/α)
  5. Use cross-validation to find the best value

Think of regularization as Occam's Razor for ML: simpler models often generalize better!

#Machine Learning#Regularization#Overfitting#Beginner