ML8 min read

Regularization: Preventing Overfitting

Learn how regularization techniques prevent overfitting and improve model generalization.

Sarah Chen
December 19, 2025
0.0k0

Regularization: Preventing Overfitting

Regularization is like telling your model: "Keep it simple!" It's one of the most important techniques to prevent overfitting.

The Problem

Without regularization, models can become too complex:

```python # Overfitted linear regression might have huge coefficients weights = [23456.7, -12345.6, 98765.4, ...]

Small input changes → Huge output changes = Unstable! ```

The model is "working too hard" to fit training data perfectly.

The Solution: Add a Penalty

``` Original objective: Minimize error

New objective: Minimize error + λ × complexity penalty ```

λ (lambda) controls how much we penalize complexity.

L2 Regularization (Ridge)

Penalizes the **sum of squared weights**:

``` Loss = Error + λ × (w₁² + w₂² + w₃² + ...) ```

**Effect:** Shrinks all weights toward zero, but doesn't make them exactly zero.

```python from sklearn.linear_model import Ridge

α is the regularization strength (λ) model = Ridge(alpha=1.0) model.fit(X_train, y_train)

print(f"Coefficients: {model.coef_}") # Smaller than unregularized ```

L1 Regularization (Lasso)

Penalizes the **sum of absolute weights**:

``` Loss = Error + λ × (|w₁| + |w₂| + |w₃| + ...) ```

**Effect:** Shrinks weights AND makes some exactly zero (feature selection!).

```python from sklearn.linear_model import Lasso

model = Lasso(alpha=1.0) model.fit(X_train, y_train)

print(f"Coefficients: {model.coef_}") # Some are exactly 0 ```

Elastic Net (L1 + L2)

Combines both:

```python from sklearn.linear_model import ElasticNet

model = ElasticNet(alpha=1.0, l1_ratio=0.5) # 50% L1, 50% L2 ```

**Use when:** Many correlated features

Comparing L1 and L2

| Aspect | L1 (Lasso) | L2 (Ridge) | |--------|------------|------------| | Penalty | |w| | w² | | Effect on weights | Some become 0 | All shrink | | Feature selection | Yes | No | | Correlated features | Picks one randomly | Keeps all | | Computationally | Harder | Easier |

Visual comparison:

``` L1 penalty region: L2 penalty region: ◇ ○ ╱ ╲ ╱ ╲ ╱ ╲ ╱ ╲ (Diamond shape - (Circle - corners at axes) touches at corners less often) ```

L1's corners make weights hit exactly zero more often.

Choosing λ (Alpha)

``` λ = 0: No regularization (might overfit) λ = ∞: All weights → 0 (will underfit) ```

Use cross-validation to find the best value:

```python from sklearn.linear_model import RidgeCV, LassoCV

Automatically finds best alpha using CV ridge = RidgeCV(alphas=[0.1, 1.0, 10.0], cv=5) ridge.fit(X_train, y_train) print(f"Best alpha: {ridge.alpha_}")

lasso = LassoCV(alphas=[0.01, 0.1, 1.0], cv=5) lasso.fit(X_train, y_train) print(f"Best alpha: {lasso.alpha_}") ```

Regularization in Logistic Regression

Same concept applies:

```python from sklearn.linear_model import LogisticRegression

C = 1/λ (smaller C = more regularization) model = LogisticRegression(C=0.1, penalty='l2') # Strong L2 model = LogisticRegression(C=10, penalty='l1', solver='saga') # Weak L1 ```

Regularization in Other Models

### Decision Trees ```python from sklearn.tree import DecisionTreeClassifier

Regularization via constraints tree = DecisionTreeClassifier( max_depth=5, # Limit depth min_samples_split=10, # Min samples to split min_samples_leaf=5 # Min samples per leaf ) ```

### Neural Networks ```python # L2 (weight decay) from keras.regularizers import l2 model.add(Dense(64, kernel_regularizer=l2(0.01)))

Dropout (randomly zeros neurons) model.add(Dropout(0.5)) ```

When to Use Regularization

**Always consider it for:** - Linear/Logistic Regression - Neural Networks - Any model prone to overfitting

**Signs you need MORE regularization:** - Training accuracy >> Test accuracy - Large coefficients/weights - Unstable predictions

**Signs you need LESS regularization:** - Training accuracy is low - Model is too simple - Important features have zero weight

Practical Example

```python from sklearn.linear_model import Ridge, Lasso from sklearn.model_selection import cross_val_score import numpy as np

Compare different regularization strengths alphas = [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]

for alpha in alphas: ridge = Ridge(alpha=alpha) scores = cross_val_score(ridge, X, y, cv=5) print(f"α={alpha:6.3f}: {scores.mean():.3f} ± {scores.std():.3f}") ```

Output: ``` α= 0.001: 0.812 ± 0.045 (underfitting slight) α= 0.010: 0.845 ± 0.032 α= 0.100: 0.867 ± 0.028 ← Best! α= 1.000: 0.854 ± 0.031 α=10.000: 0.798 ± 0.042 (overfitting) α=100.00: 0.721 ± 0.058 (severe overfitting) ```

Key Takeaways

1. **Regularization adds a penalty for complexity** 2. **L2 (Ridge):** Shrinks all weights 3. **L1 (Lasso):** Can zero out weights (feature selection) 4. **Always tune the regularization strength (λ/α)** 5. **Use cross-validation to find the best value**

Think of regularization as Occam's Razor for ML: simpler models often generalize better!

#Machine Learning#Regularization#Overfitting#Beginner