Feature Scaling: Normalization vs Standardization

Different features have different scales. A house's square footage (1000-5000) vs bedrooms (1-5) shouldn't be compared directly. Scaling fixes this.

Why Scale Features?

### Problem Without Scaling

``` Feature 1 (Income): 30,000 - 200,000 Feature 2 (Age): 18 - 80 ```

Distance-based algorithms (KNN, SVM, Neural Networks) will be dominated by income!

### Algorithms That NEED Scaling - K-Nearest Neighbors (KNN) - Support Vector Machines (SVM) - Neural Networks - Linear/Logistic Regression (for convergence) - PCA

### Algorithms That DON'T Need Scaling - Decision Trees - Random Forest - Gradient Boosting (XGBoost, LightGBM) - Naive Bayes

Two Main Techniques

### 1. Normalization (Min-Max Scaling)

Scales to a fixed range, usually [0, 1]:

```python x_normalized = (x - min) / (max - min) ```

```python from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler() X_normalized = scaler.fit_transform(X)

All values now between 0 and 1 ```

**Use when:** - You need bounded values - Data doesn't have outliers - Neural networks (especially image data)

### 2. Standardization (Z-Score Scaling)

Centers around 0 with unit variance:

```python x_standardized = (x - mean) / std ```

```python from sklearn.preprocessing import StandardScaler

scaler = StandardScaler() X_standardized = scaler.fit_transform(X)

Mean ≈ 0, Std ≈ 1 ```

**Use when:** - Data has outliers - Algorithm assumes normally distributed data - Default choice for most cases

Quick Comparison

| Aspect | Normalization | Standardization | |--------|---------------|-----------------| | Range | [0, 1] fixed | No fixed range | | Outlier handling | Poor | Better | | Mean | Not centered | Centered at 0 | | Use case | Image pixels, bounded data | Most other cases |

Code Example

```python import numpy as np from sklearn.preprocessing import StandardScaler, MinMaxScaler

Sample data X = np.array([[30000, 25], [80000, 45], [50000, 35], [120000, 50]]) # income age

Standardization std_scaler = StandardScaler() X_std = std_scaler.fit_transform(X) print("Standardized:") print(X_std) print(f"Mean: {X_std.mean(axis=0)}") # Should be ~0 print(f"Std: {X_std.std(axis=0)}") # Should be ~1

Normalization minmax_scaler = MinMaxScaler() X_norm = minmax_scaler.fit_transform(X) print("\nNormalized:") print(X_norm) print(f"Min: {X_norm.min(axis=0)}") # Should be 0 print(f"Max: {X_norm.max(axis=0)}") # Should be 1 ```

Critical: Fit on Train, Transform Both

```python # WRONG - Data leakage! scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # Learns from ALL data X_train, X_test = split(X_scaled)

RIGHT X_train, X_test = train_test_split(X, test_size=0.2) scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) # Learn from train only X_test_scaled = scaler.transform(X_test) # Apply same transformation ```

The test set shouldn't influence the scaling parameters!

Other Scaling Methods

### Robust Scaler (for outliers) ```python from sklearn.preprocessing import RobustScaler

scaler = RobustScaler() # Uses median and IQR, robust to outliers ```

### Max Abs Scaler (sparse data) ```python from sklearn.preprocessing import MaxAbsScaler

scaler = MaxAbsScaler() # Scales by max absolute value, keeps sparsity ```

When NOT to Scale

1. **Tree-based models** - They split on thresholds, scale doesn't matter 2. **Categorical features** - One-hot encoded features (0/1) shouldn't be scaled 3. **Already scaled data** - Don't scale twice!

Summary

| Situation | Recommendation | |-----------|---------------| | Default choice | StandardScaler | | Bounded output needed | MinMaxScaler | | Data has outliers | RobustScaler | | Tree-based models | No scaling needed | | Neural networks | MinMaxScaler or StandardScaler |

Remember: Always fit on training data only, then transform both train and test sets!

Feature Scaling: Normalization vs Standardization

Feature Scaling: Normalization vs Standardization

Why Scale Features?

Two Main Techniques

All values now between 0 and 1 ```

Mean ≈ 0, Std ≈ 1 ```

Quick Comparison

Code Example

Sample data X = np.array([[30000, 25], [80000, 45], [50000, 35], [120000, 50]]) # income age

Standardization std_scaler = StandardScaler() X_std = std_scaler.fit_transform(X) print("Standardized:") print(X_std) print(f"Mean: {X_std.mean(axis=0)}") # Should be ~0 print(f"Std: {X_std.std(axis=0)}") # Should be ~1

Normalization minmax_scaler = MinMaxScaler() X_norm = minmax_scaler.fit_transform(X) print("\nNormalized:") print(X_norm) print(f"Min: {X_norm.min(axis=0)}") # Should be 0 print(f"Max: {X_norm.max(axis=0)}") # Should be 1 ```

Critical: Fit on Train, Transform Both

RIGHT X_train, X_test = train_test_split(X, test_size=0.2) scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) # Learn from train only X_test_scaled = scaler.transform(X_test) # Apply same transformation ```

Other Scaling Methods

When NOT to Scale

Summary

More on ML

What is Machine Learning? A Simple Introduction

Supervised vs Unsupervised Learning Explained

Understanding Training, Validation, and Test Sets