Feature Scaling: Normalization vs Standardization
Learn when and how to scale your features for better ML model performance.
Feature Scaling: Normalization vs Standardization
Different features have different scales. A house's square footage (1000-5000) vs bedrooms (1-5) shouldn't be compared directly. Scaling fixes this.
Why Scale Features?
### Problem Without Scaling
``` Feature 1 (Income): 30,000 - 200,000 Feature 2 (Age): 18 - 80 ```
Distance-based algorithms (KNN, SVM, Neural Networks) will be dominated by income!
### Algorithms That NEED Scaling - K-Nearest Neighbors (KNN) - Support Vector Machines (SVM) - Neural Networks - Linear/Logistic Regression (for convergence) - PCA
### Algorithms That DON'T Need Scaling - Decision Trees - Random Forest - Gradient Boosting (XGBoost, LightGBM) - Naive Bayes
Two Main Techniques
### 1. Normalization (Min-Max Scaling)
Scales to a fixed range, usually [0, 1]:
```python x_normalized = (x - min) / (max - min) ```
```python from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler() X_normalized = scaler.fit_transform(X)
All values now between 0 and 1 ```
**Use when:** - You need bounded values - Data doesn't have outliers - Neural networks (especially image data)
### 2. Standardization (Z-Score Scaling)
Centers around 0 with unit variance:
```python x_standardized = (x - mean) / std ```
```python from sklearn.preprocessing import StandardScaler
scaler = StandardScaler() X_standardized = scaler.fit_transform(X)
Mean ≈ 0, Std ≈ 1 ```
**Use when:** - Data has outliers - Algorithm assumes normally distributed data - Default choice for most cases
Quick Comparison
| Aspect | Normalization | Standardization | |--------|---------------|-----------------| | Range | [0, 1] fixed | No fixed range | | Outlier handling | Poor | Better | | Mean | Not centered | Centered at 0 | | Use case | Image pixels, bounded data | Most other cases |
Code Example
```python import numpy as np from sklearn.preprocessing import StandardScaler, MinMaxScaler
Sample data X = np.array([[30000, 25], [80000, 45], [50000, 35], [120000, 50]]) # income age
Standardization std_scaler = StandardScaler() X_std = std_scaler.fit_transform(X) print("Standardized:") print(X_std) print(f"Mean: {X_std.mean(axis=0)}") # Should be ~0 print(f"Std: {X_std.std(axis=0)}") # Should be ~1
Normalization minmax_scaler = MinMaxScaler() X_norm = minmax_scaler.fit_transform(X) print("\nNormalized:") print(X_norm) print(f"Min: {X_norm.min(axis=0)}") # Should be 0 print(f"Max: {X_norm.max(axis=0)}") # Should be 1 ```
Critical: Fit on Train, Transform Both
```python # WRONG - Data leakage! scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # Learns from ALL data X_train, X_test = split(X_scaled)
RIGHT X_train, X_test = train_test_split(X, test_size=0.2) scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) # Learn from train only X_test_scaled = scaler.transform(X_test) # Apply same transformation ```
The test set shouldn't influence the scaling parameters!
Other Scaling Methods
### Robust Scaler (for outliers) ```python from sklearn.preprocessing import RobustScaler
scaler = RobustScaler() # Uses median and IQR, robust to outliers ```
### Max Abs Scaler (sparse data) ```python from sklearn.preprocessing import MaxAbsScaler
scaler = MaxAbsScaler() # Scales by max absolute value, keeps sparsity ```
When NOT to Scale
1. **Tree-based models** - They split on thresholds, scale doesn't matter 2. **Categorical features** - One-hot encoded features (0/1) shouldn't be scaled 3. **Already scaled data** - Don't scale twice!
Summary
| Situation | Recommendation | |-----------|---------------| | Default choice | StandardScaler | | Bounded output needed | MinMaxScaler | | Data has outliers | RobustScaler | | Tree-based models | No scaling needed | | Neural networks | MinMaxScaler or StandardScaler |
Remember: Always fit on training data only, then transform both train and test sets!