Principal Component Analysis

Simplify complex data.

What is PCA?

Reduces number of features while keeping important info.

Like summarizing a long book into key points!

Why Use PCA?

Too many features slow down training
Visualization (can't plot 100 dimensions!)
Remove noise
Reduce storage

Real Example

Customer data with 50 features → Reduce to 5 key features

Maybe those 5 capture 95% of important information!

How It Works

Find directions with most variation
Project data onto those directions
Keep top N components

Python Code

from sklearn.decomposition import PCA
import numpy as np

# Data with many features
X = np.array([
    [2.5, 2.4, 1.5, 3.2],
    [0.5, 0.7, 0.9, 1.1],
    [2.2, 2.9, 1.8, 3.5],
    [1.9, 2.2, 1.6, 2.8]
])

# Reduce to 2 components
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)

print(f"Original shape: {X.shape}")
print(f"Reduced shape: {X_reduced.shape}")

# Check variance explained
print(f"Variance explained: {pca.explained_variance_ratio_}")

Choosing Components

Keep components that explain 95% of variance:

pca = PCA(n_components=0.95)  # Keep 95% variance
X_reduced = pca.fit_transform(X)

Applications

Image compression
Noise reduction
Feature extraction
Data visualization

Advantages

Fast
No parameters to tune
Interpretable

Disadvantages

Loses some information
Assumes linear relationships
Hard to interpret components

Remember

Use before training models
Great for visualization
Try keeping 95% variance