Principal Component Analysis
Reduce dimensions while keeping important information.
Simplify complex data.
What is PCA?
Reduces number of features while keeping important info.
Like summarizing a long book into key points!
Why Use PCA?
- Too many features slow down training - Visualization (can't plot 100 dimensions!) - Remove noise - Reduce storage
Real Example
Customer data with 50 features → Reduce to 5 key features
Maybe those 5 capture 95% of important information!
How It Works
1. Find directions with most variation 2. Project data onto those directions 3. Keep top N components
Python Code
```python from sklearn.decomposition import PCA import numpy as np
Data with many features X = np.array([ [2.5, 2.4, 1.5, 3.2], [0.5, 0.7, 0.9, 1.1], [2.2, 2.9, 1.8, 3.5], [1.9, 2.2, 1.6, 2.8] ])
Reduce to 2 components pca = PCA(n_components=2) X_reduced = pca.fit_transform(X)
print(f"Original shape: {X.shape}") print(f"Reduced shape: {X_reduced.shape}")
Check variance explained print(f"Variance explained: {pca.explained_variance_ratio_}") ```
Choosing Components
Keep components that explain 95% of variance:
```python pca = PCA(n_components=0.95) # Keep 95% variance X_reduced = pca.fit_transform(X) ```
Applications
- Image compression - Noise reduction - Feature extraction - Data visualization
Advantages
- Fast - No parameters to tune - Interpretable
Disadvantages
- Loses some information - Assumes linear relationships - Hard to interpret components
Remember
- Use before training models - Great for visualization - Try keeping 95% variance