Dimensionality Reduction with PCA
Learn how Principal Component Analysis reduces features while preserving important information.
Dimensionality Reduction with PCA
Too many features? PCA (Principal Component Analysis) finds the most important directions in your data and lets you keep just those.
Why Reduce Dimensions?
**Problems with high dimensions:** - Curse of dimensionality (need exponentially more data) - Overfitting - Slow training - Hard to visualize
**Benefits of PCA:** - Fewer features, faster training - Removes multicollinearity - Enables visualization (reduce to 2-3D) - Often improves performance
How PCA Works (Intuition)
1. Find the direction of maximum variance (PC1) 2. Find the next direction, perpendicular to PC1 (PC2) 3. Continue for all dimensions 4. Keep only the top components
Think of it as finding the "natural axes" of your data.
Implementation
```python from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA
PCA requires scaled data! scaler = StandardScaler() X_scaled = scaler.fit_transform(X)
Apply PCA pca = PCA(n_components=0.95) # Keep 95% of variance X_pca = pca.fit_transform(X_scaled)
print(f"Original features: {X.shape[1]}") print(f"Reduced features: {X_pca.shape[1]}") ```
Choosing Number of Components
### Method 1: Variance Threshold
```python # Keep 95% of total variance pca = PCA(n_components=0.95) ```
### Method 2: Elbow Plot
```python import matplotlib.pyplot as plt
pca = PCA() pca.fit(X_scaled)
plt.plot(range(1, len(pca.explained_variance_ratio_) + 1), pca.explained_variance_ratio_.cumsum()) plt.xlabel('Number of Components') plt.ylabel('Cumulative Variance Explained') plt.axhline(y=0.95, color='r', linestyle='--') plt.show() ```
### Method 3: Fixed Number
```python # For visualization: 2 or 3 components pca = PCA(n_components=2) ```
Visualization Example
```python from sklearn.datasets import load_iris
Load data iris = load_iris() X, y = iris.data, iris.target
Reduce to 2D pca = PCA(n_components=2) X_2d = pca.fit_transform(StandardScaler().fit_transform(X))
Plot plt.scatter(X_2d[:, 0], X_2d[:, 1], c=y, cmap='viridis') plt.xlabel('First Principal Component') plt.ylabel('Second Principal Component') plt.title('Iris Dataset - PCA') plt.show() ```
Important Considerations
### 1. Always Scale First
PCA is affected by scale. Standardize your data:
```python # Always do this before PCA X_scaled = StandardScaler().fit_transform(X) ```
### 2. Fit Only on Training Data
```python # Correct way pca = PCA(n_components=50) X_train_pca = pca.fit_transform(X_train) X_test_pca = pca.transform(X_test) # Only transform! ```
### 3. Interpretability is Lost
Original features become linear combinations. You can check contributions:
```python # Feature importance for each component components_df = pd.DataFrame( pca.components_, columns=feature_names ) ```
When to Use PCA
**Good use cases:** - High-dimensional data (100+ features) - Visualization - Preprocessing for other algorithms - When features are correlated
**Not ideal when:** - Interpretability is important - Features are already independent - Non-linear relationships (consider t-SNE or UMAP)
Key Takeaway
PCA finds the most important directions in your data. Always scale first, use explained variance to choose components, and remember you're trading interpretability for dimensionality reduction. It's a powerful preprocessing step for high-dimensional data!