ML8 min read

Dimensionality Reduction with PCA

Learn how Principal Component Analysis reduces features while preserving important information.

Sarah Chen
December 19, 2025
0.0k0

Dimensionality Reduction with PCA

Too many features? PCA (Principal Component Analysis) finds the most important directions in your data and lets you keep just those.

Why Reduce Dimensions?

Problems with high dimensions:

  • Curse of dimensionality (need exponentially more data)
  • Overfitting
  • Slow training
  • Hard to visualize

Benefits of PCA:

  • Fewer features, faster training
  • Removes multicollinearity
  • Enables visualization (reduce to 2-3D)
  • Often improves performance

How PCA Works (Intuition)

  1. Find the direction of maximum variance (PC1)
  2. Find the next direction, perpendicular to PC1 (PC2)
  3. Continue for all dimensions
  4. Keep only the top components

Think of it as finding the "natural axes" of your data.

Implementation

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# PCA requires scaled data!
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA
pca = PCA(n_components=0.95)  # Keep 95% of variance
X_pca = pca.fit_transform(X_scaled)

print(f"Original features: {X.shape[1]}")
print(f"Reduced features: {X_pca.shape[1]}")

Choosing Number of Components

Method 1: Variance Threshold

# Keep 95% of total variance
pca = PCA(n_components=0.95)

Method 2: Elbow Plot

import matplotlib.pyplot as plt

pca = PCA()
pca.fit(X_scaled)

plt.plot(range(1, len(pca.explained_variance_ratio_) + 1),
         pca.explained_variance_ratio_.cumsum())
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Variance Explained')
plt.axhline(y=0.95, color='r', linestyle='--')
plt.show()

Method 3: Fixed Number

# For visualization: 2 or 3 components
pca = PCA(n_components=2)

Visualization Example

from sklearn.datasets import load_iris

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Reduce to 2D
pca = PCA(n_components=2)
X_2d = pca.fit_transform(StandardScaler().fit_transform(X))

# Plot
plt.scatter(X_2d[:, 0], X_2d[:, 1], c=y, cmap='viridis')
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.title('Iris Dataset - PCA')
plt.show()

Important Considerations

1. Always Scale First

PCA is affected by scale. Standardize your data:

# Always do this before PCA
X_scaled = StandardScaler().fit_transform(X)

2. Fit Only on Training Data

# Correct way
pca = PCA(n_components=50)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)  # Only transform!

3. Interpretability is Lost

Original features become linear combinations. You can check contributions:

# Feature importance for each component
components_df = pd.DataFrame(
    pca.components_,
    columns=feature_names
)

When to Use PCA

Good use cases:

  • High-dimensional data (100+ features)
  • Visualization
  • Preprocessing for other algorithms
  • When features are correlated

Not ideal when:

  • Interpretability is important
  • Features are already independent
  • Non-linear relationships (consider t-SNE or UMAP)

Key Takeaway

PCA finds the most important directions in your data. Always scale first, use explained variance to choose components, and remember you're trading interpretability for dimensionality reduction. It's a powerful preprocessing step for high-dimensional data!

#Machine Learning#PCA#Dimensionality Reduction#Intermediate