ML8 min read
Dimensionality Reduction with PCA
Learn how Principal Component Analysis reduces features while preserving important information.
Sarah Chen
December 19, 2025
0.0k0
Dimensionality Reduction with PCA
Too many features? PCA (Principal Component Analysis) finds the most important directions in your data and lets you keep just those.
Why Reduce Dimensions?
Problems with high dimensions:
- Curse of dimensionality (need exponentially more data)
- Overfitting
- Slow training
- Hard to visualize
Benefits of PCA:
- Fewer features, faster training
- Removes multicollinearity
- Enables visualization (reduce to 2-3D)
- Often improves performance
How PCA Works (Intuition)
- Find the direction of maximum variance (PC1)
- Find the next direction, perpendicular to PC1 (PC2)
- Continue for all dimensions
- Keep only the top components
Think of it as finding the "natural axes" of your data.
Implementation
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
# PCA requires scaled data!
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Apply PCA
pca = PCA(n_components=0.95) # Keep 95% of variance
X_pca = pca.fit_transform(X_scaled)
print(f"Original features: {X.shape[1]}")
print(f"Reduced features: {X_pca.shape[1]}")
Choosing Number of Components
Method 1: Variance Threshold
# Keep 95% of total variance
pca = PCA(n_components=0.95)
Method 2: Elbow Plot
import matplotlib.pyplot as plt
pca = PCA()
pca.fit(X_scaled)
plt.plot(range(1, len(pca.explained_variance_ratio_) + 1),
pca.explained_variance_ratio_.cumsum())
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Variance Explained')
plt.axhline(y=0.95, color='r', linestyle='--')
plt.show()
Method 3: Fixed Number
# For visualization: 2 or 3 components
pca = PCA(n_components=2)
Visualization Example
from sklearn.datasets import load_iris
# Load data
iris = load_iris()
X, y = iris.data, iris.target
# Reduce to 2D
pca = PCA(n_components=2)
X_2d = pca.fit_transform(StandardScaler().fit_transform(X))
# Plot
plt.scatter(X_2d[:, 0], X_2d[:, 1], c=y, cmap='viridis')
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.title('Iris Dataset - PCA')
plt.show()
Important Considerations
1. Always Scale First
PCA is affected by scale. Standardize your data:
# Always do this before PCA
X_scaled = StandardScaler().fit_transform(X)
2. Fit Only on Training Data
# Correct way
pca = PCA(n_components=50)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test) # Only transform!
3. Interpretability is Lost
Original features become linear combinations. You can check contributions:
# Feature importance for each component
components_df = pd.DataFrame(
pca.components_,
columns=feature_names
)
When to Use PCA
Good use cases:
- High-dimensional data (100+ features)
- Visualization
- Preprocessing for other algorithms
- When features are correlated
Not ideal when:
- Interpretability is important
- Features are already independent
- Non-linear relationships (consider t-SNE or UMAP)
Key Takeaway
PCA finds the most important directions in your data. Always scale first, use explained variance to choose components, and remember you're trading interpretability for dimensionality reduction. It's a powerful preprocessing step for high-dimensional data!
#Machine Learning#PCA#Dimensionality Reduction#Intermediate