Feature Selection Techniques
Learn how to select the most important features to improve model performance and reduce complexity.
Feature Selection Techniques
More features isn't always better. Feature selection removes noise, reduces overfitting, speeds up training, and improves interpretability.
Why Select Features?
Problems with too many features:
- Curse of dimensionality
- Overfitting
- Slower training
- Harder to interpret
Method 1: Filter Methods (Before Training)
Select features based on statistical tests, independent of any model.
Variance Threshold
Remove features with low variance (near-constant):
from sklearn.feature_selection import VarianceThreshold
# Remove features with variance below threshold
selector = VarianceThreshold(threshold=0.01)
X_selected = selector.fit_transform(X)
# See which features kept
kept_features = X.columns[selector.get_support()]
print(f"Kept {len(kept_features)} of {X.shape[1]} features")
Correlation with Target
import pandas as pd
import numpy as np
# For classification
from sklearn.feature_selection import f_classif, chi2
# Calculate F-scores
f_scores, p_values = f_classif(X, y)
# Rank features
feature_scores = pd.DataFrame({
'feature': X.columns,
'f_score': f_scores,
'p_value': p_values
}).sort_values('f_score', ascending=False)
# Select top K
from sklearn.feature_selection import SelectKBest
selector = SelectKBest(f_classif, k=10)
X_selected = selector.fit_transform(X, y)
Remove Correlated Features
def remove_correlated(df, threshold=0.95):
corr_matrix = df.corr().abs()
upper = corr_matrix.where(
np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)
)
# Find columns with correlation above threshold
to_drop = [col for col in upper.columns if any(upper[col] > threshold)]
return df.drop(columns=to_drop)
X_reduced = remove_correlated(X)
Method 2: Wrapper Methods (Use Model Performance)
Recursive Feature Elimination (RFE)
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier
# Create model and RFE
model = RandomForestClassifier(n_estimators=100)
rfe = RFE(estimator=model, n_features_to_select=10, step=1)
rfe.fit(X, y)
# Get selected features
selected = X.columns[rfe.support_]
print(f"Selected features: {list(selected)}")
# Feature ranking (1 = selected)
ranking = pd.DataFrame({
'feature': X.columns,
'ranking': rfe.ranking_
}).sort_values('ranking')
RFE with Cross-Validation
from sklearn.feature_selection import RFECV
rfecv = RFECV(
estimator=RandomForestClassifier(n_estimators=100),
step=1,
cv=5,
scoring='accuracy'
)
rfecv.fit(X, y)
print(f"Optimal number of features: {rfecv.n_features_}")
# Plot
plt.plot(range(1, len(rfecv.cv_results_['mean_test_score']) + 1),
rfecv.cv_results_['mean_test_score'])
plt.xlabel('Number of features')
plt.ylabel('CV Score')
Method 3: Embedded Methods (During Training)
L1 Regularization (Lasso)
L1 drives some coefficients to exactly zero:
from sklearn.linear_model import Lasso, LogisticRegression
from sklearn.feature_selection import SelectFromModel
# For regression
lasso = Lasso(alpha=0.01)
lasso.fit(X, y)
# For classification
lr = LogisticRegression(penalty='l1', solver='saga', C=1.0)
lr.fit(X, y)
# Select non-zero features
selector = SelectFromModel(lr, prefit=True)
X_selected = selector.transform(X)
Tree-Based Feature Importance
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
# Train model
rf = RandomForestClassifier(n_estimators=100)
rf.fit(X, y)
# Select important features
selector = SelectFromModel(rf, threshold='median')
X_selected = selector.fit_transform(X, y)
# Or manually select top features
importance = pd.DataFrame({
'feature': X.columns,
'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)
top_features = importance.head(10)['feature'].tolist()
Quick Comparison
| Method | Speed | Considers Model | Best For |
|---|---|---|---|
| Variance | Fast | No | Quick filtering |
| Correlation | Fast | No | Removing redundancy |
| SelectKBest | Fast | No | Simple selection |
| RFE | Slow | Yes | Optimal subset |
| L1/Lasso | Medium | Yes | Sparse models |
| Tree Importance | Medium | Yes | Tree models |
Practical Pipeline
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_classif
pipeline = Pipeline([
('scaler', StandardScaler()),
('selector', SelectKBest(f_classif, k=20)),
('classifier', RandomForestClassifier())
])
pipeline.fit(X_train, y_train)
Key Takeaway
Start with filter methods (fast, remove obvious noise), then use embedded methods (L1 or tree importance) for automatic selection. Use wrapper methods (RFE) when you need the optimal subset and have time. Always validate that feature selection actually improves your cross-validation score - sometimes all features are useful!