Feature Selection Techniques
Learn how to select the most important features to improve model performance and reduce complexity.
Feature Selection Techniques
More features isn't always better. Feature selection removes noise, reduces overfitting, speeds up training, and improves interpretability.
Why Select Features?
**Problems with too many features:** - Curse of dimensionality - Overfitting - Slower training - Harder to interpret
Method 1: Filter Methods (Before Training)
Select features based on statistical tests, independent of any model.
### Variance Threshold
Remove features with low variance (near-constant):
```python from sklearn.feature_selection import VarianceThreshold
Remove features with variance below threshold selector = VarianceThreshold(threshold=0.01) X_selected = selector.fit_transform(X)
See which features kept kept_features = X.columns[selector.get_support()] print(f"Kept {len(kept_features)} of {X.shape[1]} features") ```
### Correlation with Target
```python import pandas as pd import numpy as np
For classification from sklearn.feature_selection import f_classif, chi2
Calculate F-scores f_scores, p_values = f_classif(X, y)
Rank features feature_scores = pd.DataFrame({ 'feature': X.columns, 'f_score': f_scores, 'p_value': p_values }).sort_values('f_score', ascending=False)
Select top K from sklearn.feature_selection import SelectKBest selector = SelectKBest(f_classif, k=10) X_selected = selector.fit_transform(X, y) ```
### Remove Correlated Features
```python def remove_correlated(df, threshold=0.95): corr_matrix = df.corr().abs() upper = corr_matrix.where( np.triu(np.ones(corr_matrix.shape), k=1).astype(bool) ) # Find columns with correlation above threshold to_drop = [col for col in upper.columns if any(upper[col] > threshold)] return df.drop(columns=to_drop)
X_reduced = remove_correlated(X) ```
Method 2: Wrapper Methods (Use Model Performance)
### Recursive Feature Elimination (RFE)
```python from sklearn.feature_selection import RFE from sklearn.ensemble import RandomForestClassifier
Create model and RFE model = RandomForestClassifier(n_estimators=100) rfe = RFE(estimator=model, n_features_to_select=10, step=1) rfe.fit(X, y)
Get selected features selected = X.columns[rfe.support_] print(f"Selected features: {list(selected)}")
Feature ranking (1 = selected) ranking = pd.DataFrame({ 'feature': X.columns, 'ranking': rfe.ranking_ }).sort_values('ranking') ```
### RFE with Cross-Validation
```python from sklearn.feature_selection import RFECV
rfecv = RFECV( estimator=RandomForestClassifier(n_estimators=100), step=1, cv=5, scoring='accuracy' ) rfecv.fit(X, y)
print(f"Optimal number of features: {rfecv.n_features_}")
Plot plt.plot(range(1, len(rfecv.cv_results_['mean_test_score']) + 1), rfecv.cv_results_['mean_test_score']) plt.xlabel('Number of features') plt.ylabel('CV Score') ```
Method 3: Embedded Methods (During Training)
### L1 Regularization (Lasso)
L1 drives some coefficients to exactly zero:
```python from sklearn.linear_model import Lasso, LogisticRegression from sklearn.feature_selection import SelectFromModel
For regression lasso = Lasso(alpha=0.01) lasso.fit(X, y)
For classification lr = LogisticRegression(penalty='l1', solver='saga', C=1.0) lr.fit(X, y)
Select non-zero features selector = SelectFromModel(lr, prefit=True) X_selected = selector.transform(X) ```
### Tree-Based Feature Importance
```python from sklearn.ensemble import RandomForestClassifier from sklearn.feature_selection import SelectFromModel
Train model rf = RandomForestClassifier(n_estimators=100) rf.fit(X, y)
Select important features selector = SelectFromModel(rf, threshold='median') X_selected = selector.fit_transform(X, y)
Or manually select top features importance = pd.DataFrame({ 'feature': X.columns, 'importance': rf.feature_importances_ }).sort_values('importance', ascending=False)
top_features = importance.head(10)['feature'].tolist() ```
Quick Comparison
| Method | Speed | Considers Model | Best For | |--------|-------|-----------------|----------| | Variance | Fast | No | Quick filtering | | Correlation | Fast | No | Removing redundancy | | SelectKBest | Fast | No | Simple selection | | RFE | Slow | Yes | Optimal subset | | L1/Lasso | Medium | Yes | Sparse models | | Tree Importance | Medium | Yes | Tree models |
Practical Pipeline
```python from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.feature_selection import SelectKBest, f_classif
pipeline = Pipeline([ ('scaler', StandardScaler()), ('selector', SelectKBest(f_classif, k=20)), ('classifier', RandomForestClassifier()) ])
pipeline.fit(X_train, y_train) ```
Key Takeaway
Start with filter methods (fast, remove obvious noise), then use embedded methods (L1 or tree importance) for automatic selection. Use wrapper methods (RFE) when you need the optimal subset and have time. Always validate that feature selection actually improves your cross-validation score - sometimes all features are useful!