ML8 min read

Feature Selection Techniques

Learn how to select the most important features to improve model performance and reduce complexity.

Sarah Chen
December 19, 2025
0.0k0

Feature Selection Techniques

More features isn't always better. Feature selection removes noise, reduces overfitting, speeds up training, and improves interpretability.

Why Select Features?

Problems with too many features:

  • Curse of dimensionality
  • Overfitting
  • Slower training
  • Harder to interpret

Method 1: Filter Methods (Before Training)

Select features based on statistical tests, independent of any model.

Variance Threshold

Remove features with low variance (near-constant):

from sklearn.feature_selection import VarianceThreshold

# Remove features with variance below threshold
selector = VarianceThreshold(threshold=0.01)
X_selected = selector.fit_transform(X)

# See which features kept
kept_features = X.columns[selector.get_support()]
print(f"Kept {len(kept_features)} of {X.shape[1]} features")

Correlation with Target

import pandas as pd
import numpy as np

# For classification
from sklearn.feature_selection import f_classif, chi2

# Calculate F-scores
f_scores, p_values = f_classif(X, y)

# Rank features
feature_scores = pd.DataFrame({
    'feature': X.columns,
    'f_score': f_scores,
    'p_value': p_values
}).sort_values('f_score', ascending=False)

# Select top K
from sklearn.feature_selection import SelectKBest
selector = SelectKBest(f_classif, k=10)
X_selected = selector.fit_transform(X, y)

Remove Correlated Features

def remove_correlated(df, threshold=0.95):
    corr_matrix = df.corr().abs()
    upper = corr_matrix.where(
        np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)
    )
    
    # Find columns with correlation above threshold
    to_drop = [col for col in upper.columns if any(upper[col] > threshold)]
    return df.drop(columns=to_drop)

X_reduced = remove_correlated(X)

Method 2: Wrapper Methods (Use Model Performance)

Recursive Feature Elimination (RFE)

from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier

# Create model and RFE
model = RandomForestClassifier(n_estimators=100)
rfe = RFE(estimator=model, n_features_to_select=10, step=1)
rfe.fit(X, y)

# Get selected features
selected = X.columns[rfe.support_]
print(f"Selected features: {list(selected)}")

# Feature ranking (1 = selected)
ranking = pd.DataFrame({
    'feature': X.columns,
    'ranking': rfe.ranking_
}).sort_values('ranking')

RFE with Cross-Validation

from sklearn.feature_selection import RFECV

rfecv = RFECV(
    estimator=RandomForestClassifier(n_estimators=100),
    step=1,
    cv=5,
    scoring='accuracy'
)
rfecv.fit(X, y)

print(f"Optimal number of features: {rfecv.n_features_}")

# Plot
plt.plot(range(1, len(rfecv.cv_results_['mean_test_score']) + 1),
         rfecv.cv_results_['mean_test_score'])
plt.xlabel('Number of features')
plt.ylabel('CV Score')

Method 3: Embedded Methods (During Training)

L1 Regularization (Lasso)

L1 drives some coefficients to exactly zero:

from sklearn.linear_model import Lasso, LogisticRegression
from sklearn.feature_selection import SelectFromModel

# For regression
lasso = Lasso(alpha=0.01)
lasso.fit(X, y)

# For classification
lr = LogisticRegression(penalty='l1', solver='saga', C=1.0)
lr.fit(X, y)

# Select non-zero features
selector = SelectFromModel(lr, prefit=True)
X_selected = selector.transform(X)

Tree-Based Feature Importance

from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel

# Train model
rf = RandomForestClassifier(n_estimators=100)
rf.fit(X, y)

# Select important features
selector = SelectFromModel(rf, threshold='median')
X_selected = selector.fit_transform(X, y)

# Or manually select top features
importance = pd.DataFrame({
    'feature': X.columns,
    'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)

top_features = importance.head(10)['feature'].tolist()

Quick Comparison

Method Speed Considers Model Best For
Variance Fast No Quick filtering
Correlation Fast No Removing redundancy
SelectKBest Fast No Simple selection
RFE Slow Yes Optimal subset
L1/Lasso Medium Yes Sparse models
Tree Importance Medium Yes Tree models

Practical Pipeline

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_classif

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('selector', SelectKBest(f_classif, k=20)),
    ('classifier', RandomForestClassifier())
])

pipeline.fit(X_train, y_train)

Key Takeaway

Start with filter methods (fast, remove obvious noise), then use embedded methods (L1 or tree importance) for automatic selection. Use wrapper methods (RFE) when you need the optimal subset and have time. Always validate that feature selection actually improves your cross-validation score - sometimes all features are useful!

#Machine Learning#Feature Selection#Dimensionality Reduction#Intermediate