Learn how to select the most important features to improve model performance and reduce complexity.

Feature Selection Techniques

More features isn't always better. Feature selection removes noise, reduces overfitting, speeds up training, and improves interpretability.

Why Select Features?

**Problems with too many features:** - Curse of dimensionality - Overfitting - Slower training - Harder to interpret

Method 1: Filter Methods (Before Training)

Select features based on statistical tests, independent of any model.

### Variance Threshold

Remove features with low variance (near-constant):

```python from sklearn.feature_selection import VarianceThreshold

Remove features with variance below threshold selector = VarianceThreshold(threshold=0.01) X_selected = selector.fit_transform(X)

See which features kept kept_features = X.columns[selector.get_support()] print(f"Kept {len(kept_features)} of {X.shape[1]} features") ```

### Correlation with Target

```python import pandas as pd import numpy as np

For classification from sklearn.feature_selection import f_classif, chi2

Calculate F-scores f_scores, p_values = f_classif(X, y)

Rank features feature_scores = pd.DataFrame({ 'feature': X.columns, 'f_score': f_scores, 'p_value': p_values }).sort_values('f_score', ascending=False)

Select top K from sklearn.feature_selection import SelectKBest selector = SelectKBest(f_classif, k=10) X_selected = selector.fit_transform(X, y) ```

### Remove Correlated Features

```python def remove_correlated(df, threshold=0.95): corr_matrix = df.corr().abs() upper = corr_matrix.where( np.triu(np.ones(corr_matrix.shape), k=1).astype(bool) ) # Find columns with correlation above threshold to_drop = [col for col in upper.columns if any(upper[col] > threshold)] return df.drop(columns=to_drop)

X_reduced = remove_correlated(X) ```

Method 2: Wrapper Methods (Use Model Performance)

### Recursive Feature Elimination (RFE)

```python from sklearn.feature_selection import RFE from sklearn.ensemble import RandomForestClassifier

Create model and RFE model = RandomForestClassifier(n_estimators=100) rfe = RFE(estimator=model, n_features_to_select=10, step=1) rfe.fit(X, y)

Get selected features selected = X.columns[rfe.support_] print(f"Selected features: {list(selected)}")

Feature ranking (1 = selected) ranking = pd.DataFrame({ 'feature': X.columns, 'ranking': rfe.ranking_ }).sort_values('ranking') ```

### RFE with Cross-Validation

```python from sklearn.feature_selection import RFECV

rfecv = RFECV( estimator=RandomForestClassifier(n_estimators=100), step=1, cv=5, scoring='accuracy' ) rfecv.fit(X, y)

print(f"Optimal number of features: {rfecv.n_features_}")

Plot plt.plot(range(1, len(rfecv.cv_results_['mean_test_score']) + 1), rfecv.cv_results_['mean_test_score']) plt.xlabel('Number of features') plt.ylabel('CV Score') ```

Method 3: Embedded Methods (During Training)

### L1 Regularization (Lasso)

L1 drives some coefficients to exactly zero:

```python from sklearn.linear_model import Lasso, LogisticRegression from sklearn.feature_selection import SelectFromModel

For regression lasso = Lasso(alpha=0.01) lasso.fit(X, y)

For classification lr = LogisticRegression(penalty='l1', solver='saga', C=1.0) lr.fit(X, y)

Select non-zero features selector = SelectFromModel(lr, prefit=True) X_selected = selector.transform(X) ```

### Tree-Based Feature Importance

```python from sklearn.ensemble import RandomForestClassifier from sklearn.feature_selection import SelectFromModel

Train model rf = RandomForestClassifier(n_estimators=100) rf.fit(X, y)

Select important features selector = SelectFromModel(rf, threshold='median') X_selected = selector.fit_transform(X, y)

Or manually select top features importance = pd.DataFrame({ 'feature': X.columns, 'importance': rf.feature_importances_ }).sort_values('importance', ascending=False)

top_features = importance.head(10)['feature'].tolist() ```

Quick Comparison

| Method | Speed | Considers Model | Best For | |--------|-------|-----------------|----------| | Variance | Fast | No | Quick filtering | | Correlation | Fast | No | Removing redundancy | | SelectKBest | Fast | No | Simple selection | | RFE | Slow | Yes | Optimal subset | | L1/Lasso | Medium | Yes | Sparse models | | Tree Importance | Medium | Yes | Tree models |

Practical Pipeline

```python from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.feature_selection import SelectKBest, f_classif

pipeline = Pipeline([ ('scaler', StandardScaler()), ('selector', SelectKBest(f_classif, k=20)), ('classifier', RandomForestClassifier()) ])

pipeline.fit(X_train, y_train) ```

Key Takeaway

Start with filter methods (fast, remove obvious noise), then use embedded methods (L1 or tree importance) for automatic selection. Use wrapper methods (RFE) when you need the optimal subset and have time. Always validate that feature selection actually improves your cross-validation score - sometimes all features are useful!

Feature Selection Techniques

Feature Selection Techniques

Why Select Features?

Method 1: Filter Methods (Before Training)

Remove features with variance below threshold selector = VarianceThreshold(threshold=0.01) X_selected = selector.fit_transform(X)

See which features kept kept_features = X.columns[selector.get_support()] print(f"Kept {len(kept_features)} of {X.shape[1]} features") ```

For classification from sklearn.feature_selection import f_classif, chi2

Calculate F-scores f_scores, p_values = f_classif(X, y)

Rank features feature_scores = pd.DataFrame({ 'feature': X.columns, 'f_score': f_scores, 'p_value': p_values }).sort_values('f_score', ascending=False)

Select top K from sklearn.feature_selection import SelectKBest selector = SelectKBest(f_classif, k=10) X_selected = selector.fit_transform(X, y) ```

Method 2: Wrapper Methods (Use Model Performance)

Create model and RFE model = RandomForestClassifier(n_estimators=100) rfe = RFE(estimator=model, n_features_to_select=10, step=1) rfe.fit(X, y)

Get selected features selected = X.columns[rfe.support_] print(f"Selected features: {list(selected)}")

Feature ranking (1 = selected) ranking = pd.DataFrame({ 'feature': X.columns, 'ranking': rfe.ranking_ }).sort_values('ranking') ```

Plot plt.plot(range(1, len(rfecv.cv_results_['mean_test_score']) + 1), rfecv.cv_results_['mean_test_score']) plt.xlabel('Number of features') plt.ylabel('CV Score') ```

Method 3: Embedded Methods (During Training)

For regression lasso = Lasso(alpha=0.01) lasso.fit(X, y)

For classification lr = LogisticRegression(penalty='l1', solver='saga', C=1.0) lr.fit(X, y)

Select non-zero features selector = SelectFromModel(lr, prefit=True) X_selected = selector.transform(X) ```

Train model rf = RandomForestClassifier(n_estimators=100) rf.fit(X, y)

Select important features selector = SelectFromModel(rf, threshold='median') X_selected = selector.fit_transform(X, y)

Or manually select top features importance = pd.DataFrame({ 'feature': X.columns, 'importance': rf.feature_importances_ }).sort_values('importance', ascending=False)

Quick Comparison

Practical Pipeline

Key Takeaway

More on ML

What is Machine Learning? A Simple Introduction

Supervised vs Unsupervised Learning Explained

Understanding Training, Validation, and Test Sets