Feature Engineering: The Art of Creating Better Features
Learn practical feature engineering techniques that can dramatically improve your model performance.
Feature Engineering: Creating Better Features
Good features can make a simple model outperform a complex one with bad features. Feature engineering is often the difference between a working model and a great one.
What is Feature Engineering?
Creating new features from existing data to help models learn better.
Raw data → Transformed features → Better predictions
Numerical Feature Techniques
### 1. Binning Convert continuous to categorical:
```python # Age bins df['age_group'] = pd.cut(df['age'], bins=[0, 18, 35, 50, 65, 100], labels=['child', 'young', 'middle', 'senior', 'elderly']) ```
### 2. Log Transform Handles skewed data:
```python import numpy as np
For right-skewed data (income, prices) df['log_income'] = np.log1p(df['income']) # log1p handles zeros ```
### 3. Polynomial Features Capture non-linear relationships:
```python from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, include_bias=False) X_poly = poly.fit_transform(X) # Creates: x1, x2, x1², x2², x1*x2 ```
Date/Time Features
Dates are goldmines for features:
```python df['date'] = pd.to_datetime(df['date'])
Extract components df['year'] = df['date'].dt.year df['month'] = df['date'].dt.month df['day_of_week'] = df['date'].dt.dayofweek df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int) df['quarter'] = df['date'].dt.quarter
Time since event df['days_since_signup'] = (pd.Timestamp.now() - df['signup_date']).dt.days ```
Text Features
```python # Length features df['text_length'] = df['text'].str.len() df['word_count'] = df['text'].str.split().str.len()
Contains patterns df['has_question'] = df['text'].str.contains('\?').astype(int) df['has_url'] = df['text'].str.contains('http').astype(int) ```
Aggregation Features
For grouped data (like customer transactions):
```python # Customer-level features from transactions customer_features = df.groupby('customer_id').agg({ 'amount': ['sum', 'mean', 'max', 'count'], 'date': ['min', 'max'] }).reset_index()
Flatten column names customer_features.columns = ['customer_id', 'total_spent', 'avg_spent', 'max_purchase', 'num_purchases', 'first_purchase', 'last_purchase'] ```
Interaction Features
Combine features that might have joint effects:
```python # Ratios df['price_per_sqft'] = df['price'] / df['sqft']
Products df['total_value'] = df['quantity'] * df['unit_price']
Combinations df['age_income'] = df['age'] * df['income'] ```
Domain-Specific Features
**E-commerce:** - Days since last purchase - Average order value - Purchase frequency
**Finance:** - Debt-to-income ratio - Moving averages - Volatility measures
**Healthcare:** - BMI from height/weight - Age at diagnosis - Time between visits
Feature Selection After Engineering
More features isn't always better. Remove: - Features with zero variance - Highly correlated features (keep one) - Features with little predictive power
```python # Remove low variance from sklearn.feature_selection import VarianceThreshold selector = VarianceThreshold(threshold=0.01) X_selected = selector.fit_transform(X)
Check correlations corr_matrix = df.corr() high_corr = (corr_matrix.abs() > 0.95) & (corr_matrix != 1.0) ```
Key Takeaway
Feature engineering is often more valuable than trying fancier models. Understand your data and domain, create features that capture meaningful patterns, and always validate that new features actually help. The best features often come from domain knowledge, not just mathematical transformations.