Feature Engineering: The Art of Creating Better Features
Learn practical feature engineering techniques that can dramatically improve your model performance.
Feature Engineering: Creating Better Features
Good features can make a simple model outperform a complex one with bad features. Feature engineering is often the difference between a working model and a great one.
What is Feature Engineering?
Creating new features from existing data to help models learn better.
Raw data → Transformed features → Better predictions
Numerical Feature Techniques
1. Binning
Convert continuous to categorical:
# Age bins
df['age_group'] = pd.cut(df['age'],
bins=[0, 18, 35, 50, 65, 100],
labels=['child', 'young', 'middle', 'senior', 'elderly'])
2. Log Transform
Handles skewed data:
import numpy as np
# For right-skewed data (income, prices)
df['log_income'] = np.log1p(df['income']) # log1p handles zeros
3. Polynomial Features
Capture non-linear relationships:
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)
# Creates: x1, x2, x1², x2², x1*x2
Date/Time Features
Dates are goldmines for features:
df['date'] = pd.to_datetime(df['date'])
# Extract components
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day_of_week'] = df['date'].dt.dayofweek
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
df['quarter'] = df['date'].dt.quarter
# Time since event
df['days_since_signup'] = (pd.Timestamp.now() - df['signup_date']).dt.days
Text Features
# Length features
df['text_length'] = df['text'].str.len()
df['word_count'] = df['text'].str.split().str.len()
# Contains patterns
df['has_question'] = df['text'].str.contains('\?').astype(int)
df['has_url'] = df['text'].str.contains('http').astype(int)
Aggregation Features
For grouped data (like customer transactions):
# Customer-level features from transactions
customer_features = df.groupby('customer_id').agg({
'amount': ['sum', 'mean', 'max', 'count'],
'date': ['min', 'max']
}).reset_index()
# Flatten column names
customer_features.columns = ['customer_id', 'total_spent', 'avg_spent',
'max_purchase', 'num_purchases',
'first_purchase', 'last_purchase']
Interaction Features
Combine features that might have joint effects:
# Ratios
df['price_per_sqft'] = df['price'] / df['sqft']
# Products
df['total_value'] = df['quantity'] * df['unit_price']
# Combinations
df['age_income'] = df['age'] * df['income']
Domain-Specific Features
E-commerce:
- Days since last purchase
- Average order value
- Purchase frequency
Finance:
- Debt-to-income ratio
- Moving averages
- Volatility measures
Healthcare:
- BMI from height/weight
- Age at diagnosis
- Time between visits
Feature Selection After Engineering
More features isn't always better. Remove:
- Features with zero variance
- Highly correlated features (keep one)
- Features with little predictive power
# Remove low variance
from sklearn.feature_selection import VarianceThreshold
selector = VarianceThreshold(threshold=0.01)
X_selected = selector.fit_transform(X)
# Check correlations
corr_matrix = df.corr()
high_corr = (corr_matrix.abs() > 0.95) & (corr_matrix != 1.0)
Key Takeaway
Feature engineering is often more valuable than trying fancier models. Understand your data and domain, create features that capture meaningful patterns, and always validate that new features actually help. The best features often come from domain knowledge, not just mathematical transformations.