ML9 min read

Feature Engineering: The Art of Creating Better Features

Learn practical feature engineering techniques that can dramatically improve your model performance.

Sarah Chen
December 19, 2025
0.0k0

Feature Engineering: Creating Better Features

Good features can make a simple model outperform a complex one with bad features. Feature engineering is often the difference between a working model and a great one.

What is Feature Engineering?

Creating new features from existing data to help models learn better.

Raw data → Transformed features → Better predictions

Numerical Feature Techniques

1. Binning

Convert continuous to categorical:

# Age bins
df['age_group'] = pd.cut(df['age'], 
    bins=[0, 18, 35, 50, 65, 100],
    labels=['child', 'young', 'middle', 'senior', 'elderly'])

2. Log Transform

Handles skewed data:

import numpy as np

# For right-skewed data (income, prices)
df['log_income'] = np.log1p(df['income'])  # log1p handles zeros

3. Polynomial Features

Capture non-linear relationships:

from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)
# Creates: x1, x2, x1², x2², x1*x2

Date/Time Features

Dates are goldmines for features:

df['date'] = pd.to_datetime(df['date'])

# Extract components
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day_of_week'] = df['date'].dt.dayofweek
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
df['quarter'] = df['date'].dt.quarter

# Time since event
df['days_since_signup'] = (pd.Timestamp.now() - df['signup_date']).dt.days

Text Features

# Length features
df['text_length'] = df['text'].str.len()
df['word_count'] = df['text'].str.split().str.len()

# Contains patterns
df['has_question'] = df['text'].str.contains('\?').astype(int)
df['has_url'] = df['text'].str.contains('http').astype(int)

Aggregation Features

For grouped data (like customer transactions):

# Customer-level features from transactions
customer_features = df.groupby('customer_id').agg({
    'amount': ['sum', 'mean', 'max', 'count'],
    'date': ['min', 'max']
}).reset_index()

# Flatten column names
customer_features.columns = ['customer_id', 'total_spent', 'avg_spent', 
                             'max_purchase', 'num_purchases', 
                             'first_purchase', 'last_purchase']

Interaction Features

Combine features that might have joint effects:

# Ratios
df['price_per_sqft'] = df['price'] / df['sqft']

# Products
df['total_value'] = df['quantity'] * df['unit_price']

# Combinations
df['age_income'] = df['age'] * df['income']

Domain-Specific Features

E-commerce:

  • Days since last purchase
  • Average order value
  • Purchase frequency

Finance:

  • Debt-to-income ratio
  • Moving averages
  • Volatility measures

Healthcare:

  • BMI from height/weight
  • Age at diagnosis
  • Time between visits

Feature Selection After Engineering

More features isn't always better. Remove:

  • Features with zero variance
  • Highly correlated features (keep one)
  • Features with little predictive power
# Remove low variance
from sklearn.feature_selection import VarianceThreshold
selector = VarianceThreshold(threshold=0.01)
X_selected = selector.fit_transform(X)

# Check correlations
corr_matrix = df.corr()
high_corr = (corr_matrix.abs() > 0.95) & (corr_matrix != 1.0)

Key Takeaway

Feature engineering is often more valuable than trying fancier models. Understand your data and domain, create features that capture meaningful patterns, and always validate that new features actually help. The best features often come from domain knowledge, not just mathematical transformations.

#Machine Learning#Feature Engineering#Data Preprocessing#Intermediate