Learn practical feature engineering techniques that can dramatically improve your model performance.

Feature Engineering: Creating Better Features

Good features can make a simple model outperform a complex one with bad features. Feature engineering is often the difference between a working model and a great one.

What is Feature Engineering?

Creating new features from existing data to help models learn better.

Raw data → Transformed features → Better predictions

Numerical Feature Techniques

### 1. Binning Convert continuous to categorical:

```python # Age bins df['age_group'] = pd.cut(df['age'], bins=[0, 18, 35, 50, 65, 100], labels=['child', 'young', 'middle', 'senior', 'elderly']) ```

### 2. Log Transform Handles skewed data:

```python import numpy as np

For right-skewed data (income, prices) df['log_income'] = np.log1p(df['income']) # log1p handles zeros ```

### 3. Polynomial Features Capture non-linear relationships:

```python from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=2, include_bias=False) X_poly = poly.fit_transform(X) # Creates: x1, x2, x1², x2², x1*x2 ```

Date/Time Features

Dates are goldmines for features:

```python df['date'] = pd.to_datetime(df['date'])

Extract components df['year'] = df['date'].dt.year df['month'] = df['date'].dt.month df['day_of_week'] = df['date'].dt.dayofweek df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int) df['quarter'] = df['date'].dt.quarter

Time since event df['days_since_signup'] = (pd.Timestamp.now() - df['signup_date']).dt.days ```

Text Features

```python # Length features df['text_length'] = df['text'].str.len() df['word_count'] = df['text'].str.split().str.len()

Contains patterns df['has_question'] = df['text'].str.contains('\?').astype(int) df['has_url'] = df['text'].str.contains('http').astype(int) ```

Aggregation Features

For grouped data (like customer transactions):

```python # Customer-level features from transactions customer_features = df.groupby('customer_id').agg({ 'amount': ['sum', 'mean', 'max', 'count'], 'date': ['min', 'max'] }).reset_index()

Flatten column names customer_features.columns = ['customer_id', 'total_spent', 'avg_spent', 'max_purchase', 'num_purchases', 'first_purchase', 'last_purchase'] ```

Interaction Features

Combine features that might have joint effects:

```python # Ratios df['price_per_sqft'] = df['price'] / df['sqft']

Products df['total_value'] = df['quantity'] * df['unit_price']

Combinations df['age_income'] = df['age'] * df['income'] ```

Domain-Specific Features

**E-commerce:** - Days since last purchase - Average order value - Purchase frequency

**Finance:** - Debt-to-income ratio - Moving averages - Volatility measures

**Healthcare:** - BMI from height/weight - Age at diagnosis - Time between visits

Feature Selection After Engineering

More features isn't always better. Remove: - Features with zero variance - Highly correlated features (keep one) - Features with little predictive power

```python # Remove low variance from sklearn.feature_selection import VarianceThreshold selector = VarianceThreshold(threshold=0.01) X_selected = selector.fit_transform(X)

Check correlations corr_matrix = df.corr() high_corr = (corr_matrix.abs() > 0.95) & (corr_matrix != 1.0) ```

Key Takeaway

Feature engineering is often more valuable than trying fancier models. Understand your data and domain, create features that capture meaningful patterns, and always validate that new features actually help. The best features often come from domain knowledge, not just mathematical transformations.

Feature Engineering: The Art of Creating Better Features