Handling Missing Data in Machine Learning
Learn practical techniques to handle missing values in your datasets effectively.
Handling Missing Data in Machine Learning
Real-world data is messy. Missing values are everywhere. Here's how to deal with them.
Why Data Goes Missing
- User didn't fill a form field - Sensor malfunction - Data corruption - Feature didn't exist before a certain date - Intentionally not collected
First: Understand Your Missing Data
```python import pandas as pd
Check missing values df.isnull().sum()
Percentage missing per column (df.isnull().sum() / len(df)) * 100
Visualize import seaborn as sns sns.heatmap(df.isnull(), cbar=True) ```
Types of Missingness
### MCAR (Missing Completely At Random) No pattern. Random chance. Example: Survey responses lost in mail.
### MAR (Missing At Random) Missing depends on OTHER observed variables. Example: Men less likely to report weight (but you know gender).
### MNAR (Missing Not At Random) Missing depends on the MISSING value itself. Example: High earners don't report income.
**Why it matters:** MNAR is hardest to handle correctly.
Strategy 1: Delete
### Delete Rows ```python df_clean = df.dropna() # Remove any row with missing values df_clean = df.dropna(subset=['important_column']) # Only check specific columns ```
**Use when:** - Very few missing values (<5%) - Large dataset (can afford to lose rows) - Data is MCAR
**Don't use when:** - Many missing values - Missingness has a pattern
### Delete Columns ```python # Remove columns with >50% missing threshold = 0.5 df_clean = df.dropna(axis=1, thresh=len(df) * (1 - threshold)) ```
**Use when:** - Column is mostly missing - Column isn't important
Strategy 2: Impute (Fill In)
### Simple Imputation
```python from sklearn.impute import SimpleImputer
Numerical: fill with mean or median imputer = SimpleImputer(strategy='mean') # or 'median' X_imputed = imputer.fit_transform(X)
Categorical: fill with most frequent imputer = SimpleImputer(strategy='most_frequent') X_imputed = imputer.fit_transform(X) ```
**Mean vs Median:** - Mean: Good for normal distribution - Median: Better with outliers
### Constant Value ```python imputer = SimpleImputer(strategy='constant', fill_value=0) ```
### Pandas Methods ```python # Forward fill (use previous value) df['col'].fillna(method='ffill')
Backward fill df['col'].fillna(method='bfill')
Fill with specific value df['col'].fillna(df['col'].median()) ```
Strategy 3: Advanced Imputation
### KNN Imputation ```python from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5) X_imputed = imputer.fit_transform(X) ```
Uses K nearest similar samples to estimate missing values.
### Iterative Imputation (MICE) ```python from sklearn.experimental import enable_iterative_imputer from sklearn.impute import IterativeImputer
imputer = IterativeImputer(max_iter=10) X_imputed = imputer.fit_transform(X) ```
Models each feature as a function of other features. Iterates until convergence.
Strategy 4: Add Missing Indicator
Sometimes "missing" is information itself!
```python from sklearn.impute import SimpleImputer, MissingIndicator from sklearn.pipeline import FeatureUnion
Combine imputation + missing indicator transformer = FeatureUnion([ ('imputer', SimpleImputer(strategy='mean')), ('indicator', MissingIndicator()) ])
X_transformed = transformer.fit_transform(X) # Now includes original features (imputed) + binary missing indicators ```
Practical Guidelines
| Missing % | Recommendation | |-----------|---------------| | <5% | Simple imputation or deletion | | 5-25% | Imputation (consider indicator) | | 25-50% | Advanced imputation + indicator | | >50% | Consider dropping column |
Important: Fit on Train Only!
```python from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
imputer = SimpleImputer(strategy='median') X_train_imputed = imputer.fit_transform(X_train) # Fit on train X_test_imputed = imputer.transform(X_test) # Transform test ```
Don't let test data influence imputation values!
Quick Decision Guide
``` Missing values? │ ├─ <5%? → Delete rows or simple imputation │ ├─ Numerical? │ ├─ Has outliers? → Median │ └─ No outliers? → Mean or KNN │ └─ Categorical? └─ Most frequent or new category "Missing" ```
Key Takeaway
There's no perfect solution. The best approach depends on: - How much data is missing - Why it's missing - Your model's sensitivity to imputation
Always document your choices and test different approaches!