Handling Missing Data in Machine Learning
Learn practical techniques to handle missing values in your datasets effectively.
Handling Missing Data in Machine Learning
Real-world data is messy. Missing values are everywhere. Here's how to deal with them.
Why Data Goes Missing
- User didn't fill a form field
- Sensor malfunction
- Data corruption
- Feature didn't exist before a certain date
- Intentionally not collected
First: Understand Your Missing Data
import pandas as pd
# Check missing values
df.isnull().sum()
# Percentage missing per column
(df.isnull().sum() / len(df)) * 100
# Visualize
import seaborn as sns
sns.heatmap(df.isnull(), cbar=True)
Types of Missingness
MCAR (Missing Completely At Random)
No pattern. Random chance.
Example: Survey responses lost in mail.
MAR (Missing At Random)
Missing depends on OTHER observed variables.
Example: Men less likely to report weight (but you know gender).
MNAR (Missing Not At Random)
Missing depends on the MISSING value itself.
Example: High earners don't report income.
Why it matters: MNAR is hardest to handle correctly.
Strategy 1: Delete
Delete Rows
df_clean = df.dropna() # Remove any row with missing values
df_clean = df.dropna(subset=['important_column']) # Only check specific columns
Use when:
- Very few missing values (<5%)
- Large dataset (can afford to lose rows)
- Data is MCAR
Don't use when:
- Many missing values
- Missingness has a pattern
Delete Columns
# Remove columns with >50% missing
threshold = 0.5
df_clean = df.dropna(axis=1, thresh=len(df) * (1 - threshold))
Use when:
- Column is mostly missing
- Column isn't important
Strategy 2: Impute (Fill In)
Simple Imputation
from sklearn.impute import SimpleImputer
# Numerical: fill with mean or median
imputer = SimpleImputer(strategy='mean') # or 'median'
X_imputed = imputer.fit_transform(X)
# Categorical: fill with most frequent
imputer = SimpleImputer(strategy='most_frequent')
X_imputed = imputer.fit_transform(X)
Mean vs Median:
- Mean: Good for normal distribution
- Median: Better with outliers
Constant Value
imputer = SimpleImputer(strategy='constant', fill_value=0)
Pandas Methods
# Forward fill (use previous value)
df['col'].fillna(method='ffill')
# Backward fill
df['col'].fillna(method='bfill')
# Fill with specific value
df['col'].fillna(df['col'].median())
Strategy 3: Advanced Imputation
KNN Imputation
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)
X_imputed = imputer.fit_transform(X)
Uses K nearest similar samples to estimate missing values.
Iterative Imputation (MICE)
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
imputer = IterativeImputer(max_iter=10)
X_imputed = imputer.fit_transform(X)
Models each feature as a function of other features. Iterates until convergence.
Strategy 4: Add Missing Indicator
Sometimes "missing" is information itself!
from sklearn.impute import SimpleImputer, MissingIndicator
from sklearn.pipeline import FeatureUnion
# Combine imputation + missing indicator
transformer = FeatureUnion([
('imputer', SimpleImputer(strategy='mean')),
('indicator', MissingIndicator())
])
X_transformed = transformer.fit_transform(X)
# Now includes original features (imputed) + binary missing indicators
Practical Guidelines
| Missing % | Recommendation |
|---|---|
| <5% | Simple imputation or deletion |
| 5-25% | Imputation (consider indicator) |
| 25-50% | Advanced imputation + indicator |
| >50% | Consider dropping column |
Important: Fit on Train Only!
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
imputer = SimpleImputer(strategy='median')
X_train_imputed = imputer.fit_transform(X_train) # Fit on train
X_test_imputed = imputer.transform(X_test) # Transform test
Don't let test data influence imputation values!
Quick Decision Guide
Missing values?
│
├─ <5%? → Delete rows or simple imputation
│
├─ Numerical?
│ ├─ Has outliers? → Median
│ └─ No outliers? → Mean or KNN
│
└─ Categorical?
└─ Most frequent or new category "Missing"
Key Takeaway
There's no perfect solution. The best approach depends on:
- How much data is missing
- Why it's missing
- Your model's sensitivity to imputation
Always document your choices and test different approaches!