ML8 min read

Handling Missing Data in Machine Learning

Learn practical techniques to handle missing values in your datasets effectively.

Sarah Chen
December 19, 2025
0.0k0

Handling Missing Data in Machine Learning

Real-world data is messy. Missing values are everywhere. Here's how to deal with them.

Why Data Goes Missing

- User didn't fill a form field - Sensor malfunction - Data corruption - Feature didn't exist before a certain date - Intentionally not collected

First: Understand Your Missing Data

```python import pandas as pd

Check missing values df.isnull().sum()

Percentage missing per column (df.isnull().sum() / len(df)) * 100

Visualize import seaborn as sns sns.heatmap(df.isnull(), cbar=True) ```

Types of Missingness

### MCAR (Missing Completely At Random) No pattern. Random chance. Example: Survey responses lost in mail.

### MAR (Missing At Random) Missing depends on OTHER observed variables. Example: Men less likely to report weight (but you know gender).

### MNAR (Missing Not At Random) Missing depends on the MISSING value itself. Example: High earners don't report income.

**Why it matters:** MNAR is hardest to handle correctly.

Strategy 1: Delete

### Delete Rows ```python df_clean = df.dropna() # Remove any row with missing values df_clean = df.dropna(subset=['important_column']) # Only check specific columns ```

**Use when:** - Very few missing values (<5%) - Large dataset (can afford to lose rows) - Data is MCAR

**Don't use when:** - Many missing values - Missingness has a pattern

### Delete Columns ```python # Remove columns with >50% missing threshold = 0.5 df_clean = df.dropna(axis=1, thresh=len(df) * (1 - threshold)) ```

**Use when:** - Column is mostly missing - Column isn't important

Strategy 2: Impute (Fill In)

### Simple Imputation

```python from sklearn.impute import SimpleImputer

Numerical: fill with mean or median imputer = SimpleImputer(strategy='mean') # or 'median' X_imputed = imputer.fit_transform(X)

Categorical: fill with most frequent imputer = SimpleImputer(strategy='most_frequent') X_imputed = imputer.fit_transform(X) ```

**Mean vs Median:** - Mean: Good for normal distribution - Median: Better with outliers

### Constant Value ```python imputer = SimpleImputer(strategy='constant', fill_value=0) ```

### Pandas Methods ```python # Forward fill (use previous value) df['col'].fillna(method='ffill')

Backward fill df['col'].fillna(method='bfill')

Fill with specific value df['col'].fillna(df['col'].median()) ```

Strategy 3: Advanced Imputation

### KNN Imputation ```python from sklearn.impute import KNNImputer

imputer = KNNImputer(n_neighbors=5) X_imputed = imputer.fit_transform(X) ```

Uses K nearest similar samples to estimate missing values.

### Iterative Imputation (MICE) ```python from sklearn.experimental import enable_iterative_imputer from sklearn.impute import IterativeImputer

imputer = IterativeImputer(max_iter=10) X_imputed = imputer.fit_transform(X) ```

Models each feature as a function of other features. Iterates until convergence.

Strategy 4: Add Missing Indicator

Sometimes "missing" is information itself!

```python from sklearn.impute import SimpleImputer, MissingIndicator from sklearn.pipeline import FeatureUnion

Combine imputation + missing indicator transformer = FeatureUnion([ ('imputer', SimpleImputer(strategy='mean')), ('indicator', MissingIndicator()) ])

X_transformed = transformer.fit_transform(X) # Now includes original features (imputed) + binary missing indicators ```

Practical Guidelines

| Missing % | Recommendation | |-----------|---------------| | <5% | Simple imputation or deletion | | 5-25% | Imputation (consider indicator) | | 25-50% | Advanced imputation + indicator | | >50% | Consider dropping column |

Important: Fit on Train Only!

```python from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

imputer = SimpleImputer(strategy='median') X_train_imputed = imputer.fit_transform(X_train) # Fit on train X_test_imputed = imputer.transform(X_test) # Transform test ```

Don't let test data influence imputation values!

Quick Decision Guide

``` Missing values? │ ├─ <5%? → Delete rows or simple imputation │ ├─ Numerical? │ ├─ Has outliers? → Median │ └─ No outliers? → Mean or KNN │ └─ Categorical? └─ Most frequent or new category "Missing" ```

Key Takeaway

There's no perfect solution. The best approach depends on: - How much data is missing - Why it's missing - Your model's sensitivity to imputation

Always document your choices and test different approaches!

#Machine Learning#Missing Data#Data Preprocessing#Beginner