ML8 min read

Handling Missing Data in Machine Learning

Learn practical techniques to handle missing values in your datasets effectively.

Sarah Chen
December 19, 2025
0.0k0

Handling Missing Data in Machine Learning

Real-world data is messy. Missing values are everywhere. Here's how to deal with them.

Why Data Goes Missing

  • User didn't fill a form field
  • Sensor malfunction
  • Data corruption
  • Feature didn't exist before a certain date
  • Intentionally not collected

First: Understand Your Missing Data

import pandas as pd

# Check missing values
df.isnull().sum()

# Percentage missing per column
(df.isnull().sum() / len(df)) * 100

# Visualize
import seaborn as sns
sns.heatmap(df.isnull(), cbar=True)

Types of Missingness

MCAR (Missing Completely At Random)

No pattern. Random chance.
Example: Survey responses lost in mail.

MAR (Missing At Random)

Missing depends on OTHER observed variables.
Example: Men less likely to report weight (but you know gender).

MNAR (Missing Not At Random)

Missing depends on the MISSING value itself.
Example: High earners don't report income.

Why it matters: MNAR is hardest to handle correctly.

Strategy 1: Delete

Delete Rows

df_clean = df.dropna()  # Remove any row with missing values
df_clean = df.dropna(subset=['important_column'])  # Only check specific columns

Use when:

  • Very few missing values (<5%)
  • Large dataset (can afford to lose rows)
  • Data is MCAR

Don't use when:

  • Many missing values
  • Missingness has a pattern

Delete Columns

# Remove columns with >50% missing
threshold = 0.5
df_clean = df.dropna(axis=1, thresh=len(df) * (1 - threshold))

Use when:

  • Column is mostly missing
  • Column isn't important

Strategy 2: Impute (Fill In)

Simple Imputation

from sklearn.impute import SimpleImputer

# Numerical: fill with mean or median
imputer = SimpleImputer(strategy='mean')  # or 'median'
X_imputed = imputer.fit_transform(X)

# Categorical: fill with most frequent
imputer = SimpleImputer(strategy='most_frequent')
X_imputed = imputer.fit_transform(X)

Mean vs Median:

  • Mean: Good for normal distribution
  • Median: Better with outliers

Constant Value

imputer = SimpleImputer(strategy='constant', fill_value=0)

Pandas Methods

# Forward fill (use previous value)
df['col'].fillna(method='ffill')

# Backward fill
df['col'].fillna(method='bfill')

# Fill with specific value
df['col'].fillna(df['col'].median())

Strategy 3: Advanced Imputation

KNN Imputation

from sklearn.impute import KNNImputer

imputer = KNNImputer(n_neighbors=5)
X_imputed = imputer.fit_transform(X)

Uses K nearest similar samples to estimate missing values.

Iterative Imputation (MICE)

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

imputer = IterativeImputer(max_iter=10)
X_imputed = imputer.fit_transform(X)

Models each feature as a function of other features. Iterates until convergence.

Strategy 4: Add Missing Indicator

Sometimes "missing" is information itself!

from sklearn.impute import SimpleImputer, MissingIndicator
from sklearn.pipeline import FeatureUnion

# Combine imputation + missing indicator
transformer = FeatureUnion([
    ('imputer', SimpleImputer(strategy='mean')),
    ('indicator', MissingIndicator())
])

X_transformed = transformer.fit_transform(X)
# Now includes original features (imputed) + binary missing indicators

Practical Guidelines

Missing % Recommendation
<5% Simple imputation or deletion
5-25% Imputation (consider indicator)
25-50% Advanced imputation + indicator
>50% Consider dropping column

Important: Fit on Train Only!

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

imputer = SimpleImputer(strategy='median')
X_train_imputed = imputer.fit_transform(X_train)  # Fit on train
X_test_imputed = imputer.transform(X_test)        # Transform test

Don't let test data influence imputation values!

Quick Decision Guide

Missing values?
    │
    ├─ <5%? → Delete rows or simple imputation
    │
    ├─ Numerical?
    │   ├─ Has outliers? → Median
    │   └─ No outliers? → Mean or KNN
    │
    └─ Categorical?
        └─ Most frequent or new category "Missing"

Key Takeaway

There's no perfect solution. The best approach depends on:

  • How much data is missing
  • Why it's missing
  • Your model's sensitivity to imputation

Always document your choices and test different approaches!

#Machine Learning#Missing Data#Data Preprocessing#Beginner