Learn practical techniques to handle missing values in your datasets effectively.

Handling Missing Data in Machine Learning

Real-world data is messy. Missing values are everywhere. Here's how to deal with them.

Why Data Goes Missing

- User didn't fill a form field - Sensor malfunction - Data corruption - Feature didn't exist before a certain date - Intentionally not collected

First: Understand Your Missing Data

```python import pandas as pd

Check missing values df.isnull().sum()

Percentage missing per column (df.isnull().sum() / len(df)) * 100

Visualize import seaborn as sns sns.heatmap(df.isnull(), cbar=True) ```

Types of Missingness

### MCAR (Missing Completely At Random) No pattern. Random chance. Example: Survey responses lost in mail.

### MAR (Missing At Random) Missing depends on OTHER observed variables. Example: Men less likely to report weight (but you know gender).

### MNAR (Missing Not At Random) Missing depends on the MISSING value itself. Example: High earners don't report income.

**Why it matters:** MNAR is hardest to handle correctly.

Strategy 1: Delete

### Delete Rows ```python df_clean = df.dropna() # Remove any row with missing values df_clean = df.dropna(subset=['important_column']) # Only check specific columns ```

**Use when:** - Very few missing values (<5%) - Large dataset (can afford to lose rows) - Data is MCAR

**Don't use when:** - Many missing values - Missingness has a pattern

### Delete Columns ```python # Remove columns with >50% missing threshold = 0.5 df_clean = df.dropna(axis=1, thresh=len(df) * (1 - threshold)) ```

**Use when:** - Column is mostly missing - Column isn't important

Strategy 2: Impute (Fill In)

### Simple Imputation

```python from sklearn.impute import SimpleImputer

Numerical: fill with mean or median imputer = SimpleImputer(strategy='mean') # or 'median' X_imputed = imputer.fit_transform(X)

Categorical: fill with most frequent imputer = SimpleImputer(strategy='most_frequent') X_imputed = imputer.fit_transform(X) ```

**Mean vs Median:** - Mean: Good for normal distribution - Median: Better with outliers

### Constant Value ```python imputer = SimpleImputer(strategy='constant', fill_value=0) ```

### Pandas Methods ```python # Forward fill (use previous value) df['col'].fillna(method='ffill')

Backward fill df['col'].fillna(method='bfill')

Fill with specific value df['col'].fillna(df['col'].median()) ```

Strategy 3: Advanced Imputation

### KNN Imputation ```python from sklearn.impute import KNNImputer

imputer = KNNImputer(n_neighbors=5) X_imputed = imputer.fit_transform(X) ```

Uses K nearest similar samples to estimate missing values.

### Iterative Imputation (MICE) ```python from sklearn.experimental import enable_iterative_imputer from sklearn.impute import IterativeImputer

imputer = IterativeImputer(max_iter=10) X_imputed = imputer.fit_transform(X) ```

Models each feature as a function of other features. Iterates until convergence.

Strategy 4: Add Missing Indicator

Sometimes "missing" is information itself!

```python from sklearn.impute import SimpleImputer, MissingIndicator from sklearn.pipeline import FeatureUnion

Combine imputation + missing indicator transformer = FeatureUnion([ ('imputer', SimpleImputer(strategy='mean')), ('indicator', MissingIndicator()) ])

X_transformed = transformer.fit_transform(X) # Now includes original features (imputed) + binary missing indicators ```

Practical Guidelines

| Missing % | Recommendation | |-----------|---------------| | <5% | Simple imputation or deletion | | 5-25% | Imputation (consider indicator) | | 25-50% | Advanced imputation + indicator | | >50% | Consider dropping column |

Important: Fit on Train Only!

```python from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

imputer = SimpleImputer(strategy='median') X_train_imputed = imputer.fit_transform(X_train) # Fit on train X_test_imputed = imputer.transform(X_test) # Transform test ```

Don't let test data influence imputation values!

Quick Decision Guide

``` Missing values? │ ├─ <5%? → Delete rows or simple imputation │ ├─ Numerical? │ ├─ Has outliers? → Median │ └─ No outliers? → Mean or KNN │ └─ Categorical? └─ Most frequent or new category "Missing" ```

Key Takeaway

There's no perfect solution. The best approach depends on: - How much data is missing - Why it's missing - Your model's sensitivity to imputation

Always document your choices and test different approaches!

Handling Missing Data in Machine Learning

Handling Missing Data in Machine Learning

Why Data Goes Missing

First: Understand Your Missing Data

Check missing values df.isnull().sum()

Percentage missing per column (df.isnull().sum() / len(df)) * 100

Visualize import seaborn as sns sns.heatmap(df.isnull(), cbar=True) ```

Types of Missingness

Strategy 1: Delete

Strategy 2: Impute (Fill In)

Numerical: fill with mean or median imputer = SimpleImputer(strategy='mean') # or 'median' X_imputed = imputer.fit_transform(X)

Categorical: fill with most frequent imputer = SimpleImputer(strategy='most_frequent') X_imputed = imputer.fit_transform(X) ```

Backward fill df['col'].fillna(method='bfill')

Fill with specific value df['col'].fillna(df['col'].median()) ```

Strategy 3: Advanced Imputation

Strategy 4: Add Missing Indicator

Combine imputation + missing indicator transformer = FeatureUnion([ ('imputer', SimpleImputer(strategy='mean')), ('indicator', MissingIndicator()) ])

Practical Guidelines

Important: Fit on Train Only!

Quick Decision Guide

Key Takeaway

More on ML

What is Machine Learning? A Simple Introduction

Supervised vs Unsupervised Learning Explained

Understanding Training, Validation, and Test Sets