Learn how to identify and prevent data leakage that causes models to fail in production.

Avoiding Data Leakage in Machine Learning

Data leakage is when your model learns from information it shouldn't have access to. Great test scores, disastrous production performance. It's one of the most common ML mistakes.

What is Data Leakage?

``` Training with leakage: Production (no leakage): "Future" info available Only past/present available ├── Test accuracy: 99% ├── Actual accuracy: 60% └── Model looks great! └── Model fails! ```

Type 1: Target Leakage

Features that contain information about the target but wouldn't be available at prediction time.

**Example:** Predicting loan default ``` Feature: "account_closed" = True Problem: Account closed BECAUSE they defaulted! This wouldn't be known when making lending decision. ```

**Fix:** Think about what you'd know at prediction time.

```python # Bad: These are consequences of the target features_to_remove = [ 'account_closed', # Result of default 'sent_to_collections', # Result of default 'bankruptcy_filed' # Result of default ]

X = df.drop(columns=features_to_remove + ['target']) ```

Type 2: Train-Test Contamination

Test data information leaks into training.

**Common mistakes:**

### Fitting on All Data

```python # WRONG: Scaler sees test data! scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # Leaks test statistics X_train, X_test = train_test_split(X_scaled, ...)

RIGHT: Fit only on training data X_train, X_test = train_test_split(X, ...) scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # Only transform! ```

### Feature Selection on All Data

```python # WRONG selected_features = select_best_features(X, y) # Uses test info X_train, X_test = train_test_split(X[selected_features], ...)

RIGHT X_train, X_test = train_test_split(X, ...) selected_features = select_best_features(X_train, y_train) X_train = X_train[selected_features] X_test = X_test[selected_features] ```

Type 3: Temporal Leakage

Using future information to predict the past.

```python # WRONG: Random split with time series X_train, X_test = train_test_split(X, ...) # 2023 data in train, 2020 in test!

RIGHT: Split by time train = df[df['date'] < '2023-01-01'] test = df[df['date'] >= '2023-01-01'] ```

Also watch for features: ```python # Bad: Uses future aggregations df['user_total_purchases'] = df.groupby('user')['amount'].transform('sum')

Better: Use only past data df['user_purchases_before'] = df.groupby('user')['amount'].apply( lambda x: x.shift().expanding().sum() ) ```

Type 4: Duplicate/Near-Duplicate Rows

```python # If same row appears in train and test # Model memorizes, doesn't generalize

Check for duplicates before splitting print(f"Duplicates: {df.duplicated().sum()}")

Remove duplicates df = df.drop_duplicates()

Or split by unique entity (e.g., customer_id) unique_customers = df['customer_id'].unique() train_customers, test_customers = train_test_split(unique_customers) train = df[df['customer_id'].isin(train_customers)] test = df[df['customer_id'].isin(test_customers)] ```

How to Detect Leakage

### 1. Suspiciously Good Performance

```python if test_accuracy > 0.95: print("This seems too good. Check for leakage!") ```

### 2. Feature Importance Check

```python # If a feature is unrealistically important, investigate importance = model.feature_importances_ top_feature = feature_names[importance.argmax()] print(f"Top feature: {top_feature}") # Ask: "Would this be available at prediction time?" ```

### 3. Time-Based Validation

```python # If time-based validation is much worse than random CV cv_score = cross_val_score(model, X, y, cv=5).mean() time_score = # time-based split score

if cv_score - time_score > 0.1: print("Possible temporal leakage!") ```

Prevention Checklist

- [ ] Split data BEFORE any preprocessing - [ ] Use pipelines to prevent leakage - [ ] For time series, always split by time - [ ] Check: "Would I have this feature at prediction time?" - [ ] Remove duplicates before splitting - [ ] Be suspicious of very high scores

Safe Pipeline

```python from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.feature_selection import SelectKBest from sklearn.ensemble import RandomForestClassifier

Pipeline applies steps correctly during CV pipeline = Pipeline([ ('scaler', StandardScaler()), ('selector', SelectKBest(k=20)), ('classifier', RandomForestClassifier()) ])

Safe cross-validation scores = cross_val_score(pipeline, X, y, cv=5) ```

Key Takeaway

Data leakage is silent and deadly - your model looks great until production. Always split data first, use pipelines, think carefully about what information would be available at prediction time, and be suspicious of results that seem too good. Prevention is much easier than detection!

Avoiding Data Leakage in Machine Learning

Avoiding Data Leakage in Machine Learning

What is Data Leakage?

Type 1: Target Leakage

Type 2: Train-Test Contamination

RIGHT: Fit only on training data X_train, X_test = train_test_split(X, ...) scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # Only transform! ```

RIGHT X_train, X_test = train_test_split(X, ...) selected_features = select_best_features(X_train, y_train) X_train = X_train[selected_features] X_test = X_test[selected_features] ```

Type 3: Temporal Leakage

RIGHT: Split by time train = df[df['date'] < '2023-01-01'] test = df[df['date'] >= '2023-01-01'] ```

Better: Use only past data df['user_purchases_before'] = df.groupby('user')['amount'].apply( lambda x: x.shift().expanding().sum() ) ```

Type 4: Duplicate/Near-Duplicate Rows

Check for duplicates before splitting print(f"Duplicates: {df.duplicated().sum()}")

Remove duplicates df = df.drop_duplicates()

Or split by unique entity (e.g., customer_id) unique_customers = df['customer_id'].unique() train_customers, test_customers = train_test_split(unique_customers) train = df[df['customer_id'].isin(train_customers)] test = df[df['customer_id'].isin(test_customers)] ```

How to Detect Leakage

Prevention Checklist

Safe Pipeline

Pipeline applies steps correctly during CV pipeline = Pipeline([ ('scaler', StandardScaler()), ('selector', SelectKBest(k=20)), ('classifier', RandomForestClassifier()) ])

Safe cross-validation scores = cross_val_score(pipeline, X, y, cv=5) ```

Key Takeaway

More on ML

What is Machine Learning? A Simple Introduction

Supervised vs Unsupervised Learning Explained

Understanding Training, Validation, and Test Sets