ML8 min read

Avoiding Data Leakage in Machine Learning

Learn how to identify and prevent data leakage that causes models to fail in production.

Sarah Chen
December 19, 2025
0.0k0

Avoiding Data Leakage in Machine Learning

Data leakage is when your model learns from information it shouldn't have access to. Great test scores, disastrous production performance. It's one of the most common ML mistakes.

What is Data Leakage?

Training with leakage:        Production (no leakage):
"Future" info available       Only past/present available
├── Test accuracy: 99%        ├── Actual accuracy: 60%
└── Model looks great!        └── Model fails!

Type 1: Target Leakage

Features that contain information about the target but wouldn't be available at prediction time.

Example: Predicting loan default

Feature: "account_closed" = True
Problem: Account closed BECAUSE they defaulted!
         This wouldn't be known when making lending decision.

Fix: Think about what you'd know at prediction time.

# Bad: These are consequences of the target
features_to_remove = [
    'account_closed',      # Result of default
    'sent_to_collections', # Result of default  
    'bankruptcy_filed'     # Result of default
]

X = df.drop(columns=features_to_remove + ['target'])

Type 2: Train-Test Contamination

Test data information leaks into training.

Common mistakes:

Fitting on All Data

# WRONG: Scaler sees test data!
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # Leaks test statistics
X_train, X_test = train_test_split(X_scaled, ...)

# RIGHT: Fit only on training data
X_train, X_test = train_test_split(X, ...)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)  # Only transform!

Feature Selection on All Data

# WRONG
selected_features = select_best_features(X, y)  # Uses test info
X_train, X_test = train_test_split(X[selected_features], ...)

# RIGHT
X_train, X_test = train_test_split(X, ...)
selected_features = select_best_features(X_train, y_train)
X_train = X_train[selected_features]
X_test = X_test[selected_features]

Type 3: Temporal Leakage

Using future information to predict the past.

# WRONG: Random split with time series
X_train, X_test = train_test_split(X, ...)  # 2023 data in train, 2020 in test!

# RIGHT: Split by time
train = df[df['date'] < '2023-01-01']
test = df[df['date'] >= '2023-01-01']

Also watch for features:

# Bad: Uses future aggregations
df['user_total_purchases'] = df.groupby('user')['amount'].transform('sum')

# Better: Use only past data
df['user_purchases_before'] = df.groupby('user')['amount'].apply(
    lambda x: x.shift().expanding().sum()
)

Type 4: Duplicate/Near-Duplicate Rows

# If same row appears in train and test
# Model memorizes, doesn't generalize

# Check for duplicates before splitting
print(f"Duplicates: {df.duplicated().sum()}")

# Remove duplicates
df = df.drop_duplicates()

# Or split by unique entity (e.g., customer_id)
unique_customers = df['customer_id'].unique()
train_customers, test_customers = train_test_split(unique_customers)
train = df[df['customer_id'].isin(train_customers)]
test = df[df['customer_id'].isin(test_customers)]

How to Detect Leakage

1. Suspiciously Good Performance

if test_accuracy > 0.95:
    print("This seems too good. Check for leakage!")

2. Feature Importance Check

# If a feature is unrealistically important, investigate
importance = model.feature_importances_
top_feature = feature_names[importance.argmax()]
print(f"Top feature: {top_feature}")
# Ask: "Would this be available at prediction time?"

3. Time-Based Validation

# If time-based validation is much worse than random CV
cv_score = cross_val_score(model, X, y, cv=5).mean()
time_score = # time-based split score

if cv_score - time_score > 0.1:
    print("Possible temporal leakage!")

Prevention Checklist

  • Split data BEFORE any preprocessing
  • Use pipelines to prevent leakage
  • For time series, always split by time
  • Check: "Would I have this feature at prediction time?"
  • Remove duplicates before splitting
  • Be suspicious of very high scores

Safe Pipeline

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest
from sklearn.ensemble import RandomForestClassifier

# Pipeline applies steps correctly during CV
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('selector', SelectKBest(k=20)),
    ('classifier', RandomForestClassifier())
])

# Safe cross-validation
scores = cross_val_score(pipeline, X, y, cv=5)

Key Takeaway

Data leakage is silent and deadly - your model looks great until production. Always split data first, use pipelines, think carefully about what information would be available at prediction time, and be suspicious of results that seem too good. Prevention is much easier than detection!

#Machine Learning#Data Leakage#Best Practices#Intermediate