Avoiding Data Leakage in Machine Learning
Learn how to identify and prevent data leakage that causes models to fail in production.
Avoiding Data Leakage in Machine Learning
Data leakage is when your model learns from information it shouldn't have access to. Great test scores, disastrous production performance. It's one of the most common ML mistakes.
What is Data Leakage?
Training with leakage: Production (no leakage):
"Future" info available Only past/present available
├── Test accuracy: 99% ├── Actual accuracy: 60%
└── Model looks great! └── Model fails!
Type 1: Target Leakage
Features that contain information about the target but wouldn't be available at prediction time.
Example: Predicting loan default
Feature: "account_closed" = True
Problem: Account closed BECAUSE they defaulted!
This wouldn't be known when making lending decision.
Fix: Think about what you'd know at prediction time.
# Bad: These are consequences of the target
features_to_remove = [
'account_closed', # Result of default
'sent_to_collections', # Result of default
'bankruptcy_filed' # Result of default
]
X = df.drop(columns=features_to_remove + ['target'])
Type 2: Train-Test Contamination
Test data information leaks into training.
Common mistakes:
Fitting on All Data
# WRONG: Scaler sees test data!
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X) # Leaks test statistics
X_train, X_test = train_test_split(X_scaled, ...)
# RIGHT: Fit only on training data
X_train, X_test = train_test_split(X, ...)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) # Only transform!
Feature Selection on All Data
# WRONG
selected_features = select_best_features(X, y) # Uses test info
X_train, X_test = train_test_split(X[selected_features], ...)
# RIGHT
X_train, X_test = train_test_split(X, ...)
selected_features = select_best_features(X_train, y_train)
X_train = X_train[selected_features]
X_test = X_test[selected_features]
Type 3: Temporal Leakage
Using future information to predict the past.
# WRONG: Random split with time series
X_train, X_test = train_test_split(X, ...) # 2023 data in train, 2020 in test!
# RIGHT: Split by time
train = df[df['date'] < '2023-01-01']
test = df[df['date'] >= '2023-01-01']
Also watch for features:
# Bad: Uses future aggregations
df['user_total_purchases'] = df.groupby('user')['amount'].transform('sum')
# Better: Use only past data
df['user_purchases_before'] = df.groupby('user')['amount'].apply(
lambda x: x.shift().expanding().sum()
)
Type 4: Duplicate/Near-Duplicate Rows
# If same row appears in train and test
# Model memorizes, doesn't generalize
# Check for duplicates before splitting
print(f"Duplicates: {df.duplicated().sum()}")
# Remove duplicates
df = df.drop_duplicates()
# Or split by unique entity (e.g., customer_id)
unique_customers = df['customer_id'].unique()
train_customers, test_customers = train_test_split(unique_customers)
train = df[df['customer_id'].isin(train_customers)]
test = df[df['customer_id'].isin(test_customers)]
How to Detect Leakage
1. Suspiciously Good Performance
if test_accuracy > 0.95:
print("This seems too good. Check for leakage!")
2. Feature Importance Check
# If a feature is unrealistically important, investigate
importance = model.feature_importances_
top_feature = feature_names[importance.argmax()]
print(f"Top feature: {top_feature}")
# Ask: "Would this be available at prediction time?"
3. Time-Based Validation
# If time-based validation is much worse than random CV
cv_score = cross_val_score(model, X, y, cv=5).mean()
time_score = # time-based split score
if cv_score - time_score > 0.1:
print("Possible temporal leakage!")
Prevention Checklist
- Split data BEFORE any preprocessing
- Use pipelines to prevent leakage
- For time series, always split by time
- Check: "Would I have this feature at prediction time?"
- Remove duplicates before splitting
- Be suspicious of very high scores
Safe Pipeline
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest
from sklearn.ensemble import RandomForestClassifier
# Pipeline applies steps correctly during CV
pipeline = Pipeline([
('scaler', StandardScaler()),
('selector', SelectKBest(k=20)),
('classifier', RandomForestClassifier())
])
# Safe cross-validation
scores = cross_val_score(pipeline, X, y, cv=5)
Key Takeaway
Data leakage is silent and deadly - your model looks great until production. Always split data first, use pipelines, think carefully about what information would be available at prediction time, and be suspicious of results that seem too good. Prevention is much easier than detection!