Cross-Validation: The Right Way to Evaluate Models
Learn cross-validation techniques to get reliable estimates of model performance.
Cross-Validation: The Right Way to Evaluate Models
A single train-test split can be misleading. Cross-validation gives you reliable, robust performance estimates.
The Problem with Single Split
# Single split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model.fit(X_train, y_train)
score = model.score(X_test, y_test) # 85%
But what if you got lucky (or unlucky) with that particular split?
Different random splits give different results:
- Split 1: 85%
- Split 2: 78%
- Split 3: 91%
Which one is the real performance?
K-Fold Cross-Validation
Split data into K parts. Train on K-1 parts, test on 1 part. Repeat K times.
5-Fold CV:
Fold 1: [TEST][TRAIN][TRAIN][TRAIN][TRAIN] → Score: 84%
Fold 2: [TRAIN][TEST][TRAIN][TRAIN][TRAIN] → Score: 87%
Fold 3: [TRAIN][TRAIN][TEST][TRAIN][TRAIN] → Score: 82%
Fold 4: [TRAIN][TRAIN][TRAIN][TEST][TRAIN] → Score: 86%
Fold 5: [TRAIN][TRAIN][TRAIN][TRAIN][TEST] → Score: 85%
Average: 84.8% ± 1.8%
Every data point gets to be in the test set exactly once!
Code Example
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
# 5-fold cross-validation
scores = cross_val_score(model, X, y, cv=5)
print(f"Scores: {scores}")
print(f"Mean: {scores.mean():.3f}")
print(f"Std: {scores.std():.3f}")
print(f"95% CI: {scores.mean():.3f} ± {scores.std() * 2:.3f}")
Output:
Scores: [0.84 0.87 0.82 0.86 0.85]
Mean: 0.848
Std: 0.018
95% CI: 0.848 ± 0.036
Choosing K
| K Value | Pros | Cons |
|---|---|---|
| K=5 | Good balance, common choice | - |
| K=10 | More reliable estimate | Slower |
| K=n (LOOCV) | Uses all data | Very slow, high variance |
Rule of thumb: K=5 or K=10 for most cases.
Types of Cross-Validation
Stratified K-Fold (for Classification)
Maintains class proportions in each fold:
from sklearn.model_selection import StratifiedKFold
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=cv)
Always use for classification! Especially with imbalanced data.
Leave-One-Out (LOOCV)
K = number of samples. Each sample is test set once.
from sklearn.model_selection import LeaveOneOut
cv = LeaveOneOut()
scores = cross_val_score(model, X, y, cv=cv)
Use when: Very small dataset
Time Series Split
For time-dependent data. Train on past, test on future.
Split 1: [TRAIN][TEST]
Split 2: [TRAIN][TRAIN][TEST]
Split 3: [TRAIN][TRAIN][TRAIN][TEST]
from sklearn.model_selection import TimeSeriesSplit
cv = TimeSeriesSplit(n_splits=5)
scores = cross_val_score(model, X, y, cv=cv)
Must use for: Stock prices, sales forecasting, any time-ordered data
Group K-Fold
When samples belong to groups that shouldn't be split.
from sklearn.model_selection import GroupKFold
# Each patient's samples stay together
groups = patient_ids
cv = GroupKFold(n_splits=5)
scores = cross_val_score(model, X, y, cv=cv, groups=groups)
Use when: Multiple samples from same person/entity
Getting More Information
Cross-Validate with Multiple Metrics
from sklearn.model_selection import cross_validate
scores = cross_validate(
model, X, y, cv=5,
scoring=['accuracy', 'precision', 'recall', 'f1'],
return_train_score=True
)
print(f"Test Accuracy: {scores['test_accuracy'].mean():.3f}")
print(f"Test F1: {scores['test_f1'].mean():.3f}")
print(f"Train Accuracy: {scores['train_accuracy'].mean():.3f}")
Get Predictions
from sklearn.model_selection import cross_val_predict
# Predictions from each fold
predictions = cross_val_predict(model, X, y, cv=5)
Nested Cross-Validation
For hyperparameter tuning + evaluation without data leakage:
Outer loop: Model evaluation
Inner loop: Hyperparameter tuning
from sklearn.model_selection import GridSearchCV, cross_val_score
# Inner CV for tuning
inner_cv = StratifiedKFold(n_splits=5)
param_grid = {'C': [0.1, 1, 10]}
grid_search = GridSearchCV(LogisticRegression(), param_grid, cv=inner_cv)
# Outer CV for evaluation
outer_cv = StratifiedKFold(n_splits=5)
nested_scores = cross_val_score(grid_search, X, y, cv=outer_cv)
print(f"Nested CV Score: {nested_scores.mean():.3f} ± {nested_scores.std():.3f}")
Common Mistakes
1. Data Leakage in Preprocessing
# WRONG
X_scaled = scaler.fit_transform(X) # Learns from all data
scores = cross_val_score(model, X_scaled, y, cv=5)
# RIGHT - Use Pipeline
from sklearn.pipeline import Pipeline
pipe = Pipeline([
('scaler', StandardScaler()),
('model', LogisticRegression())
])
scores = cross_val_score(pipe, X, y, cv=5)
2. Using Test Set Multiple Times
# WRONG
score = model.score(X_test, y_test) # Looked at test set
# ... tune model ...
score = model.score(X_test, y_test) # Looked again!
# RIGHT
# Use CV for tuning, test set only at the very end
Summary
| Scenario | CV Type |
|---|---|
| Classification | StratifiedKFold |
| Time series | TimeSeriesSplit |
| Grouped data | GroupKFold |
| Small data | LOOCV |
| General | KFold |
Key takeaway: Always use cross-validation. A single score is not reliable!