Cross-Validation: The Right Way to Evaluate Models
Learn cross-validation techniques to get reliable estimates of model performance.
Cross-Validation: The Right Way to Evaluate Models
A single train-test split can be misleading. Cross-validation gives you reliable, robust performance estimates.
The Problem with Single Split
```python # Single split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) model.fit(X_train, y_train) score = model.score(X_test, y_test) # 85% ```
But what if you got lucky (or unlucky) with that particular split?
Different random splits give different results: - Split 1: 85% - Split 2: 78% - Split 3: 91%
Which one is the real performance?
K-Fold Cross-Validation
Split data into K parts. Train on K-1 parts, test on 1 part. Repeat K times.
``` 5-Fold CV:
Fold 1: [TEST][TRAIN][TRAIN][TRAIN][TRAIN] → Score: 84% Fold 2: [TRAIN][TEST][TRAIN][TRAIN][TRAIN] → Score: 87% Fold 3: [TRAIN][TRAIN][TEST][TRAIN][TRAIN] → Score: 82% Fold 4: [TRAIN][TRAIN][TRAIN][TEST][TRAIN] → Score: 86% Fold 5: [TRAIN][TRAIN][TRAIN][TRAIN][TEST] → Score: 85%
Average: 84.8% ± 1.8% ```
Every data point gets to be in the test set exactly once!
Code Example
```python from sklearn.model_selection import cross_val_score from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
5-fold cross-validation scores = cross_val_score(model, X, y, cv=5)
print(f"Scores: {scores}") print(f"Mean: {scores.mean():.3f}") print(f"Std: {scores.std():.3f}") print(f"95% CI: {scores.mean():.3f} ± {scores.std() * 2:.3f}") ```
Output: ``` Scores: [0.84 0.87 0.82 0.86 0.85] Mean: 0.848 Std: 0.018 95% CI: 0.848 ± 0.036 ```
Choosing K
| K Value | Pros | Cons | |---------|------|------| | K=5 | Good balance, common choice | - | | K=10 | More reliable estimate | Slower | | K=n (LOOCV) | Uses all data | Very slow, high variance |
**Rule of thumb:** K=5 or K=10 for most cases.
Types of Cross-Validation
### Stratified K-Fold (for Classification)
Maintains class proportions in each fold:
```python from sklearn.model_selection import StratifiedKFold
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) scores = cross_val_score(model, X, y, cv=cv) ```
**Always use for classification!** Especially with imbalanced data.
### Leave-One-Out (LOOCV)
K = number of samples. Each sample is test set once.
```python from sklearn.model_selection import LeaveOneOut
cv = LeaveOneOut() scores = cross_val_score(model, X, y, cv=cv) ```
**Use when:** Very small dataset
### Time Series Split
For time-dependent data. Train on past, test on future.
``` Split 1: [TRAIN][TEST] Split 2: [TRAIN][TRAIN][TEST] Split 3: [TRAIN][TRAIN][TRAIN][TEST] ```
```python from sklearn.model_selection import TimeSeriesSplit
cv = TimeSeriesSplit(n_splits=5) scores = cross_val_score(model, X, y, cv=cv) ```
**Must use for:** Stock prices, sales forecasting, any time-ordered data
### Group K-Fold
When samples belong to groups that shouldn't be split.
```python from sklearn.model_selection import GroupKFold
Each patient's samples stay together groups = patient_ids cv = GroupKFold(n_splits=5) scores = cross_val_score(model, X, y, cv=cv, groups=groups) ```
**Use when:** Multiple samples from same person/entity
Getting More Information
### Cross-Validate with Multiple Metrics
```python from sklearn.model_selection import cross_validate
scores = cross_validate( model, X, y, cv=5, scoring=['accuracy', 'precision', 'recall', 'f1'], return_train_score=True )
print(f"Test Accuracy: {scores['test_accuracy'].mean():.3f}") print(f"Test F1: {scores['test_f1'].mean():.3f}") print(f"Train Accuracy: {scores['train_accuracy'].mean():.3f}") ```
### Get Predictions
```python from sklearn.model_selection import cross_val_predict
Predictions from each fold predictions = cross_val_predict(model, X, y, cv=5) ```
Nested Cross-Validation
For hyperparameter tuning + evaluation without data leakage:
``` Outer loop: Model evaluation Inner loop: Hyperparameter tuning ```
```python from sklearn.model_selection import GridSearchCV, cross_val_score
Inner CV for tuning inner_cv = StratifiedKFold(n_splits=5) param_grid = {'C': [0.1, 1, 10]} grid_search = GridSearchCV(LogisticRegression(), param_grid, cv=inner_cv)
Outer CV for evaluation outer_cv = StratifiedKFold(n_splits=5) nested_scores = cross_val_score(grid_search, X, y, cv=outer_cv)
print(f"Nested CV Score: {nested_scores.mean():.3f} ± {nested_scores.std():.3f}") ```
Common Mistakes
### 1. Data Leakage in Preprocessing ```python # WRONG X_scaled = scaler.fit_transform(X) # Learns from all data scores = cross_val_score(model, X_scaled, y, cv=5)
RIGHT - Use Pipeline from sklearn.pipeline import Pipeline pipe = Pipeline([ ('scaler', StandardScaler()), ('model', LogisticRegression()) ]) scores = cross_val_score(pipe, X, y, cv=5) ```
### 2. Using Test Set Multiple Times ```python # WRONG score = model.score(X_test, y_test) # Looked at test set # ... tune model ... score = model.score(X_test, y_test) # Looked again!
RIGHT # Use CV for tuning, test set only at the very end ```
Summary
| Scenario | CV Type | |----------|---------| | Classification | StratifiedKFold | | Time series | TimeSeriesSplit | | Grouped data | GroupKFold | | Small data | LOOCV | | General | KFold |
**Key takeaway:** Always use cross-validation. A single score is not reliable!