ML7 min read

Cross-Validation: The Right Way to Evaluate Models

Learn cross-validation techniques to get reliable estimates of model performance.

Sarah Chen
December 19, 2025
0.0k0

Cross-Validation: The Right Way to Evaluate Models

A single train-test split can be misleading. Cross-validation gives you reliable, robust performance estimates.

The Problem with Single Split

# Single split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model.fit(X_train, y_train)
score = model.score(X_test, y_test)  # 85%

But what if you got lucky (or unlucky) with that particular split?

Different random splits give different results:

  • Split 1: 85%
  • Split 2: 78%
  • Split 3: 91%

Which one is the real performance?

K-Fold Cross-Validation

Split data into K parts. Train on K-1 parts, test on 1 part. Repeat K times.

5-Fold CV:

Fold 1: [TEST][TRAIN][TRAIN][TRAIN][TRAIN] → Score: 84%
Fold 2: [TRAIN][TEST][TRAIN][TRAIN][TRAIN] → Score: 87%
Fold 3: [TRAIN][TRAIN][TEST][TRAIN][TRAIN] → Score: 82%
Fold 4: [TRAIN][TRAIN][TRAIN][TEST][TRAIN] → Score: 86%
Fold 5: [TRAIN][TRAIN][TRAIN][TRAIN][TEST] → Score: 85%

Average: 84.8% ± 1.8%

Every data point gets to be in the test set exactly once!

Code Example

from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

# 5-fold cross-validation
scores = cross_val_score(model, X, y, cv=5)

print(f"Scores: {scores}")
print(f"Mean: {scores.mean():.3f}")
print(f"Std: {scores.std():.3f}")
print(f"95% CI: {scores.mean():.3f} ± {scores.std() * 2:.3f}")

Output:

Scores: [0.84 0.87 0.82 0.86 0.85]
Mean: 0.848
Std: 0.018
95% CI: 0.848 ± 0.036

Choosing K

K Value Pros Cons
K=5 Good balance, common choice -
K=10 More reliable estimate Slower
K=n (LOOCV) Uses all data Very slow, high variance

Rule of thumb: K=5 or K=10 for most cases.

Types of Cross-Validation

Stratified K-Fold (for Classification)

Maintains class proportions in each fold:

from sklearn.model_selection import StratifiedKFold

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=cv)

Always use for classification! Especially with imbalanced data.

Leave-One-Out (LOOCV)

K = number of samples. Each sample is test set once.

from sklearn.model_selection import LeaveOneOut

cv = LeaveOneOut()
scores = cross_val_score(model, X, y, cv=cv)

Use when: Very small dataset

Time Series Split

For time-dependent data. Train on past, test on future.

Split 1: [TRAIN][TEST]
Split 2: [TRAIN][TRAIN][TEST]
Split 3: [TRAIN][TRAIN][TRAIN][TEST]
from sklearn.model_selection import TimeSeriesSplit

cv = TimeSeriesSplit(n_splits=5)
scores = cross_val_score(model, X, y, cv=cv)

Must use for: Stock prices, sales forecasting, any time-ordered data

Group K-Fold

When samples belong to groups that shouldn't be split.

from sklearn.model_selection import GroupKFold

# Each patient's samples stay together
groups = patient_ids
cv = GroupKFold(n_splits=5)
scores = cross_val_score(model, X, y, cv=cv, groups=groups)

Use when: Multiple samples from same person/entity

Getting More Information

Cross-Validate with Multiple Metrics

from sklearn.model_selection import cross_validate

scores = cross_validate(
    model, X, y, cv=5,
    scoring=['accuracy', 'precision', 'recall', 'f1'],
    return_train_score=True
)

print(f"Test Accuracy: {scores['test_accuracy'].mean():.3f}")
print(f"Test F1: {scores['test_f1'].mean():.3f}")
print(f"Train Accuracy: {scores['train_accuracy'].mean():.3f}")

Get Predictions

from sklearn.model_selection import cross_val_predict

# Predictions from each fold
predictions = cross_val_predict(model, X, y, cv=5)

Nested Cross-Validation

For hyperparameter tuning + evaluation without data leakage:

Outer loop: Model evaluation
  Inner loop: Hyperparameter tuning
from sklearn.model_selection import GridSearchCV, cross_val_score

# Inner CV for tuning
inner_cv = StratifiedKFold(n_splits=5)
param_grid = {'C': [0.1, 1, 10]}
grid_search = GridSearchCV(LogisticRegression(), param_grid, cv=inner_cv)

# Outer CV for evaluation
outer_cv = StratifiedKFold(n_splits=5)
nested_scores = cross_val_score(grid_search, X, y, cv=outer_cv)

print(f"Nested CV Score: {nested_scores.mean():.3f} ± {nested_scores.std():.3f}")

Common Mistakes

1. Data Leakage in Preprocessing

# WRONG
X_scaled = scaler.fit_transform(X)  # Learns from all data
scores = cross_val_score(model, X_scaled, y, cv=5)

# RIGHT - Use Pipeline
from sklearn.pipeline import Pipeline
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LogisticRegression())
])
scores = cross_val_score(pipe, X, y, cv=5)

2. Using Test Set Multiple Times

# WRONG
score = model.score(X_test, y_test)  # Looked at test set
# ... tune model ...
score = model.score(X_test, y_test)  # Looked again!

# RIGHT
# Use CV for tuning, test set only at the very end

Summary

Scenario CV Type
Classification StratifiedKFold
Time series TimeSeriesSplit
Grouped data GroupKFold
Small data LOOCV
General KFold

Key takeaway: Always use cross-validation. A single score is not reliable!

#Machine Learning#Cross-Validation#Model Evaluation#Beginner