ML7 min read

Cross-Validation: The Right Way to Evaluate Models

Learn cross-validation techniques to get reliable estimates of model performance.

Sarah Chen
December 19, 2025
0.0k0

Cross-Validation: The Right Way to Evaluate Models

A single train-test split can be misleading. Cross-validation gives you reliable, robust performance estimates.

The Problem with Single Split

```python # Single split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) model.fit(X_train, y_train) score = model.score(X_test, y_test) # 85% ```

But what if you got lucky (or unlucky) with that particular split?

Different random splits give different results: - Split 1: 85% - Split 2: 78% - Split 3: 91%

Which one is the real performance?

K-Fold Cross-Validation

Split data into K parts. Train on K-1 parts, test on 1 part. Repeat K times.

``` 5-Fold CV:

Fold 1: [TEST][TRAIN][TRAIN][TRAIN][TRAIN] → Score: 84% Fold 2: [TRAIN][TEST][TRAIN][TRAIN][TRAIN] → Score: 87% Fold 3: [TRAIN][TRAIN][TEST][TRAIN][TRAIN] → Score: 82% Fold 4: [TRAIN][TRAIN][TRAIN][TEST][TRAIN] → Score: 86% Fold 5: [TRAIN][TRAIN][TRAIN][TRAIN][TEST] → Score: 85%

Average: 84.8% ± 1.8% ```

Every data point gets to be in the test set exactly once!

Code Example

```python from sklearn.model_selection import cross_val_score from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

5-fold cross-validation scores = cross_val_score(model, X, y, cv=5)

print(f"Scores: {scores}") print(f"Mean: {scores.mean():.3f}") print(f"Std: {scores.std():.3f}") print(f"95% CI: {scores.mean():.3f} ± {scores.std() * 2:.3f}") ```

Output: ``` Scores: [0.84 0.87 0.82 0.86 0.85] Mean: 0.848 Std: 0.018 95% CI: 0.848 ± 0.036 ```

Choosing K

| K Value | Pros | Cons | |---------|------|------| | K=5 | Good balance, common choice | - | | K=10 | More reliable estimate | Slower | | K=n (LOOCV) | Uses all data | Very slow, high variance |

**Rule of thumb:** K=5 or K=10 for most cases.

Types of Cross-Validation

### Stratified K-Fold (for Classification)

Maintains class proportions in each fold:

```python from sklearn.model_selection import StratifiedKFold

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) scores = cross_val_score(model, X, y, cv=cv) ```

**Always use for classification!** Especially with imbalanced data.

### Leave-One-Out (LOOCV)

K = number of samples. Each sample is test set once.

```python from sklearn.model_selection import LeaveOneOut

cv = LeaveOneOut() scores = cross_val_score(model, X, y, cv=cv) ```

**Use when:** Very small dataset

### Time Series Split

For time-dependent data. Train on past, test on future.

``` Split 1: [TRAIN][TEST] Split 2: [TRAIN][TRAIN][TEST] Split 3: [TRAIN][TRAIN][TRAIN][TEST] ```

```python from sklearn.model_selection import TimeSeriesSplit

cv = TimeSeriesSplit(n_splits=5) scores = cross_val_score(model, X, y, cv=cv) ```

**Must use for:** Stock prices, sales forecasting, any time-ordered data

### Group K-Fold

When samples belong to groups that shouldn't be split.

```python from sklearn.model_selection import GroupKFold

Each patient's samples stay together groups = patient_ids cv = GroupKFold(n_splits=5) scores = cross_val_score(model, X, y, cv=cv, groups=groups) ```

**Use when:** Multiple samples from same person/entity

Getting More Information

### Cross-Validate with Multiple Metrics

```python from sklearn.model_selection import cross_validate

scores = cross_validate( model, X, y, cv=5, scoring=['accuracy', 'precision', 'recall', 'f1'], return_train_score=True )

print(f"Test Accuracy: {scores['test_accuracy'].mean():.3f}") print(f"Test F1: {scores['test_f1'].mean():.3f}") print(f"Train Accuracy: {scores['train_accuracy'].mean():.3f}") ```

### Get Predictions

```python from sklearn.model_selection import cross_val_predict

Predictions from each fold predictions = cross_val_predict(model, X, y, cv=5) ```

Nested Cross-Validation

For hyperparameter tuning + evaluation without data leakage:

``` Outer loop: Model evaluation Inner loop: Hyperparameter tuning ```

```python from sklearn.model_selection import GridSearchCV, cross_val_score

Inner CV for tuning inner_cv = StratifiedKFold(n_splits=5) param_grid = {'C': [0.1, 1, 10]} grid_search = GridSearchCV(LogisticRegression(), param_grid, cv=inner_cv)

Outer CV for evaluation outer_cv = StratifiedKFold(n_splits=5) nested_scores = cross_val_score(grid_search, X, y, cv=outer_cv)

print(f"Nested CV Score: {nested_scores.mean():.3f} ± {nested_scores.std():.3f}") ```

Common Mistakes

### 1. Data Leakage in Preprocessing ```python # WRONG X_scaled = scaler.fit_transform(X) # Learns from all data scores = cross_val_score(model, X_scaled, y, cv=5)

RIGHT - Use Pipeline from sklearn.pipeline import Pipeline pipe = Pipeline([ ('scaler', StandardScaler()), ('model', LogisticRegression()) ]) scores = cross_val_score(pipe, X, y, cv=5) ```

### 2. Using Test Set Multiple Times ```python # WRONG score = model.score(X_test, y_test) # Looked at test set # ... tune model ... score = model.score(X_test, y_test) # Looked again!

RIGHT # Use CV for tuning, test set only at the very end ```

Summary

| Scenario | CV Type | |----------|---------| | Classification | StratifiedKFold | | Time series | TimeSeriesSplit | | Grouped data | GroupKFold | | Small data | LOOCV | | General | KFold |

**Key takeaway:** Always use cross-validation. A single score is not reliable!

#Machine Learning#Cross-Validation#Model Evaluation#Beginner