ML9 min read

Cross-Validation: Testing Your Model Properly

Master cross-validation - the right way to evaluate ML models. Learn k-fold, stratified k-fold, and when to use each. Essential for building reliable models that work in production.

Dr. Alex Kumar
December 18, 2025
0.0k0

Testing your model on the same data you trained it on is like taking an exam with the answer key. Cross-validation is the proper way to evaluate models and make sure they'll work on new data.

What is Cross-Validation?

Cross-validation splits your data into multiple folds, trains on some folds and tests on others, then repeats this process. This gives you a more reliable estimate of how your model will perform on unseen data.

K-Fold Cross-Validation

The most common type - split data into k folds (usually 5 or 10), train on k-1 folds, test on the remaining fold, repeat k times. You get k performance scores and can calculate the average and standard deviation.

Stratified K-Fold

For classification problems with imbalanced classes, use stratified k-fold. It ensures each fold has the same proportion of each class, giving you more reliable results.

When to Use What

I'll show you when to use simple train/test split, when to use k-fold, and when to use stratified. Understanding this helps you evaluate models correctly and avoid false confidence.

#ML#Cross-Validation#Model Evaluation#K-Fold

Common Questions & Answers

Q1

What is k-fold cross-validation?

A

K-fold cross-validation splits data into k equal parts (folds). The model trains on k-1 folds and tests on the remaining fold. This process repeats k times, with each fold serving as the test set once. You get k performance scores and can calculate average and variance for more reliable evaluation.

python
from sklearn.model_selection import cross_val_score, KFold
from sklearn.ensemble import RandomForestClassifier

# Create model
model = RandomForestClassifier()

# 5-fold cross-validation
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=kfold, scoring='accuracy')

print(f"Average accuracy: {scores.mean():.3f}")
print(f"Standard deviation: {scores.std():.3f}")

# Each score is from a different train/test split
Q2

What is stratified k-fold cross-validation?

A

Stratified k-fold ensures each fold has the same proportion of each class as the original dataset. Essential for imbalanced datasets where classes are unevenly distributed. Prevents some folds from having no examples of a minority class.

python
from sklearn.model_selection import StratifiedKFold, cross_val_score

# For classification with imbalanced classes
skfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=skfold, scoring='accuracy')

# Each fold maintains class distribution
# If original data has 80% class A, 20% class B,
# each fold will have approximately 80% A, 20% B