Training, Validation, and Test Sets

Why split your data? Because you need to know if your model actually works on NEW data, not just the data it learned from.

The Problem

Imagine studying for an exam by memorizing all the practice questions.

If the exam has the SAME questions → You ace it
If the exam has DIFFERENT questions → You might fail

This is exactly what happens with ML models. We need to test on unseen data.

The Three Splits

Training Set (60-80% of data)

Model learns from this
Like the textbook you study from

Validation Set (10-20% of data)

Used to tune the model
Like practice tests
Helps you decide model settings

Test Set (10-20% of data)

Final evaluation only
Like the actual exam
NEVER touch until the very end

Why Three Sets? Why Not Two?

With just train/test:

# Bad approach
train_model(training_data)
if accuracy_on_test < 90%:
    tweak_settings()  # Oops! Now test set influenced your choices
    train_again()

The test set gets "leaked" into your decisions. You need a separate validation set for tuning.

How to Split

from sklearn.model_selection import train_test_split

# First split: separate test set
train_val, test = train_test_split(data, test_size=0.2)

# Second split: separate validation from training
train, val = train_test_split(train_val, test_size=0.2)

# Result: 64% train, 16% val, 20% test

Common Splits

Dataset Size	Train	Validation	Test
Small (<1K)	60%	20%	20%
Medium (1K-100K)	70%	15%	15%
Large (>100K)	80%	10%	10%

With big data, you can afford smaller validation/test percentages.

Important Rules

1. Shuffle Before Splitting

# Data might be ordered (all cats first, then dogs)
data = shuffle(data)  # Randomize first!

2. Keep Test Set Sacred

Never use test set for:

Choosing features
Tuning hyperparameters
Deciding which model to use

3. Split BEFORE Any Processing

# Wrong - data leakage!
normalized_data = normalize(all_data)
train, test = split(normalized_data)

# Right
train, test = split(raw_data)
train_normalized = normalize(train)  # Learn stats from train only
test_normalized = apply_same_normalization(test)

Cross-Validation

When data is limited, use k-fold cross-validation:

Fold 1: [VAL][TRAIN][TRAIN][TRAIN][TRAIN]
Fold 2: [TRAIN][VAL][TRAIN][TRAIN][TRAIN]
Fold 3: [TRAIN][TRAIN][VAL][TRAIN][TRAIN]
Fold 4: [TRAIN][TRAIN][TRAIN][VAL][TRAIN]
Fold 5: [TRAIN][TRAIN][TRAIN][TRAIN][VAL]

Each data point gets to be in validation once. Average the results.

Quick Summary

Set	Purpose	When Used
Training	Learn patterns	During training
Validation	Tune settings	While developing
Test	Final score	Once, at the end

The Golden Rule

Your test set score should represent real-world performance. If you peek at it during development, you're fooling yourself about how good your model really is.

Understanding Training, Validation, and Test Sets