ML6 min read

Understanding Training, Validation, and Test Sets

Learn why we split data into training, validation, and test sets, and how to do it correctly.

Sarah Chen
December 19, 2025
0.0k0

Training, Validation, and Test Sets

Why split your data? Because you need to know if your model actually works on NEW data, not just the data it learned from.

The Problem

Imagine studying for an exam by memorizing all the practice questions.

If the exam has the SAME questions → You ace it If the exam has DIFFERENT questions → You might fail

This is exactly what happens with ML models. We need to test on unseen data.

The Three Splits

### Training Set (60-80% of data) - Model learns from this - Like the textbook you study from

### Validation Set (10-20% of data) - Used to tune the model - Like practice tests - Helps you decide model settings

### Test Set (10-20% of data) - Final evaluation only - Like the actual exam - NEVER touch until the very end

Why Three Sets? Why Not Two?

With just train/test:

```python # Bad approach train_model(training_data) if accuracy_on_test < 90%: tweak_settings() # Oops! Now test set influenced your choices train_again() ```

The test set gets "leaked" into your decisions. You need a separate validation set for tuning.

How to Split

```python from sklearn.model_selection import train_test_split

First split: separate test set train_val, test = train_test_split(data, test_size=0.2)

Second split: separate validation from training train, val = train_test_split(train_val, test_size=0.2)

Result: 64% train, 16% val, 20% test ```

Common Splits

| Dataset Size | Train | Validation | Test | |-------------|-------|------------|------| | Small (<1K) | 60% | 20% | 20% | | Medium (1K-100K) | 70% | 15% | 15% | | Large (>100K) | 80% | 10% | 10% |

With big data, you can afford smaller validation/test percentages.

Important Rules

### 1. Shuffle Before Splitting ```python # Data might be ordered (all cats first, then dogs) data = shuffle(data) # Randomize first! ```

### 2. Keep Test Set Sacred Never use test set for: - Choosing features - Tuning hyperparameters - Deciding which model to use

### 3. Split BEFORE Any Processing ```python # Wrong - data leakage! normalized_data = normalize(all_data) train, test = split(normalized_data)

Right train, test = split(raw_data) train_normalized = normalize(train) # Learn stats from train only test_normalized = apply_same_normalization(test) ```

Cross-Validation

When data is limited, use k-fold cross-validation:

``` Fold 1: [VAL][TRAIN][TRAIN][TRAIN][TRAIN] Fold 2: [TRAIN][VAL][TRAIN][TRAIN][TRAIN] Fold 3: [TRAIN][TRAIN][VAL][TRAIN][TRAIN] Fold 4: [TRAIN][TRAIN][TRAIN][VAL][TRAIN] Fold 5: [TRAIN][TRAIN][TRAIN][TRAIN][VAL] ```

Each data point gets to be in validation once. Average the results.

Quick Summary

| Set | Purpose | When Used | |-----|---------|-----------| | Training | Learn patterns | During training | | Validation | Tune settings | While developing | | Test | Final score | Once, at the end |

The Golden Rule

Your test set score should represent real-world performance. If you peek at it during development, you're fooling yourself about how good your model really is.

#Machine Learning#Data Splitting#Beginner