Understanding Training, Validation, and Test Sets
Learn why we split data into training, validation, and test sets, and how to do it correctly.
Training, Validation, and Test Sets
Why split your data? Because you need to know if your model actually works on NEW data, not just the data it learned from.
The Problem
Imagine studying for an exam by memorizing all the practice questions.
If the exam has the SAME questions → You ace it If the exam has DIFFERENT questions → You might fail
This is exactly what happens with ML models. We need to test on unseen data.
The Three Splits
### Training Set (60-80% of data) - Model learns from this - Like the textbook you study from
### Validation Set (10-20% of data) - Used to tune the model - Like practice tests - Helps you decide model settings
### Test Set (10-20% of data) - Final evaluation only - Like the actual exam - NEVER touch until the very end
Why Three Sets? Why Not Two?
With just train/test:
```python # Bad approach train_model(training_data) if accuracy_on_test < 90%: tweak_settings() # Oops! Now test set influenced your choices train_again() ```
The test set gets "leaked" into your decisions. You need a separate validation set for tuning.
How to Split
```python from sklearn.model_selection import train_test_split
First split: separate test set train_val, test = train_test_split(data, test_size=0.2)
Second split: separate validation from training train, val = train_test_split(train_val, test_size=0.2)
Result: 64% train, 16% val, 20% test ```
Common Splits
| Dataset Size | Train | Validation | Test | |-------------|-------|------------|------| | Small (<1K) | 60% | 20% | 20% | | Medium (1K-100K) | 70% | 15% | 15% | | Large (>100K) | 80% | 10% | 10% |
With big data, you can afford smaller validation/test percentages.
Important Rules
### 1. Shuffle Before Splitting ```python # Data might be ordered (all cats first, then dogs) data = shuffle(data) # Randomize first! ```
### 2. Keep Test Set Sacred Never use test set for: - Choosing features - Tuning hyperparameters - Deciding which model to use
### 3. Split BEFORE Any Processing ```python # Wrong - data leakage! normalized_data = normalize(all_data) train, test = split(normalized_data)
Right train, test = split(raw_data) train_normalized = normalize(train) # Learn stats from train only test_normalized = apply_same_normalization(test) ```
Cross-Validation
When data is limited, use k-fold cross-validation:
``` Fold 1: [VAL][TRAIN][TRAIN][TRAIN][TRAIN] Fold 2: [TRAIN][VAL][TRAIN][TRAIN][TRAIN] Fold 3: [TRAIN][TRAIN][VAL][TRAIN][TRAIN] Fold 4: [TRAIN][TRAIN][TRAIN][VAL][TRAIN] Fold 5: [TRAIN][TRAIN][TRAIN][TRAIN][VAL] ```
Each data point gets to be in validation once. Average the results.
Quick Summary
| Set | Purpose | When Used | |-----|---------|-----------| | Training | Learn patterns | During training | | Validation | Tune settings | While developing | | Test | Final score | Once, at the end |
The Golden Rule
Your test set score should represent real-world performance. If you peek at it during development, you're fooling yourself about how good your model really is.