Cross-Validation
Properly evaluate model performance.
Test models properly.
What is Cross-Validation?
Better way to test model performance than single train/test split.
Like taking multiple practice exams instead of just one!
Why Needed?
Single split might be lucky or unlucky: - Lucky: Easy test data → Overestimate accuracy - Unlucky: Hard test data → Underestimate accuracy
K-Fold Cross-Validation
Split data into K parts, test K times:
**5-Fold Example**: 1. Use fold 1 as test, others as train 2. Use fold 2 as test, others as train 3. Use fold 3 as test, others as train 4. Use fold 4 as test, others as train 5. Use fold 5 as test, others as train
**Final score**: Average of all 5 tests
Python Code
```python from sklearn.model_selection import cross_val_score from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
5-fold cross-validation scores = cross_val_score(model, X, y, cv=5)
print(f"Scores: {scores}") print(f"Average: {scores.mean():.2f}") print(f"Std Dev: {scores.std():.2f}") ```
Stratified K-Fold
Keeps class proportions in each fold:
```python from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5) scores = cross_val_score(model, X, y, cv=skf) ```
Leave-One-Out
Use each sample as test once:
Good for small datasets, but slow!
Time Series Split
For time-based data:
```python from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5) scores = cross_val_score(model, X, y, cv=tscv) ```
Best Practices
- Use 5 or 10 folds - Stratified for imbalanced data - Time series split for temporal data - More folds = more reliable, but slower
Remember
- More reliable than single split - Standard is 5-fold - Reports average performance