Gradient Boosting: Sequential Ensemble Learning
Understand how Gradient Boosting builds trees sequentially, each fixing the mistakes of the previous ones.
Gradient Boosting: Sequential Ensemble Learning
Random Forest builds trees in parallel. Gradient Boosting builds them sequentially, where each tree learns from the mistakes of all previous trees.
The Core Idea
1. Train a simple model 2. Calculate errors (residuals) 3. Train next model to predict those errors 4. Add to ensemble 5. Repeat
Each new tree focuses on what the ensemble got wrong.
Random Forest vs Gradient Boosting
| Aspect | Random Forest | Gradient Boosting | |--------|--------------|-------------------| | Tree Building | Parallel | Sequential | | Tree Type | Deep, independent | Shallow, dependent | | Speed | Faster training | Slower training | | Overfitting | More resistant | Can overfit easily | | Tuning | Less sensitive | Needs careful tuning |
Implementation with XGBoost
XGBoost is the most popular gradient boosting library:
```python import xgboost as xgb from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
Create and train model = xgb.XGBClassifier( n_estimators=100, max_depth=3, # Shallow trees learning_rate=0.1, # Step size subsample=0.8, # Row sampling colsample_bytree=0.8 # Column sampling ) model.fit(X_train, y_train)
Evaluate accuracy = model.score(X_test, y_test) print(f"Accuracy: {accuracy:.2f}") ```
Key Parameters
**n_estimators:** Number of boosting rounds - More = potentially better, but slower - Use early stopping to find optimal
**learning_rate:** How much each tree contributes - Lower = more trees needed, but better generalization - Typical: 0.01 - 0.3
**max_depth:** Tree depth - Shallow (3-6) for boosting - Deeper trees can overfit
Early Stopping
Don't guess the number of trees. Let the data tell you:
```python model = xgb.XGBClassifier( n_estimators=1000, learning_rate=0.1, early_stopping_rounds=50 )
model.fit( X_train, y_train, eval_set=[(X_val, y_val)], verbose=False )
print(f"Best iteration: {model.best_iteration}") ```
Popular Libraries
1. **XGBoost** - Fast, widely used 2. **LightGBM** - Even faster, good for large data 3. **CatBoost** - Great with categorical features
When to Use
**Gradient Boosting shines when:** - You need maximum accuracy - You have time to tune - Tabular/structured data - Kaggle competitions!
**Consider alternatives when:** - Training speed is critical - You have limited data (may overfit) - Interpretability is key
Key Takeaway
Gradient Boosting often gives the best performance on tabular data. Start with XGBoost, use shallow trees (max_depth=3-6), and always use early stopping. The learning_rate and n_estimators work together - lower rate needs more estimators.