Random Forest: Ensemble Learning Made Simple
Learn how Random Forests combine multiple decision trees for better predictions and reduced overfitting.
Random Forest: Ensemble Learning Made Simple
One decision tree is decent. A hundred decision trees voting together? That's powerful. Welcome to Random Forest.
The Core Idea
Random Forest = Many Decision Trees + Voting
Each tree gets: - A random subset of training data (bagging) - A random subset of features at each split
Final prediction = majority vote (classification) or average (regression)
Why Does This Work?
**Single Tree Problems:** - Prone to overfitting - Sensitive to small data changes - Can miss important patterns
**Forest Solutions:** - Individual errors cancel out - More stable predictions - Captures diverse patterns
Think of it like asking 100 people for directions vs asking 1 person. The crowd is usually right.
Implementation
```python from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score
Prepare data X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
Create and train Random Forest rf = RandomForestClassifier( n_estimators=100, # Number of trees max_depth=10, # Limit tree depth min_samples_split=5, # Min samples to split random_state=42 ) rf.fit(X_train, y_train)
Predict predictions = rf.predict(X_test) print(f"Accuracy: {accuracy_score(y_test, predictions):.2f}") ```
Key Parameters
| Parameter | What It Does | Typical Values | |-----------|-------------|----------------| | n_estimators | Number of trees | 100-500 | | max_depth | Tree depth limit | 10-30 | | min_samples_split | Min samples to split | 2-10 | | max_features | Features per split | 'sqrt' or 'log2' |
Feature Importance
Random Forest tells you which features matter most:
```python import pandas as pd
Get feature importance importance = pd.DataFrame({ 'feature': feature_names, 'importance': rf.feature_importances_ }).sort_values('importance', ascending=False)
print(importance.head(10)) ```
When to Use Random Forest
**Good for:** - Tabular data - When you need feature importance - Classification and regression - When you want decent results without much tuning
**Not ideal for:** - Very high-dimensional sparse data - When interpretability is critical - Real-time predictions (slower than single tree)
Key Takeaway
Random Forest is often the first algorithm to try for tabular data. It's robust, handles missing values, provides feature importance, and works well out of the box. Start with 100 trees and tune from there!