Random Forest: Ensemble Learning Made Simple
Learn how Random Forests combine multiple decision trees for better predictions and reduced overfitting.
Random Forest: Ensemble Learning Made Simple
One decision tree is decent. A hundred decision trees voting together? That's powerful. Welcome to Random Forest.
The Core Idea
Random Forest = Many Decision Trees + Voting
Each tree gets:
- A random subset of training data (bagging)
- A random subset of features at each split
Final prediction = majority vote (classification) or average (regression)
Why Does This Work?
Single Tree Problems:
- Prone to overfitting
- Sensitive to small data changes
- Can miss important patterns
Forest Solutions:
- Individual errors cancel out
- More stable predictions
- Captures diverse patterns
Think of it like asking 100 people for directions vs asking 1 person. The crowd is usually right.
Implementation
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Prepare data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Create and train Random Forest
rf = RandomForestClassifier(
n_estimators=100, # Number of trees
max_depth=10, # Limit tree depth
min_samples_split=5, # Min samples to split
random_state=42
)
rf.fit(X_train, y_train)
# Predict
predictions = rf.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, predictions):.2f}")
Key Parameters
| Parameter | What It Does | Typical Values |
|---|---|---|
| n_estimators | Number of trees | 100-500 |
| max_depth | Tree depth limit | 10-30 |
| min_samples_split | Min samples to split | 2-10 |
| max_features | Features per split | 'sqrt' or 'log2' |
Feature Importance
Random Forest tells you which features matter most:
import pandas as pd
# Get feature importance
importance = pd.DataFrame({
'feature': feature_names,
'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)
print(importance.head(10))
When to Use Random Forest
Good for:
- Tabular data
- When you need feature importance
- Classification and regression
- When you want decent results without much tuning
Not ideal for:
- Very high-dimensional sparse data
- When interpretability is critical
- Real-time predictions (slower than single tree)
Key Takeaway
Random Forest is often the first algorithm to try for tabular data. It's robust, handles missing values, provides feature importance, and works well out of the box. Start with 100 trees and tune from there!