Decision Trees: How They Work
Understand Decision Trees - one of the most intuitive and interpretable ML algorithms.
Decision Trees: How They Work
Decision Trees are exactly what they sound like—a tree of decisions. They're intuitive, interpretable, and surprisingly powerful.
The Concept
Think of playing 20 Questions:
- Is it alive? → Yes
- Is it an animal? → Yes
- Does it have 4 legs? → Yes
- Is it bigger than a cat? → Yes
- Is it a dog? → Yes!
That's a decision tree!
Visual Example
[Age > 30?]
/ \
Yes No
/ \
[Income > 50k?] [Student?]
/ \ / \
Yes No Yes No
| | | |
[BUY] [MAYBE] [BUY] [NO BUY]
Each internal node = a question about a feature
Each leaf node = a prediction
How Does It Build the Tree?
The Goal: Find the Best Splits
At each step, the algorithm asks: "Which question separates the data best?"
Measuring "Best" - Information Gain
Imagine you have 50 spam and 50 non-spam emails.
Bad split: Left has 45 spam + 40 non-spam, Right has 5 spam + 10 non-spam
(Still mixed up!)
Good split: Left has 48 spam + 2 non-spam, Right has 2 spam + 48 non-spam
(Much cleaner!)
This "cleanness" is measured using Gini Impurity or Entropy.
Gini Impurity
gini = 1 - (p_class1² + p_class2² + ...)
# Pure node (all same class): gini = 0
# Mixed node (50-50): gini = 0.5
Lower Gini = Better split
Code Example
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
# Load famous iris dataset
iris = load_iris()
X, y = iris.data, iris.target
# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
# Create and train tree
tree = DecisionTreeClassifier(max_depth=3) # Limit depth to prevent overfitting
tree.fit(X_train, y_train)
# Evaluate
accuracy = tree.score(X_test, y_test)
print(f"Accuracy: {accuracy:.2%}")
# See feature importance
for name, importance in zip(iris.feature_names, tree.feature_importances_):
print(f"{name}: {importance:.3f}")
Output:
Accuracy: 95.56%
sepal length (cm): 0.000
sepal width (cm): 0.000
petal length (cm): 0.587
petal width (cm): 0.413
Petal features are most important for classifying iris types!
Visualizing the Tree
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt
plt.figure(figsize=(15, 10))
plot_tree(tree,
feature_names=iris.feature_names,
class_names=iris.target_names,
filled=True)
plt.show()
For Regression Too!
Decision Trees can predict numbers, not just categories:
from sklearn.tree import DecisionTreeRegressor
# Predict house prices
tree = DecisionTreeRegressor(max_depth=5)
tree.fit(X_train, y_train)
predictions = tree.predict(X_test)
Instead of voting (classification), leaf nodes average the training values.
The Overfitting Problem
Decision Trees LOVE to overfit. Without limits, they'll create a rule for every single training example.
No limits:
- Accuracy on training: 100%
- Accuracy on test: 65%
(Memorized, didn't learn!)
How to Prevent Overfitting
tree = DecisionTreeClassifier(
max_depth=5, # Limit tree depth
min_samples_split=10, # Need at least 10 samples to split
min_samples_leaf=5, # Each leaf needs at least 5 samples
max_features='sqrt' # Only consider sqrt(n) features per split
)
Pros and Cons
Pros ✅
- Interpretable: You can explain every prediction
- No scaling needed: Works with raw features
- Handles mixed data: Numbers and categories
- Finds non-linear patterns: Unlike linear models
- Fast: Quick to train and predict
Cons ❌
- Overfits easily: Needs careful tuning
- Unstable: Small data changes = very different tree
- Greedy: Might miss globally optimal splits
- Biased: Prefers features with many values
Decision Tree vs Linear Models
| Aspect | Decision Tree | Linear Model |
|---|---|---|
| Decision boundary | Rectangular | Straight line |
| Interpretability | Visual rules | Coefficients |
| Feature scaling | Not needed | Usually needed |
| Handles non-linearity | Yes | No |
| Stability | Low | High |
Key Insight
Decision Trees are powerful alone but even more powerful together. Random Forest (many trees) and Gradient Boosting (trees that learn from mistakes) dominate machine learning competitions.
We'll cover those next!