Decision Trees: How They Work
Understand Decision Trees - one of the most intuitive and interpretable ML algorithms.
Decision Trees: How They Work
Decision Trees are exactly what they sound like—a tree of decisions. They're intuitive, interpretable, and surprisingly powerful.
The Concept
Think of playing 20 Questions: - Is it alive? → Yes - Is it an animal? → Yes - Does it have 4 legs? → Yes - Is it bigger than a cat? → Yes - Is it a dog? → Yes!
That's a decision tree!
Visual Example
``` [Age > 30?] / \ Yes No / \ [Income > 50k?] [Student?] / \ / \ Yes No Yes No | | | | [BUY] [MAYBE] [BUY] [NO BUY] ```
Each internal node = a question about a feature Each leaf node = a prediction
How Does It Build the Tree?
### The Goal: Find the Best Splits
At each step, the algorithm asks: "Which question separates the data best?"
### Measuring "Best" - Information Gain
Imagine you have 50 spam and 50 non-spam emails.
**Bad split:** Left has 45 spam + 40 non-spam, Right has 5 spam + 10 non-spam (Still mixed up!)
**Good split:** Left has 48 spam + 2 non-spam, Right has 2 spam + 48 non-spam (Much cleaner!)
This "cleanness" is measured using **Gini Impurity** or **Entropy**.
### Gini Impurity
```python gini = 1 - (p_class1² + p_class2² + ...)
Pure node (all same class): gini = 0 # Mixed node (50-50): gini = 0.5 ```
Lower Gini = Better split
Code Example
```python from sklearn.tree import DecisionTreeClassifier from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split
Load famous iris dataset iris = load_iris() X, y = iris.data, iris.target
Split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
Create and train tree tree = DecisionTreeClassifier(max_depth=3) # Limit depth to prevent overfitting tree.fit(X_train, y_train)
Evaluate accuracy = tree.score(X_test, y_test) print(f"Accuracy: {accuracy:.2%}")
See feature importance for name, importance in zip(iris.feature_names, tree.feature_importances_): print(f"{name}: {importance:.3f}") ```
Output: ``` Accuracy: 95.56% sepal length (cm): 0.000 sepal width (cm): 0.000 petal length (cm): 0.587 petal width (cm): 0.413 ```
Petal features are most important for classifying iris types!
Visualizing the Tree
```python from sklearn.tree import plot_tree import matplotlib.pyplot as plt
plt.figure(figsize=(15, 10)) plot_tree(tree, feature_names=iris.feature_names, class_names=iris.target_names, filled=True) plt.show() ```
For Regression Too!
Decision Trees can predict numbers, not just categories:
```python from sklearn.tree import DecisionTreeRegressor
Predict house prices tree = DecisionTreeRegressor(max_depth=5) tree.fit(X_train, y_train) predictions = tree.predict(X_test) ```
Instead of voting (classification), leaf nodes average the training values.
The Overfitting Problem
Decision Trees LOVE to overfit. Without limits, they'll create a rule for every single training example.
``` No limits: - Accuracy on training: 100% - Accuracy on test: 65% (Memorized, didn't learn!) ```
### How to Prevent Overfitting
```python tree = DecisionTreeClassifier( max_depth=5, # Limit tree depth min_samples_split=10, # Need at least 10 samples to split min_samples_leaf=5, # Each leaf needs at least 5 samples max_features='sqrt' # Only consider sqrt(n) features per split ) ```
Pros and Cons
### Pros ✅ - **Interpretable**: You can explain every prediction - **No scaling needed**: Works with raw features - **Handles mixed data**: Numbers and categories - **Finds non-linear patterns**: Unlike linear models - **Fast**: Quick to train and predict
### Cons ❌ - **Overfits easily**: Needs careful tuning - **Unstable**: Small data changes = very different tree - **Greedy**: Might miss globally optimal splits - **Biased**: Prefers features with many values
Decision Tree vs Linear Models
| Aspect | Decision Tree | Linear Model | |--------|--------------|--------------| | Decision boundary | Rectangular | Straight line | | Interpretability | Visual rules | Coefficients | | Feature scaling | Not needed | Usually needed | | Handles non-linearity | Yes | No | | Stability | Low | High |
Key Insight
Decision Trees are powerful alone but even more powerful together. **Random Forest** (many trees) and **Gradient Boosting** (trees that learn from mistakes) dominate machine learning competitions.
We'll cover those next!