Handling Imbalanced Datasets
Learn techniques to handle imbalanced classes when one class heavily outnumbers the other.
Handling Imbalanced Datasets
When 99% of your data is class A and 1% is class B, your model might just predict A for everything and get 99% accuracy. That's the imbalanced data problem.
Why It's a Problem
``` Fraud Detection: ████████████████████████████████ 99.9% Normal █ 0.1% Fraud
Model can get 99.9% accuracy by predicting "Normal" for everything. But that's useless for catching fraud! ```
Common Scenarios
- Fraud detection (rare frauds) - Disease diagnosis (rare diseases) - Anomaly detection - Churn prediction (most don't churn)
Solution 1: Use the Right Metrics
Accuracy is misleading. Use these instead:
```python from sklearn.metrics import ( precision_score, recall_score, f1_score, classification_report, roc_auc_score )
Get full picture print(classification_report(y_test, y_pred))
ROC-AUC works well for imbalanced data auc = roc_auc_score(y_test, y_pred_proba) ```
**Key metrics:** - **Precision:** Of predicted positives, how many are correct? - **Recall:** Of actual positives, how many did we find? - **F1:** Balance of precision and recall - **ROC-AUC:** Overall ranking ability
Solution 2: Class Weights
Tell the model to care more about the minority class:
```python from sklearn.linear_model import LogisticRegression from sklearn.ensemble import RandomForestClassifier
Automatic balancing model = LogisticRegression(class_weight='balanced')
Manual weights model = RandomForestClassifier(class_weight={0: 1, 1: 10})
For XGBoost model = XGBClassifier(scale_pos_weight=99) # ratio of neg/pos ```
Solution 3: Resampling
### Oversample Minority Class (SMOTE)
```python from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42) X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
print(f"Before: {Counter(y_train)}") print(f"After: {Counter(y_resampled)}") ```
### Undersample Majority Class
```python from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(random_state=42) X_resampled, y_resampled = rus.fit_resample(X_train, y_train) ```
### Combine Both
```python from imblearn.combine import SMOTETomek
smt = SMOTETomek(random_state=42) X_resampled, y_resampled = smt.fit_resample(X_train, y_train) ```
Solution 4: Change the Threshold
Default threshold is 0.5. Adjust it:
```python # Get probabilities y_proba = model.predict_proba(X_test)[:, 1]
Lower threshold to catch more positives threshold = 0.3 y_pred = (y_proba >= threshold).astype(int)
Find optimal threshold from sklearn.metrics import precision_recall_curve
precision, recall, thresholds = precision_recall_curve(y_test, y_proba) # Choose threshold based on your precision/recall needs ```
Solution 5: Ensemble Methods
```python from imblearn.ensemble import BalancedRandomForestClassifier
model = BalancedRandomForestClassifier(n_estimators=100) model.fit(X_train, y_train) ```
Which Solution to Use?
| Situation | Recommended Approach | |-----------|---------------------| | Quick fix | Class weights | | Very imbalanced | SMOTE + class weights | | Large dataset | Undersampling | | Small dataset | SMOTE | | Need high recall | Lower threshold |
Complete Pipeline
```python from imblearn.pipeline import Pipeline from imblearn.over_sampling import SMOTE from sklearn.ensemble import RandomForestClassifier
Create balanced pipeline pipeline = Pipeline([ ('smote', SMOTE(random_state=42)), ('classifier', RandomForestClassifier( class_weight='balanced', n_estimators=100 )) ])
pipeline.fit(X_train, y_train)
Evaluate properly from sklearn.metrics import classification_report y_pred = pipeline.predict(X_test) print(classification_report(y_test, y_pred)) ```
Key Takeaway
Never use accuracy alone for imbalanced data. Start with class weights (easiest), try SMOTE for synthetic oversampling, and always evaluate with precision/recall/F1/AUC. The right approach depends on your specific needs - sometimes missing frauds is worse than false alarms (prioritize recall), sometimes the opposite (prioritize precision).