Handling Imbalanced Data
Deal with unequal class distributions.
Fix unbalanced datasets.
What is Imbalanced Data?
When one class has way more examples than others.
**Example**: Fraud detection - 99,000 normal transactions - 1,000 fraud transactions
Why It's a Problem
Model learns to always predict majority class:
Predict "Not Fraud" for everything → 99% accuracy! But catches 0% of actual fraud!
Solution 1: Oversampling
Add more minority class examples:
```python from imblearn.over_sampling import SMOTE
smote = SMOTE() X_resampled, y_resampled = smote.fit_resample(X, y)
print(f"Original: {len(y)}") print(f"After SMOTE: {len(y_resampled)}") ```
Solution 2: Undersampling
Remove majority class examples:
```python from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler() X_resampled, y_resampled = rus.fit_resample(X, y) ```
Solution 3: Class Weights
Tell model to care more about minority class:
```python from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(class_weight='balanced') model.fit(X, y) ```
Solution 4: Different Metrics
Don't use accuracy! Use:
**Precision**: Of predicted frauds, how many were real? **Recall**: Of real frauds, how many did we catch? **F1-Score**: Balance of precision and recall
```python from sklearn.metrics import classification_report
predictions = model.predict(X_test) print(classification_report(y_test, predictions)) ```
Real Example - Credit Card Fraud
```python from imblearn.over_sampling import SMOTE from sklearn.ensemble import RandomForestClassifier
Apply SMOTE smote = SMOTE() X_train_balanced, y_train_balanced = smote.fit_resample(X_train, y_train)
Train with class weights model = RandomForestClassifier(class_weight='balanced') model.fit(X_train_balanced, y_train_balanced) ```
When to Use What
- **Lots of data**: Undersample - **Little data**: Oversample (SMOTE) - **Can't resample**: Use class weights
Remember
- Never use accuracy alone - Use F1-score or AUC - Combine multiple techniques