Learn techniques to handle imbalanced classes when one class heavily outnumbers the other.

Handling Imbalanced Datasets

When 99% of your data is class A and 1% is class B, your model might just predict A for everything and get 99% accuracy. That's the imbalanced data problem.

Why It's a Problem

``` Fraud Detection: ████████████████████████████████ 99.9% Normal █ 0.1% Fraud

Model can get 99.9% accuracy by predicting "Normal" for everything. But that's useless for catching fraud! ```

Common Scenarios

- Fraud detection (rare frauds) - Disease diagnosis (rare diseases) - Anomaly detection - Churn prediction (most don't churn)

Solution 1: Use the Right Metrics

Accuracy is misleading. Use these instead:

```python from sklearn.metrics import ( precision_score, recall_score, f1_score, classification_report, roc_auc_score )

Get full picture print(classification_report(y_test, y_pred))

ROC-AUC works well for imbalanced data auc = roc_auc_score(y_test, y_pred_proba) ```

**Key metrics:** - **Precision:** Of predicted positives, how many are correct? - **Recall:** Of actual positives, how many did we find? - **F1:** Balance of precision and recall - **ROC-AUC:** Overall ranking ability

Solution 2: Class Weights

Tell the model to care more about the minority class:

```python from sklearn.linear_model import LogisticRegression from sklearn.ensemble import RandomForestClassifier

Automatic balancing model = LogisticRegression(class_weight='balanced')

Manual weights model = RandomForestClassifier(class_weight={0: 1, 1: 10})

For XGBoost model = XGBClassifier(scale_pos_weight=99) # ratio of neg/pos ```

Solution 3: Resampling

### Oversample Minority Class (SMOTE)

```python from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42) X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

print(f"Before: {Counter(y_train)}") print(f"After: {Counter(y_resampled)}") ```

### Undersample Majority Class

```python from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler(random_state=42) X_resampled, y_resampled = rus.fit_resample(X_train, y_train) ```

### Combine Both

```python from imblearn.combine import SMOTETomek

smt = SMOTETomek(random_state=42) X_resampled, y_resampled = smt.fit_resample(X_train, y_train) ```

Solution 4: Change the Threshold

Default threshold is 0.5. Adjust it:

```python # Get probabilities y_proba = model.predict_proba(X_test)[:, 1]

Lower threshold to catch more positives threshold = 0.3 y_pred = (y_proba >= threshold).astype(int)

Find optimal threshold from sklearn.metrics import precision_recall_curve

precision, recall, thresholds = precision_recall_curve(y_test, y_proba) # Choose threshold based on your precision/recall needs ```

Solution 5: Ensemble Methods

```python from imblearn.ensemble import BalancedRandomForestClassifier

model = BalancedRandomForestClassifier(n_estimators=100) model.fit(X_train, y_train) ```

Which Solution to Use?

| Situation | Recommended Approach | |-----------|---------------------| | Quick fix | Class weights | | Very imbalanced | SMOTE + class weights | | Large dataset | Undersampling | | Small dataset | SMOTE | | Need high recall | Lower threshold |

Complete Pipeline

```python from imblearn.pipeline import Pipeline from imblearn.over_sampling import SMOTE from sklearn.ensemble import RandomForestClassifier

Create balanced pipeline pipeline = Pipeline([ ('smote', SMOTE(random_state=42)), ('classifier', RandomForestClassifier( class_weight='balanced', n_estimators=100 )) ])

pipeline.fit(X_train, y_train)

Evaluate properly from sklearn.metrics import classification_report y_pred = pipeline.predict(X_test) print(classification_report(y_test, y_pred)) ```

Key Takeaway

Never use accuracy alone for imbalanced data. Start with class weights (easiest), try SMOTE for synthetic oversampling, and always evaluate with precision/recall/F1/AUC. The right approach depends on your specific needs - sometimes missing frauds is worse than false alarms (prioritize recall), sometimes the opposite (prioritize precision).

Handling Imbalanced Datasets

Handling Imbalanced Datasets

Why It's a Problem

Common Scenarios

Solution 1: Use the Right Metrics

Get full picture print(classification_report(y_test, y_pred))

ROC-AUC works well for imbalanced data auc = roc_auc_score(y_test, y_pred_proba) ```

Solution 2: Class Weights

Automatic balancing model = LogisticRegression(class_weight='balanced')

Manual weights model = RandomForestClassifier(class_weight={0: 1, 1: 10})

For XGBoost model = XGBClassifier(scale_pos_weight=99) # ratio of neg/pos ```

Solution 3: Resampling

Solution 4: Change the Threshold

Lower threshold to catch more positives threshold = 0.3 y_pred = (y_proba >= threshold).astype(int)

Find optimal threshold from sklearn.metrics import precision_recall_curve

Solution 5: Ensemble Methods

Which Solution to Use?

Complete Pipeline

Create balanced pipeline pipeline = Pipeline([ ('smote', SMOTE(random_state=42)), ('classifier', RandomForestClassifier( class_weight='balanced', n_estimators=100 )) ])

Evaluate properly from sklearn.metrics import classification_report y_pred = pipeline.predict(X_test) print(classification_report(y_test, y_pred)) ```

Key Takeaway

More on ML

What is Machine Learning? A Simple Introduction

Supervised vs Unsupervised Learning Explained

Understanding Training, Validation, and Test Sets