Fix unbalanced datasets.

What is Imbalanced Data?

When one class has way more examples than others.

**Example**: Fraud detection - 99,000 normal transactions - 1,000 fraud transactions

Why It's a Problem

Model learns to always predict majority class:

Predict "Not Fraud" for everything → 99% accuracy! But catches 0% of actual fraud!

Solution 1: Oversampling

Add more minority class examples:

```python from imblearn.over_sampling import SMOTE

smote = SMOTE() X_resampled, y_resampled = smote.fit_resample(X, y)

print(f"Original: {len(y)}") print(f"After SMOTE: {len(y_resampled)}") ```

Solution 2: Undersampling

Remove majority class examples:

```python from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler() X_resampled, y_resampled = rus.fit_resample(X, y) ```

Solution 3: Class Weights

Tell model to care more about minority class:

```python from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(class_weight='balanced') model.fit(X, y) ```

Solution 4: Different Metrics

Don't use accuracy! Use:

**Precision**: Of predicted frauds, how many were real? **Recall**: Of real frauds, how many did we catch? **F1-Score**: Balance of precision and recall

```python from sklearn.metrics import classification_report

predictions = model.predict(X_test) print(classification_report(y_test, predictions)) ```

Real Example - Credit Card Fraud

```python from imblearn.over_sampling import SMOTE from sklearn.ensemble import RandomForestClassifier

Apply SMOTE smote = SMOTE() X_train_balanced, y_train_balanced = smote.fit_resample(X_train, y_train)

Train with class weights model = RandomForestClassifier(class_weight='balanced') model.fit(X_train_balanced, y_train_balanced) ```

When to Use What

- **Lots of data**: Undersample - **Little data**: Oversample (SMOTE) - **Can't resample**: Use class weights

Remember

- Never use accuracy alone - Use F1-score or AUC - Combine multiple techniques

Handling Imbalanced Data