ML9 min read

Handling Imbalanced Datasets

Learn techniques to handle imbalanced classes when one class heavily outnumbers the other.

Sarah Chen
December 19, 2025
0.0k0

Handling Imbalanced Datasets

When 99% of your data is class A and 1% is class B, your model might just predict A for everything and get 99% accuracy. That's the imbalanced data problem.

Why It's a Problem

Fraud Detection:
  ████████████████████████████████ 99.9% Normal
  █ 0.1% Fraud

Model can get 99.9% accuracy by predicting "Normal" for everything.
But that's useless for catching fraud!

Common Scenarios

  • Fraud detection (rare frauds)
  • Disease diagnosis (rare diseases)
  • Anomaly detection
  • Churn prediction (most don't churn)

Solution 1: Use the Right Metrics

Accuracy is misleading. Use these instead:

from sklearn.metrics import (
    precision_score, recall_score, f1_score,
    classification_report, roc_auc_score
)

# Get full picture
print(classification_report(y_test, y_pred))

# ROC-AUC works well for imbalanced data
auc = roc_auc_score(y_test, y_pred_proba)

Key metrics:

  • Precision: Of predicted positives, how many are correct?
  • Recall: Of actual positives, how many did we find?
  • F1: Balance of precision and recall
  • ROC-AUC: Overall ranking ability

Solution 2: Class Weights

Tell the model to care more about the minority class:

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# Automatic balancing
model = LogisticRegression(class_weight='balanced')

# Manual weights
model = RandomForestClassifier(class_weight={0: 1, 1: 10})

# For XGBoost
model = XGBClassifier(scale_pos_weight=99)  # ratio of neg/pos

Solution 3: Resampling

Oversample Minority Class (SMOTE)

from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

print(f"Before: {Counter(y_train)}")
print(f"After: {Counter(y_resampled)}")

Undersample Majority Class

from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler(random_state=42)
X_resampled, y_resampled = rus.fit_resample(X_train, y_train)

Combine Both

from imblearn.combine import SMOTETomek

smt = SMOTETomek(random_state=42)
X_resampled, y_resampled = smt.fit_resample(X_train, y_train)

Solution 4: Change the Threshold

Default threshold is 0.5. Adjust it:

# Get probabilities
y_proba = model.predict_proba(X_test)[:, 1]

# Lower threshold to catch more positives
threshold = 0.3
y_pred = (y_proba >= threshold).astype(int)

# Find optimal threshold
from sklearn.metrics import precision_recall_curve

precision, recall, thresholds = precision_recall_curve(y_test, y_proba)
# Choose threshold based on your precision/recall needs

Solution 5: Ensemble Methods

from imblearn.ensemble import BalancedRandomForestClassifier

model = BalancedRandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)

Which Solution to Use?

Situation Recommended Approach
Quick fix Class weights
Very imbalanced SMOTE + class weights
Large dataset Undersampling
Small dataset SMOTE
Need high recall Lower threshold

Complete Pipeline

from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier

# Create balanced pipeline
pipeline = Pipeline([
    ('smote', SMOTE(random_state=42)),
    ('classifier', RandomForestClassifier(
        class_weight='balanced',
        n_estimators=100
    ))
])

pipeline.fit(X_train, y_train)

# Evaluate properly
from sklearn.metrics import classification_report
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred))

Key Takeaway

Never use accuracy alone for imbalanced data. Start with class weights (easiest), try SMOTE for synthetic oversampling, and always evaluate with precision/recall/F1/AUC. The right approach depends on your specific needs - sometimes missing frauds is worse than false alarms (prioritize recall), sometimes the opposite (prioritize precision).

#Machine Learning#Imbalanced Data#SMOTE#Class Weights#Intermediate