ML9 min read
Handling Imbalanced Datasets
Learn techniques to handle imbalanced classes when one class heavily outnumbers the other.
Sarah Chen
December 19, 2025
0.0k0
Handling Imbalanced Datasets
When 99% of your data is class A and 1% is class B, your model might just predict A for everything and get 99% accuracy. That's the imbalanced data problem.
Why It's a Problem
Fraud Detection:
████████████████████████████████ 99.9% Normal
█ 0.1% Fraud
Model can get 99.9% accuracy by predicting "Normal" for everything.
But that's useless for catching fraud!
Common Scenarios
- Fraud detection (rare frauds)
- Disease diagnosis (rare diseases)
- Anomaly detection
- Churn prediction (most don't churn)
Solution 1: Use the Right Metrics
Accuracy is misleading. Use these instead:
from sklearn.metrics import (
precision_score, recall_score, f1_score,
classification_report, roc_auc_score
)
# Get full picture
print(classification_report(y_test, y_pred))
# ROC-AUC works well for imbalanced data
auc = roc_auc_score(y_test, y_pred_proba)
Key metrics:
- Precision: Of predicted positives, how many are correct?
- Recall: Of actual positives, how many did we find?
- F1: Balance of precision and recall
- ROC-AUC: Overall ranking ability
Solution 2: Class Weights
Tell the model to care more about the minority class:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
# Automatic balancing
model = LogisticRegression(class_weight='balanced')
# Manual weights
model = RandomForestClassifier(class_weight={0: 1, 1: 10})
# For XGBoost
model = XGBClassifier(scale_pos_weight=99) # ratio of neg/pos
Solution 3: Resampling
Oversample Minority Class (SMOTE)
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
print(f"Before: {Counter(y_train)}")
print(f"After: {Counter(y_resampled)}")
Undersample Majority Class
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(random_state=42)
X_resampled, y_resampled = rus.fit_resample(X_train, y_train)
Combine Both
from imblearn.combine import SMOTETomek
smt = SMOTETomek(random_state=42)
X_resampled, y_resampled = smt.fit_resample(X_train, y_train)
Solution 4: Change the Threshold
Default threshold is 0.5. Adjust it:
# Get probabilities
y_proba = model.predict_proba(X_test)[:, 1]
# Lower threshold to catch more positives
threshold = 0.3
y_pred = (y_proba >= threshold).astype(int)
# Find optimal threshold
from sklearn.metrics import precision_recall_curve
precision, recall, thresholds = precision_recall_curve(y_test, y_proba)
# Choose threshold based on your precision/recall needs
Solution 5: Ensemble Methods
from imblearn.ensemble import BalancedRandomForestClassifier
model = BalancedRandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)
Which Solution to Use?
| Situation | Recommended Approach |
|---|---|
| Quick fix | Class weights |
| Very imbalanced | SMOTE + class weights |
| Large dataset | Undersampling |
| Small dataset | SMOTE |
| Need high recall | Lower threshold |
Complete Pipeline
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
# Create balanced pipeline
pipeline = Pipeline([
('smote', SMOTE(random_state=42)),
('classifier', RandomForestClassifier(
class_weight='balanced',
n_estimators=100
))
])
pipeline.fit(X_train, y_train)
# Evaluate properly
from sklearn.metrics import classification_report
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred))
Key Takeaway
Never use accuracy alone for imbalanced data. Start with class weights (easiest), try SMOTE for synthetic oversampling, and always evaluate with precision/recall/F1/AUC. The right approach depends on your specific needs - sometimes missing frauds is worse than false alarms (prioritize recall), sometimes the opposite (prioritize precision).
#Machine Learning#Imbalanced Data#SMOTE#Class Weights#Intermediate