AI6 min read
Handling Imbalanced Data
Deal with unequal class distributions.
Robert Anderson
December 18, 2025
0.0k0
Fix unbalanced datasets.
What is Imbalanced Data?
When one class has way more examples than others.
Example: Fraud detection
- 99,000 normal transactions
- 1,000 fraud transactions
Why It's a Problem
Model learns to always predict majority class:
Predict "Not Fraud" for everything → 99% accuracy!
But catches 0% of actual fraud!
Solution 1: Oversampling
Add more minority class examples:
from imblearn.over_sampling import SMOTE
smote = SMOTE()
X_resampled, y_resampled = smote.fit_resample(X, y)
print(f"Original: {len(y)}")
print(f"After SMOTE: {len(y_resampled)}")
Solution 2: Undersampling
Remove majority class examples:
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler()
X_resampled, y_resampled = rus.fit_resample(X, y)
Solution 3: Class Weights
Tell model to care more about minority class:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(class_weight='balanced')
model.fit(X, y)
Solution 4: Different Metrics
Don't use accuracy! Use:
Precision: Of predicted frauds, how many were real?
Recall: Of real frauds, how many did we catch?
F1-Score: Balance of precision and recall
from sklearn.metrics import classification_report
predictions = model.predict(X_test)
print(classification_report(y_test, predictions))
Real Example - Credit Card Fraud
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
# Apply SMOTE
smote = SMOTE()
X_train_balanced, y_train_balanced = smote.fit_resample(X_train, y_train)
# Train with class weights
model = RandomForestClassifier(class_weight='balanced')
model.fit(X_train_balanced, y_train_balanced)
When to Use What
- Lots of data: Undersample
- Little data: Oversample (SMOTE)
- Can't resample: Use class weights
Remember
- Never use accuracy alone
- Use F1-score or AUC
- Combine multiple techniques
#AI#Intermediate#Imbalanced