Data Imbalance
Data imbalance happens when one class in a dataset has far more samples than another. The model learns more from the dominant class. This creates weak predictions for the minority class.
Why Data Imbalance Is a Problem
- The model focuses on the majority class.
- The model ignores rare cases.
- Accuracy becomes misleading.
- Predictions lose fairness.
Common Examples
- Fraud detection. Fraud cases are few.
- Medical diagnosis. Rare diseases appear with low frequency.
- Spam detection. Spam or ham counts differ.
Effects of Data Imbalance
- High accuracy with poor real performance
- Biased model outputs
- Weak recall on minority class
Ways to Handle Data Imbalance
1. Undersampling
Reduce samples in the majority class.
2. Oversampling
Increase samples in the minority class by duplication.
3. SMOTE
Create synthetic samples for the minority class.
4. Class Weighting
Give higher weight to minority samples during training.
Evaluation Tips
- Use precision and recall.
- Use F1 score.
- Use confusion matrix.
Data Imbalance in Moroccan Darija
Data imbalance kaykoun mlli class wahed kayn b quantidade kbira w class okhor kayn b quantidade sghira. Model kayt3llam aktar men class l kbir w kaytghafel class sghir.
L Moshkil
- Model kayfocus 3la majority.
- Minority kaywalou weak.
- Accuracy katban mzyana bsah reality la.
L Hal
- Undersampling.
- Oversampling.
- SMOTE.
- Class weighting.
Conclusion
Data imbalance creates biased models. Fixing it improves fairness and prediction quality.