Mitigating Class Imbalance in Machine Learning Datasets

[!IMPORTANT] Class Imbalance (CI) occurs when the number of samples in one class (the majority) significantly outweighs the others (the minority). Left unaddressed, ML models typically default to predicting the majority class to achieve high accuracy, while failing to detect the critical minority class (e.g., fraud, rare diseases).

Learning Objectives

After studying this guide, you should be able to:

Identify class imbalance across numeric, text, and image datasets.
Contrast resampling techniques (Oversampling vs. Undersampling).
Explain the difference between Data Augmentation and Synthetic Data Generation.
Select appropriate mitigation strategies (SMOTE, GANs, Class Weighting) based on data type.

Key Terms & Glossary

Class Imbalance (CI): A disproportionate distribution of labels in a training dataset.
Oversampling: Increasing minority class representation by duplicating samples or creating new ones.
Undersampling: Decreasing majority class representation by removing samples.
SMOTE (Synthetic Minority Over-sampling Technique): An algorithm that generates synthetic samples by interpolating between existing minority data points.
Data Augmentation: Creating new training samples by applying transformations (flips, rotations, synonym replacement) to existing data.
Synthetic Data: Entirely new data generated from scratch (e.g., via GANs) that does not use original dataset records directly.

The "Big Idea"

The core challenge of machine learning is not just "accuracy" but "generalization." In an imbalanced dataset, a model can achieve 99% accuracy by simply guessing the majority class every time, yet it is useless for its intended purpose (like finding the 1% of fraudulent transactions). Addressing CI is the process of forcing the model to value the "rare event" as much as the "common event."

Formula / Concept Box

1. Class Weighting Logic

In the loss function, we assign a weight $W$ to each class to penalize the model more for missing the minority class. $L_{total} = W_{maj} \cdot L_{maj} + W_{min} \cdot L_{min}$ Typically, $W_{min} = \frac{Total\ Samples}{Number\ of\ Minority\ Samples}$ .

2. SMOTE Interpolation

To create a new point $P_{new}$ between point $A$ and its neighbor $B$ : $P_{new} = A + \text{rand}(0, 1) \times (B - A)$

Visual Anchors

Decision Flow for CI Mitigation

Loading Diagram...

SMOTE Visual Representation

This diagram shows how SMOTE creates a synthetic point along the line segment joining two existing minority samples.

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Hierarchical Outline

I. Resampling Techniques
- A. Oversampling (Increases minority count)
  - Random Oversampling (Simple duplication; risks overfitting)
  - SMOTE (Interpolation; creates more diverse samples)
- B. Undersampling (Decreases majority count)
  - Random Undersampling (Fast; risks information loss)
II. Data Augmentation
- A. Image: Rotations, flips, color jittering, scaling.
- B. Text: Back-translation, synonym replacement, random insertion.
III. Generative Approaches (Synthetic Data)
- A. GANs (Generative Adversarial Networks): Generator creates data; Discriminator tests it.
- B. Diffusion Models: Adding/removing noise to generate diverse image samples.
IV. Algorithmic Adjustments
- A. Class Weighting: Penalizing minority class errors more heavily.
- B. SageMaker Clarify: Tooling to detect pre-training bias (CI metrics).

Comparison Tables

Feature	Oversampling (SMOTE)	Undersampling	Data Augmentation
Data Change	Adds synthetic samples	Removes majority samples	Modifies existing samples
Primary Risk	Overfitting (if simple)	Loss of useful data	May not add enough variety
Best Use Case	Small minority class	Massive majority class	Image/Text data
Complexity	Medium	Low	Medium/High

Definition-Example Pairs

Technique: Undersampling
- Definition: Reducing the number of samples from the majority class to match the minority.
- Example: In a credit card dataset with 1,000,000 legitimate transactions and 1,000 frauds, you might randomly select only 5,000 legitimate transactions to train the model.
Technique: GANs (Generative Adversarial Networks)
- Definition: Using two neural networks (Generator and Discriminator) to create highly realistic synthetic data.
- Example: Generating realistic MRI scans of a rare brain tumor to supplement a medical imaging dataset that lacks sufficient positive cases.

Worked Example: Fraud Detection Setup

Scenario: You have a dataset of 100,000 bank transactions. 99,900 are "Normal" and 100 are "Fraud" (0.1% minority).

Step 1: Metric Selection Do NOT use Accuracy. A model saying "Always Normal" gets 99.9% accuracy but misses all fraud. Use Precision-Recall Area Under Curve (PR-AUC) or F1-Score.

Step 2: Apply SMOTE (Numeric Data) Since transaction amounts and timestamps are numeric, apply SMOTE to the 100 Fraud cases to generate 4,900 synthetic cases, bringing the total to 5,000 Fraud cases.

Step 3: Apply Class Weighting Set the model parameters to weight the Fraud class 20x more than the Normal class. This ensures the "cost" of missing a fraud is much higher than misidentifying a normal transaction.

Checkpoint Questions

Why is accuracy a poor metric for imbalanced datasets?
What is the primary risk associated with undersampling the majority class?
How does SMOTE differ from simple random oversampling?
Which AWS tool is used to identify pre-training bias like Class Imbalance?

▶Click to see answers

Accuracy ignores class distribution; a model can be highly accurate by simply predicting the majority class while failing the minority task.
You might delete valuable patterns/information from the majority class.
Random oversampling duplicates existing rows (leading to overfitting); SMOTE creates new, interpolated rows.
Amazon SageMaker Clarify.

Muddy Points & Cross-Refs

Augmentation vs. Synthetic: Learners often confuse these. Remember: Augmentation starts with an existing photo/text and tweaks it. Synthetic (GANs) creates a brand-new "fake" record from noise/patterns.
When to stop oversampling? You don't always need a 50/50 split. Often, a 10/90 or 20/80 split is enough for the model to learn the minority features without drowning in synthetic noise.
Next Steps: See SageMaker Clarify for post-training bias metrics and Cost-Sensitive Learning for deep-dives into weighting.