Study Guide945 words

Mitigating Class Imbalance in Machine Learning Datasets

Strategies to address CI in numeric, text, and image datasets (for example, synthetic data generation, resampling)

Mitigating Class Imbalance in Machine Learning Datasets

[!IMPORTANT] Class Imbalance (CI) occurs when the number of samples in one class (the majority) significantly outweighs the others (the minority). Left unaddressed, ML models typically default to predicting the majority class to achieve high accuracy, while failing to detect the critical minority class (e.g., fraud, rare diseases).


Learning Objectives

After studying this guide, you should be able to:

  • Identify class imbalance across numeric, text, and image datasets.
  • Contrast resampling techniques (Oversampling vs. Undersampling).
  • Explain the difference between Data Augmentation and Synthetic Data Generation.
  • Select appropriate mitigation strategies (SMOTE, GANs, Class Weighting) based on data type.

Key Terms & Glossary

  • Class Imbalance (CI): A disproportionate distribution of labels in a training dataset.
  • Oversampling: Increasing minority class representation by duplicating samples or creating new ones.
  • Undersampling: Decreasing majority class representation by removing samples.
  • SMOTE (Synthetic Minority Over-sampling Technique): An algorithm that generates synthetic samples by interpolating between existing minority data points.
  • Data Augmentation: Creating new training samples by applying transformations (flips, rotations, synonym replacement) to existing data.
  • Synthetic Data: Entirely new data generated from scratch (e.g., via GANs) that does not use original dataset records directly.

The "Big Idea"

The core challenge of machine learning is not just "accuracy" but "generalization." In an imbalanced dataset, a model can achieve 99% accuracy by simply guessing the majority class every time, yet it is useless for its intended purpose (like finding the 1% of fraudulent transactions). Addressing CI is the process of forcing the model to value the "rare event" as much as the "common event."


Formula / Concept Box

1. Class Weighting Logic

In the loss function, we assign a weight WW to each class to penalize the model more for missing the minority class. Ltotal=WmajLmaj+WminLminL_{total} = W_{maj} \cdot L_{maj} + W_{min} \cdot L_{min} Typically, Wmin=Total SamplesNumber of Minority SamplesW_{min} = \frac{Total\ Samples}{Number\ of\ Minority\ Samples}.

2. SMOTE Interpolation

To create a new point PnewP_{new} between point AA and its neighbor BB: Pnew=A+rand(0,1)×(BA)P_{new} = A + \text{rand}(0, 1) \times (B - A)


Visual Anchors

Decision Flow for CI Mitigation

Loading Diagram...

SMOTE Visual Representation

This diagram shows how SMOTE creates a synthetic point along the line segment joining two existing minority samples.

\begin{tikzpicture} % Axes \draw[->] (0,0) -- (5,0) node[right] {Feature 1}; \draw[->] (0,0) -- (0,5) node[above] {Feature 2};

code
% Minority Points \filldraw[blue] (1,1) circle (2pt) node[anchor=north] {Point A}; \filldraw[blue] (3,4) circle (2pt) node[anchor=south] {Point B}; % Interpolation Line \draw[dashed, blue] (1,1) -- (3,4); % Synthetic Point \filldraw[red] (2,2.5) circle (3pt) node[anchor=west] {Synthetic Sample (SMOTE)}; % Majority Points \draw (4,1) node {X}; \draw (4.5,2) node {X}; \draw (3.5,0.5) node {X};

\end{tikzpicture}


Hierarchical Outline

  • I. Resampling Techniques
    • A. Oversampling (Increases minority count)
      • Random Oversampling (Simple duplication; risks overfitting)
      • SMOTE (Interpolation; creates more diverse samples)
    • B. Undersampling (Decreases majority count)
      • Random Undersampling (Fast; risks information loss)
  • II. Data Augmentation
    • A. Image: Rotations, flips, color jittering, scaling.
    • B. Text: Back-translation, synonym replacement, random insertion.
  • III. Generative Approaches (Synthetic Data)
    • A. GANs (Generative Adversarial Networks): Generator creates data; Discriminator tests it.
    • B. Diffusion Models: Adding/removing noise to generate diverse image samples.
  • IV. Algorithmic Adjustments
    • A. Class Weighting: Penalizing minority class errors more heavily.
    • B. SageMaker Clarify: Tooling to detect pre-training bias (CI metrics).

Comparison Tables

FeatureOversampling (SMOTE)UndersamplingData Augmentation
Data ChangeAdds synthetic samplesRemoves majority samplesModifies existing samples
Primary RiskOverfitting (if simple)Loss of useful dataMay not add enough variety
Best Use CaseSmall minority classMassive majority classImage/Text data
ComplexityMediumLowMedium/High

Definition-Example Pairs

  • Technique: Undersampling
    • Definition: Reducing the number of samples from the majority class to match the minority.
    • Example: In a credit card dataset with 1,000,000 legitimate transactions and 1,000 frauds, you might randomly select only 5,000 legitimate transactions to train the model.
  • Technique: GANs (Generative Adversarial Networks)
    • Definition: Using two neural networks (Generator and Discriminator) to create highly realistic synthetic data.
    • Example: Generating realistic MRI scans of a rare brain tumor to supplement a medical imaging dataset that lacks sufficient positive cases.

Worked Example: Fraud Detection Setup

Scenario: You have a dataset of 100,000 bank transactions. 99,900 are "Normal" and 100 are "Fraud" (0.1% minority).

Step 1: Metric Selection Do NOT use Accuracy. A model saying "Always Normal" gets 99.9% accuracy but misses all fraud. Use Precision-Recall Area Under Curve (PR-AUC) or F1-Score.

Step 2: Apply SMOTE (Numeric Data) Since transaction amounts and timestamps are numeric, apply SMOTE to the 100 Fraud cases to generate 4,900 synthetic cases, bringing the total to 5,000 Fraud cases.

Step 3: Apply Class Weighting Set the model parameters to weight the Fraud class 20x more than the Normal class. This ensures the "cost" of missing a fraud is much higher than misidentifying a normal transaction.


Checkpoint Questions

  1. Why is accuracy a poor metric for imbalanced datasets?
  2. What is the primary risk associated with undersampling the majority class?
  3. How does SMOTE differ from simple random oversampling?
  4. Which AWS tool is used to identify pre-training bias like Class Imbalance?
Click to see answers
  1. Accuracy ignores class distribution; a model can be highly accurate by simply predicting the majority class while failing the minority task.
  2. You might delete valuable patterns/information from the majority class.
  3. Random oversampling duplicates existing rows (leading to overfitting); SMOTE creates new, interpolated rows.
  4. Amazon SageMaker Clarify.

Muddy Points & Cross-Refs

  • Augmentation vs. Synthetic: Learners often confuse these. Remember: Augmentation starts with an existing photo/text and tweaks it. Synthetic (GANs) creates a brand-new "fake" record from noise/patterns.
  • When to stop oversampling? You don't always need a 50/50 split. Often, a 10/90 or 20/80 split is enough for the model to learn the minority features without drowning in synthetic noise.
  • Next Steps: See SageMaker Clarify for post-training bias metrics and Cost-Sensitive Learning for deep-dives into weighting.

Ready to study AWS Certified Machine Learning Engineer - Associate (MLA-C01)?

Practice tests, flashcards, and all study notes — free, no sign-up needed.

Start Studying — Free