Study Guide: Pre-training Bias Metrics in Machine Learning
Pre-training bias metrics for numeric, text, and image data (for example, class imbalance [CI], difference in proportions of labels [DPL])
Pre-training Bias Metrics in Machine Learning
This study guide focuses on identifying and measuring bias in datasets before model training begins, primarily using Amazon SageMaker Clarify. Ensuring data integrity at this stage is crucial for building fair and ethical machine learning models.
Learning Objectives
- Define and interpret Class Imbalance (CI) and Difference in Proportions of Labels (DPL).
- Identify facets within a dataset that represent protected or sensitive attributes.
- Evaluate the severity of bias based on metric ranges (e.g., [-1, +1]).
- Select appropriate mitigation strategies based on pre-training bias findings.
Key Terms & Glossary
- Facet: A specific feature or attribute in a dataset (e.g., gender, age, or postal code) being analyzed for potential bias.
- Facet a: The feature value defining a demographic that bias typically favors (privileged group).
- Facet d: The feature value defining a demographic that bias typically disfavors (underprivileged group).
- Class Imbalance (CI): A metric measuring the difference in the number of samples between different facets.
- Difference in Proportions of Labels (DPL): A metric measuring the imbalance of positive outcomes (labels) between different facets.
- SageMaker Clarify: An AWS service used to detect bias in ML models and datasets and provide explainability.
The "Big Idea"
Bias in machine learning is often a "Garbage In, Garbage Out" problem. If the training data contains historical or systemic imbalances, the model will learn and amplify these biases. Pre-training metrics allow data scientists to quantify these imbalances before the model is ever built, providing a window for mitigation (like resampling or synthetic data generation) that ensures the final product is equitable.
Formula / Concept Box
| Metric | Range | Ideal Value | Description |
|---|---|---|---|
| Class Imbalance (CI) | $[-1, +1] | 0 | Measures if one facet has significantly more data points than another. |
| DPL (Binary/Categorical) | [-1, +1] | 0 | Measures if one facet receives positive outcomes more frequently than another. |
| DPL (Continuous) | [-\infty, +\infty]$ | 0 | Measures the difference in mean label values across facets. |
[!IMPORTANT] For all SageMaker Clarify metrics, a value of 0 (or near 0) denotes no bias or perfect balance.
Hierarchical Outline
- I. Introduction to Bias Detection
- Amazon SageMaker Clarify: The primary tool for pre-training and post-training bias detection.
- BiasConfig Class: Used in the
sagemaker.clarifylibrary to configure analysis settings.
- II. Identifying Facets
- Sensitive Attributes: Selecting columns like
is_fraudulent,gender, orraceas facets. - Group Definitions: Understanding Facet a (favored) vs. Facet d (disfavored).
- Sensitive Attributes: Selecting columns like
- III. Core Pre-training Metrics
- Class Imbalance (CI):
- $+1: Complete imbalance toward the majority class.
- -1: Complete imbalance toward the minority class.
- Difference in Proportions of Labels (DPL):
- Measures the outcome distribution across facets.
- Positive value: Facet a has more positive labels.
- Negative value: Facet d$ has more positive labels.
- Class Imbalance (CI):
- IV. Application Areas
- Numeric Data: Tabular credit scores or income.
- Image Data: Object detection and classification bias.
- Text Data: Sentiment analysis across different demographic keywords.
Visual Anchors
Bias Detection Workflow
Visualizing Facet Imbalance
\begin{tikzpicture} % Bar for Facet A \draw[fill=blue!30] (0,0) rectangle (1,4); \node at (0.5,-0.3) {Facet a}; \node at (0.5,4.3) {1000 Samples};
% Bar for Facet D \draw[fill=red!30] (2,0) rectangle (3,0.5); \node at (2.5,-0.3) {Facet d}; \node at (2.5,0.8) {125 Samples};
% Legend/Calculation \draw[dashed] (-0.5,0) -- (4,0); \node[right] at (4,2) {CI (High Bias)}; \end{tikzpicture}
Definition-Example Pairs
- Metric: Class Imbalance (CI)
- Definition: The difference in the number of samples ($n) between facets.
- Example: In a medical dataset, if there are 9,000 records for "Group A" and only 100 for "Group B," the CI will be close to +1, indicating Group A is heavily overrepresented.
- Metric: Difference in Proportions of Labels (DPL)
- Definition: The difference between the ratio of positive outcomes in facet ad.
- Example: If 80% of applicants in Facet a are approved for a loan, but only 20% in Facet d), indicating a strong bias in outcomes toward Facet .
Worked Examples
Example 1: Credit Card Fraud
Scenario: You have a dataset where 99.9% of transactions are is_fraudulent = 0 (not fraud) and 0.1% are is_fraudulent = 1 (fraud).
- Metric Selection: You calculate Class Imbalance (CI).
- Result: The CI value is $0.998.
- Interpretation: Since the value is near $+1, there is a severe majority class imbalance.
- Action: Use undersampling on the majority class or oversampling (synthetic data generation) on the minority class before splitting data into training and test sets.
Example 2: Hiring Algorithm
Scenario: A company checks if their resume screening tool is biased against a certain zip code (Facet d$).
- Facet (Other zip codes) Positive Outcome Rate: $0.50
- Facet (Target zip code) Positive Outcome Rate: $0.10
- Calculation:
- Conclusion: The dataset shows a $0.40 bias in favor of other zip codes. The data scientist must investigate if the labels themselves are biased before training.
Checkpoint Questions
- What does a CI value of 0 indicate?
- In SageMaker Clarify, which facet ( or $d) represents the group potentially disfavored by bias?
- If you find a high DPL in your pre-training data, which library/class do you use to configure the analysis in SageMaker?
- True or False: DPL for continuous labels has a fixed range of [-1, +1].
▶Click for Answers
- A CI of 0 indicates no class imbalance (perfectly balanced samples).
- Facet d$ is the disfavored demographic.
- The
BiasConfigclass within thesagemaker.clarifylibrary. - False. For continuous facet labels, the range is $[-\infty, +\infty].
Muddy Points & Cross-Refs
- CI vs. DPL: Students often confuse these. Remember: CI is about the count of records (rows), while DPL is about the count of "Yes/Success" outcomes within those rows.
- Facet a/d Assignment: There is no "hard" rule on which group is ad, but convention usually assigns a to the majority or historically privileged group to make positive metric values indicate bias toward that group.
- Next Steps: After pre-training metrics, refer to Chapter 5 for mitigation strategies like SMOTE or cost-sensitive learning.
Comparison Tables
| Feature | Class Imbalance (CI) | Difference in Proportions of Labels (DPL) |
|---|---|---|
| Focus | Dataset size distribution | Outcome (label) distribution |
| Use Case | Identifying underrepresented groups | Identifying unequal success rates |
| Range (Binary) | [-1, +1]$ | |
| Calculation Type | Sample count differences | Probability/Proportion differences |
| Remedy | Resampling / Augmentation | Data re-labeling / Algorithmic constraints |