Pre-training Bias Metrics in Machine Learning

This study guide focuses on identifying and measuring bias in datasets before model training begins, primarily using Amazon SageMaker Clarify. Ensuring data integrity at this stage is crucial for building fair and ethical machine learning models.

Learning Objectives

Define and interpret Class Imbalance (CI) and Difference in Proportions of Labels (DPL).
Identify facets within a dataset that represent protected or sensitive attributes.
Evaluate the severity of bias based on metric ranges (e.g., [-1, +1]).
Select appropriate mitigation strategies based on pre-training bias findings.

Key Terms & Glossary

Facet: A specific feature or attribute in a dataset (e.g., gender, age, or postal code) being analyzed for potential bias.
Facet a: The feature value defining a demographic that bias typically favors (privileged group).
Facet d: The feature value defining a demographic that bias typically disfavors (underprivileged group).
Class Imbalance (CI): A metric measuring the difference in the number of samples between different facets.
Difference in Proportions of Labels (DPL): A metric measuring the imbalance of positive outcomes (labels) between different facets.
SageMaker Clarify: An AWS service used to detect bias in ML models and datasets and provide explainability.

The "Big Idea"

Bias in machine learning is often a "Garbage In, Garbage Out" problem. If the training data contains historical or systemic imbalances, the model will learn and amplify these biases. Pre-training metrics allow data scientists to quantify these imbalances before the model is ever built, providing a window for mitigation (like resampling or synthetic data generation) that ensures the final product is equitable.

Formula / Concept Box

Metric	Range	Description
Class Imbalance (CI)	$[-1, +1]$	Measures if one facet has significantly more data points than another.
DPL (Binary/Categorical)	$[-1, +1]$	Measures if one facet receives positive outcomes more frequently than another.
DPL (Continuous)	$[-\infty, +\infty]$	Measures the difference in mean label values across facets.

[!IMPORTANT] For all SageMaker Clarify metrics, a value of 0 (or near 0) denotes no bias or perfect balance.

Hierarchical Outline

I. Introduction to Bias Detection
- Amazon SageMaker Clarify: The primary tool for pre-training and post-training bias detection.
- BiasConfig Class: Used in the sagemaker.clarify library to configure analysis settings.
II. Identifying Facets
- Sensitive Attributes: Selecting columns like is_fraudulent, gender, or race as facets.
- Group Definitions: Understanding Facet a (favored) vs. Facet d (disfavored).
III. Core Pre-training Metrics
- Class Imbalance (CI):
  - $+1$ : Complete imbalance toward the majority class.
  - $-1$ : Complete imbalance toward the minority class.
- Difference in Proportions of Labels (DPL):
  - Measures the outcome distribution across facets.
  - Positive value: Facet $a$ has more positive labels.
  - Negative value: Facet $d$ has more positive labels.
IV. Application Areas
- Numeric Data: Tabular credit scores or income.
- Image Data: Object detection and classification bias.
- Text Data: Sentiment analysis across different demographic keywords.

Visual Anchors

Bias Detection Workflow

Loading Diagram...

Visualizing Facet Imbalance

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

Metric: Class Imbalance (CI)
- Definition: The difference in the number of samples ( $n$ ) between facets.
- Example: In a medical dataset, if there are 9,000 records for "Group A" and only 100 for "Group B," the CI will be close to $+1$ , indicating Group A is heavily overrepresented.
Metric: Difference in Proportions of Labels (DPL)
- Definition: The difference between the ratio of positive outcomes in facet $a$ and facet $d$ .
- Example: If 80% of applicants in Facet $a are approved for a loan, but only 20% in Facet d$ are approved, the DPL is $0.6 ($0.8 - 0.2 $), indicating a strong bias in outcomes toward Facet$ a$.

Worked Examples

Example 1: Credit Card Fraud

Scenario: You have a dataset where 99.9% of transactions are is_fraudulent = 0 (not fraud) and 0.1% are is_fraudulent = 1 (fraud).

Metric Selection: You calculate Class Imbalance (CI).
Result: The CI value is $0.998.
Interpretation: Since the value is near $+1$ , there is a severe majority class imbalance.
Action: Use undersampling on the majority class or oversampling (synthetic data generation) on the minority class before splitting data into training and test sets.

Example 2: Hiring Algorithm

Scenario: A company checks if their resume screening tool is biased against a certain zip code (Facet $d$ ).

Facet $a$ (Other zip codes) Positive Outcome Rate: $0.50
Facet $d$ (Target zip code) Positive Outcome Rate: $0.10
Calculation: DPL = 0.50 - 0.10 = 0.40
Conclusion: The dataset shows a $0.40 bias in favor of other zip codes. The data scientist must investigate if the labels themselves are biased before training.

Checkpoint Questions

What does a CI value of 0 indicate?
In SageMaker Clarify, which facet ( $a$ or $d$ ) represents the group potentially disfavored by bias?
If you find a high DPL in your pre-training data, which library/class do you use to configure the analysis in SageMaker?
True or False: DPL for continuous labels has a fixed range of $[-1, +1]$ .

▶Click for Answers

A CI of 0 indicates no class imbalance (perfectly balanced samples).
Facet $d$ is the disfavored demographic.
The BiasConfig class within the sagemaker.clarify library.
False. For continuous facet labels, the range is $[$ -\infty $,$ +\infty $]$ .

Muddy Points & Cross-Refs

CI vs. DPL: Students often confuse these. Remember: CI is about the count of records (rows), while DPL is about the count of "Yes/Success" outcomes within those rows.
Facet a/d Assignment: There is no "hard" rule on which group is $a$ or $d, but convention usually assigns a$ to the majority or historically privileged group to make positive metric values indicate bias toward that group.
Next Steps: After pre-training metrics, refer to Chapter 5 for mitigation strategies like SMOTE or cost-sensitive learning.

Comparison Tables

Feature	Class Imbalance (CI)	Difference in Proportions of Labels (DPL)
Focus	Dataset size distribution	Outcome (label) distribution
Use Case	Identifying underrepresented groups	Identifying unequal success rates
Range (Binary)	$[-1, +1]$	$[-1, +1]$
Calculation Type	Sample count differences	Probability/Proportion differences
Remedy	Resampling / Augmentation	Data re-labeling / Algorithmic constraints

Pre-training Bias Metrics in Machine Learning

Learning Objectives

Define and interpret Class Imbalance (CI) and Difference in Proportions of Labels (DPL).
Identify facets within a dataset that represent protected or sensitive attributes.
Evaluate the severity of bias based on metric ranges (e.g., [-1, +1]).
Select appropriate mitigation strategies based on pre-training bias findings.

Key Terms & Glossary

Facet: A specific feature or attribute in a dataset (e.g., gender, age, or postal code) being analyzed for potential bias.
Facet a: The feature value defining a demographic that bias typically favors (privileged group).
Facet d: The feature value defining a demographic that bias typically disfavors (underprivileged group).
Class Imbalance (CI): A metric measuring the difference in the number of samples between different facets.
Difference in Proportions of Labels (DPL): A metric measuring the imbalance of positive outcomes (labels) between different facets.
SageMaker Clarify: An AWS service used to detect bias in ML models and datasets and provide explainability.

The "Big Idea"

Formula / Concept Box

Metric	Range	Description
Class Imbalance (CI)	$[-1, +1]$	Measures if one facet has significantly more data points than another.
DPL (Binary/Categorical)	$[-1, +1]$	Measures if one facet receives positive outcomes more frequently than another.
DPL (Continuous)	$[-\infty, +\infty]$	Measures the difference in mean label values across facets.

[!IMPORTANT] For all SageMaker Clarify metrics, a value of 0 (or near 0) denotes no bias or perfect balance.

Hierarchical Outline

I. Introduction to Bias Detection
- Amazon SageMaker Clarify: The primary tool for pre-training and post-training bias detection.
- BiasConfig Class: Used in the sagemaker.clarify library to configure analysis settings.
II. Identifying Facets
- Sensitive Attributes: Selecting columns like is_fraudulent, gender, or race as facets.
- Group Definitions: Understanding Facet a (favored) vs. Facet d (disfavored).
III. Core Pre-training Metrics
- Class Imbalance (CI):
  - $+1$ : Complete imbalance toward the majority class.
  - $-1$ : Complete imbalance toward the minority class.
- Difference in Proportions of Labels (DPL):
  - Measures the outcome distribution across facets.
  - Positive value: Facet $a$ has more positive labels.
  - Negative value: Facet $d$ has more positive labels.
IV. Application Areas
- Numeric Data: Tabular credit scores or income.
- Image Data: Object detection and classification bias.
- Text Data: Sentiment analysis across different demographic keywords.

Visual Anchors

Bias Detection Workflow

Loading Diagram...

Visualizing Facet Imbalance

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

Metric: Class Imbalance (CI)
- Definition: The difference in the number of samples ( $n$ ) between facets.
- Example: In a medical dataset, if there are 9,000 records for "Group A" and only 100 for "Group B," the CI will be close to $+1$ , indicating Group A is heavily overrepresented.
Metric: Difference in Proportions of Labels (DPL)
- Definition: The difference between the ratio of positive outcomes in facet $a$ and facet $d$ .
- Example: If 80% of applicants in Facet $a are approved for a loan, but only 20% in Facet d$ are approved, the DPL is $0.6 ($0.8 - 0.2 $), indicating a strong bias in outcomes toward Facet$ a$.

Worked Examples

Example 1: Credit Card Fraud

Scenario: You have a dataset where 99.9% of transactions are is_fraudulent = 0 (not fraud) and 0.1% are is_fraudulent = 1 (fraud).

Metric Selection: You calculate Class Imbalance (CI).
Result: The CI value is $0.998.
Interpretation: Since the value is near $+1$ , there is a severe majority class imbalance.
Action: Use undersampling on the majority class or oversampling (synthetic data generation) on the minority class before splitting data into training and test sets.

Example 2: Hiring Algorithm

Scenario: A company checks if their resume screening tool is biased against a certain zip code (Facet $d$ ).

Facet $a$ (Other zip codes) Positive Outcome Rate: $0.50
Facet $d$ (Target zip code) Positive Outcome Rate: $0.10
Calculation: DPL = 0.50 - 0.10 = 0.40
Conclusion: The dataset shows a $0.40 bias in favor of other zip codes. The data scientist must investigate if the labels themselves are biased before training.

Checkpoint Questions

What does a CI value of 0 indicate?
In SageMaker Clarify, which facet ( $a$ or $d$ ) represents the group potentially disfavored by bias?
If you find a high DPL in your pre-training data, which library/class do you use to configure the analysis in SageMaker?
True or False: DPL for continuous labels has a fixed range of $[-1, +1]$ .

▶Click for Answers

A CI of 0 indicates no class imbalance (perfectly balanced samples).
Facet $d$ is the disfavored demographic.
The BiasConfig class within the sagemaker.clarify library.
False. For continuous facet labels, the range is $[$ -\infty $,$ +\infty $]$ .

Muddy Points & Cross-Refs

CI vs. DPL: Students often confuse these. Remember: CI is about the count of records (rows), while DPL is about the count of "Yes/Success" outcomes within those rows.
Facet a/d Assignment: There is no "hard" rule on which group is $a$ or $d, but convention usually assigns a$ to the majority or historically privileged group to make positive metric values indicate bias toward that group.
Next Steps: After pre-training metrics, refer to Chapter 5 for mitigation strategies like SMOTE or cost-sensitive learning.

Comparison Tables

Feature	Class Imbalance (CI)	Difference in Proportions of Labels (DPL)
Focus	Dataset size distribution	Outcome (label) distribution
Use Case	Identifying underrepresented groups	Identifying unequal success rates
Range (Binary)	$[-1, +1]$	$[-1, +1]$
Calculation Type	Sample count differences	Probability/Proportion differences
Remedy	Resampling / Augmentation	Data re-labeling / Algorithmic constraints

Study Guide: Pre-training Bias Metrics in Machine Learning

Pre-training Bias Metrics in Machine Learning

Learning Objectives

Key Terms & Glossary

The "Big Idea"

Formula / Concept Box

Hierarchical Outline

Visual Anchors

Bias Detection Workflow

Visualizing Facet Imbalance

Definition-Example Pairs

Worked Examples

Example 1: Credit Card Fraud

Example 2: Hiring Algorithm

Checkpoint Questions

Muddy Points & Cross-Refs

Comparison Tables

Study Guide: Pre-training Bias Metrics in Machine Learning

Pre-training Bias Metrics in Machine Learning

Learning Objectives

Key Terms & Glossary

The "Big Idea"

Formula / Concept Box

Hierarchical Outline

Visual Anchors

Bias Detection Workflow

Visualizing Facet Imbalance

Definition-Example Pairs

Worked Examples

Example 1: Credit Card Fraud

Example 2: Hiring Algorithm

Checkpoint Questions

Muddy Points & Cross-Refs

Comparison Tables