Study Guide920 words

Study Guide: Pre-training Bias Metrics in Machine Learning

Pre-training bias metrics for numeric, text, and image data (for example, class imbalance [CI], difference in proportions of labels [DPL])

Pre-training Bias Metrics in Machine Learning

This study guide focuses on identifying and measuring bias in datasets before model training begins, primarily using Amazon SageMaker Clarify. Ensuring data integrity at this stage is crucial for building fair and ethical machine learning models.

Learning Objectives

  • Define and interpret Class Imbalance (CI) and Difference in Proportions of Labels (DPL).
  • Identify facets within a dataset that represent protected or sensitive attributes.
  • Evaluate the severity of bias based on metric ranges (e.g., [-1, +1]).
  • Select appropriate mitigation strategies based on pre-training bias findings.

Key Terms & Glossary

  • Facet: A specific feature or attribute in a dataset (e.g., gender, age, or postal code) being analyzed for potential bias.
  • Facet a: The feature value defining a demographic that bias typically favors (privileged group).
  • Facet d: The feature value defining a demographic that bias typically disfavors (underprivileged group).
  • Class Imbalance (CI): A metric measuring the difference in the number of samples between different facets.
  • Difference in Proportions of Labels (DPL): A metric measuring the imbalance of positive outcomes (labels) between different facets.
  • SageMaker Clarify: An AWS service used to detect bias in ML models and datasets and provide explainability.

The "Big Idea"

Bias in machine learning is often a "Garbage In, Garbage Out" problem. If the training data contains historical or systemic imbalances, the model will learn and amplify these biases. Pre-training metrics allow data scientists to quantify these imbalances before the model is ever built, providing a window for mitigation (like resampling or synthetic data generation) that ensures the final product is equitable.

Formula / Concept Box

MetricRangeIdeal ValueDescription
Class Imbalance (CI)$[-1, +1]0Measures if one facet has significantly more data points than another.
DPL (Binary/Categorical)[-1, +1]0Measures if one facet receives positive outcomes more frequently than another.
DPL (Continuous)[-\infty, +\infty]$0Measures the difference in mean label values across facets.

[!IMPORTANT] For all SageMaker Clarify metrics, a value of 0 (or near 0) denotes no bias or perfect balance.

Hierarchical Outline

  • I. Introduction to Bias Detection
    • Amazon SageMaker Clarify: The primary tool for pre-training and post-training bias detection.
    • BiasConfig Class: Used in the sagemaker.clarify library to configure analysis settings.
  • II. Identifying Facets
    • Sensitive Attributes: Selecting columns like is_fraudulent, gender, or race as facets.
    • Group Definitions: Understanding Facet a (favored) vs. Facet d (disfavored).
  • III. Core Pre-training Metrics
    • Class Imbalance (CI):
      • $+1: Complete imbalance toward the majority class.
      • -1: Complete imbalance toward the minority class.
    • Difference in Proportions of Labels (DPL):
      • Measures the outcome distribution across facets.
      • Positive value: Facet a has more positive labels.
      • Negative value: Facet d$ has more positive labels.
  • IV. Application Areas
    • Numeric Data: Tabular credit scores or income.
    • Image Data: Object detection and classification bias.
    • Text Data: Sentiment analysis across different demographic keywords.

Visual Anchors

Bias Detection Workflow

Loading Diagram...

Visualizing Facet Imbalance

\begin{tikzpicture} % Bar for Facet A \draw[fill=blue!30] (0,0) rectangle (1,4); \node at (0.5,-0.3) {Facet a}; \node at (0.5,4.3) {1000 Samples};

% Bar for Facet D \draw[fill=red!30] (2,0) rectangle (3,0.5); \node at (2.5,-0.3) {Facet d}; \node at (2.5,0.8) {125 Samples};

% Legend/Calculation \draw[dashed] (-0.5,0) -- (4,0); \node[right] at (4,2) {CI 0.78\approx 0.78 (High Bias)}; \end{tikzpicture}

Definition-Example Pairs

  • Metric: Class Imbalance (CI)
    • Definition: The difference in the number of samples ($n) between facets.
    • Example: In a medical dataset, if there are 9,000 records for "Group A" and only 100 for "Group B," the CI will be close to +1, indicating Group A is heavily overrepresented.
  • Metric: Difference in Proportions of Labels (DPL)
    • Definition: The difference between the ratio of positive outcomes in facet aandfacetand facetd.
    • Example: If 80% of applicants in Facet a are approved for a loan, but only 20% in Facet dareapproved,theDPLis$0.6($0.80.2 are approved, the DPL is $0.6 ($0.8 - 0.2), indicating a strong bias in outcomes toward Facet aa.

Worked Examples

Example 1: Credit Card Fraud

Scenario: You have a dataset where 99.9% of transactions are is_fraudulent = 0 (not fraud) and 0.1% are is_fraudulent = 1 (fraud).

  1. Metric Selection: You calculate Class Imbalance (CI).
  2. Result: The CI value is $0.998.
  3. Interpretation: Since the value is near $+1, there is a severe majority class imbalance.
  4. Action: Use undersampling on the majority class or oversampling (synthetic data generation) on the minority class before splitting data into training and test sets.

Example 2: Hiring Algorithm

Scenario: A company checks if their resume screening tool is biased against a certain zip code (Facet d$).

  • Facet aa (Other zip codes) Positive Outcome Rate: $0.50
  • Facet dd (Target zip code) Positive Outcome Rate: $0.10
  • Calculation: DPL=0.500.10=0.40DPL = 0.50 - 0.10 = 0.40
  • Conclusion: The dataset shows a $0.40 bias in favor of other zip codes. The data scientist must investigate if the labels themselves are biased before training.

Checkpoint Questions

  1. What does a CI value of 0 indicate?
  2. In SageMaker Clarify, which facet (aa or $d) represents the group potentially disfavored by bias?
  3. If you find a high DPL in your pre-training data, which library/class do you use to configure the analysis in SageMaker?
  4. True or False: DPL for continuous labels has a fixed range of [-1, +1].
Click for Answers
  1. A CI of 0 indicates no class imbalance (perfectly balanced samples).
  2. Facet d$ is the disfavored demographic.
  3. The BiasConfig class within the sagemaker.clarify library.
  4. False. For continuous facet labels, the range is $[-\infty, +\infty].

Muddy Points & Cross-Refs

  • CI vs. DPL: Students often confuse these. Remember: CI is about the count of records (rows), while DPL is about the count of "Yes/Success" outcomes within those rows.
  • Facet a/d Assignment: There is no "hard" rule on which group is aorord, but convention usually assigns a to the majority or historically privileged group to make positive metric values indicate bias toward that group.
  • Next Steps: After pre-training metrics, refer to Chapter 5 for mitigation strategies like SMOTE or cost-sensitive learning.

Comparison Tables

FeatureClass Imbalance (CI)Difference in Proportions of Labels (DPL)
FocusDataset size distributionOutcome (label) distribution
Use CaseIdentifying underrepresented groupsIdentifying unequal success rates
Range (Binary)[-1, +1]$[1,+1][-1, +1]
Calculation TypeSample count differencesProbability/Proportion differences
RemedyResampling / AugmentationData re-labeling / Algorithmic constraints

Ready to study AWS Certified Machine Learning Engineer - Associate (MLA-C01)?

Practice tests, flashcards, and all study notes — free, no sign-up needed.

Start Studying — Free