Model Performance Analysis & Bias Detection

This guide covers the critical tasks of selecting appropriate evaluation metrics, interpreting model performance, and identifying both pre-training and post-training biases using AWS tools like SageMaker Clarify.

Learning Objectives

After studying this guide, you should be able to:

Differentiate between classification and regression evaluation metrics.
Calculate and interpret Precision, Recall, and F1-score from a confusion matrix.
Identify pre-training bias metrics like Class Imbalance (CI) and Difference in Proportions of Labels (DPL).
Utilize SageMaker Clarify to monitor and mitigate model bias across the ML lifecycle.
Explain the trade-offs between model performance, training time, and cost.

Key Terms & Glossary

Confusion Matrix: A $N \times N table used for evaluating the performance of a classification model, where N$ is the number of target classes.
Facet: A specific feature in a dataset (e.g., age, gender, or location) analyzed for potential bias.
Facet $a$ (Favored): The group/demographic that the bias currently favors.
Facet $d$ (Disfavored): The group/demographic that is underrepresented or negatively impacted by bias.
Bias Drift: A phenomenon where a model's predictions become increasingly unfair toward certain groups over time in production.
SHAP (SHapley Additive exPlanations): A method used in SageMaker Clarify to provide local and global explanations for model decisions.

The "Big Idea"

Model performance isn't just about high accuracy; it's about trust and reliability. A model with 99% accuracy can still be useless if it fails on the minority class in an imbalanced dataset, or worse, unethical if it discriminates against specific demographics. Evaluation is the bridge between training a mathematical function and deploying a responsible business solution.

Formula / Concept Box

Metric	Formula / Definition	Context
Accuracy	(TP + TN) / (TP + TN + FP + FN)	General performance (balanced data)
Precision	$TP / (TP + FP)$	Minimizing False Positives (e.g., Spam detection)
Recall	$TP / (TP + FN)$	Minimizing False Negatives (e.g., Cancer screening)
F1-Score	$$2 \times \frac{Precision \times Recall}{Precision + Recall}$$	Harmonic mean; balance of Precision/Recall
RMSE	$\sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}$	Regression; penalizes large errors
Class Imbalance (CI)	$(n_a - n_d) / (n_a + n_d)$	Pre-training bias; range $[-1, 1]$

Visual Anchors

Bias Detection Workflow

Loading Diagram...

Visualizing the Confusion Matrix

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Hierarchical Outline

I. Performance Evaluation Metrics
- Classification
  - Confusion Matrix: Foundation for error analysis.
  - Precision vs. Recall: The classic trade-off.
  - F1 Score: Use when class distribution is uneven.
  - ROC/AUC: Measures the ability of the model to distinguish between classes.
- Regression
  - RMSE: Standard for error measurement.
  - MAPE: Useful for business-level percentage error communication.
II. Bias Detection with SageMaker Clarify
- Pre-training Bias
  - Class Imbalance (CI): Measures skew in label distribution.
  - DPL: Difference in proportions of positive outcomes across facets.
- Post-training Bias
  - EOD (Equal Opportunity Difference): Measures if true positive rates are equal across facets.
  - PPD (Predictive Parity Difference): Measures if precision is equal across facets.
III. Explainability
- SHAP Values: Identifying feature importance globally and locally.

Definition-Example Pairs

Overfitting: A model learns noise in the training data rather than the underlying pattern.
- Example: A housing price model that predicts prices perfectly on old data but fails on new listings because it memorized specific street names.
Measurement Bias: Data is collected in a way that distorts the true values.
- Example: A smart scale that is calibrated incorrectly, consistently reporting weights 2lbs lighter than reality, leading to skewed health models.
Disparate Impact: When a policy or model unintentionally discriminates against a protected group.
- Example: A loan approval model that uses zip codes as a feature, which may inadvertently proxy for race, resulting in lower approval rates for minority neighborhoods.

Worked Examples

Example 1: Calculating Classification Metrics

Scenario: A model predicts if an email is "Spam" or "Not Spam".

Results: TP=30, FP=10, FN=5, TN=55.

Precision: $30 / (30 + 10) = 0.75$ (75%)
Recall: $30 / (30 + 5) = 0.857$ (85.7%)
Interpretation: The model is better at finding all spam (Recall) than it is at being sure a predicted spam is actually spam (Precision).

Example 2: Detecting Data Bias

Scenario: A dataset for "Job Hiring" has 1000 records. Facet A (Men) has 800 records. Facet D (Women) has 200 records.

Class Imbalance (CI): (800 - 200) / (800 + 200) = 0.6.
Action: Since $CI$ is close to 1, the data is heavily biased toward the majority class. The engineer should consider undersampling the majority or oversampling the minority.

Checkpoint Questions

Which metric is most appropriate for a fraud detection model where missing a fraud case is much more expensive than a false alarm?
In SageMaker Clarify, what does a CI value of 0 indicate?
What is the difference between global and local explainability in SHAP?
Why might a model's accuracy be misleading in an imbalanced dataset?

Muddy Points & Cross-Refs

Precision-Recall Trade-off: Increasing the classification threshold increases Precision but decreases Recall. Choosing the right threshold depends on the business cost of False Positives vs. False Negatives.
Bias vs. Fairness: Bias is a mathematical measurement (e.g., CI). Fairness is a social and legal requirement. SageMaker Clarify provides the metrics, but the engineer must decide the threshold for what is "fair."
Cross-Ref: For implementation details, see the SageMaker SDK BiasConfig documentation.

Comparison Tables

Feature	Precision	Recall
Goal	Minimize False Positives	Minimize False Negatives
Business Impact	Avoid "Crying Wolf"	Avoid "Missing the Target"
Focus	Quality of positive predictions	Completeness of positive coverage

Bias Type	When to measure?	Key Metrics
Pre-training	Before model training	CI, DPL, Facet Imbalance
Post-training	After training / In Production	EOD, PPD, Bias Drift

Model Performance Analysis & Bias Detection

Learning Objectives

After studying this guide, you should be able to:

Differentiate between classification and regression evaluation metrics.
Calculate and interpret Precision, Recall, and F1-score from a confusion matrix.
Identify pre-training bias metrics like Class Imbalance (CI) and Difference in Proportions of Labels (DPL).
Utilize SageMaker Clarify to monitor and mitigate model bias across the ML lifecycle.
Explain the trade-offs between model performance, training time, and cost.

Key Terms & Glossary

Confusion Matrix: A $N \times N table used for evaluating the performance of a classification model, where N$ is the number of target classes.
Facet: A specific feature in a dataset (e.g., age, gender, or location) analyzed for potential bias.
Facet $a$ (Favored): The group/demographic that the bias currently favors.
Facet $d$ (Disfavored): The group/demographic that is underrepresented or negatively impacted by bias.
Bias Drift: A phenomenon where a model's predictions become increasingly unfair toward certain groups over time in production.
SHAP (SHapley Additive exPlanations): A method used in SageMaker Clarify to provide local and global explanations for model decisions.

The "Big Idea"

Formula / Concept Box

Metric	Formula / Definition	Context
Accuracy	(TP + TN) / (TP + TN + FP + FN)	General performance (balanced data)
Precision	$TP / (TP + FP)$	Minimizing False Positives (e.g., Spam detection)
Recall	$TP / (TP + FN)$	Minimizing False Negatives (e.g., Cancer screening)
F1-Score	$$2 \times \frac{Precision \times Recall}{Precision + Recall}$$	Harmonic mean; balance of Precision/Recall
RMSE	$\sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}$	Regression; penalizes large errors
Class Imbalance (CI)	$(n_a - n_d) / (n_a + n_d)$	Pre-training bias; range $[-1, 1]$

Visual Anchors

Bias Detection Workflow

Loading Diagram...

Visualizing the Confusion Matrix

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Hierarchical Outline

I. Performance Evaluation Metrics
- Classification
  - Confusion Matrix: Foundation for error analysis.
  - Precision vs. Recall: The classic trade-off.
  - F1 Score: Use when class distribution is uneven.
  - ROC/AUC: Measures the ability of the model to distinguish between classes.
- Regression
  - RMSE: Standard for error measurement.
  - MAPE: Useful for business-level percentage error communication.
II. Bias Detection with SageMaker Clarify
- Pre-training Bias
  - Class Imbalance (CI): Measures skew in label distribution.
  - DPL: Difference in proportions of positive outcomes across facets.
- Post-training Bias
  - EOD (Equal Opportunity Difference): Measures if true positive rates are equal across facets.
  - PPD (Predictive Parity Difference): Measures if precision is equal across facets.
III. Explainability
- SHAP Values: Identifying feature importance globally and locally.

Definition-Example Pairs

Overfitting: A model learns noise in the training data rather than the underlying pattern.
- Example: A housing price model that predicts prices perfectly on old data but fails on new listings because it memorized specific street names.
Measurement Bias: Data is collected in a way that distorts the true values.
- Example: A smart scale that is calibrated incorrectly, consistently reporting weights 2lbs lighter than reality, leading to skewed health models.
Disparate Impact: When a policy or model unintentionally discriminates against a protected group.
- Example: A loan approval model that uses zip codes as a feature, which may inadvertently proxy for race, resulting in lower approval rates for minority neighborhoods.

Worked Examples

Example 1: Calculating Classification Metrics

Scenario: A model predicts if an email is "Spam" or "Not Spam".

Results: TP=30, FP=10, FN=5, TN=55.

Precision: $30 / (30 + 10) = 0.75$ (75%)
Recall: $30 / (30 + 5) = 0.857$ (85.7%)
Interpretation: The model is better at finding all spam (Recall) than it is at being sure a predicted spam is actually spam (Precision).

Example 2: Detecting Data Bias

Scenario: A dataset for "Job Hiring" has 1000 records. Facet A (Men) has 800 records. Facet D (Women) has 200 records.

Class Imbalance (CI): (800 - 200) / (800 + 200) = 0.6.
Action: Since $CI$ is close to 1, the data is heavily biased toward the majority class. The engineer should consider undersampling the majority or oversampling the minority.

Checkpoint Questions

Which metric is most appropriate for a fraud detection model where missing a fraud case is much more expensive than a false alarm?
In SageMaker Clarify, what does a CI value of 0 indicate?
What is the difference between global and local explainability in SHAP?
Why might a model's accuracy be misleading in an imbalanced dataset?

Muddy Points & Cross-Refs

Precision-Recall Trade-off: Increasing the classification threshold increases Precision but decreases Recall. Choosing the right threshold depends on the business cost of False Positives vs. False Negatives.
Bias vs. Fairness: Bias is a mathematical measurement (e.g., CI). Fairness is a social and legal requirement. SageMaker Clarify provides the metrics, but the engineer must decide the threshold for what is "fair."
Cross-Ref: For implementation details, see the SageMaker SDK BiasConfig documentation.

Comparison Tables

Feature	Precision	Recall
Goal	Minimize False Positives	Minimize False Negatives
Business Impact	Avoid "Crying Wolf"	Avoid "Missing the Target"
Focus	Quality of positive predictions	Completeness of positive coverage

Bias Type	When to measure?	Key Metrics
Pre-training	Before model training	CI, DPL, Facet Imbalance
Post-training	After training / In Production	EOD, PPD, Bias Drift

Model Performance Analysis & Bias Detection with SageMaker Clarify

Model Performance Analysis & Bias Detection

Learning Objectives

Key Terms & Glossary

The "Big Idea"

Formula / Concept Box

Visual Anchors

Bias Detection Workflow

Visualizing the Confusion Matrix

Hierarchical Outline

Definition-Example Pairs

Worked Examples

Example 1: Calculating Classification Metrics

Example 2: Detecting Data Bias

Checkpoint Questions

Muddy Points & Cross-Refs

Comparison Tables

Model Performance Analysis & Bias Detection with SageMaker Clarify

Model Performance Analysis & Bias Detection

Learning Objectives

Key Terms & Glossary

The "Big Idea"

Formula / Concept Box

Visual Anchors

Bias Detection Workflow

Visualizing the Confusion Matrix

Hierarchical Outline

Definition-Example Pairs

Worked Examples

Example 1: Calculating Classification Metrics

Example 2: Detecting Data Bias

Checkpoint Questions

Muddy Points & Cross-Refs

Comparison Tables