Model Performance Analysis & Bias Detection with SageMaker Clarify
Selecting and interpreting evaluation metrics and detecting model bias
Model Performance Analysis & Bias Detection
This guide covers the critical tasks of selecting appropriate evaluation metrics, interpreting model performance, and identifying both pre-training and post-training biases using AWS tools like SageMaker Clarify.
Learning Objectives
After studying this guide, you should be able to:
- Differentiate between classification and regression evaluation metrics.
- Calculate and interpret Precision, Recall, and F1-score from a confusion matrix.
- Identify pre-training bias metrics like Class Imbalance (CI) and Difference in Proportions of Labels (DPL).
- Utilize SageMaker Clarify to monitor and mitigate model bias across the ML lifecycle.
- Explain the trade-offs between model performance, training time, and cost.
Key Terms & Glossary
- Confusion Matrix: A $N \times N table used for evaluating the performance of a classification model, where N is the number of target classes.
- Facet: A specific feature in a dataset (e.g., age, gender, or location) analyzed for potential bias.
- Facet a (Favored): The group/demographic that the bias currently favors.
- Facet d (Disfavored): The group/demographic that is underrepresented or negatively impacted by bias.
- Bias Drift: A phenomenon where a model's predictions become increasingly unfair toward certain groups over time in production.
- SHAP (SHapley Additive exPlanations): A method used in SageMaker Clarify to provide local and global explanations for model decisions.
The "Big Idea"
Model performance isn't just about high accuracy; it's about trust and reliability. A model with 99% accuracy can still be useless if it fails on the minority class in an imbalanced dataset, or worse, unethical if it discriminates against specific demographics. Evaluation is the bridge between training a mathematical function and deploying a responsible business solution.
Formula / Concept Box
| Metric | Formula / Definition | Context |
|---|---|---|
| Accuracy | (TP + TN) / (TP + TN + FP + FN) | General performance (balanced data) |
| Precision | TP / (TP + FP) | Minimizing False Positives (e.g., Spam detection) |
| Recall | TP / (TP + FN)$ | Minimizing False Negatives (e.g., Cancer screening) |
| F1-Score | $2 \times \frac{Precision \times Recall}{Precision + Recall}$ | Harmonic mean; balance of Precision/Recall |
| RMSE | Regression; penalizes large errors | |
| Class Imbalance (CI) | Pre-training bias; range |
Visual Anchors
Bias Detection Workflow
Visualizing the Confusion Matrix
\begin{tikzpicture}[scale=1.5] \draw[thick] (0,0) rectangle (2,2); \draw[thick] (0,1) -- (2,1); \draw[thick] (1,0) -- (1,2); \node at (0.5, 1.5) {TP}; \node at (1.5, 1.5) {FP}; \node at (0.5, 0.5) {FN}; \node at (1.5, 0.5) {TN}; \node[rotate=90] at (-0.3, 1) {Actual}; \node at (1, 2.3) {Predicted}; \node at (0.5, 2.1) {Pos}; \node at (1.5, 2.1) {Neg}; \node at (-0.2, 1.5) {Pos}; \node at (-0.2, 0.5) {Neg}; \end{tikzpicture}
Hierarchical Outline
- I. Performance Evaluation Metrics
- Classification
- Confusion Matrix: Foundation for error analysis.
- Precision vs. Recall: The classic trade-off.
- F1 Score: Use when class distribution is uneven.
- ROC/AUC: Measures the ability of the model to distinguish between classes.
- Regression
- RMSE: Standard for error measurement.
- MAPE: Useful for business-level percentage error communication.
- Classification
- II. Bias Detection with SageMaker Clarify
- Pre-training Bias
- Class Imbalance (CI): Measures skew in label distribution.
- DPL: Difference in proportions of positive outcomes across facets.
- Post-training Bias
- EOD (Equal Opportunity Difference): Measures if true positive rates are equal across facets.
- PPD (Predictive Parity Difference): Measures if precision is equal across facets.
- Pre-training Bias
- III. Explainability
- SHAP Values: Identifying feature importance globally and locally.
Definition-Example Pairs
- Overfitting: A model learns noise in the training data rather than the underlying pattern.
- Example: A housing price model that predicts prices perfectly on old data but fails on new listings because it memorized specific street names.
- Measurement Bias: Data is collected in a way that distorts the true values.
- Example: A smart scale that is calibrated incorrectly, consistently reporting weights 2lbs lighter than reality, leading to skewed health models.
- Disparate Impact: When a policy or model unintentionally discriminates against a protected group.
- Example: A loan approval model that uses zip codes as a feature, which may inadvertently proxy for race, resulting in lower approval rates for minority neighborhoods.
Worked Examples
Example 1: Calculating Classification Metrics
Scenario: A model predicts if an email is "Spam" or "Not Spam".
- Results: TP=30, FP=10, FN=5, TN=55.
- Precision: $30 / (30 + 10) = 0.75$ (75%)
- Recall: $30 / (30 + 5) = 0.857 (85.7%)
- Interpretation: The model is better at finding all spam (Recall) than it is at being sure a predicted spam is actually spam (Precision).
Example 2: Detecting Data Bias
Scenario: A dataset for "Job Hiring" has 1000 records. Facet A (Men) has 800 records. Facet D (Women) has 200 records.
- Class Imbalance (CI): (800 - 200) / (800 + 200) = 0.6.
- Action: Since CI$ is close to 1, the data is heavily biased toward the majority class. The engineer should consider undersampling the majority or oversampling the minority.
Checkpoint Questions
- Which metric is most appropriate for a fraud detection model where missing a fraud case is much more expensive than a false alarm?
- In SageMaker Clarify, what does a CI value of 0 indicate?
- What is the difference between global and local explainability in SHAP?
- Why might a model's accuracy be misleading in an imbalanced dataset?
Muddy Points & Cross-Refs
- Precision-Recall Trade-off: Increasing the classification threshold increases Precision but decreases Recall. Choosing the right threshold depends on the business cost of False Positives vs. False Negatives.
- Bias vs. Fairness: Bias is a mathematical measurement (e.g., CI). Fairness is a social and legal requirement. SageMaker Clarify provides the metrics, but the engineer must decide the threshold for what is "fair."
- Cross-Ref: For implementation details, see the SageMaker SDK
BiasConfigdocumentation.
Comparison Tables
| Feature | Precision | Recall |
|---|---|---|
| Goal | Minimize False Positives | Minimize False Negatives |
| Business Impact | Avoid "Crying Wolf" | Avoid "Missing the Target" |
| Focus | Quality of positive predictions | Completeness of positive coverage |
| Bias Type | When to measure? | Key Metrics |
|---|---|---|
| Pre-training | Before model training | CI, DPL, Facet Imbalance |
| Post-training | After training / In Production | EOD, PPD, Bias Drift |