Hands-On Lab845 words

Lab: Detecting Bias and Ensuring Data Integrity with SageMaker Clarify and AWS Glue

Ensure data integrity and prepare data for modeling

Lab: Detecting Bias and Ensuring Data Integrity with SageMaker Clarify and AWS Glue

In this lab, you will act as a Machine Learning Engineer responsible for validating a dataset before model training. You will use AWS Glue Data Quality to identify missing values and duplicates, and Amazon SageMaker Clarify to detect pre-training bias metrics like Class Imbalance (CI).

[!WARNING] Remember to run the teardown commands at the end of this lab to avoid ongoing charges for S3 storage and SageMaker processing resources.

Prerequisites

  • An AWS account with Administrator access or the AmazonSageMakerFullAccess and AWSGlueConsoleFullAccess policies.
  • AWS CLI configured on your local machine.
  • Basic knowledge of Python and S3.

Learning Objectives

  • Validate Data Quality: Use AWS Glue to check for missing values and schema consistency.
  • Detect Pre-training Bias: Calculate Class Imbalance (CI) and Difference in Proportions of Labels (DPL) using SageMaker Clarify.
  • Handle Sensitive Data: Identify PII (Personally Identifiable Information) patterns for masking.

Architecture Overview

Loading Diagram...

Visualizing Bias

Understanding class imbalance is the first step in data integrity. If one category (e.g., "Approved") significantly outweighs another (e.g., "Rejected"), the model may become biased.

\begin{tikzpicture}[scale=0.8] \draw[thick] (0,0) circle (2cm); \draw[fill=blue!30] (0,0) -- (0,2) arc (90:350:2) -- cycle; \draw[fill=red!30] (0,0) -- (0,2) arc (90:10:2) -- cycle; \node at (-1,0) {\textbf{80% Group A}}; \node at (1.5,1) {\textbf{20% Group B}}; \draw[->] (3,1) -- (4,1) node[right] {High Class Imbalance (CI)}; \end{tikzpicture}

Step-by-Step Instructions

Step 1: Prepare the S3 Environment

You need a bucket to store your dataset and the resulting bias reports.

bash
# Create a unique bucket name export BUCKET_NAME="brainybee-lab-data-integrity-$(date +%s)" aws s3 mb s3://$BUCKET_NAME
Console Alternative
  1. Navigate to the S3 Console.
  2. Click Create bucket.
  3. Enter a name like `brainybee-lab-data-integrity-

` and click Create.

Step 2: Upload Sample Data

Create a local file named dataset.csv with the following imbalanced data (predicting loan approval based on age and gender):

csv
age,gender,loan_approved 25,M,1 30,F,1 22,M,1 45,F,0 35,M,1 28,M,1
bash
aws s3 cp dataset.csv s3://$BUCKET_NAME/input/dataset.csv

Step 3: Run Glue Data Quality Rules

Define a ruleset to ensure the age column is never null and the loan_approved column is binary.

bash
aws glue create-data-quality-ruleset \ --name "LoanDataIntegrity" \ --ruleset "Rules = [ IsComplete \"age\", ColumnValues \"loan_approved\" in [ 0, 1 ] ]"

Step 4: Configure SageMaker Clarify for Bias Detection

We will check if there is a bias against a specific gender.

bash
# Note: This requires a JSON configuration for Clarify. # In a real scenario, you would use the SageMaker SDK. # Here we simulate the configuration setup. cat <<EOF > analysis_config.json { "dataset_type": "text/csv", "headers": ["age", "gender", "loan_approved"], "label": "loan_approved", "methods": { "pre_training_bias": { "methods": ["CI", "DPL"] } }, "predictor": { "model_name": "none_pre_training_only" } } EOF

Checkpoints

CheckpointActionExpected Result
S3 Uploadaws s3 ls s3://$BUCKET_NAME/input/dataset.csv is listed.
Glue Rulesetaws glue get-data-quality-ruleset --name "LoanDataIntegrity"Ruleset JSON is returned.
Clarify Configls analysis_config.jsonFile exists with correct JSON syntax.

Troubleshooting

ErrorCauseFix
AccessDeniedIAM User lacks S3/Glue permissions.Attach AdministratorAccess for lab purposes.
NoSuchBucketTypo in bucket name variable.Run echo $BUCKET_NAME to verify.
InvalidInputCSV headers don't match config.Ensure dataset.csv first line matches analysis_config.json.

Clean-Up / Teardown

To prevent further costs, delete the resources created:

bash
# Delete S3 objects and bucket aws s3 rb s3://$BUCKET_NAME --force # Delete Glue Ruleset aws glue delete-data-quality-ruleset --name "LoanDataIntegrity"

Stretch Challenge

Task: Add a rule to the Glue Ruleset to detect PII. Hint: Use the DetectSensitiveData rule type in Glue Data Quality to find if any email or phone columns were accidentally included in your training set.

Cost Estimate

  • S3 Storage: Negligible (<$0.01 for this volume).
  • Glue Data Quality: $0.44 per DPU-Hour (Small runs usually fit in free tier or cost <$0.10).
  • SageMaker Clarify: Billed at standard Processing Job rates ($0.20 - $0.50 per run for small instances).
  • Total Estimated Spend: < $1.00.

Concept Review

ConceptDefinitionExample
Class Imbalance (CI)One class has significantly more samples than others.90% of transactions are "Legit", 10% are "Fraud".
DPLDifference in proportions of positive outcomes between groups.Approval rate for Men is 80%, for Women is 40%.
Data MaskingReplacing sensitive data with functional surrogates.Replacing Social Security Numbers with XXX-XX-1234.
AnonymizationRemoving all PII so individuals cannot be identified.Removing names, exact birthdays, and specific addresses.

Ready to study AWS Certified Machine Learning Engineer - Associate (MLA-C01)?

Practice tests, flashcards, and all study notes — free, no sign-up needed.

Start Studying — Free