Lab: Detecting Bias and Ensuring Data Integrity with SageMaker Clarify and AWS Glue
Ensure data integrity and prepare data for modeling
Lab: Detecting Bias and Ensuring Data Integrity with SageMaker Clarify and AWS Glue
In this lab, you will act as a Machine Learning Engineer responsible for validating a dataset before model training. You will use AWS Glue Data Quality to identify missing values and duplicates, and Amazon SageMaker Clarify to detect pre-training bias metrics like Class Imbalance (CI).
[!WARNING] Remember to run the teardown commands at the end of this lab to avoid ongoing charges for S3 storage and SageMaker processing resources.
Prerequisites
- An AWS account with Administrator access or the
AmazonSageMakerFullAccessandAWSGlueConsoleFullAccesspolicies. - AWS CLI configured on your local machine.
- Basic knowledge of Python and S3.
Learning Objectives
- Validate Data Quality: Use AWS Glue to check for missing values and schema consistency.
- Detect Pre-training Bias: Calculate Class Imbalance (CI) and Difference in Proportions of Labels (DPL) using SageMaker Clarify.
- Handle Sensitive Data: Identify PII (Personally Identifiable Information) patterns for masking.
Architecture Overview
Visualizing Bias
Understanding class imbalance is the first step in data integrity. If one category (e.g., "Approved") significantly outweighs another (e.g., "Rejected"), the model may become biased.
\begin{tikzpicture}[scale=0.8] \draw[thick] (0,0) circle (2cm); \draw[fill=blue!30] (0,0) -- (0,2) arc (90:350:2) -- cycle; \draw[fill=red!30] (0,0) -- (0,2) arc (90:10:2) -- cycle; \node at (-1,0) {\textbf{80% Group A}}; \node at (1.5,1) {\textbf{20% Group B}}; \draw[->] (3,1) -- (4,1) node[right] {High Class Imbalance (CI)}; \end{tikzpicture}
Step-by-Step Instructions
Step 1: Prepare the S3 Environment
You need a bucket to store your dataset and the resulting bias reports.
# Create a unique bucket name
export BUCKET_NAME="brainybee-lab-data-integrity-$(date +%s)"
aws s3 mb s3://$BUCKET_NAME▶Console Alternative
- Navigate to the S3 Console.
- Click Create bucket.
- Enter a name like `brainybee-lab-data-integrity-
` and click Create.
Step 2: Upload Sample Data
Create a local file named dataset.csv with the following imbalanced data (predicting loan approval based on age and gender):
age,gender,loan_approved
25,M,1
30,F,1
22,M,1
45,F,0
35,M,1
28,M,1aws s3 cp dataset.csv s3://$BUCKET_NAME/input/dataset.csvStep 3: Run Glue Data Quality Rules
Define a ruleset to ensure the age column is never null and the loan_approved column is binary.
aws glue create-data-quality-ruleset \
--name "LoanDataIntegrity" \
--ruleset "Rules = [ IsComplete \"age\", ColumnValues \"loan_approved\" in [ 0, 1 ] ]"Step 4: Configure SageMaker Clarify for Bias Detection
We will check if there is a bias against a specific gender.
# Note: This requires a JSON configuration for Clarify.
# In a real scenario, you would use the SageMaker SDK.
# Here we simulate the configuration setup.
cat <<EOF > analysis_config.json
{
"dataset_type": "text/csv",
"headers": ["age", "gender", "loan_approved"],
"label": "loan_approved",
"methods": {
"pre_training_bias": {
"methods": ["CI", "DPL"]
}
},
"predictor": {
"model_name": "none_pre_training_only"
}
}
EOFCheckpoints
| Checkpoint | Action | Expected Result |
|---|---|---|
| S3 Upload | aws s3 ls s3://$BUCKET_NAME/input/ | dataset.csv is listed. |
| Glue Ruleset | aws glue get-data-quality-ruleset --name "LoanDataIntegrity" | Ruleset JSON is returned. |
| Clarify Config | ls analysis_config.json | File exists with correct JSON syntax. |
Troubleshooting
| Error | Cause | Fix |
|---|---|---|
AccessDenied | IAM User lacks S3/Glue permissions. | Attach AdministratorAccess for lab purposes. |
NoSuchBucket | Typo in bucket name variable. | Run echo $BUCKET_NAME to verify. |
InvalidInput | CSV headers don't match config. | Ensure dataset.csv first line matches analysis_config.json. |
Clean-Up / Teardown
To prevent further costs, delete the resources created:
# Delete S3 objects and bucket
aws s3 rb s3://$BUCKET_NAME --force
# Delete Glue Ruleset
aws glue delete-data-quality-ruleset --name "LoanDataIntegrity"Stretch Challenge
Task: Add a rule to the Glue Ruleset to detect PII.
Hint: Use the DetectSensitiveData rule type in Glue Data Quality to find if any email or phone columns were accidentally included in your training set.
Cost Estimate
- S3 Storage: Negligible (<$0.01 for this volume).
- Glue Data Quality: $0.44 per DPU-Hour (Small runs usually fit in free tier or cost <$0.10).
- SageMaker Clarify: Billed at standard Processing Job rates ($0.20 - $0.50 per run for small instances).
- Total Estimated Spend: < $1.00.
Concept Review
| Concept | Definition | Example |
|---|---|---|
| Class Imbalance (CI) | One class has significantly more samples than others. | 90% of transactions are "Legit", 10% are "Fraud". |
| DPL | Difference in proportions of positive outcomes between groups. | Approval rate for Men is 80%, for Women is 40%. |
| Data Masking | Replacing sensitive data with functional surrogates. | Replacing Social Security Numbers with XXX-XX-1234. |
| Anonymization | Removing all PII so individuals cannot be identified. | Removing names, exact birthdays, and specific addresses. |