Lab: Implementing Data Privacy and Governance on AWS
Data Privacy and Governance
Lab: Implementing Data Privacy and Governance on AWS
This hands-on lab guides you through implementing data privacy controls, PII identification, and fine-grained access control using AWS Lake Formation, Amazon Glue, and Amazon S3. These skills are critical for the AWS Certified Data Engineer - Associate (DEA-C01) exam.
[!WARNING] Remember to run the teardown commands at the end of this lab to avoid ongoing charges. Estimated cost is < $0.50 if within the Free Tier.
Prerequisites
- An active AWS Account.
- AWS CLI installed and configured with Administrator access.
- Basic knowledge of IAM (Identity and Access Management).
- A text editor to create a sample CSV file.
- Placeholder: Replace
<YOUR_ACCOUNT_ID>and<YOUR_REGION>with your actual details.
Learning Objectives
- Create a secure data lake storage structure in Amazon S3.
- Catalog datasets using AWS Glue Crawlers.
- Implement Fine-Grained Access Control (FGAC) (column-level) using AWS Lake Formation.
- Understand the workflow for PII Identification and data masking.
Architecture Overview
Step-by-Step Instructions
Step 1: Prepare Sample Data and S3 Bucket
First, we create a bucket and a mock dataset containing PII (Personally Identifiable Information).
-
Create a file named
users.csvwith the following content:csvuser_id,name,email,phone,credit_card,zipcode 1,John Doe,john@example.com,555-0101,1234-5678-9012,90210 2,Jane Smith,jane@example.com,555-0102,9876-5432-1098,10001 -
Run the following CLI commands:
# Create a unique bucket name
BUCKET_NAME="brainybee-lab-privacy-<YOUR_ACCOUNT_ID>"
# Create the bucket
aws s3 mb s3://$BUCKET_NAME --region <YOUR_REGION>
# Upload the data to a 'raw' prefix
aws s3 cp users.csv s3://$BUCKET_NAME/raw/users.csv▶Console alternative
- Navigate to S3 > Create bucket.
- Name it
brainybee-lab-privacy-<YOUR_ACCOUNT_ID>. - Click Create.
- Upload the
users.csvfile into a new folder namedraw/.
Step 2: Set Up Lake Formation Permissions
Before Lake Formation can manage the data, the IAM role/user running this lab must be a Data Lake Administrator.
# Grant your current user administrative rights in Lake Formation
aws lakeformation put-data-lake-settings --data-lake-settings '{
"DataLakeAdmins": [{"DataLakePrincipalIdentifier": "arn:aws:iam::<YOUR_ACCOUNT_ID>:root"}]
}'Step 3: Register S3 Location & Catalog Data
We will now register the S3 bucket as a managed location and run a crawler to populate the metadata.
# Register the S3 location
aws lakeformation register-resource --resource-arn arn:aws:s3:::$BUCKET_NAME
# Create a Glue Database
aws glue create-database --database-input '{"Name": "lab_privacy_db"}'
# Create and run a Crawler (Simplified IAM role assumption assumed)
aws glue create-crawler --name privacy-crawler --role AWSGlueServiceRole-Lab --database-name lab_privacy_db --targets '{"S3Targets": [{"Path": "s3://'$BUCKET_NAME'/raw/"}]}'
aws glue start-crawler --name privacy-crawler[!TIP] It takes about 1-2 minutes for the crawler to finish. You can check status with
aws glue get-crawler --name privacy-crawler.
Step 4: Implement Column-Level Security
In this step, we restrict an IAM user from seeing the credit_card column.
# Create a Data Cell Filter (This is the mechanism for FGAC)
aws lakeformation create-data-cells-filter --table-data-cells-filter '{
"TableName": "raw",
"DatabaseName": "lab_privacy_db",
"Name": "mask_pii_filter",
"ColumnNames": ["user_id", "name", "email", "zipcode"],
"ColumnWildcard": {"ExcludedColumnNames": ["credit_card", "phone"]}
}'Checkpoints
- Glue Catalog Check: Run
aws glue get-table --database-name lab_privacy_db --name raw. You should see the schema includingcredit_card. - Lake Formation Check: Open the Lake Formation Console > Data filters. You should see
mask_pii_filterlisted. - Permission Verification: Ensure your principal does not have
SELECTon the full table but only through the filter.
Troubleshooting
| Error | Possible Cause | Fix |
|---|---|---|
Insufficient Lake Formation Permissions | You aren't a Data Lake Admin. | Repeat Step 2 or add your IAM ARN to Admin list in LF Console. |
Crawler failed: Access Denied | S3 bucket policy or Glue Role issues. | Ensure the IAM Role used by the crawler has s3:GetObject on the bucket. |
Table not found | Crawler hasn't finished. | Wait for the crawler status to return to READY. |
Clean-Up / Teardown
Execute these commands to remove all resources and stop charges:
# 1. Delete Glue Crawler and Database
aws glue delete-crawler --name privacy-crawler
aws glue delete-database --name lab_privacy_db
# 2. Deregister S3 location
aws lakeformation deregister-resource --resource-arn arn:aws:s3:::$BUCKET_NAME
# 3. Delete S3 Bucket and objects
aws s3 rb s3://$BUCKET_NAME --forceCost Estimate
| Service | Estimated Cost |
|---|---|
| S3 | Free Tier (0.023/GB if exceeded) |
| AWS Glue | ~$0.44 per DPU-Hour (Crawler typically uses 2 DPUs for < 2 mins) |
| Lake Formation | $0.00 (Governance is free) |
| Total | ~$0.10 - $0.25 |
Stretch Challenge
Row-Level Filtering: Modify the Data Cell Filter in Step 4 to only show users where zipcode = '90210'. This demonstrates how to handle data residency requirements (e.g., ensuring local analysts only see local data).
Concept Review
Data Privacy Pillars on AWS
\begin{tikzpicture}[node distance=2cm] \draw[thick, fill=blue!10] (0,0) circle (1.5cm) node[align=center] {\textbf{Encryption}\KMS / TLS}; \draw[thick, fill=red!10] (2.5,0) circle (1.5cm) node[align=center] {\textbf{Governance}\Lake Formation}; \draw[thick, fill=green!10] (1.25,-2) circle (1.5cm) node[align=center] {\textbf{Discovery}\Macie / Glue}; \node at (1.25, 1) {\textbf{Data Privacy Triad}}; \end{tikzpicture}
- Data Masking: Replacing sensitive data with functional aliases. Lake Formation handles this via cell-level filters.
- PII Identification: The process of finding sensitive strings (SSNs, Emails). Amazon Macie automates this using machine learning.
- Principle of Least Privilege: Granting users only the specific columns and rows they need to perform their jobs.
| Feature | IAM Policy | Lake Formation |
|---|---|---|
| Granularity | Resource/API level | Row, Column, Cell level |
| Ease of Use | Complex JSON policies | Visual Grant/Revoke |
| Cross-Account | Manual Role Assumption | Automated Data Sharing |