Hands-On Lab1,050 words

Lab: Implementing Data Privacy and Governance on AWS

Data Privacy and Governance

Lab: Implementing Data Privacy and Governance on AWS

This hands-on lab guides you through implementing data privacy controls, PII identification, and fine-grained access control using AWS Lake Formation, Amazon Glue, and Amazon S3. These skills are critical for the AWS Certified Data Engineer - Associate (DEA-C01) exam.

[!WARNING] Remember to run the teardown commands at the end of this lab to avoid ongoing charges. Estimated cost is < $0.50 if within the Free Tier.

Prerequisites

  • An active AWS Account.
  • AWS CLI installed and configured with Administrator access.
  • Basic knowledge of IAM (Identity and Access Management).
  • A text editor to create a sample CSV file.
  • Placeholder: Replace <YOUR_ACCOUNT_ID> and <YOUR_REGION> with your actual details.

Learning Objectives

  • Create a secure data lake storage structure in Amazon S3.
  • Catalog datasets using AWS Glue Crawlers.
  • Implement Fine-Grained Access Control (FGAC) (column-level) using AWS Lake Formation.
  • Understand the workflow for PII Identification and data masking.

Architecture Overview

Loading Diagram...

Step-by-Step Instructions

Step 1: Prepare Sample Data and S3 Bucket

First, we create a bucket and a mock dataset containing PII (Personally Identifiable Information).

  1. Create a file named users.csv with the following content:

    csv
    user_id,name,email,phone,credit_card,zipcode 1,John Doe,john@example.com,555-0101,1234-5678-9012,90210 2,Jane Smith,jane@example.com,555-0102,9876-5432-1098,10001
  2. Run the following CLI commands:

bash
# Create a unique bucket name BUCKET_NAME="brainybee-lab-privacy-<YOUR_ACCOUNT_ID>" # Create the bucket aws s3 mb s3://$BUCKET_NAME --region <YOUR_REGION> # Upload the data to a 'raw' prefix aws s3 cp users.csv s3://$BUCKET_NAME/raw/users.csv
Console alternative
  1. Navigate to S3 > Create bucket.
  2. Name it brainybee-lab-privacy-<YOUR_ACCOUNT_ID>.
  3. Click Create.
  4. Upload the users.csv file into a new folder named raw/.

Step 2: Set Up Lake Formation Permissions

Before Lake Formation can manage the data, the IAM role/user running this lab must be a Data Lake Administrator.

bash
# Grant your current user administrative rights in Lake Formation aws lakeformation put-data-lake-settings --data-lake-settings '{ "DataLakeAdmins": [{"DataLakePrincipalIdentifier": "arn:aws:iam::<YOUR_ACCOUNT_ID>:root"}] }'

Step 3: Register S3 Location & Catalog Data

We will now register the S3 bucket as a managed location and run a crawler to populate the metadata.

bash
# Register the S3 location aws lakeformation register-resource --resource-arn arn:aws:s3:::$BUCKET_NAME # Create a Glue Database aws glue create-database --database-input '{"Name": "lab_privacy_db"}' # Create and run a Crawler (Simplified IAM role assumption assumed) aws glue create-crawler --name privacy-crawler --role AWSGlueServiceRole-Lab --database-name lab_privacy_db --targets '{"S3Targets": [{"Path": "s3://'$BUCKET_NAME'/raw/"}]}' aws glue start-crawler --name privacy-crawler

[!TIP] It takes about 1-2 minutes for the crawler to finish. You can check status with aws glue get-crawler --name privacy-crawler.

Step 4: Implement Column-Level Security

In this step, we restrict an IAM user from seeing the credit_card column.

bash
# Create a Data Cell Filter (This is the mechanism for FGAC) aws lakeformation create-data-cells-filter --table-data-cells-filter '{ "TableName": "raw", "DatabaseName": "lab_privacy_db", "Name": "mask_pii_filter", "ColumnNames": ["user_id", "name", "email", "zipcode"], "ColumnWildcard": {"ExcludedColumnNames": ["credit_card", "phone"]} }'

Checkpoints

  1. Glue Catalog Check: Run aws glue get-table --database-name lab_privacy_db --name raw. You should see the schema including credit_card.
  2. Lake Formation Check: Open the Lake Formation Console > Data filters. You should see mask_pii_filter listed.
  3. Permission Verification: Ensure your principal does not have SELECT on the full table but only through the filter.

Troubleshooting

ErrorPossible CauseFix
Insufficient Lake Formation PermissionsYou aren't a Data Lake Admin.Repeat Step 2 or add your IAM ARN to Admin list in LF Console.
Crawler failed: Access DeniedS3 bucket policy or Glue Role issues.Ensure the IAM Role used by the crawler has s3:GetObject on the bucket.
Table not foundCrawler hasn't finished.Wait for the crawler status to return to READY.

Clean-Up / Teardown

Execute these commands to remove all resources and stop charges:

bash
# 1. Delete Glue Crawler and Database aws glue delete-crawler --name privacy-crawler aws glue delete-database --name lab_privacy_db # 2. Deregister S3 location aws lakeformation deregister-resource --resource-arn arn:aws:s3:::$BUCKET_NAME # 3. Delete S3 Bucket and objects aws s3 rb s3://$BUCKET_NAME --force

Cost Estimate

ServiceEstimated Cost
S3Free Tier (0.023/GB if exceeded)
AWS Glue~$0.44 per DPU-Hour (Crawler typically uses 2 DPUs for < 2 mins)
Lake Formation$0.00 (Governance is free)
Total~$0.10 - $0.25

Stretch Challenge

Row-Level Filtering: Modify the Data Cell Filter in Step 4 to only show users where zipcode = '90210'. This demonstrates how to handle data residency requirements (e.g., ensuring local analysts only see local data).

Concept Review

Data Privacy Pillars on AWS

\begin{tikzpicture}[node distance=2cm] \draw[thick, fill=blue!10] (0,0) circle (1.5cm) node[align=center] {\textbf{Encryption}\KMS / TLS}; \draw[thick, fill=red!10] (2.5,0) circle (1.5cm) node[align=center] {\textbf{Governance}\Lake Formation}; \draw[thick, fill=green!10] (1.25,-2) circle (1.5cm) node[align=center] {\textbf{Discovery}\Macie / Glue}; \node at (1.25, 1) {\textbf{Data Privacy Triad}}; \end{tikzpicture}

  • Data Masking: Replacing sensitive data with functional aliases. Lake Formation handles this via cell-level filters.
  • PII Identification: The process of finding sensitive strings (SSNs, Emails). Amazon Macie automates this using machine learning.
  • Principle of Least Privilege: Granting users only the specific columns and rows they need to perform their jobs.
FeatureIAM PolicyLake Formation
GranularityResource/API levelRow, Column, Cell level
Ease of UseComplex JSON policiesVisual Grant/Revoke
Cross-AccountManual Role AssumptionAutomated Data Sharing

Ready to study AWS Certified Data Engineer - Associate (DEA-C01)?

Practice tests, flashcards, and all study notes — free, no sign-up needed.

Start Studying — Free