Data Classification & Sensitive Data Management

This guide covers the critical aspects of identifying and protecting sensitive data within the AWS ecosystem, specifically focusing on the requirements for the AWS Certified Developer - Associate (DVA-C02) exam.

Learning Objectives

After studying this guide, you should be able to:

Define and differentiate between PII, PHI, and other data classification levels.
Implement data masking and sanitization techniques within application code.
Identify AWS services used to discover and secure sensitive data.
Apply encryption strategies for environment variables and secrets.

Key Terms & Glossary

Term	Definition	Example
PII	Personally Identifiable Information; any data that can be used to identify a specific individual.	Social Security Number (SSN), Full Name, Home Address.
PHI	Protected Health Information; health-related data created, used, or disclosed in the course of providing health care.	Medical records, health insurance IDs, lab results.
Data Masking	The process of obscuring specific data within a database or log to protect sensitive information while keeping it functional.	Showing only the last four digits of a credit card: `**--**-1234`.
Sanitization	The process of removing sensitive information from a data set so that it can be safely used for other purposes.	Removing user IDs and names from a dataset before sending it to a data science team.
Tokenization	Replacing sensitive data with a non-sensitive equivalent (token) that has no extrinsic value.	Using a randomly generated string to represent a credit card number in a transaction log.

The "Big Idea"

[!IMPORTANT] You cannot protect what you haven't identified. Data Classification is the foundational step of the security lifecycle. It involves categorizing data based on its sensitivity so that appropriate security controls (encryption, access policies, logging) can be applied proportionally to the risk.

Formula / Concept Box

Data Sensitivity Levels

Level	Description	Typical Handling
Public	Low risk; intended for public consumption.	No encryption required; open access.
Internal	Moderate risk; only for company employees.	Standard IAM permissions.
Confidential	High risk; unauthorized disclosure causes harm.	Encryption at rest and in transit required.
Restricted	Critical risk; strictly regulated (PII/PHI).	KMS Encryption, MFA, Fine-grained access, Audit logs.

Hierarchical Outline

Understanding Sensitive Data Types
- Personally Identifiable Information (PII): Direct identifiers (SSN) vs. Indirect identifiers (Zip code + Birthdate).
- Protected Health Information (PHI): Governed by HIPAA in the US; requires Business Associate Addendum (BAA) with AWS.
- Payment Card Industry (PCI): Standards for credit card data (PCI DSS).
Managing Sensitive Data in Code
- Environment Variables: Never store raw secrets in app.yaml or Lambda env vars. Use AWS Secrets Manager or Systems Manager Parameter Store (SecureString).
- Logging: Ensure logs do not contain PII. Implement interceptor patterns to scrub data before it reaches CloudWatch.
Data Masking & Sanitization
- Static Masking: Permanent change to data (useful for dev/test environments).
- Dynamic Masking: On-the-fly masking based on the user's permissions.
AWS Discovery Tools
- Amazon Macie: Uses Machine Learning to automatically discover and classify PII in Amazon S3 buckets.

Visual Anchors

Data Classification Workflow

Loading Diagram...

Sensitivity Pyramid

\begin{center}

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

\end{center}

Definition-Example Pairs

Term: Deterministic Masking
- Definition: A technique where the same original value is always replaced with the same masked value, preserving data integrity for joins.
- Example: Masking "User_123" as "ABC_999" across both the Sales and Marketing databases so records can still be linked without revealing the true ID.
Term: Redaction
- Definition: The absolute removal of sensitive data from a document or log file.
- Example: A Lambda function that uses a Regular Expression (Regex) to find any 9-digit number resembling an SSN and replacing it with [REDACTED] before printing to stdout.

Worked Examples

Example 1: Securely Handling PII in Lambda

Scenario: A developer needs to process a user's profile which includes a phone number. The phone number must be logged for debugging, but must not be visible to DevOps engineers viewing CloudWatch.

Step-by-Step Solution:

Fetch Data: Retrieve the user record from DynamoDB.
Apply Masking Logic:
python
def mask_phone(phone_number): # Returns only last 4 digits return "***-***-" + phone_number[-4:] raw_phone = "555-0199-1234" print(f"Processing user: {mask_phone(raw_phone)}")
Result: The log displays Processing user: ***-***-1234, protecting the user's privacy while allowing the developer to confirm the process triggered for some phone number.

Example 2: Automating Discovery with Macie

Scenario: An organization has 500 S3 buckets and is worried that employees might have accidentally uploaded spreadsheets containing PHI.

Step-by-Step Solution:

Enable Amazon Macie: Turn on the service in the AWS Management Console.
Run Discovery Job: Select the target S3 buckets for a "Sensitive Data Discovery" job.
Review Findings: Macie identifies a file patient_list.csv in a public bucket with a high "Sensitivity Score" because it detected patterns matching medical record numbers.
Remediate: Use an EventBridge trigger to automatically set the bucket to private or alert the security team.

Checkpoint Questions

What is the primary difference between PII and PHI?
If you need to store an API key for a third-party service, which AWS service is preferred for encryption and automatic rotation?
True or False: Data Masking is primarily used for encryption at rest.
Which AWS service uses Machine Learning to find PII in S3?
Why should you avoid using the "Full Name" of a customer in a Lambda function's Environment Variable?

▶View Answers

PII (Personally Identifiable Information) identifies an individual generally (like an SSN), whereas PHI (Protected Health Information) is specifically related to medical and health records.
AWS Secrets Manager (supports rotation; Parameter Store does not natively rotate secrets).
False. Data Masking is used to obscure data for display or testing; encryption at rest uses mathematical algorithms (like AES-256) to secure data on disk.
Amazon Macie.
Environment variables are often visible in the AWS Console and in deployment logs; sensitive data should be fetched at runtime from a secure store like Secrets Manager.

Data Classification & Sensitive Data Management

Learning Objectives

After studying this guide, you should be able to:

Define and differentiate between PII, PHI, and other data classification levels.
Implement data masking and sanitization techniques within application code.
Identify AWS services used to discover and secure sensitive data.
Apply encryption strategies for environment variables and secrets.

Key Terms & Glossary

Term	Definition	Example
PII	Personally Identifiable Information; any data that can be used to identify a specific individual.	Social Security Number (SSN), Full Name, Home Address.
PHI	Protected Health Information; health-related data created, used, or disclosed in the course of providing health care.	Medical records, health insurance IDs, lab results.
Data Masking	The process of obscuring specific data within a database or log to protect sensitive information while keeping it functional.	Showing only the last four digits of a credit card: `**--**-1234`.
Sanitization	The process of removing sensitive information from a data set so that it can be safely used for other purposes.	Removing user IDs and names from a dataset before sending it to a data science team.
Tokenization	Replacing sensitive data with a non-sensitive equivalent (token) that has no extrinsic value.	Using a randomly generated string to represent a credit card number in a transaction log.

The "Big Idea"

[!IMPORTANT] You cannot protect what you haven't identified. Data Classification is the foundational step of the security lifecycle. It involves categorizing data based on its sensitivity so that appropriate security controls (encryption, access policies, logging) can be applied proportionally to the risk.

Formula / Concept Box

Data Sensitivity Levels

Level	Description	Typical Handling
Public	Low risk; intended for public consumption.	No encryption required; open access.
Internal	Moderate risk; only for company employees.	Standard IAM permissions.
Confidential	High risk; unauthorized disclosure causes harm.	Encryption at rest and in transit required.
Restricted	Critical risk; strictly regulated (PII/PHI).	KMS Encryption, MFA, Fine-grained access, Audit logs.

Hierarchical Outline

Understanding Sensitive Data Types
- Personally Identifiable Information (PII): Direct identifiers (SSN) vs. Indirect identifiers (Zip code + Birthdate).
- Protected Health Information (PHI): Governed by HIPAA in the US; requires Business Associate Addendum (BAA) with AWS.
- Payment Card Industry (PCI): Standards for credit card data (PCI DSS).
Managing Sensitive Data in Code
- Environment Variables: Never store raw secrets in app.yaml or Lambda env vars. Use AWS Secrets Manager or Systems Manager Parameter Store (SecureString).
- Logging: Ensure logs do not contain PII. Implement interceptor patterns to scrub data before it reaches CloudWatch.
Data Masking & Sanitization
- Static Masking: Permanent change to data (useful for dev/test environments).
- Dynamic Masking: On-the-fly masking based on the user's permissions.
AWS Discovery Tools
- Amazon Macie: Uses Machine Learning to automatically discover and classify PII in Amazon S3 buckets.

Visual Anchors

Data Classification Workflow

Loading Diagram...

Sensitivity Pyramid

\begin{center}

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

\end{center}

Definition-Example Pairs

Term: Deterministic Masking
- Definition: A technique where the same original value is always replaced with the same masked value, preserving data integrity for joins.
- Example: Masking "User_123" as "ABC_999" across both the Sales and Marketing databases so records can still be linked without revealing the true ID.
Term: Redaction
- Definition: The absolute removal of sensitive data from a document or log file.
- Example: A Lambda function that uses a Regular Expression (Regex) to find any 9-digit number resembling an SSN and replacing it with [REDACTED] before printing to stdout.

Worked Examples

Example 1: Securely Handling PII in Lambda

Step-by-Step Solution:

Fetch Data: Retrieve the user record from DynamoDB.
Apply Masking Logic:
python
def mask_phone(phone_number): # Returns only last 4 digits return "***-***-" + phone_number[-4:] raw_phone = "555-0199-1234" print(f"Processing user: {mask_phone(raw_phone)}")
Result: The log displays Processing user: ***-***-1234, protecting the user's privacy while allowing the developer to confirm the process triggered for some phone number.

Example 2: Automating Discovery with Macie

Scenario: An organization has 500 S3 buckets and is worried that employees might have accidentally uploaded spreadsheets containing PHI.

Step-by-Step Solution:

Enable Amazon Macie: Turn on the service in the AWS Management Console.
Run Discovery Job: Select the target S3 buckets for a "Sensitive Data Discovery" job.
Review Findings: Macie identifies a file patient_list.csv in a public bucket with a high "Sensitivity Score" because it detected patterns matching medical record numbers.
Remediate: Use an EventBridge trigger to automatically set the bucket to private or alert the security team.

Checkpoint Questions

What is the primary difference between PII and PHI?
If you need to store an API key for a third-party service, which AWS service is preferred for encryption and automatic rotation?
True or False: Data Masking is primarily used for encryption at rest.
Which AWS service uses Machine Learning to find PII in S3?
Why should you avoid using the "Full Name" of a customer in a Lambda function's Environment Variable?

▶View Answers

PII (Personally Identifiable Information) identifies an individual generally (like an SSN), whereas PHI (Protected Health Information) is specifically related to medical and health records.
AWS Secrets Manager (supports rotation; Parameter Store does not natively rotate secrets).
False. Data Masking is used to obscure data for display or testing; encryption at rest uses mathematical algorithms (like AES-256) to secure data on disk.
Amazon Macie.
Environment variables are often visible in the AWS Console and in deployment logs; sensitive data should be fetched at runtime from a secure store like Secrets Manager.