☁️ AWS

Free AWS Certified Data Engineer - Associate (DEA-C01) Study Resources

Comprehensive Certified Data Engineer - Associate (DEA-C01) hive provides study notes, practice tests, flashcards, and hands-on labs, all supported by a personal AI tutor to help you master the AWS Certified Data Engineer - Associate DEA-C01 certification.

635
Practice Questions
9
Mock Exams
153
Study Notes
680
Flashcard Decks
2
Source Materials

AWS Certified Data Engineer - Associate (DEA-C01) Study Notes & Guides

153 AI-generated study notes covering the full AWS Certified Data Engineer - Associate (DEA-C01) curriculum. Showing 10 complete guides below.

Study Guide945 words

AWS Data Engineering: Addressing Changes to Data Characteristics

Address changes to the characteristics of data

Read full article

AWS Data Engineering: Addressing Changes to Data Characteristics

This guide covers Task 2.4.2 of the AWS Certified Data Engineer – Associate (DEA-C01) exam. It focuses on how data engineers manage the evolving nature of data, including schema drift, structural changes, and lifecycle management within the AWS ecosystem.

Learning Objectives

By the end of this guide, you will be able to:

  • Define Schema Evolution and identify strategies for handling Schema Drift.
  • Configure AWS Glue Crawlers to automatically detect and update metadata.
  • Differentiate between tools used for schema conversion like AWS SCT and AWS DMS.
  • Implement data lifecycle policies in Amazon S3 and Amazon DynamoDB to manage data aging.
  • Establish Data Lineage to track changes across the data environment.

Key Terms & Glossary

  • Schema Drift: The phenomenon where source data systems change their structure (e.g., adding/removing columns) without notifying downstream consumers.
  • Data Catalog: A persistent metadata store (like AWS Glue Data Catalog) that provides a unified view of data across various sources.
  • Partition Projection: A technique in AWS Glue that speeds up query processing of highly partitioned tables by calculating partition information from configuration rather than S3 metadata.
  • TTL (Time to Live): A mechanism in DynamoDB that automatically deletes items from a table after a specific timestamp to reduce storage costs.
  • DQDL (Data Quality Definition Language): A declarative language used in AWS Glue to define rules for validating data quality.

The "Big Idea"

In a modern data architecture, change is the only constant. Data characteristics—such as its schema, volume, and velocity—evolve over time. A Data Engineer's primary responsibility is to build resilient pipelines that can gracefully handle these changes without manual intervention. This involves balancing automated discovery (Glue Crawlers) with rigid governance (Lake Formation) and cost-optimized storage (S3 Lifecycle).

Formula / Concept Box

ConceptTool / RuleImpact
Schema UpdatesAWS Glue Crawler UpdateTableAutomatically adds new columns to the Data Catalog.
Structural MappingAWS Schema Conversion Tool (SCT)Converts source database schemas to a different target engine (e.g., Oracle to Aurora).
Data AgingS3 Lifecycle PoliciesAutomates transitions: S3 StandardS3 GlacierExpiration.
Item ExpirationDynamoDB TTLDeletes data based on an epoch timestamp attribute without using RCU/WCU.

Hierarchical Outline

  • I. Schema Evolution & Management
    • AWS Glue Data Catalog: The central metadata repository for AWS Lake House architectures.
    • Glue Crawlers: Automate schema discovery; can be configured to add new columns or mark deleted columns as deprecated.
    • Schema Versioning: Keeping history of schema changes to ensure backward compatibility for Athena/Redshift Spectrum queries.
  • II. Addressing Structural Changes
    • AWS SCT: Used for heterogeneous migrations; transforms schema, functions, and stored procedures.
    • AWS DMS: Performs the actual data movement; can handle simple schema changes during replication.
  • III. Managing Data Characteristics over Time
    • S3 Versioning: Protects against accidental deletes and allows rollbacks to previous states of data.
    • Partitioning Strategies: Using date-based partitioning (year=2023/month=10/day=24) to optimize query performance as data grows.

Visual Anchors

Data Cataloging Workflow

Loading Diagram...

S3 Lifecycle Transition Logic

Compiling TikZ diagram…
Running TeX engine…
This may take a few seconds

Definition-Example Pairs

  • Term: Heterogeneous Migration
    • Definition: Moving data between different database engines where the schema must be converted.
    • Example: Migrating an on-premises Microsoft SQL Server database to an Amazon Aurora PostgreSQL cluster using AWS SCT to rewrite the SQL syntax.
  • Term: Data Lineage
    • Definition: A visual map of the data's journey, showing where it originated and how it was transformed.
    • Example: Using Amazon SageMaker ML Lineage Tracking to see which specific S3 dataset was used to train a specific version of an AI model.

Worked Examples

Example 1: Handling Added Columns in a CSV Batch

Scenario: A marketing team adds a promo_code column to their daily CSV upload in S3. Your Athena queries are failing because the Data Catalog doesn't know about this column. Solution:

  1. Run the AWS Glue Crawler assigned to that S3 path.
  2. Set the crawler configuration to "Update the table definition in the data catalog" for any schema changes.
  3. The Crawler detects the new column and updates the Metadata. Athena can now query the new column immediately without manual SQL ALTER TABLE commands.

Example 2: Optimizing DynamoDB Storage Costs

Scenario: A gaming app stores temporary session data in DynamoDB. This data is only needed for 24 hours. Solution:

  1. Add a TimeToLive attribute to each item (format: Unix Epoch time).
  2. Enable TTL on the DynamoDB table, selecting that attribute.
  3. Result: DynamoDB automatically deletes the sessions within 48 hours of expiration, and these deletes do not consume Write Capacity Units (WCU).

Checkpoint Questions

  1. What is the difference between AWS SCT and AWS DMS regarding schema changes?
  2. How does Partition Projection improve performance for highly partitioned data in S3?
  3. Which S3 feature allows you to recover a file that was overwritten by a script with incorrect data?
  4. When should you use AWS Glue DataBrew instead of a Glue ETL script?

Comparison Tables

FeatureAWS Glue CrawlerAWS SCT
Primary PurposeMetadata Discovery (S3/RDS/NoSQL)Schema Conversion (Database-to-Database)
Target OutputGlue Data Catalog TablesSQL DDL Scripts / Converted Schema
Handling ChangeDetects schema drift automaticallyManual re-run for structural redesigns
Use CasePopulating Data LakesDatabase Migrations

Muddy Points & Cross-Refs

  • Crawler vs. Manual Entry: If your schema is extremely stable and you want to prevent unauthorized changes, manual entry is better. Crawlers are best for evolving datasets.
  • Partitioning vs. Indexing: In Redshift, use Sort Keys for performance; in S3/Athena, use Partitions (folders) to limit the amount of data scanned.
  • S3 Versioning vs. Backup: Versioning is for immediate recovery of specific objects; AWS Backup is for cross-region disaster recovery and compliance-level snapshots.
Study Guide945 words

Analyzing Logs with AWS Services: A Study Guide

Analyze logs by using AWS services (for example, Athena, CloudWatch Logs Insights, Amazon OpenSearch Service)

Read full article

Analyzing Logs with AWS Services

This study guide covers the core AWS services used to aggregate, process, and analyze log data for operational health, security auditing, and performance optimization.

Learning Objectives

After studying this guide, you should be able to:

  • Differentiate between Amazon CloudWatch, Amazon Athena, and Amazon OpenSearch Service for log analysis.
  • Identify the correct service for analyzing CloudTrail API calls and VPC Flow Logs.
  • Explain the role of AWS Glue and Amazon EMR in processing unstructured or large-scale log volumes.
  • Utilize SQL and Natural Language queries to extract insights from log streams.

Key Terms & Glossary

  • Serialization/Deserialization: The process of converting data from a readable format (text) to a compressed storage format (binary) and back again.
  • Log Group: A group of log streams that share the same retention, monitoring, and access control settings in CloudWatch.
  • PII (Personally Identifiable Information): Sensitive data that must be identified (e.g., using Amazon Macie) and potentially masked during log processing.
  • Hot Data: Data that is frequently accessed and stored on high-performance storage (used primarily in Amazon OpenSearch Service).
  • Anomaly Detection: Using baselines to identify deviations in API call volumes or error rates (e.g., CloudTrail Insights).

The "Big Idea"

In a distributed cloud environment, logs are the "source of truth" for both security and operations. The core challenge is not just collecting logs, but normalizing diverse formats (application logs, system logs, API traces) so they can be queried at scale. AWS provides a tiered approach: CloudWatch for real-time monitoring, Athena for cost-effective SQL analysis on S3, and OpenSearch for complex, full-text interactive analytics.

Formula / Concept Box

FeatureCloudWatch Logs InsightsAmazon AthenaAmazon OpenSearch Service
Data SourceCloudWatch Log GroupsAmazon S3OpenSearch Cluster (Hot Data)
Query LanguageSpecialized Query SyntaxStandard SQLDSL / SQL / Lucene
Primary UseOperational TroubleshootingCompliance / Long-term AuditInteractive Analytics / Search
Setup EffortZero (Managed)Low (Define Schema)Medium (Manage Cluster)

Hierarchical Outline

  • 1. Native Logging Services
    • Amazon CloudWatch: Centralized store for application and AWS service logs. Includes alarms and dashboards.
    • AWS CloudTrail: Records API activity across the AWS account for governance and auditing.
  • 2. Interactive Analysis Tools
    • CloudWatch Logs Insights: Interactive querying of logs; supports natural language query generation and field auto-detection.
    • Amazon Athena: Serverless SQL queries on log data stored in S3 (VPC Flow Logs, CloudTrail, S3 Access Logs).
  • 3. Advanced Analytics & Visualization
    • Amazon OpenSearch Service: Distributed engine for log analytics, security intelligence, and full-text search.
    • Amazon Managed Grafana: Visualization tool to analyze metrics, logs, and traces across multiple AWS sources.
  • 4. Log Processing Pipelines
    • AWS Glue / Amazon EMR: Used for terabyte-scale logs or custom formats that require transformation before analysis.

Visual Anchors

Log Analysis Flowchart

Loading Diagram...

Architecture: Log Ingestion and Processing

Compiling TikZ diagram…
Running TeX engine…
This may take a few seconds

Definition-Example Pairs

  • CloudTrail Insights: Continuously analyzes management events to baseline API call volumes.
    • Example: An alert is triggered when the RunInstances API call volume spikes 300% above the normal baseline, indicating a potential security breach or script error.
  • VPC Flow Logs: Captures information about the IP traffic to and from network interfaces in a VPC.
    • Example: Using Athena to query Flow Logs to identify which specific IP addresses are being rejected by security group rules.
  • System Tables (Redshift): Internal tables used to monitor data warehouse performance.
    • Example: Querying STL_QUERY_METRICS to find the CPU usage and disk I/O of a specific long-running financial report.

Worked Examples

Example 1: CloudWatch Logs Insights Query

To find the number of errors per 5-minute bin in an application log:

bash
fields @timestamp, @message | filter @message like /Error/ | stats count(*) as errorCount by bin(5m) | sort errorCount desc

Example 2: Querying CloudTrail Logs in Athena

If CloudTrail logs are stored in S3, you can use SQL to find who deleted a specific S3 bucket:

sql
SELECT eventTime, userIdentity.arn, sourceIPAddress FROM cloudtrail_logs WHERE eventName = 'DeleteBucket' AND requestParameters LIKE '%my-target-bucket-name%' ORDER BY eventTime DESC;

Checkpoint Questions

  1. Which service allows you to use natural language to generate queries for log data?
  2. If you have terabytes of unstructured custom logs, which two services are recommended for processing them into a queryable format?
  3. What is the main difference between Amazon Kendra and Amazon OpenSearch Service regarding query types?
  4. How long does it typically take for VPC Flow Logs to appear in a CloudWatch Log Group after configuration?

Comparison Tables

Use CaseRecommended ServiceWhy?
Finding specific API errorsCloudTrail InsightsAutomatically baselines "normal" and flags anomalies.
Full-text search in logsOpenSearch ServiceBuilt on Apache Lucene; optimized for string matching and indexing.
Ad-hoc SQL on S3 filesAmazon AthenaServerless; pay-per-query; no infrastructure to manage.
Debugging Lambda codeCloudWatch LogsNative integration; Lambda automatically streams stdout/stderr here.

Muddy Points & Cross-Refs

  • Athena vs. OpenSearch: Use Athena for cost-effective, occasional analysis of massive datasets (Data Lake). Use OpenSearch for frequent, interactive dashboarding and sub-second search latency (Hot data).
  • Glue vs. EMR: Both use Spark. Use AWS Glue for serverless, event-driven ETL. Use Amazon EMR for long-running, complex clusters where you need granular control over the Spark environment.
  • Serialization Pitfall: Remember that Athena requires a defined schema (DML). If your logs change format, the query might fail unless you update the Glue Data Catalog or use JSON extraction functions.

[!TIP] When analyzing logs for the exam, always look for the keyword "SQL" (Athena), "Real-time/Dashboard" (OpenSearch), or "API/Audit" (CloudTrail).

Study Guide925 words

Mastering Log Analysis with AWS Services: DEA-C01 Study Guide

Analyze logs with AWS services (for example, Athena, Amazon EMR, Amazon OpenSearch Service, CloudWatch Logs Insights, big data application logs)

Read full article

Mastering Log Analysis with AWS Services

This guide covers the critical skills required for the AWS Certified Data Engineer - Associate (DEA-C01) regarding log analysis, monitoring, and auditing using AWS native tools like Athena, CloudWatch, and OpenSearch.

Learning Objectives

After studying this guide, you should be able to:

  • Differentiate between CloudWatch Logs Insights, Amazon Athena, and Amazon OpenSearch for log analysis.
  • Configure AWS CloudTrail and CloudTrail Insights for API auditing.
  • Use Amazon EMR and AWS Glue for processing large-scale or unstructured log data.
  • Monitor Amazon Redshift using system tables and audit logs.
  • Apply Serialization/Deserialization (SerDe) concepts to log transformation.

Key Terms & Glossary

  • SerDe (Serialization/Deserialization): The process of converting data from one format to another (e.g., text to binary for storage, binary to text for reading).
  • CloudWatch Logs Insights: An interactive query service that uses a purpose-built query language to analyze logs in CloudWatch.
  • CloudTrail Insights: A feature that identifies unusual API activity by baselining normal operational patterns.
  • OpenSearch Dashboards: A visualization tool (formerly Kibana) for exploring data indexed in Amazon OpenSearch clusters.
  • STL Tables: System tables in Amazon Redshift used for monitoring query metrics and alerts.

The "Big Idea"

Logging is not just about storage; it is about observability and traceability. In the AWS ecosystem, log data flows from sources (EC2, Lambda, VPC) into central repositories (S3, CloudWatch). From there, the complexity and volume of the logs determine the tool: CloudWatch Insights for quick operational fixes, Athena for serverless SQL queries on S3 data lakes, and OpenSearch for real-time, interactive search and visualization.

Formula / Concept Box

FeaturePrimary ServiceKey Attribute
Ad-hoc SQL on S3Amazon AthenaServerless, Pay-per-query, No infrastructure management.
Real-time SearchAmazon OpenSearchLow-latency, indexing, visualization-heavy.
Big Data / Custom LogicAmazon EMR / GlueDistributed processing (Spark/Hive) for petabyte-scale.
Operational TriageCloudWatch InsightsNatural language query generation, auto-detects log fields.

Hierarchical Outline

  • I. Centralized Log Storage
    • Amazon S3: Durable, cost-effective storage class (Standard, Glacier) for long-term audits.
    • Amazon CloudWatch Logs: Real-time ingestion point for application and service logs.
  • II. Interactive Analysis Tools
    • CloudWatch Logs Insights: Interactively query logs; supports visualization via graphs.
    • Amazon Athena: Querying S3 logs directly using Standard SQL; integrates with Glue Data Catalog.
  • III. Advanced Search & Visualization
    • Amazon OpenSearch Service: Managed cluster for indexing logs for sub-second search results.
    • Amazon Managed Grafana: Visualizing metrics and logs across multiple AWS accounts.
  • IV. Auditing & Security
    • AWS CloudTrail: Tracks API calls; identifies "who, what, where, when."
    • CloudTrail Lake: Centralized, immutable store for long-term API query history.

Visual Anchors

Log Ingestion and Analysis Pipeline

Loading Diagram...

Query Complexity vs. Data Scale

Compiling TikZ diagram…
Running TeX engine…
This may take a few seconds

Definition-Example Pairs

  • Metric Filter
    • Definition: A feature in CloudWatch that searches for patterns in logs and turns them into numerical metrics.
    • Example: Searching for the string "404" in web server logs to create an alarm for broken links.
  • STL_ALERT_EVENT_LOG
    • Definition: A Redshift system table that records alerts (e.g., missing statistics) during query execution.
    • Example: A data engineer queries this table to find out why a specific ETL job is suddenly running slowly due to disk space constraints.
  • CloudTrail Insights
    • Definition: An anomaly detection tool for API management events.
    • Example: Receiving an alert because an IAM user who usually creates 2 S3 buckets a day suddenly creates 500 in an hour.

Worked Examples

Scenario: Identifying High-Traffic IPs in Web Logs

The Problem: You have 100GB of web server logs in an S3 bucket and need to find the top 5 IP addresses that accessed your site in the last 24 hours.

The Solution:

  1. Define Schema: Use an AWS Glue Crawler to scan the S3 bucket and create a table in the Glue Data Catalog.
  2. Query with Athena:
    sql
    SELECT remote_ip, COUNT(*) as request_count FROM web_logs WHERE request_timestamp > current_timestamp - interval '1' day GROUP BY remote_ip ORDER BY request_count DESC LIMIT 5;
  3. Result: Athena returns the data as a CSV or displays it directly in the console for visualization.

Checkpoint Questions

  1. Which service provides natural language query generation to help users write log queries?
  2. True or False: Audit logging for Amazon Redshift is enabled by default.
  3. What is the main difference between Amazon Kendra and Amazon OpenSearch regarding query logic?
  4. When should you choose Amazon EMR over Amazon Athena for log analysis?

[!NOTE] Answer Key:

  1. CloudWatch Logs Insights.
  2. False (must be explicitly enabled to S3 or CloudWatch).
  3. Kendra uses Natural Language Processing (ML); OpenSearch uses SQL-like string matches and indexing.
  4. Choose EMR when logs are unstructured/custom and require complex Spark transformations or distributed processing at a massive scale.

Comparison Tables

ServiceLatencyLanguageBest For...
AthenaSeconds/MinutesStandard SQLAd-hoc analytics on S3 Data Lakes.
OpenSearchSub-secondSQL / DSLReal-time monitoring and dashboards.
CloudWatch InsightsSecondsPurpose-builtQuick operational troubleshooting.
CloudTrail LakeSecondsSQLLong-term security and compliance audits.

Muddy Points & Cross-Refs

  • SerDe Confusion: Remember that Serialization = Data to Storage (Binary); Deserialization = Storage to Readable (Text). Use this when configuring Athena or Glue to read custom formats.
  • Redshift Logging: Redshift logs aren't just one type. There are Connection logs, User logs, and User Activity logs. Each has a specific path in CloudWatch: /aws/redshift/cluster/<name>/<type>.
  • OpenSearch Serverless: If you don't want to manage nodes or clusters, remember you can now use Amazon OpenSearch Serverless.
Study Guide1,152 words

AWS Authorization Methods: RBAC, ABAC, and TBAC

Apply authorization methods that address business needs (role-based, tag-based, and attribute-based)

Read full article

AWS Authorization Methods: RBAC, ABAC, and TBAC

This study guide focuses on designing and applying authorization mechanisms that align with business needs, specifically highlighting the differences between Role-Based (RBAC), Tag-Based (TBAC), and Attribute-Based (ABAC) access controls within the AWS ecosystem.

Learning Objectives

By the end of this guide, you should be able to:

  • Differentiate between RBAC, ABAC, and TBAC in the context of IAM and AWS Lake Formation.
  • Design IAM policies that implement the principle of least privilege using condition keys.
  • Implement fine-grained access control (row, column, and cell-level) using AWS Lake Formation tags.
  • Evaluate the best authorization method based on organizational scale and complexity.

Key Terms & Glossary

  • Principal: An entity (user, group, or role) that can make a request for an action or operation on an AWS resource.
  • RBAC (Role-Based Access Control): A traditional authorization model where permissions are assigned to roles, and users gain those permissions by assuming the role.
  • ABAC (Attribute-Based Access Control): An authorization strategy that defines permissions based on attributes (such as tags) of the user and the resource.
  • TBAC (Tag-Based Access Control): A specific implementation of ABAC where tags are the primary attributes used for evaluation; heavily utilized in AWS Lake Formation (LF-TBAC).
  • Least Privilege: The security practice of granting only the minimum permissions required to perform a task.
  • Permissions Boundary: An advanced feature where you use a managed policy to set the maximum permissions that an identity-based policy can grant to an IAM entity.

The "Big Idea"

In early cloud adoption, RBAC was sufficient: "If you are a Data Engineer, you get the Data Engineer role." However, as organizations grow to thousands of users and resources, managing individual roles for every project becomes an administrative nightmare. The shift toward ABAC/TBAC allows permissions to scale dynamically. Instead of creating new roles, you simply tag resources and users (e.g., Project=Omega). If the tags match, access is granted. This moves security from static "gatekeeping" to dynamic "logic-based" enforcement.

Formula / Concept Box

ElementPurposeExample
EffectAllow or Deny"Effect": "Allow"
ActionThe specific API call"Action": ["s3:GetObject"]
ResourceThe ARN of the target"Resource": "arn:aws:s3:::my-bucket/*"
ConditionLogic for when policy applies"StringEquals": {"aws:ResourceTag/Project": "${aws:PrincipalTag/Project}"}

Hierarchical Outline

  1. Role-Based Access Control (RBAC)
    • Structure: Identity \rightarrow Role \rightarrow Policy.
    • Use Case: Broad departmental access (e.g., all Finance users access Finance bucket).
    • Limitation: "Role Explosion" — creating too many roles for specific projects.
  2. Attribute-Based Access Control (ABAC)
    • Structure: Policy logic checks for matching attributes on Principal and Resource.
    • Benefits: High scalability; permissions update automatically when tags change.
    • Mechanism: Uses Condition blocks in IAM JSON policies.
  3. Tag-Based Access Control (TBAC) in Lake Formation
    • LF-Tags: Specialized tags for the Data Catalog (Databases, Tables, Columns).
    • Inheritance: Tags applied at the Database level can be inherited by Tables and Columns.
    • Granularity: Enables row-level (PartiQL filters) and column-level (inclusion/exclusion) security.

Visual Anchors

Authorization Logic Flow

Loading Diagram...

Identity vs. Resource Policy Intersection

Compiling TikZ diagram…
Running TeX engine…
This may take a few seconds

[!NOTE] For most services, an "Allow" in either an identity-based OR resource-based policy is sufficient. However, for KMS, you must have permission in the Key Policy specifically.

Definition-Example Pairs

  • Term: Role-Based Access Control (RBAC)

    • Definition: Permissions based on job function.
    • Example: An AdminRole allows iam:* actions. Any user assigned to this role can manage all IAM settings regardless of which project they belong to.
  • Term: Attribute-Based Access Control (ABAC)

    • Definition: Permissions based on matching metadata between user and resource.
    • Example: A developer with the tag Project=Blue can only start EC2 instances that also have the tag Project=Blue. If they move to Project=Red, their tag is updated, and they automatically gain access to Red resources without changing the policy.
  • Term: Row-Level Security

    • Definition: Restricting access to specific records within a table based on data values.
    • Example: In a Sales table, a Regional Manager for 'West' is restricted via Lake Formation to only see rows where region_id = 'West'.

Worked Examples

Example 1: Constructing an ABAC Policy

Scenario: Allow developers to manage S3 objects only if the object's Environment tag matches the user's Environment tag.

json
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": ["s3:GetObject", "s3:PutObject"], "Resource": "arn:aws:s3:::company-data-lake/*", "Condition": { "StringEquals": { "s3:ExistingObjectTag/Environment": "${aws:PrincipalTag/Environment}" } } } ] }

Example 2: Lake Formation Cell-Level Security

Scenario: A data analyst needs access to the Customers table but must not see the SSN column, and can only see customers from the UK.

  1. Step 1: In Lake Formation, create a Data Filter.
  2. Step 2: Define the column filter (Exclude ssn).
  3. Step 3: Define the row filter (PartiQL: country = 'UK').
  4. Step 4: Grant the SELECT permission to the analyst's IAM role using this specific filter.

Checkpoint Questions

  1. Which authorization method is most effective for preventing "Role Explosion" in large, fast-growing organizations?
  2. In Lake Formation, if you apply an LF-Tag to a Database, what happens to the tables within that database by default?
  3. True or False: A Permissions Boundary can be used to grant a user additional permissions they don't already have.
  4. Which AWS service is specifically used to manage fine-grained access (rows/columns) for Amazon S3 data used by Athena and EMR?

Comparison Tables

FeatureRBACABAC / TBAC
Primary LogicUser Role / Job TitleTags / Attributes
ScalabilityLow (requires more roles as it grows)High (dynamic based on metadata)
ManagementCentralized in IAM RolesDecentralized via Tagging
GranularityCoarse-grainedFine-grained (down to rows/columns)
Best ForInternal Admin tasksMulti-tenant Data Lakes

Muddy Points & Cross-Refs

  • Policy Evaluation Logic: Remember that an Explicit Deny always wins. Even if an ABAC policy allows access, a Service Control Policy (SCP) or Permissions Boundary that denies it will block the user.
  • Cross-Account Access: When accessing a resource in another account, you need permissions in both the identity-based policy (Account A) and the resource-based policy (Account B).
  • Lake Formation vs. IAM: Lake Formation doesn't replace IAM; it works with it. You still need IAM permissions to access the Lake Formation APIs, but Lake Formation handles the data-level permissions (the "Who can see this row?" logic).

[!TIP] For the Exam: If the question mentions "scale," "dynamic," or "frequent project changes," think ABAC. If it mentions "standardized job functions," think RBAC.

Study Guide1,150 words

Applying IAM Policies to Roles, Endpoints, and Services

Apply IAM policies to roles, endpoints, and services (for example, S3 Access Points, AWS PrivateLink)

Read full article

Applying IAM Policies to Roles, Endpoints, and Services

This study guide focuses on the critical skill of securing AWS resources by applying granular Identity and Access Management (IAM) policies. This is a core competency for the AWS Certified Data Engineer – Associate exam, specifically regarding data privacy, governance, and authentication mechanisms.

Learning Objectives

  • Distinguish between different IAM policy types (identity-based, resource-based, and permissions boundaries).
  • Configure IAM roles for service-to-service communication using the principle of least privilege.
  • Implement specialized access controls like S3 Access Points and VPC Endpoints (PrivateLink).
  • Evaluate effective permissions when multiple policy types overlap.

Key Terms & Glossary

  • Principal: An entity (user, role, or account) that can perform actions on AWS resources.
  • IAM Role: An identity with specific permissions that can be assumed by anyone (users or services) who needs them, providing temporary security credentials.
  • Service-Linked Role: A unique type of IAM role that is linked directly to an AWS service and predefined by the service for its own use.
  • ARN (Amazon Resource Name): A standardized format to uniquely identify AWS resources across all of AWS.
  • S3 Access Point: A named network endpoint with a dedicated access policy that describes how data can be accessed using that endpoint.
  • AWS PrivateLink: Technology that provides private connectivity between VPCs and AWS services without exposing data to the internet.

The "Big Idea"

In a data engineering ecosystem, security is not just about "who" has access, but "how" and "from where" that access occurs. By combining IAM Roles (identities) with Resource-Based Policies (on the data itself) and Network Endpoints (the path to the data), you create a multi-layered defense. This "Defense in Depth" ensures that even if a credential is leaked, the data remains protected by network constraints and resource-level locks.

Formula / Concept Box

IAM Policy Structure

Every IAM policy statement contains these four core elements:

ElementDescriptionExample
EffectWhether the statement allows or denies access."Effect": "Allow"
ActionThe specific API operation(s) being permitted."Action": "s3:GetObject"
ResourceThe specific AWS resource(s) the action applies to."Resource": "arn:aws:s3:::my-bucket/*"
ConditionOptional: When the policy is in effect."Condition": {"IpAddress": {"aws:SourceIp": "1.2.3.4/32"}}

Hierarchical Outline

  1. IAM Policy Types
    • Identity-Based: Attached to users/roles; defines what an identity can do.
    • Resource-Based: Attached to resources (e.g., S3 buckets, SQS queues); defines who can access the resource.
    • Permissions Boundaries: A managed policy used to set the maximum permissions that an identity-based policy can grant.
  2. Access Delegation & Roles
    • Service Roles: Assumed by AWS services (e.g., Lambda, EMR) to interact with other resources.
    • Cross-Account Access: Using roles to allow a principal in Account A to access resources in Account B safely.
  3. Modern S3 Security
    • S3 Access Points: Simplifies managing data access for shared datasets; unique policies for different applications.
    • Block Public Access: An account-level or bucket-level guardrail to prevent accidental exposure.
  4. Network-Level IAM (Endpoints)
    • Interface VPC Endpoints: Uses PrivateLink to keep traffic within the AWS backbone.
    • Endpoint Policies: Resource-based policies attached to a VPC endpoint to control which principals can use it.

Visual Anchors

Policy Evaluation Logic

Loading Diagram...

S3 Access Point Architecture

Compiling TikZ diagram…
Running TeX engine…
This may take a few seconds

Definition-Example Pairs

  • Service-Linked Role
    • Definition: A role predefined by an AWS service that includes all the permissions the service requires to call other AWS services on your behalf.
    • Example: An AWSServiceRoleForAutoScaling allows EC2 Auto Scaling to launch or terminate instances when your scaling policies are triggered.
  • Least-Privilege Principle
    • Definition: Granting only the specific permissions required to perform a task and nothing more.
    • Example: Instead of granting s3:* to a Lambda function, you grant s3:GetObject and restrict the resource to arn:aws:s3:::my-app-data/logs/*.

Worked Examples

Scenario: Cross-Account S3 Access

Goal: An EC2 instance in Account A (Dev) needs to read data from an S3 bucket in Account B (Production).

Step 1: Create a Role in Account B (Production) Define a Trust Policy that allows Account A to assume the role.

json
{ "Version": "2012-10-17", "Statement": [{ "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::ACCOUNT_A_ID:root" }, "Action": "sts:AssumeRole" }] }

Step 2: Attach Permissions to the Role in Account B Attach a policy allowing s3:GetObject on the specific bucket.

Step 3: Grant Permission in Account A (Dev) Attach an identity-based policy to the EC2 instance profile in Account A allowing it to call sts:AssumeRole on the ARN of the role created in Step 1.

Checkpoint Questions

  1. Can a Permissions Boundary grant access to a resource if the Identity-based policy is missing?
  2. What is the main advantage of using an S3 Access Point over a single large bucket policy for a shared dataset?
  3. Why should you avoid using long-term IAM user credentials for application authentication?
View Answers
  1. No. Permissions boundaries only limit the maximum permissions; they cannot grant access on their own.
  2. It prevents a single bucket policy from becoming overly complex and reaching the size limit as more users/applications are added.
  3. Long-term credentials (access keys) increase the risk of permanent compromise if leaked; roles use temporary credentials that expire automatically.

Comparison Tables

AWS Managed vs. Customer Managed Policies

FeatureAWS ManagedCustomer Managed
CreationCreated and maintained by AWS.Created and maintained by you.
EditabilityCannot be edited.Fully customizable.
UpdatesAWS adds new permissions automatically.You must update permissions manually.
ScopeBroad (e.g., ReadOnlyAccess).Precise (Least Privilege).

Muddy Points & Cross-Refs

  • Service Role vs. Service-Linked Role: This is a common point of confusion. A Service Role is a standard IAM role you create for a service to assume. A Service-Linked Role is a special role owned and managed by the service itself—you cannot modify its permissions.
  • Public Access: Remember that S3 Block Public Access settings override any bucket policies or ACLs that attempt to grant public access.
  • Cross-Ref: For more on auditing these permissions, study AWS CloudTrail and IAM Access Analyzer (which checks for unintended external access).
Study Guide940 words

AWS Storage Services: Purpose-Built Data Stores and Vector Indexing

Apply storage services to appropriate use cases (for example, using indexing algorithms like Hierarchical Navigable Small Worlds [HNSW] with Amazon Aurora PostgreSQL and using Amazon MemoryDB for fast key/value pair access)

Read full article

AWS Storage Services: Purpose-Built Data Stores and Vector Indexing

This guide focuses on selecting the appropriate AWS storage service for specific performance, cost, and functional requirements. It highlights modern advancements such as vector indexing (HNSW) for AI/ML and ultra-fast in-memory processing.

Learning Objectives

After studying this guide, you should be able to:

  • Identify the correct AWS storage service based on access patterns (e.g., key-value vs. relational).
  • Explain the role of Hierarchical Navigable Small Worlds (HNSW) indexing in Amazon Aurora PostgreSQL.
  • Differentiate between Amazon MemoryDB and Amazon ElastiCache for high-speed data access.
  • Select appropriate vector index types (HNSW vs. IVF) for similarity search workloads.
  • Map data types (structured, semi-structured, graph) to their optimal AWS database services.

Key Terms & Glossary

  • Vector Embedding: A numerical representation of data (text, images) that allows for similarity searching based on distance in a multi-dimensional space.
  • HNSW (Hierarchical Navigable Small Worlds): An indexing algorithm used for efficient Approximate Nearest Neighbor (ANN) searches in high-dimensional vector data.
  • IVF (Inverted File Index): A vector indexing method that partitions the vector space into clusters to speed up search by narrowing the search area.
  • Sub-millisecond Latency: Response times under 1ms, typically achieved by in-memory data stores like MemoryDB.
  • ACID Compliance: Atomicity, Consistency, Isolation, Durability—properties that guarantee reliable database transactions (Standard for Aurora/RDS).

The "Big Idea"

AWS advocates for Purpose-Built Databases. Instead of forcing all data into a single relational database, data engineers should select tools that match the specific shape and speed of the workload. A modern application might use Aurora for transactional data, MemoryDB for high-speed sessions, and OpenSearch for full-text search, all working in concert to provide a scalable architecture.

Formula / Concept Box

FeatureAmazon MemoryDBAmazon Aurora (with pgvector)Amazon DynamoDB
Primary EngineRedis-compatiblePostgreSQL/MySQLNoSQL (Key-Value)
Primary GoalUltra-fast performance + DurabilityRelational + Vector SearchMassively scalable Key-Value
Typical LatencyMicrosecondsMillisecondsSingle-digit Milliseconds
Vector SupportLimited (Redis Search)HNSW / IVFNo (requires integration)

Hierarchical Outline

  • I. High-Performance Key-Value Storage
    • Amazon MemoryDB: Redis-compatible, in-memory, but with Multi-AZ Durability. Ideal for microservices and banking ledgers.
    • Amazon ElastiCache: Best for non-durable caching (speed only). Data is lost if the cache fails/restarts.
  • II. Vector Search and AI Workloads
    • Amazon Aurora PostgreSQL: Supports pgvector extension.
    • HNSW Indexing: High precision, faster query speed, but higher memory usage during index build.
    • IVF Indexing: Lower memory footprint, faster build times, but potentially lower recall/accuracy than HNSW.
  • III. Specialized Databases
    • Amazon Neptune: Graph data (social connections, fraud networks).
    • Amazon OpenSearch: Log analytics and semantic search.
    • Amazon Redshift: OLAP (Analytics) and Data Warehousing.

Visual Anchors

Storage Selection Flowchart

Loading Diagram...

Vector Space Concept (HNSW vs. IVF)

Compiling TikZ diagram…
Running TeX engine…
This may take a few seconds

Definition-Example Pairs

  • Graph Database (Amazon Neptune): A database optimized for representing relationships between entities.
    • Example: Identifying fraudulent user accounts by tracing common IP addresses and credit card numbers used across multiple accounts.
  • In-Memory Database (MemoryDB): A database that keeps its entire data set in RAM for speed but logs transactions to multiple AZs for safety.
    • Example: A real-time leaderboard for a global gaming application where updates must be instant but scores cannot be lost.
  • Vector Search (Aurora pgvector): Searching for data based on semantic meaning rather than keywords.
    • Example: Searching an image catalog for "sunset over mountains" by comparing the vector representation of the query to the vectors of the images.

Worked Examples

Example 1: Selecting for Low Latency and Durability

Scenario: A financial service needs a key-value store for transaction processing. They require sub-millisecond response times but cannot risk losing any data if a node fails.

  • Incorrect Choice: ElastiCache (not durable; data in RAM is volatile).
  • Correct Choice: Amazon MemoryDB. It uses a distributed transactional log to ensure that even though data is served from RAM, it is written to disk across multiple Availability Zones.

Example 2: Implementing Vector Search for RAG

Scenario: A developer is building a Retrieval-Augmented Generation (RAG) system using Amazon Bedrock. They need to store millions of document embeddings and retrieve the most relevant ones within 50ms.

  • Implementation: Enable the pgvector extension on an Amazon Aurora PostgreSQL instance. Use the HNSW index type for the vector column to ensure high-speed retrieval of the nearest neighbors with high accuracy.

Checkpoint Questions

  1. Which service would you choose for a social media application's "friend-of-a-friend" recommendation feature? (Answer: Amazon Neptune)
  2. What is the primary difference between MemoryDB and ElastiCache regarding data safety? (Answer: MemoryDB is durable across multiple AZs; ElastiCache is primary volatile/cache-only)
  3. In vector search, which indexing algorithm is generally faster for queries at the cost of higher memory usage: IVF or HNSW? (Answer: HNSW)
  4. Which NoSQL service is best suited for simple, massive-scale key-value lookups with single-digit millisecond latency? (Answer: Amazon DynamoDB)

Comparison Tables

Vector Indexing Comparison

FeatureHNSW (Hierarchical Navigable Small Worlds)IVF (Inverted File Index)
Search SpeedVery FastFast (once clusters are pruned)
Memory UsageHigh (Builds a graph in memory)Low (Uses centroids and clusters)
AccuracyHighModerate (dependent on cluster count)
Best Use CaseSmall to Medium datasets where speed is kingVery large datasets with memory constraints

Muddy Points & Cross-Refs

  • HNSW vs. IVF Memory: Students often confuse memory usage. Remember: HNSW stands for Heavy memory usage because it builds a complex graph of connections between every data point.
  • MemoryDB vs. DynamoDB DAX: While both provide fast access, MemoryDB is a standalone Redis database, whereas DAX is a cache specifically for DynamoDB. If you need a full Redis API, use MemoryDB.
  • Cross-Ref: For more on how to generate the vectors used in Aurora, see Unit 4: Machine Learning and Bedrock Integration.
Curriculum Overview875 words

Curriculum Overview: AWS Audit Logs and Governance for Data Engineers

Audit Logs

Read full article

Curriculum Overview: AWS Audit Logs and Governance for Data Engineers

This curriculum provides a structured path to mastering the logging, monitoring, and auditing requirements necessary for the AWS Certified Data Engineer - Associate (DEA-C01) certification. It focuses on implementing robust audit trails to ensure data pipeline resiliency, security, and compliance.

Prerequisites

Before starting this module, students should possess the following foundational knowledge:

  • AWS Cloud Practitioner Essentials: Familiarity with core AWS services (S3, EC2, IAM).
  • IAM Fundamentals: Understanding of users, roles, and policies to manage permissions.
  • Data Format Basics: Ability to read and interpret JSON (the primary format for AWS logs).
  • SQL Basics: Proficiency in standard SQL for querying logs via Amazon Athena.

Module Breakdown

ModuleTitlePrimary ServicesDifficulty
1Fundamentals of AWS CloudTrailCloudTrail, CloudTrail LakeBeginner
2Centralized Logging with CloudWatchCloudWatch Logs, InsightsIntermediate
3Service-Specific Audit ConfigurationsAmazon Redshift, Amazon S3, EMRIntermediate
4Advanced Log Analysis & VisualizationAmazon Athena, OpenSearch, QuickSightAdvanced
5Compliance and Governance WorkflowsAWS Config, Macie, EventBridgeAdvanced

Learning Objectives per Module

Module 1: Fundamentals of AWS CloudTrail

  • Configure CloudTrail Trails: Move beyond the default 90-day event history to create permanent, multi-region trails.
  • Distinguish Event Types: Understand the difference between Management Events (control plane) and Data Events (e.g., S3 object-level actions).
  • Querying with CloudTrail Lake: Execute SQL-based queries on activity logs without managing complex ETL pipelines.

Module 2: Centralized Logging with CloudWatch

  • Log Ingestion: Configure AWS services (Lambda, Glue, EMR) to push application-level logs to CloudWatch Logs.
  • Insights & Filtering: Use CloudWatch Logs Insights to perform high-speed searches and aggregate log data.
  • Alarm Integration: Create CloudWatch Alarms to trigger SNS notifications when specific error patterns appear in logs.

Module 3: Service-Specific Audit Configurations

  • Redshift Auditing: Enable connection, user, and user activity logs (Note: This must be explicitly enabled; it is not on by default).
  • S3 Server Access Logging: Implement manual monitoring tools to track every request made to a specific bucket.
  • EMR Debugging: Access and analyze logs for large-scale distributed processing clusters.

Module 4: Advanced Log Analysis

  • Schema Definition: Use AWS Glue Crawlers to catalog log files stored in S3 for Athena querying.
  • OpenSearch Integration: Deploy OpenSearch (formerly Elasticsearch) for full-text search and real-time dashboarding of log data.

Visual Anchors

Log Flow Architecture

Loading Diagram...

Audit Choice Matrix

Loading Diagram...

Success Metrics

To demonstrate mastery of this curriculum, the learner must be able to:

  • Metric 1: Successfully query a CloudTrail log to identify the specific IAM user who deleted an AWS Glue job within the last 24 hours.
  • Metric 2: Configure a Redshift cluster to export audit logs to an S3 bucket and verify the logs appear in the specified prefix.
  • Metric 3: Build a CloudWatch Logs Insights query that identifies the top 5 most frequent error codes in a Lambda function log group.
  • Metric 4: Describe the specific use cases for S3 Storage Lens versus CloudTrail for monitoring data access patterns.

Real-World Application

[!IMPORTANT] Scenario: The "Bad Actor" Investigation A financial services company notices that a sensitive dataset in S3 was modified outside of business hours.

  • Step 1: Use AWS CloudTrail to identify the UpdateObject API call and find the source IP and IAM credentials used.
  • Step 2: Cross-reference with AWS Config to see the state of the bucket's encryption policy at the time of the change.
  • Step 3: Use Amazon Athena to scan historical S3 Server Access Logs to determine if the same IP has been performing reconnaissance (Read-Only activity) over the past month.
  • Result: The data engineer provides a complete "Chain of Custody" report for compliance officers, satisfying GDPR/HIPAA requirements for auditability.

Comparison of Primary Audit Tools

FeatureAWS CloudTrailAmazon CloudWatch LogsAmazon S3 Access Logs
Focus"Who did what?" (API Level)"What happened?" (App Level)"Who accessed the file?"
Data FormatJSONPlain Text / JSONSpace-delimited
Query ToolCloudTrail Lake / AthenaLogs InsightsAthena
Real-time?~15 min delayNear real-timePeriodic delivery
Hands-On Lab850 words

Hands-On Lab: Implementing and Analyzing Audit Logs in AWS

Audit Logs

Read full article

Hands-On Lab: Implementing and Analyzing Audit Logs in AWS

[!WARNING] Remember to run the teardown commands at the end of this lab to avoid ongoing charges.

Prerequisites

Before starting this lab, ensure you have the following:

  • An AWS Account with Administrator access.
  • AWS CLI installed and configured with credentials (aws configure).
  • Basic knowledge of JSON and the AWS Console.
  • IAM Permissions to manage S3, CloudTrail, and CloudWatch Logs.

Learning Objectives

By the end of this lab, you will be able to:

  1. Create and configure a multi-region AWS CloudTrail trail.
  2. Enable S3 Data Events for granular tracking of object-level activity.
  3. Integrate CloudTrail with Amazon CloudWatch Logs for real-time monitoring.
  4. Analyze audit logs using the CloudTrail Event History and CloudWatch Log Insights.

Architecture Overview

Loading Diagram...

Step-by-Step Instructions

Step 1: Create an S3 Bucket for Log Storage

CloudTrail requires an S3 bucket to store the log files for long-term auditing and compliance.

bash
# Generate a unique bucket name BUCKET_NAME="brainybee-audit-logs-$(aws sts get-caller-identity --query Account --output text)" # Create the bucket aws s3 mb s3://$BUCKET_NAME --region <YOUR_REGION>
Console alternative
  1. Navigate to S3 in the AWS Console.
  2. Click Create bucket.
  3. Bucket name: brainybee-audit-logs-<ACCOUNT_ID>.
  4. Keep other settings as default and click Create bucket.

Step 2: Create a CloudWatch Log Group

To enable real-time analysis, we need a destination for CloudTrail events in CloudWatch.

bash
aws logs create-log-group --log-group-name /aws/cloudtrail/audit-log-lab
Console alternative
  1. Navigate to CloudWatch > Logs > Log groups.
  2. Click Create log group.
  3. Log group name: /aws/cloudtrail/audit-log-lab.
  4. Click Create.

Step 3: Configure the CloudTrail Trail

Now we will create the trail that captures all management events and routes them to S3 and CloudWatch.

bash
# Create the trail aws cloudtrail create-trail --name LabAuditTrail --s3-bucket-name $BUCKET_NAME --is-multi-region-trail --cloud-watch-logs-log-group-arn $(aws logs describe-log-groups --log-group-name-prefix /aws/cloudtrail/audit-log-lab --query "logGroups[0].arn" --output text) --cloud-watch-logs-role-arn <YOUR_CLOUDTRAIL_IAM_ROLE_ARN> # Start logging aws cloudtrail start-logging --name LabAuditTrail

[!NOTE] In the console, AWS automatically creates the IAM role for CloudWatch integration. In the CLI, you must provide a role with permissions to create log streams and put log events.

Console alternative
  1. Navigate to CloudTrail > Trails > Create trail.
  2. Trail name: LabAuditTrail.
  3. Storage location: Choose "Use existing S3 bucket" and select the bucket from Step 1.
  4. CloudWatch Logs: Check "Enabled".
  5. Log group: Select the group from Step 2.
  6. IAM Role: Choose "New" and let AWS create the default role.
  7. Click Next, then Create trail.

Step 4: Generate and View Activity

Perform actions in your account to generate logs (e.g., create an S3 folder or modify a security group).

bash
# Create a dummy object to generate a 'PutObject' event (if data events are enabled) aws s3 cp hello.txt s3://$BUCKET_NAME/test-activity.txt

Checkpoints

  1. Verify Trail Status: Run aws cloudtrail get-trail-status --name LabAuditTrail. The IsLogging field should be true.
  2. Check S3 Delivery: Navigate to your S3 bucket. You should see a folder structure starting with AWSLogs/.
  3. CloudWatch Logs: Navigate to the Log Group. You should see log streams being populated with JSON entries of your recent API calls.

Troubleshooting

ProblemPotential CauseFix
No logs in S3Bucket PolicyEnsure the S3 bucket policy allows cloudtrail.amazonaws.com to PutObject.
Logs not appearing in CloudWatchIAM Role PermissionsVerify the CloudWatch Logs role has logs:CreateLogStream and logs:PutLogEvents permissions.
Delay in logsPropagation TimeCloudTrail logs can take up to 15 minutes to appear in CloudWatch/S3.

Clean-Up / Teardown

To avoid charges, delete the resources created in this lab:

bash
# Stop and delete the trail aws cloudtrail stop-logging --name LabAuditTrail aws cloudtrail delete-trail --name LabAuditTrail # Delete the Log Group aws logs delete-log-group --log-group-name /aws/cloudtrail/audit-log-lab # Empty and delete the S3 bucket aws s3 rb s3://$BUCKET_NAME --force

Cost Estimate

  • CloudTrail: The first management trail in each region is Free. Data events (if enabled) are charged at $0.10 per 100,000 events.
  • S3: Standard storage rates apply (negligible for small log files).
  • CloudWatch Logs: Ingestion is charged at ~$0.50/GB (depending on region). This lab will likely stay within the Free Tier limits.

Stretch Challenge

Enable S3 Data Events for your specific bucket. Use CloudWatch Logs Insights to write a query that identifies all DeleteObject calls made in the last hour.

Concept Review

FeatureCloudTrail Event HistoryCloudTrail Trails
Retention90 DaysIndefinite (based on S3 lifecycle)
ScopeManagement Events onlyManagement + Data Events
CostFreePaid (per events processed)
Multi-regionSingle Region viewCan be Multi-Region
Compiling TikZ diagram…
Running TeX engine…
This may take a few seconds
Curriculum Overview845 words

Curriculum Overview: Authentication Mechanisms for AWS Data Engineering

Authentication Mechanisms

Read full article

Curriculum Overview: Authentication Mechanisms for AWS Data Engineering

This curriculum provides a comprehensive guide to implementing, managing, and auditing authentication within the AWS ecosystem, specifically tailored for the AWS Certified Data Engineer – Associate (DEA-C01). It covers the spectrum from basic IAM credentials to sophisticated identity federation and secret rotation strategies.


Prerequisites

Before starting this module, students should possess the following foundational knowledge:

  • Foundational AWS Knowledge: Familiarity with the AWS Management Console and the Shared Responsibility Model.
  • Basic Security Concepts: Understanding of the difference between Authentication (Who are you?) and Authorization (What can you do?).
  • Networking Basics: A baseline understanding of VPCs, Subnets, and Security Groups.
  • Data Literacy: Basic knowledge of how data flows between services like Amazon S3, AWS Glue, and Amazon Redshift.

Module Breakdown

ModuleTopicDifficultyKey Services
1IAM Fundamentals & IdentitiesBeginnerIAM Users, Groups, Roles
2Programmatic Auth & Secret ManagementIntermediateSecrets Manager, SSM Parameter Store
3Cross-Service & Connectivity AuthIntermediateVPC Endpoints, Security Groups, PrivateLink
4Enterprise Identity & GovernanceAdvancedIAM Identity Center, Lake Formation, SSO
5Service-Specific Auth (MSK, Redshift, OpenSearch)AdvancedMSK IAM, Redshift Data Sharing

Module Objectives

Module 1: IAM Fundamentals & Identities

  • Goal: Master the creation and management of IAM principals.
  • Objectives:
    • Differentiate between IAM Users (long-term credentials) and IAM Roles (temporary security tokens).
    • Implement the Principle of Least Privilege using custom IAM policies.
    • Configure trust relationships for service-linked roles (e.g., allowing Lambda to access S3).

Module 2: Programmatic Auth & Secret Management

  • Goal: Securely manage application-level credentials without hardcoding.
  • Objectives:
    • Implement automatic credential rotation using AWS Secrets Manager.
    • Store sensitive parameters (API keys, DB strings) in Systems Manager Parameter Store.
    • Compare the use cases for Secrets Manager vs. Parameter Store.

Module 3: Cross-Service & Connectivity Auth

  • Goal: Secure the network perimeter for data traffic.
  • Objectives:
    • Configure VPC Interface Endpoints for OpenSearch and Redshift.
    • Utilize S3 Gateway Endpoints to ensure data never leaves the AWS private network.
    • Enforce HTTPS-only protocols for sensitive data ingestion.

Module 4: Enterprise Identity & Governance

  • Goal: Scale authentication for large organizations.
  • Objectives:
    • Integrate IAM Identity Center with external Directory Services.
    • Apply fine-grained access control at the database, table, and column level via AWS Lake Formation.

Visual Anchors

Identity Flow Architecture

Loading Diagram...

The Hierarchy of Authentication

Compiling TikZ diagram…
Running TeX engine…
This may take a few seconds

Success Metrics

To demonstrate mastery of this curriculum, a student should be able to:

  1. Draft a Zero-Trust Policy: Write a JSON IAM policy that restricts access to a specific S3 prefix using ${aws:username} variables.
  2. Automate Rotation: Successfully configure a Lambda function to rotate a Redshift password in Secrets Manager every 30 days.
  3. Secure a Pipeline: Design a multi-service pipeline (EMR to Redshift) where all communication occurs over VPC Endpoints with no public IP addresses.
  4. Audit Access: Use AWS CloudTrail to identify which IAM principal deleted a specific Glue Table.

[!IMPORTANT] For the DEA-C01 exam, remember that IAM Role-based authentication is the recommended best practice for internal AWS service-to-service communication, while IAM Users are primarily for external tools or CLI access.


Real-World Application

Authentication mechanisms are the "first line of defense" in any data engineering role. Understanding these tools is critical for:

  • Compliance (GDPR/HIPAA): Ensuring that only authorized personnel can view PII (Personally Identifiable Information) through fine-grained Lake Formation permissions.
  • Security Posture: Preventing data breaches caused by hardcoded credentials in GitHub or public S3 buckets.
  • Operational Efficiency: Using SSO (IAM Identity Center) to manage thousands of users through a single directory rather than managing individual IAM users.
  • Multi-tenant Architectures: Isolating data for different Lines of Business (LOBs) within a single MSK cluster or Redshift instance using IAM-based access control.
Click to expand: Comparison of Managed vs. Unmanaged Auth
FeatureManaged (e.g., IAM Identity Center)Unmanaged (e.g., DB-native users)
Credential StorageCentralized in AWSDecentralized in DB engine
AuditabilityUnified in CloudTrailScattered across service logs
ScalabilityHigh (handles thousands of users)Low (manual user creation)
RotationAutomated via AWS toolsOften manual or requires custom scripts
Hands-On Lab945 words

Lab: Implementing Secure Authentication with IAM Roles and Secrets Manager

Authentication Mechanisms

Read full article

Lab: Implementing Secure Authentication with IAM Roles and Secrets Manager

In this lab, you will apply industry-standard authentication mechanisms within an AWS environment. You will move away from risky long-term IAM user credentials and instead implement IAM Roles for service-to-service authentication and AWS Secrets Manager for secure credential storage and rotation.

[!WARNING] Remember to run the teardown commands at the end of this lab to avoid ongoing charges for the EC2 instance and Secrets Manager secrets.

Prerequisites

  • An active AWS Account.
  • AWS CLI configured on your local machine with AdministratorAccess.
  • Basic familiarity with the Linux command line.
  • Access to a region where Amazon EC2 and AWS Secrets Manager are available (e.g., us-east-1).

Learning Objectives

  • Create and attach an IAM Role to an EC2 instance to eliminate hardcoded credentials.
  • Implement the Principle of Least Privilege using custom IAM policies.
  • Securely store and retrieve sensitive information using AWS Secrets Manager.
  • Verify authentication flows through the AWS CLI.

Architecture Overview

This diagram illustrates the flow of authentication. Instead of storing an Access Key on the EC2 instance, the instance "assumes" a role to gain temporary security credentials.

Loading Diagram...

Step-by-Step Instructions

Step 1: Create a Least-Privilege IAM Policy

First, we define exactly what our data processor is allowed to do. We want it to list objects in a specific bucket and retrieve a specific secret.

bash
# Create a policy file cat <<EOF > lab-policy.json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": ["s3:ListBucket", "s3:GetObject"], "Resource": ["arn:aws:s3:::brainybee-lab-*", "arn:aws:s3:::brainybee-lab-*/*"] }, { "Effect": "Allow", "Action": "secretsmanager:GetSecretValue", "Resource": "*" } ] } EOF # Create the IAM Policy aws iam create-policy --policy-name DataEngineerLabPolicy --policy-document file://lab-policy.json
Console Alternative

Navigate to

IAM > Policies > Create Policy

. Select the

JSON

tab and paste the code above. Name it

DataEngineerLabPolicy

.

Step 2: Create the IAM Role and Instance Profile

Services cannot "assume" a role unless we grant them permission to do so via a Trust Policy.

bash
# Create trust policy for EC2 cat <<EOF > trust-policy.json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": "ec2.amazonaws.com" }, "Action": "sts:AssumeRole" } ] } EOF # Create the Role aws iam create-role --role-name DataEngineerRole --assume-role-policy-document file://trust-policy.json # Attach the policy from Step 1 (Replace <ACCOUNT_ID>) aws iam attach-role-policy --role-name DataEngineerRole --policy-arn arn:aws:iam::<YOUR_ACCOUNT_ID>:policy/DataEngineerLabPolicy # Create Instance Profile (required for EC2 to use a role) aws iam create-instance-profile --instance-profile-name DataEngineerInstanceProfile aws iam add-role-to-instance-profile --instance-profile-name DataEngineerInstanceProfile --role-name DataEngineerRole

Step 3: Store a Secret in Secrets Manager

Instead of hardcoding a database password in your app, you will store it in the managed service.

bash
aws secretsmanager create-secret --name "lab/db/password" \ --description "Database password for data engineering lab" \ --secret-string "{\"username\":\"admin\",\"password\":\"P@ssw0rd123!\"}"

[!TIP] In a production environment, you would enable Rotation to automatically change this password every 30-90 days.

Step 4: Launch EC2 with the Instance Profile

Now we launch a small instance and tell AWS to give it the identity we just created.

bash
# Launch t2.micro instance aws ec2 run-instances --image-id ami-0c101f26f147fa7fd --count 1 --instance-type t2.micro \ --iam-instance-profile Name=DataEngineerInstanceProfile \ --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=AuthenticationLab}]'

Checkpoints

  1. Verify Role Attachment: Navigate to the EC2 console. Select your instance and check the "IAM Role" field. It should say DataEngineerRole.
  2. Test Authentication: SSH into your instance (or use EC2 Instance Connect) and run:
    bash
    aws secretsmanager get-secret-value --secret-id lab/db/password
    If successful, you will see the JSON secret without ever having to run aws configure on that machine.

Visual Concept: IAM Policy Structure

Compiling TikZ diagram…
Running TeX engine…
This may take a few seconds

Troubleshooting

ErrorLikely CauseFix
An error occurred (AccessDenied)The IAM Policy does not have the correct ARN for the secret or bucket.Check the Resource block in lab-policy.json.
InstanceProfile not foundThere is a propagation delay in IAM.Wait 60 seconds and try the command again.
Connection TimeoutSecurity Group is not allowing SSH (Port 22).Update the VPC Security Group to allow your IP on port 22.

Concept Review

MechanismBest Use CaseSecurity Benefit
IAM UserHumans accessing the Console/CLI.Individual accountability.
IAM RoleApplications or Services (EC2, Lambda).No long-term credentials to leak.
Secrets ManagerDatabase credentials, API Keys.Automatic rotation and encryption.
Identity CenterLarge organizations with many users.Centralized SSO and directory sync.

Clean-Up / Teardown

To avoid charges, delete these resources in order:

bash
# 1. Terminate EC2 Instance (Get ID from console or previous output) aws ec2 terminate-instances --instance-ids <YOUR_INSTANCE_ID> # 2. Delete Secret aws secretsmanager delete-secret --secret-id lab/db/password --force-delete-without-recovery # 3. Remove Role from Profile and Delete aws iam remove-role-from-instance-profile --instance-profile-name DataEngineerInstanceProfile --role-name DataEngineerRole aws iam delete-instance-profile --instance-profile-name DataEngineerInstanceProfile aws iam detach-role-policy --role-name DataEngineerRole --policy-arn arn:aws:iam::<YOUR_ACCOUNT_ID>:policy/DataEngineerLabPolicy aws iam delete-role --role-name DataEngineerRole aws iam delete-policy --policy-arn arn:aws:iam::<YOUR_ACCOUNT_ID>:policy/DataEngineerLabPolicy

Cost Estimate

  • EC2 t2.micro: Free Tier eligible (otherwise ~$0.0116/hour).
  • Secrets Manager: $0.40 per secret per month (pro-rated for this lab: <$0.01).
  • IAM: Free.

Stretch Challenge

Try to modify the IAM Policy so the EC2 instance can only retrieve the secret if it is accessed from within your specific VPC. Look up the aws:SourceVpc condition key in AWS documentation.

More Study Notes (143)

Curriculum Overview: AWS Authorization Mechanisms for Data Engineers

Authorization Mechanisms

785 words

Lab: Implementing Least-Privilege Authorization with IAM Roles and Policies

Authorization Mechanisms

850 words

Automating Data Pipelines: Event-Driven Processing with Step Functions and Lambda

Automate data processing by using AWS services

940 words

Curriculum Overview: Automating Data Processing with AWS (DEA-C01)

Automate data processing by using AWS services

845 words

AWS Certified Data Engineer – Associate (DEA-C01): Curriculum Overview

AWS - Certified Data Engineer - Associate DEA-C01

895 words

Mastering Technical Data Catalogs: AWS Glue and Apache Hive

Build and reference a technical data catalog (for example, AWS Glue Data Catalog, Apache Hive metastore)

1,050 words

AWS Data Pipeline Engineering: Performance, Availability, and Resilience

Build data pipelines for performance, availability, scalability, resiliency, and fault tolerance

945 words

Data Engineering Study Guide: Integrating AWS Lambda with Amazon Kinesis

Call a Lambda function from Kinesis

864 words

Mastering Programmatic Access: AWS SDKs and Developer Tools for Data Engineering

Call SDKs to access Amazon features from code

1,085 words

Curriculum Overview: Cataloging and Schema Evolution (AWS Data Engineer Associate)

Cataloging and Schema Evolution

820 words

Lab: Mastering Schema Evolution with AWS Glue Crawlers

Cataloging and Schema Evolution

945 words

Configuring Encryption Across AWS Account Boundaries

Configure encryption across AWS account boundaries

945 words

AWS Lambda: Concurrency and Performance Optimization

Configure Lambda functions to meet concurrency and performance needs

925 words

AWS Data Store Selection & Configuration Guide

Configure the appropriate storage services for specific access patterns and requirements (for example, Amazon Redshift, Amazon EMR, Lake Formation, Amazon RDS, DynamoDB)

925 words

Mastering Data Source Connectivity: JDBC & ODBC in AWS

Connect to different data sources (for example, Java Database Connectivity [JDBC], Open Database Connectivity [ODBC])

925 words

Mastering AWS Custom Policies & The Principle of Least Privilege

Construct custom policies that meet the principle of least privilege

1,150 words

AWS Data Engineering: Consuming and Maintaining Data APIs

Consume and maintain data APIs

845 words

Mastering Data API Consumption and Creation on AWS

Consume data APIs

1,050 words

Mastering IP Allowlisting and Network Connectivity for Data Sources

Create allowlists for IP addresses to allow connections to data sources

945 words

Mastering AWS Data Catalogs: Business and Technical Metadata Management

Create and manage business data catalogs (for example, Amazon SageMaker Catalog)

945 words

Credential Management and Secret Rotation with AWS Secrets Manager

Create and rotate credentials for password management (for example, AWS Secrets Manager)

925 words

Mastering AWS IAM: Identities, Policies, and Endpoints

Create and update AWS Identity and Access Management (IAM) groups, roles, endpoints, and services

920 words

Mastering Custom IAM Policies: Beyond AWS Managed Defaults

Create custom IAM policies when a managed policy does not meet the needs

890 words

AWS Data APIs: Building the Front Door for Your Data Lake

Create data APIs to make data available to other systems by using AWS services

875 words

AWS Glue: Source and Target Connections for Data Cataloging

Create new source or target connections for cataloging (for example, AWS Glue)

1,050 words

Data Analysis and Querying Using AWS Services: Curriculum Overview

Data Analysis and Querying Using AWS Services

745 words

Lab: Building a Serverless Data Lake with AWS Glue and Amazon Athena

Data Analysis and Querying Using AWS Services

1,050 words

Curriculum Overview: Data Encryption and Masking in AWS

Data Encryption and Masking

680 words

Hands-On Lab: Implementing Data Encryption and PII Masking on AWS

Data Encryption and Masking

920 words

Curriculum Overview: Data Lifecycle Management (AWS DEA-C01)

Data Lifecycle Management

842 words

Hands-On Lab: Implementing Automated Data Lifecycle Management on AWS

Data Lifecycle Management

945 words

Curriculum Overview: Data Models and Schema Evolution

Data Models and Schema Evolution

845 words

Lab: Managing Schema Evolution with AWS Glue and Athena

Data Models and Schema Evolution

920 words

Curriculum Overview: Data Privacy and Governance

Data Privacy and Governance

820 words

Lab: Implementing Data Privacy and Governance on AWS

Data Privacy and Governance

1,050 words

Automating Data Quality Validation with AWS Glue and DQDL

Data Quality and Validation

945 words

Curriculum Overview: Data Quality and Validation (AWS DEA-C01)

Data Quality and Validation

685 words

Lab: Building a Real-Time Serverless Transformation Pipeline with Amazon Data Firehose and AWS Lambda

Data Transformation and Processing

925 words

AWS Data Engineering: Data Aggregation, Rolling Averages, Grouping, and Pivoting

Define data aggregation, rolling average, grouping, and pivoting

920 words

Mastering Data Quality Rules: AWS Glue Data Quality & DataBrew

Define data quality rules (for example, DataBrew)

920 words

Fundamentals of Distributed Computing for Data Engineering

Define distributed computing

1,245 words

Stateful vs. Stateless Data Transactions: AWS Data Engineering Guide

Define stateful and stateless data transactions

940 words

AWS Certified Data Engineer: Foundations of Big Data (The 5 Vs)

Define volume, velocity, and variety of data (for example, structured data, unstructured data)

945 words

Study Guide: Deleting Data to Meet Business and Legal Requirements

Delete data to meet business and legal requirements

948 words

AWS Logging, Monitoring, and Auditing for Data Engineers

Deploy logging and monitoring solutions to facilitate auditing and traceability

920 words

Data Optimization: Indexing, Partitioning, and Compression Strategies

Describe best practices for indexing, partitioning strategies, compression, and other data optimization techniques

945 words

Mastering CI/CD for Data Pipelines

Describe continuous integration and continuous delivery (CI/CD) (implementation, testing, and deployment of data pipelines)

1,085 words

AWS Data Engineering: Data Sampling Techniques & Quality Validation

Describe data sampling technique

850 words

Data Structures and Algorithms for Data Engineering (DEA-C01)

Describe data structures and algorithms (for example, graph data structures and tree data structures)

925 words

AWS Data Governance Frameworks and Sharing Patterns

Describe governance data framework and data sharing patterns

890 words

Data Ingestion Replayability: AWS Implementation Guide

Describe replayability of data ingestion pipelines

895 words

AWS Managed vs. Unmanaged Services: A Strategic Study Guide

Describe the differences between managed services and unmanaged services

875 words

AWS Study Guide: Provisioned vs. Serverless Services

Describe tradeoffs between provisioned services and serverless services

920 words

AWS Data Engineer Associate: Vector Indexing (HNSW & IVF)

Describe vector index types (for example, HNSW, IVF)

890 words

Study Guide: Vectorization and Amazon Bedrock Knowledge Bases

Describe vectorization concepts (for example, Amazon Bedrock knowledge base)

870 words

Mastering AWS Data Schemas: Redshift, DynamoDB, and Lake Formation

Design schemas for Amazon Redshift, DynamoDB, and Lake Formation

1,145 words

Mastering AWS Glue Crawlers and Data Catalogs

Discover schemas and use AWS Glue crawlers to populate data catalogs

920 words

Encryption in Transit: Mastering Data Protection on the Wire

Enable encryption in transit or before transit for data

915 words

Establishing Data Lineage with AWS Tools

Establish data lineage by using AWS tools (for example, Amazon SageMaker ML Lineage Tracking and Amazon SageMaker Catalog)

865 words

S3 Lifecycle Management: Automating Data Expiration and Cost Optimization

Expire data when it reaches a specific age by using S3 Lifecycle policies

945 words

AWS Data Engineering: Extracting & Preparing Logs for Audits

Extract logs for audits

945 words

Data Governance and Permissions: Amazon Redshift Data Sharing

Grant permissions for data sharing (for example, data sharing for Amazon Redshift)

945 words

AWS Data Engineer: Implementing & Maintaining Serverless Workflows

Implement and maintain serverless workflows

940 words

Mastering Batch Ingestion Configuration for AWS Data Engineering

Implement appropriate configuration options for batch ingestion

864 words

Amazon Redshift: Data Migration and Remote Access Methods

Implement data migration or remote access methods (for example, Amazon Redshift federated queries, Amazon Redshift materialized views, Amazon Redshift Spectrum)

920 words

Data Privacy Strategies: Preventing Replication to Disallowed AWS Regions

Implement data privacy strategies to prevent backups or replications of data to disallowed AWS Regions

985 words

Study Guide: Implementing Data Skew Mechanisms

Implement data skew mechanisms

1,085 words

AWS Data Transformation Services: Comprehensive DEA-C01 Study Guide

Implement data transformation services based on requirements (for example, Amazon EMR, AWS Glue, Lambda, Amazon Redshift)

925 words

Study Guide: Implementing PII Identification and Data Privacy

Implement PII identification (for example, Amazon Macie with Lake Formation)

925 words

AWS Data Store Selection: Cost and Performance Optimization

Implement the appropriate storage services for specific cost and performance requirements (for example, Amazon Redshift, Amazon EMR, AWS Lake Formation, Amazon RDS, Amazon DynamoDB, Amazon Kinesis Data Streams, Amazon Managed Streaming for Apache Kafka [Amazon MSK])

920 words

Mastering Throttling and Rate Limits in AWS Data Engineering

Implement throttling and overcoming rate limits (for example, DynamoDB, Amazon RDS, Kinesis)

1,084 words

Data Integration Mastery: Combining Multiple Sources for AWS Data Engineering

Integrate data from multiple sources

1,050 words

Integrating Large Language Models (LLMs) for Data Processing

Integrate large language models (LLMs) for data processing

940 words

Study Guide: Integrating Migration Tools into Data Processing Systems

Integrate migration tools into data processing systems (for example, AWS Transfer Family)

1,050 words

DEA-C01: Integrating AWS Services for High-Volume Logging & Auditing

Integrate various AWS services to perform logging (for example, Amazon EMR in cases of large volumes of log data)

945 words

Data Consistency and Quality with AWS Glue DataBrew

Investigate data consistency (for example, DataBrew)

1,050 words

Mastering Data Sovereignty in AWS: A Guide for Data Engineers

Maintain data sovereignty

875 words

Lab: Monitoring and Auditing AWS Data Pipelines

Maintaining and Monitoring Data Pipelines

948 words

Maintaining and Monitoring Data Pipelines: Curriculum Overview

Maintaining and Monitoring Data Pipelines

820 words

Mastering Data Access with Amazon SageMaker Catalog

Manage data access through Amazon SageMaker Catalog projects

1,085 words

Amazon EventBridge: Managing Events and Schedulers for Data Pipelines

Manage events and schedulers (for example, Amazon EventBridge)

1,142 words

Managing Fan-In and Fan-Out for Streaming Data Distribution

Manage fan-in and fan-out for streaming data distribution

985 words

AWS Data Store Security: Managing Access, Locks, and Permissions

Manage locks to prevent access to data (for example, Amazon Redshift, Amazon RDS)

875 words

Managing Open Table Formats: Apache Iceberg for Data Engineering

Manage open table formats (for example Apache Iceberg)

820 words

AWS Lake Formation: Centralized Governance and Fine-Grained Access Control

Manage permissions through AWS Lake Formation (for Amazon Redshift, Amazon EMR, Amazon Athena, and Amazon S3)

915 words

S3 Lifecycle Management: Automating Storage Tier Transitions

Manage S3 Lifecycle policies to change the storage tier of S3 data

945 words

Mastering Data Lifecycle: S3 Versioning and DynamoDB TTL

Manage S3 versioning and DynamoDB TTL

945 words

Optimizing Data Ingestion & Transformation Runtime

Optimize code to reduce runtime for data ingestion and transformatio

945 words

Optimizing Container Usage for Data Engineering: Amazon ECS & EKS

Optimize container usage for performance needs (for example, Amazon Elastic Kubernetes Service [Amazon EKS], Amazon Elastic Container Service [Amazon ECS])

940 words

Cost Optimization Strategies for Data Processing (DEA-C01)

Optimize costs while processing data

875 words

AWS Data Engineering: Orchestrating Data Pipelines with MWAA and Step Functions

Orchestrate data pipelines (for example, Amazon Managed Workflows for Apache Airflow [Amazon MWAA], AWS Step Functions)

895 words

AWS Data Ingestion: Building an Automated Batch Pipeline with S3, Lambda, and Glue

Perform data ingestion

1,050 words

Curriculum Overview: Performing Data Ingestion (AWS DEA-C01)

Perform data ingestion

820 words

Mastering Data Movement: Amazon S3 and Amazon Redshift COPY/UNLOAD Operations

Perform load and unload operations to move data between Amazon S3 and Amazon Redshift

875 words

Mastering Schema Conversion with AWS SCT and DMS

Perform schema conversion (for example, by using the AWS Schema Conversion Tool [AWS SCT] and AWS Database Migration Service [AWS DMS] Schema Conversion)

875 words

Curriculum Overview: Pipeline Orchestration and Programming

Pipeline Orchestration and Programming

785 words

Lab: Orchestrating Serverless Data Pipelines with AWS Step Functions

Pipeline Orchestration and Programming

1,142 words

Data Preparation for Transformation: AWS Glue DataBrew and SageMaker Unified Studio

Prepare data for transformation (for example, AWS Glue DataBrew and Amazon SageMaker Unified Studio)

945 words

Curriculum Overview: Programming Concepts for Data Engineering (AWS DEA-C01)

Programming Concepts

785 words

Lab: Building a Serverless Data Processor with AWS Lambda and Python

Programming Concepts

985 words

AWS Certified Data Engineer: Protecting Data with Resiliency and Availability

Protect data with appropriate resiliency and availability

1,184 words

Database Access and Authority: Amazon Redshift and AWS Security

Provide database users, groups, and roles access and authority in a database (for example, for Amazon Redshift)

945 words

Mastering Amazon Athena: Serverless SQL for Data Lakes

Query data (for example, Amazon Athena)

1,055 words

AWS Certified Data Engineer Associate: Reading Data from Batch Sources

Read data from batch sources (for example, Amazon S3, AWS Glue, Amazon EMR, AWS DMS, Amazon Redshift, AWS Lambda, Amazon AppFlow)

925 words

Reading Data from Streaming Sources: AWS Data Engineer Study Guide

Read data from streaming sources (for example, Amazon Kinesis, Amazon Managed Streaming for Apache Kafka [Amazon MSK], Amazon DynamoDB Streams, AWS Database Migration Service [AWS DMS], AWS Glue, Amazon Redshift)

1,142 words

Data Quality Engineering on AWS: Checks and Validation

Run data quality checks while processing the data (for example, checking for empty fields)

1,050 words

Curriculum Overview: Selecting Optimal Data Stores (AWS DEA-C01)

Selecting Optimal Data Stores

860 words

Lab: Implementing Optimal Data Store Strategies on AWS

Selecting Optimal Data Stores

845 words

AWS Data Engineering: Setting Up Event Triggers (S3 & EventBridge)

Set up event triggers (for example, Amazon S3 Event Notifications, EventBridge)

880 words

Mastering AWS IAM Roles: A Study Guide for Data Engineers

Set up IAM roles for access (for example, AWS Lambda, Amazon API Gateway, AWS CLI, AWS CloudFormation)

890 words

Mastering Schedulers and Orchestration in AWS

Set up schedulers by using Amazon EventBridge, Apache Airflow, or time-based schedules for jobs and crawlers

1,152 words

AWS Certified Data Engineer: Secure Credential Management

Store application and database credentials (for example, Secrets Manager, AWS Systems Manager Parameter Store)

890 words

DEA-C01 Study Guide: Synchronizing Partitions with Data Catalogs

Synchronize partitions with a data catalog

920 words

Transforming Data Formats: CSV to Apache Parquet in AWS

Transform data between formats (for example, from .csv to Apache Parquet)

1,145 words

Troubleshooting and Orchestrating Amazon Managed Workflows

Troubleshoot Amazon managed workflows

985 words

Mastering Data Transformation Troubleshooting & Performance Optimization

Troubleshoot and debug common transformation failures and performance issues

980 words

AWS Data Engineering: Troubleshooting and Maintaining Pipelines

Troubleshoot and maintain pipelines (for example, AWS Glue, Amazon EMR)

940 words

Study Guide: Troubleshooting Performance Issues in AWS Data Pipelines

Troubleshoot performance issues

945 words

Study Guide: Updating VPC Security Groups

Update VPC security groups

925 words

Mastering Amazon CloudWatch Logs: Configuration and Automation for Data Engineers

Use Amazon CloudWatch Logs to log application data (with a focus on configuration and automation)

1,185 words

Mastering Application Logging with Amazon CloudWatch Logs

Use Amazon CloudWatch Logs to store application logs

920 words

AWS Lambda Storage: Mounting Volumes for Data Pipelines

Use and mount storage volumes from within Lambda functions

1,350 words

Mastering Athena Notebooks with Apache Spark

Use Athena notebooks that use Apache Spark to explore data

985 words

Mastering AWS CloudTrail Lake: Centralized Logging and Analysis

Use AWS CloudTrail Lake for centralized logging queries

915 words

Mastering AWS CloudTrail for API Auditing and Governance

Use AWS CloudTrail to track API calls

1,184 words

Mastering AWS CloudTrail for API Tracking and Auditing

Use AWS CloudTrail to track API calls

860 words

Automating Data Processing with AWS Lambda: A Comprehensive Study Guide

Use AWS Lambda to automate data processing

875 words

Mastering Data Catalogs: Discovering and Consuming Data at Source

Use data catalogs to consume data from the data's source

942 words

Mastering SageMaker Unified Studio: Domains, Domain Units, and Projects

Use domain, domain units, and projects for SageMaker Unified Studio

925 words

AWS Key Management Service (KMS) & Data Encryption Guide

Use encryption keys to encrypt or decrypt data (for example, AWS Key Management Service [AWS KMS])

985 words

AWS Infrastructure as Code (IaC) for Data Engineering

Use infrastructure as code (IaC) for repeatable resource deployment (for example, AWS CloudFormation and AWS Cloud Development Kit [AWS CDK])

890 words

Mastering Infrastructure as Code (IaC) for Data Engineering

Use Infrastructure as Code (IaC) to deploy data engineering solutions

920 words

Monitoring and Alerting in AWS Data Pipelines

Use notifications during monitoring to send alerts

920 words

AWS Notification Services for Data Pipelines: Amazon SNS and SQS

Use notification services to send alerts (for example, Amazon Simple Notification Service [Amazon SNS], Amazon Simple Queue Service [Amazon SQS])

1,150 words

AWS Orchestration Services for Data ETL Pipelines

Use orchestration services to build workflows for data ETL pipelines (for example, Lambda, EventBridge, Amazon Managed Workflows for Apache Airflow [Amazon MWAA], AWS Step Functions, AWS Glue workflows

1,150 words

Mastering Programming Languages & Frameworks for AWS Data Engineering

Use programming languages and frameworks for data engineering (for example, Python, SQL, Scala, R, Java, Bash, PowerShell)

925 words

Software Engineering Best Practices for Data Engineering

Use software engineering best practices for data engineering (for example, version control, testing, logging, monitoring)

1,080 words

SQL Querying and Data Transformation: Amazon Redshift & Athena

Use SQL in Amazon Redshift and Athena to query data or to create views

925 words

AWS SAM: Packaging and Deploying Serverless Data Pipelines

Use the AWS Serverless Application Model (AWS SAM) to package and deploy serverless data pipelines (for example, Lambda functions, Step Functions, DynamoDB tables)

895 words

AWS Data Processing: EMR, Redshift, and Glue

Use the features of AWS services to process data (for example, Amazon EMR, Amazon Redshift, AWS Glue)

948 words

AWS Certified Data Engineer: Verifying and Cleaning Data

Verify and clean data (for example, Lambda, Athena, QuickSight, Jupyter Notebooks, Amazon SageMaker Data Wrangler)

920 words

Mastering AWS Config: Tracking Account Configuration Changes

Viewing configuration changes that have occurred in an account (for example, AWS Config)

945 words

Mastering Data Visualization: Amazon QuickSight and AWS Glue DataBrew

Visualize data by using AWS services and tools (for example, DataBrew, Amazon QuickSight)

880 words

Ready to practice? Jump straight in — no sign-up needed.

Take practice tests, review flashcards, and read study notes right now.

Take a Practice Test

AWS Certified Data Engineer - Associate (DEA-C01) Practice Questions

Try 15 sample questions from a bank of 635. Answers and detailed explanations included.

Q1medium

How does AWS Config represent and track the historical changes of a specific resource within an AWS account over time?

A.

It records a video session of the AWS Management Console to document manual configuration steps performed by users.

B.

It generates a Configuration Item (CI) as a point-in-time snapshot whenever a change in the resource state or its relationships is detected.

C.

It performs full account data backups on a fixed 24-hour cycle and calculates checksum differences between the daily snapshots.

D.

It creates a historical record only when an Amazon CloudWatch alarm is triggered, signaling that a performance threshold has been reached.

Show answer & explanation

Correct Answer: B

AWS Config tracks resource changes through the use of Configuration Items (CIs).

  1. Capture Mechanism: Unlike systems that rely on fixed intervals or periodic backups, AWS Config is event-driven. It automatically generates a new CI whenever it detects a change in a resource's configuration attributes (like instance size or tags), metadata, or its relationships to other resources.
  2. Point-in-Time Snapshot: Each CI serves as a snapshot of the resource at that specific moment (tnt_n).
  3. Configuration History: The sequence of these CIs over time creates a 'Configuration History.' This allows administrators to perform 'diff' operations to see exactly what changed between two points in time (t1t_1 vs t2t_2).

Option B correctly explains this snapshot-based, change-triggered mechanism for auditing resource history.

Q2medium

A data engineering team is managing an Amazon Kinesis Data Stream that feeds several independent microservices. As the team adds more consumer applications, they observe frequent ReadProvisionedThroughputExceeded exceptions across all consumers. Investigations show that the combined read requests from the applications exceed the default 2 MB/s2\text{ MB/s} per-shard limit. Which architectural solution effectively resolves this contention by ensuring that each consumer application receives its own dedicated throughput?

A.

Enable Amazon Kinesis Enhanced Fan-out for the consumers to provide each one with a dedicated 2 MB/s2\text{ MB/s} throughput pipe per shard.

B.

Perform a resharding operation to double the number of shards in the stream, thereby increasing the aggregate read throughput available to the shared consumers.

C.

Change the stream capacity mode to 'On-Demand' to remove the fixed 2 MB/s2\text{ MB/s} read limit and allow for automatic scaling of consumer throughput.

D.

Implement Amazon SQS queues as a buffer between the Kinesis Data Stream and each consumer application to manage bursty read requests.

Show answer & explanation

Correct Answer: A

In Amazon Kinesis Data Streams, standard consumers share a fixed read throughput limit of 2 MB/spershard.Whenmultipleapplications(fanout)readfromthesameshardsimultaneously,theycompeteforthis2 MB/s2\text{ MB/s} per shard. When multiple applications (fan-out) read from the same shard simultaneously, they compete for this 2\text{ MB/s} total capacity, leading to throughput contention and ReadProvisionedThroughputExceeded errors.

  • Enhanced Fan-out (EFO) is the correct solution. It allows developers to register consumers to use EFO, which provides each registered consumer with its own dedicated 2 MB/s2\text{ MB/s} throughput pipe per shard. This isolates consumers from one another and significantly reduces latency by using an HTTP/2 push-based delivery mechanism rather than standard polling.
  • Resharding (B) increases the total aggregate throughput of the stream but does not solve the fundamental problem of multiple consumers competing for the 2 MB/s2\text{ MB/s} limit on any specific shard containing the data they need.
  • On-Demand mode (C) scales the number of shards automatically but still maintains the standard 2 MB/s2\text{ MB/s} shared read limit per shard for non-EFO consumers.
  • SQS (D) can help buffer downstream processing but does not increase the read capacity of the Kinesis Data Stream itself; if the Kinesis read limit is exceeded, the data cannot even be pushed into SQS efficiently.

Therefore, the final answer is A.

Q3medium

A company stores log files in an Amazon S3 bucket using the prefix logs/. To comply with data retention policies and minimize costs, the company needs to ensure that any object under this prefix is automatically and permanently deleted 90 days after it is created. The bucket does not have versioning enabled. Which S3 Lifecycle configuration should be applied to meet this requirement?

A.

Create a Lifecycle rule with a filter restricted to the logs/ prefix and an Expiration action set to 90 days.

B.

Create a Lifecycle rule for the entire bucket with a Transition action to move objects to S3 Glacier Deep Archive after 90 days.

C.

Create a Lifecycle rule with a filter restricted to the logs/ prefix and an Expiration action set to 90 days after the Last Accessed date.

D.

Create a Lifecycle rule for the entire bucket that transitions objects to S3 One Zone-IA after 30 days and then to S3 Standard-IA after 60 days.

Show answer & explanation

Correct Answer: A

To automate the deletion of objects based on their age, you must use an S3 Lifecycle Expiration action. The 'Days' parameter in an expiration rule specifies the number of days after an object's creation that Amazon S3 will delete the object. Because the requirement is specific to the logs/ prefix, the rule must include a Filter to limit its scope; otherwise, it would apply to all objects in the bucket. In a bucket without versioning, the expiration action results in the permanent removal of the object. Option B is incorrect because Transitioning only moves data to a different storage class but does not delete it. Option C is incorrect because S3 Lifecycle expiration rules are calculated based on the creation date, not the last accessed date. Option D is incorrect as it only moves data between storage classes and applies to the entire bucket rather than the specific prefix. The effective expiration date is calculated by adding the number of days to the object creation time and rounding the result to the next day midnight UTC.

Q4medium

A developer needs to encrypt a batch of 120 files, each measuring approximately 256 KB. Due to the file size exceeding the AWS KMS direct encryption limits, the developer implements envelope encryption using a single AWS KMS Customer Managed Key (CMK) to generate a data key that will be shared across the batch for local encryption. What is the minimum number of AWS KMS API calls required to securely encrypt these files and obtain the encrypted data key for storage?

A.

1

B.

120

C.

240

D.

7,680

Show answer & explanation

Correct Answer: A

To solve this, we must evaluate the AWS KMS API limits and the mechanics of envelope encryption:

  1. Identify API Limits: The AWS KMS Encrypt API operation has a maximum plaintext payload limit of $4,096 bytes (4 KB). Since each file is 256 KB, they cannot be encrypted directly using the Encrypt call.
  2. Apply Envelope Encryption: For data exceeding 4 KB, envelope encryption is the best practice. The process begins with a single call to the GenerateDataKey API.
  3. Evaluate the API Response: A single GenerateDataKey request returns both a plaintext data key (to be used by the application for local encryption) and an encrypted copy of that same data key (to be stored as metadata). No separate call is needed to encrypt the data key itself.
  4. Local Processing: Once the application has the plaintext data key, it performs the encryption of all 120 files locally (e.g., using the AWS Encryption SDK). Local cryptographic operations do not consume KMS API calls.

Therefore, the minimum number of KMS API calls is 1 (the initial GenerateDataKey call).

Distractor calculation note: $120 files ×(256KB/4KB)=7\times (256 KB / 4 KB) = 7,680$ calls, which represents the incorrect approach of attempting to chunk files to fit the Encrypt API limit.

Q5medium

A security engineer is configuring permissions for a developer who needs to perform s3:PutObject and s3:GetObject operations, but only on a specific Amazon S3 bucket named project-alpha-data. The engineer reviews the AWS managed policy AmazonS3FullAccess, but notes that it uses "Resource": "*", which violates the company's least-privilege security requirement. Which action should the engineer take to meet the requirement with the least operational overhead?

A.

Create a customer managed policy that specifies the required actions and the specific ARN for the project-alpha-data bucket, then attach it to the developer's IAM role.

B.

Modify the JSON of the AmazonS3FullAccess AWS managed policy to replace the wildcard in the Resource element with the specific bucket ARN.

C.

Implement an inline policy for the developer, as inline policies are the only IAM policy type that supports the Condition element for fine-grained resource control.

D.

Attach the AmazonS3FullAccess policy to the developer and apply a Service Control Policy (SCP) to the IAM role to grant the specific permissions while overriding the broader managed policy.

Show answer & explanation

Correct Answer: A

To follow the principle of least privilege when an AWS managed policy is too broad, the correct approach is to create a customer managed policy.

  1. Why not B? AWS managed policies are created and administered by AWS; they are read-only and cannot be modified by customers.
  2. Why not C? Both customer managed policies and inline policies support the Condition element and resource-level permissions. However, customer managed policies are preferred because they support versioning and can be reused across multiple users or roles, whereas inline policies have a strict one-to-one relationship and are harder to manage at scale.
  3. Why not D? Service Control Policies (SCPs) are used as guardrails to limit the maximum available permissions in an account; they do not grant permissions. Attaching a broad managed policy and trying to 'fix' it with an SCP is structurally incorrect for IAM permission granting.

Therefore, a customer managed policy providing specific access to arn:aws:s3:::project-alpha-data/* is the standard best practice.

Q6medium

A data engineer needs to grant a Data Analyst access to a table named sales_data in the AWS Glue Data Catalog. To comply with security policies, the analyst must only be allowed to view records where the region column is set to 'North' and must be restricted from viewing the customer_ssn column. The underlying data is stored in Amazon S3 and is registered with AWS Lake Formation.

Which configuration should the engineer implement to meet these requirements with the least administrative effort?

A.

Create a Data Filter in AWS Lake Formation that includes all required columns except customer_ssn and specifies the row filter expression region = 'North'. Grant the analyst 'Select' permissions using this filter and ensure their IAM role has the lakeformation:GetDataAccess permission.

B.

Apply an IAM policy to the analyst's role with a Condition element that uses s3:ExistingObjectTag to filter rows and a policy statement that denies s3:GetObject for the specific S3 prefixes associated with the customer_ssn metadata.

C.

Grant the analyst 'Full Table' access in AWS Lake Formation, then use an IAM policy with an explicit Deny effect on the glue:GetTable action for the customer_ssn column and the s3:ListBucket action for non-North prefixes.

D.

Enable LF-Tags and assign the tag Access: Restricted to the table. Use a wildcard LF-Tag expression $Region: *$ to grant access to the 'North' rows and apply a column-level exclusion for customer_ssn in the IAM policy attached to the analyst.

Show answer & explanation

Correct Answer: A

To implement fine-grained access control (FGAC) in AWS Lake Formation, the most effective method is using Data Filters.

  1. Row-level security: A Data Filter uses a PartiQL-based row filter expression, such as region = 'North', to restrict which records are returned to the user.
  2. Column-level security: Within the same Data Filter, you can explicitly include or exclude columns (e.g., excluding customer_ssn).
  3. Permissions: Once the filter is created, you grant the 'Select' permission to the principal (the Data Analyst) referencing that filter.
  4. IAM Requirements: For the analyst to actually retrieve the data via Lake Formation's credential vending, they must have the lakeformation:GetDataAccess permission in their IAM policy.

Option B and C are incorrect because IAM and S3 bucket policies cannot filter data within a file (like a CSV or Parquet file); they can only control access at the object (file) or prefix level. Option D is incorrect because LF-Tags are used for scaling permissions to resources, but row-level filtering still requires a Data Filter, and wildcards in LF-Tags do not function as row-level logic.

Q7medium

A cloud administrator is implementing Attribute-Based Access Control (ABAC) to manage permissions for AWS Resource Groups. All IAM users have a Project tag assigned to them. The administrator wants to create a single IAM policy that allows users to execute the resource-groups:UpdateGroup action only if the Project tag on the Resource Group matches the Project tag on the IAM user making the request.

Which Condition block should be added to the IAM policy statement to fulfill this requirement dynamically?

A.

"Condition": { "StringEquals": { "aws:ResourceTag/Project": "${aws:PrincipalTag/Project}" } }

B.

"Condition": { "StringEquals": { "aws:RequestTag/Project": "${aws:PrincipalTag/Project}" } }

C.

"Condition": { "StringEquals": { "aws:ResourceTag/Project": "Alpha" } }

D.

"Condition": { "ForAllValues:StringEquals": { "aws:TagKeys": ["Project"] } }

Show answer & explanation

Correct Answer: A

To implement dynamic Attribute-Based Access Control (ABAC), the IAM policy must compare an attribute of the resource with an attribute of the principal (user or role).

  1. aws:ResourceTag/Project is the condition key used to reference the Project tag value of the AWS Resource Group being accessed.
  2. ${aws:PrincipalTag/Project} is a policy variable that resolves to the value of the Project tag attached to the IAM user making the call.
  • Option B is incorrect because aws:RequestTag checks the tags being passed during an API call (such as CreateGroup or TagResource), not the tags already existing on the resource.
  • Option C is incorrect because hardcoding "Alpha" makes the policy static and not scalable for other projects, failing the requirement for a single dynamic policy.
  • Option D is incorrect because aws:TagKeys is used to restrict which tag keys can be used in a request, not to compare values for authorization.

Correct Answer: A

Q8medium

An organization needs to implement a multi-channel notification system. When a critical system event occurs, the architecture must perform three tasks: trigger a custom Python script for immediate automated remediation, send a persistent message to a processing queue for later asynchronous auditing, and notify a mobile operations team via SMS. Which architectural pattern most efficiently meets these requirements using the fan-out pattern?

A.

Publish the event to an Amazon SNS topic. Subscribe an AWS Lambda function, an Amazon SQS queue, and an SNS SMS endpoint to the topic.

B.

Send the event to an Amazon SQS queue. Configure the queue to push messages simultaneously to an AWS Lambda function and an SNS topic for SMS distribution.

C.

Publish the event to an Amazon SNS topic. Configure an AWS Lambda function as the sole subscriber to manually parse the message and push it to SQS and SMS endpoints.

D.

Send the event to an Amazon SQS queue, which acts as a buffer. Configure the queue to trigger an Amazon SNS topic only after the message has been successfully audited.

E.

Use Amazon EventBridge to send the event to an SQS queue, which then triggers a Lambda function to handle all downstream SMS and auditing tasks sequentially.

Show answer & explanation

Correct Answer: A

The most efficient way to handle a single event that needs to trigger multiple disparate actions is the Amazon SNS fan-out pattern.

  1. Fan-out Mechanism: By publishing a message to an Amazon SNS topic, you can broadcast that message to multiple subscribers simultaneously.
  2. Automated Action: An AWS Lambda function can be a direct subscriber to the SNS topic, allowing it to run the remediation script as soon as the event is published.
  3. Asynchronous Auditing: By subscribing an Amazon SQS queue to the SNS topic, you ensure that a copy of the message is persisted. This provides decoupling, meaning the auditing process can happen at its own pace without affecting the immediate remediation.
  4. Human Notification: SNS natively supports SMS, Email, and Mobile Push as subscription protocols, satisfying the requirement to notify the operations team.

Option B is incorrect because SQS does not support push-based fan-out to multiple diverse endpoints; it is designed for point-to-point delivery. Option C is inefficient because it uses Lambda as an unnecessary intermediary, increasing complexity and cost. Option D introduces sequential processing which delays the remediation and notification.

Q9medium

In the context of troubleshooting performance in a data streaming pipeline, which of the following best explains the concept of backpressure and its impact on system components?

A.

It is a flow-control mechanism where a downstream bottleneck forces upstream components to reduce their processing rate to match the capacity of the sink.

B.

It is a performance state caused by a data source producing records too slowly, leading to excessive idle time for downstream transformation operators.

C.

It is a network-level diagnostic indicating that the Maximum Transmission Unit (MTU) between processing shards must be increased to handle peak data volume.

D.

It is a high-latency mode where the system bypasses internal memory buffers at the source to prevent data loss, resulting in immediate message drops.

Show answer & explanation

Correct Answer: A

Backpressure is a fundamental concept in data streaming (e.g., Flink, Kafka, Kinesis).

  1. Definition: It occurs when a downstream operator (the sink or a transformation step) cannot process data at the rate it is being received.
  2. Propagation: This bottleneck creates a 'pressure' that moves backward through the pipeline. Upstream operators detect that their output buffers are full and are forced to slow down their own consumption or production rate.
  3. Purpose: It serves as a flow-control mechanism to prevent the system from being overwhelmed, which helps avoid out-of-memory (OOM) errors or uncontrolled data loss during spikes.
  4. Troubleshooting: When troubleshooting, engineers typically look for the 'root' of the pressure by starting at the final sink and moving backward until they find the component with high resource utilization or throttling.

Option A correctly identifies this as a downstream-to-upstream propagation of rate limits.

Q10medium

A data engineer at a retail company wants to analyze real-time sales data stored in an Amazon Aurora PostgreSQL database. The data is required for a reporting dashboard hosted on an Amazon Redshift cluster. To ensure the dashboard reflects the most current data without the overhead of an ETL (Extract, Transform, Load) process, the engineer decides to use Amazon Redshift Federated Queries. Which of the following describes the correct architectural setup and configuration steps for this requirement?

A.

Create an external schema in Redshift using the CREATE EXTERNAL SCHEMA command, providing the Aurora endpoint and specifying the database username and password as plain text parameters within the SQL statement.

B.

Store the Aurora database credentials in AWS Secrets Manager, attach an IAM role to the Redshift cluster with permissions to access that secret, and then create an external schema using the SECRET_ARN.

C.

Configure Amazon Redshift Spectrum to create an external database in the AWS Glue Data Catalog that points directly to the Aurora storage volume in Amazon S3.

D.

Deploy an Amazon Redshift Zero-ETL integration to replicate the Aurora PostgreSQL tables into a local Redshift schema, which maintains data freshness via continuous replication.

Show answer & explanation

Correct Answer: B

Amazon Redshift Federated Queries enable users to query live data in Amazon RDS or Aurora directly from Redshift without moving or migrating the data. To configure this securely: 1. Secrets Management: Database credentials (username/password) must be stored in AWS Secrets Manager for security and rotation. 2. Permissions: An IAM role must be associated with the Amazon Redshift cluster. This role requires a policy that allows the secretsmanager:GetSecretValue action for the specific secret. 3. External Schema: The connection is established by executing the CREATE EXTERNAL SCHEMA command in Redshift, using the FROM POSTGRES (or MYSQL) clause and identifying the credentials via the SECRET_ARN. Option A is incorrect because Redshift Federated Queries do not support passing plaintext credentials in the SQL command for security reasons. Option C is incorrect because Redshift Spectrum is designed for querying data in Amazon S3, not for direct live queries against relational databases. Option D is incorrect because while Zero-ETL provides real-time analysis, it involves data replication into Redshift, whereas Federated Queries provide remote access without data movement. Therefore, Option B is the correct configuration.

Q11medium

A data analytics company manages a workload with two primary characteristics: unpredictable daily query spikes from customer dashboards and a requirement to occasionally perform analytical queries on 5 PB5\text{ PB} of historical log data stored in Amazon S3. Which configuration provides the most cost-effective and performant solution for this workload?

A.

Use Amazon Redshift Serverless for the dashboard queries and Redshift Spectrum to query the historical logs in Amazon S3.

B.

Provision an Amazon Redshift cluster using RA3 nodes and load all 5 PB5\text{ PB} of historical data into Redshift Managed Storage (RMS).

C.

Implement Redshift Spectrum for all queries, including high-frequency dashboard requests, targeting raw data stored in Amazon S3.

D.

Use an Amazon RDS for PostgreSQL instance to store historical logs and a Redshift Provisioned cluster for the daily dashboard queries.

Show answer & explanation

Correct Answer: A

To optimize for cost and performance in this scenario, a hybrid 'Lakehouse' approach is best.

  1. Unpredictable Spikes: Amazon Redshift Serverless is ideal for unpredictable or intermittent workloads because it automatically scales compute capacity (measured in Redshift Processing Units, or RPUs) up and down based on demand. You only pay for the seconds the work is running, avoiding costs for idle provisioned clusters.
  2. Large Historical Data: For 5 PBofcolddatainS3,RedshiftSpectrumisthemostcosteffectivechoice.ItallowsyoutorunSQLqueriesdirectlyagainstdatainS3withouttheneedtoloaditintoRedshiftlocalstorage.Loading5 PB5\text{ PB} of 'cold' data in S3, **Redshift Spectrum** is the most cost-effective choice. It allows you to run SQL queries directly against data in S3 without the need to load it into Redshift local storage. Loading 5\text{ PB} into RA3 nodes would result in significantly higher storage costs compared to S3 pricing.
  3. Why other options are incorrect: Option B is prohibitively expensive for 5 PB5\text{ PB} of storage. Option C would suffer from higher latency for dashboard users and high per-terabyte scan costs for frequent queries. Option D is incorrect because RDS is an OLTP service and cannot efficiently handle petabyte-scale analytical queries.

Final Answer: Use Amazon Redshift Serverless for the dashboard queries and Redshift Spectrum to query the historical logs in Amazon S3.

Q12medium

An organization needs to centralize application logs from multiple Amazon EC2 instances to meet compliance requirements for data retention and access auditing. When using Amazon CloudWatch Logs, how should the logs be organized to efficiently manage these shared policies?

A.

Logs are organized into log groups, which serve as the primary container for log streams and the level at which retention and access control policies are defined.

B.

Log retention and IAM access permissions must be configured independently for each log stream to ensure granular control over specific application instances.

C.

Application logs are stored as CloudWatch Metrics, which automatically retain the full text of log entries for 15 months based on account-level settings.

D.

CloudWatch Logs requires an associated Amazon S3 bucket for every log group to define the data retention and expiration period via S3 Lifecycle policies.

Show answer & explanation

Correct Answer: A

Amazon CloudWatch Logs uses a specific hierarchy for data organization:

  1. Log Events: A record of an activity recorded by the application (e.g., a timestamp and a message).
  2. Log Streams: A sequence of log events that share the same source, such as a specific EC2 instance or container.
  3. Log Groups: A group of log streams that share the same retention, monitoring, and access control settings.

Because compliance requirements like GDPR or HIPAA often apply to an entire application or environment, setting the retention policy and IAM permissions at the log group level allows for centralized and efficient management. Changes to the log group settings automatically apply to all current and future log streams within that group. A is the correct answer. Options B and D are incorrect because policies are not managed at the stream level and S3 is not required for basic retention. Option C is incorrect because metrics and logs are distinct services with different storage characteristics.

Q13medium

A data engineer is designing an AWS Glue Data Catalog table for a global sales dataset stored in Amazon S3. The storage structure follows a hierarchical partitioning scheme: s3://bucket/sales/year=YYYY/month=MM/region=RR/. The dataset covers 5 years, with 12 months per year, and operations across 2,000 distinct regions.

Calculate the total number of partitions generated by this structure and determine the most appropriate AWS Glue Data Catalog optimization strategy to ensure low query planning latency for Amazon Athena, given that the default partition limit is 100,000 per table.

A.

$120,000 partitions; implement Partition Projection to calculate partition locations from metadata.

B.

$120,000 partitions; implement Partition Indexing on all three partition keys to speed up catalog retrieval.

C.

$2,017 partitions; implement an incremental AWS Glue Crawler to manage metadata updates and reduce crawling time.

D.

60 partitions; use S3 Lifecycle policies to archive metadata periodically and maintain the 100,000 limit.

Show answer & explanation

Correct Answer: A

To calculate the total number of partitions in a hierarchical structure, you must multiply the cardinality of each partition level:

  1. Calculate the total count: Total Partitions=5 (years)×12 (months)×2,000 (regions)=120,000 partitions\text{Total Partitions} = 5 \text{ (years)} \times 12 \text{ (months)} \times 2,000 \text{ (regions)} = 120,000 \text{ partitions}
  2. Analyze the constraints: The AWS Glue Data Catalog has a default limit of $100,000 partitions per table. Reaching or exceeding this volume can significantly increase query planning latency because the query engine (Athena) must fetch metadata for all partitions from the catalog before execution.
  3. Determine the strategy: Partition Projection is the most appropriate solution here. It allows Athena to calculate partition locations based on configuration metadata (like ranges or enum values) instead of performing a look-up in the Glue Data Catalog. This bypasses the $100,000 partition limit and reduces latency for high-cardinality datasets.

Correct Answer: A

Q14medium

A developer is implementing a data retention policy for a serverless application. The application stores metadata in a DynamoDB table and corresponding files in a versioned S3 bucket. Metadata items must expire exactly 7 days after they are created.

If a metadata item is created at Unix epoch timestamp $1,672,531,200 (January 1, 2023, 00:00:00 UTC), what is the correct value to store in the DynamoDB Time to Live (TTL) attribute to ensure the item is marked for deletion after the 7-day retention period?

A.

$1,672,531,200,000

B.

$1,673,136,000

C.

$1,673,136,000,000

D.

$1,673,308,800

Show answer & explanation

Correct Answer: B

To calculate the DynamoDB TTL value, follow these steps:

  1. Identify the Required Format: DynamoDB TTL requires the expiration time in Unix epoch format in seconds. Using milliseconds will cause DynamoDB to ignore the attribute.
  2. Calculate the Retention Period in Seconds: 7 days must be converted to seconds.
    • $7 days ×24\times 24 hours/day ×60\times 60 minutes/hour ×60\times 60 seconds/minute = 604,800 seconds$.
  3. Calculate the Final Timestamp: Add the retention seconds to the creation timestamp.
    • $1,672,531,200 \text{ (Creation)} + 604,800 \text{ (Retention)} = 1,673,136,000$.

Distractor Analysis:

  • Option A is incorrect because it is the creation timestamp in milliseconds.
  • Option C is incorrect because it is the expiration timestamp in milliseconds.
  • Option D is incorrect because it adds an unnecessary 48-hour buffer ($172,800 seconds) to the timestamp. While DynamoDB may take up to 48 hours to actually delete the item from the partition, the TTL attribute value itself should only represent the target expiration time.

The correct TTL value is $1,673,136,000.

Q15medium

A data engineering team is designing a serverless data processing pipeline that involves several AWS Lambda functions executed in a specific sequence. They need to ensure that the entire infrastructure can be deployed consistently and repeatably across Development, Staging, and Production environments. Which approach represents the best practice for implementing and maintaining this workflow using Infrastructure as Code (IaC)?

A.

Define an AWS::StepFunctions::StateMachine to orchestrate AWS::Lambda::Function resources in an AWS CloudFormation or SAM template, using the Parameters section for environment-specific variables.

B.

Develop a single monolithic AWS Lambda function to handle all orchestration logic through synchronous code execution and deploy it using a zip file upload in the AWS Management Console.

C.

Use the AWS CLI to manually create and link each resource sequentially across different environments to ensure exact control over resource identifiers and dependency links.

D.

Utilize AWS CloudTrail to record the API calls made during the manual setup of the Development environment and re-execute those logs in target accounts for replication.

Show answer & explanation

Correct Answer: A

To implement a repeatable and maintainable serverless workflow, the following principles apply:

  1. Declarative Infrastructure: Using AWS CloudFormation or the AWS Serverless Application Model (SAM) allows for the declarative definition of resources. This ensures that the environment is identical every time it is deployed, reducing 'configuration drift'.
  2. Orchestration: Defining an AWS::StepFunctions::StateMachine is the preferred way to orchestrate multiple Lambda functions. It manages state, retries, and error handling more effectively than a monolithic function or custom code.
  3. Reusability: The Parameters section of a CloudFormation template enables the template to be reused across different accounts or regions by passing environment-specific values (like S3 bucket names or IAM roles) at runtime.

Incorrect Options:

  • B describes a monolithic approach that is difficult to scale, test, and maintain, and it lacks the benefits of automated IaC.
  • C relies on manual intervention and scripts that are error-prone and do not provide the state tracking offered by CloudFormation.
  • D incorrectly identifies AWS CloudTrail; CloudTrail is an auditing and logging service, not a tool for infrastructure deployment or replication.

Therefore, the best practice is Option A.

These are 15 of 635 questions available. Take a practice test →

AWS Certified Data Engineer - Associate (DEA-C01) Flashcards

680 flashcards for spaced-repetition study. Showing 30 sample cards below.

Address changes to the characteristics of data(5 cards shown)

Question

Schema Evolution

Answer

The ability of a data processing system to adapt to changes in the data structure (schema) over time without failing.

Key Concepts

  • Backward Compatibility: New code can read old data.
  • Forward Compatibility: Old code can read new data.
  • Full Compatibility: Both backward and forward compatible.

[!NOTE] In AWS, this is primarily managed via the AWS Glue Data Catalog which maintains version history for table definitions.

Question

Schema Drift

Answer

The phenomenon where the metadata of source systems changes unexpectedly (e.g., a new field is added to a JSON payload or a column type changes), potentially breaking downstream ETL pipelines.

Strategies to Address Drift

  • Schema-on-Read: Use tools like Amazon Athena to define the schema at query time.
  • AWS Glue Crawlers: Configure crawlers to automatically update the Data Catalog when changes are detected.
  • Data Quality Rules: Use AWS Glue Data Quality (DQDL) to detect and alert on unexpected characteristic changes.

Question

AWS Glue Schema Registry

Answer

A feature that allows you to centralize and control the evolution of schemas for streaming data.

Functions

  • Integrates with Amazon Kinesis Data Streams and Amazon MSK.
  • Validates data produced by applications against a registered schema.
  • Prevents "poison pill" records (data that doesn't match the schema) from entering the pipeline.
Loading Diagram...

Question

Partition Projection

Answer

A mechanism in Amazon Athena used to address changes in data volume and high-cardinality partitioning by calculating partition values from configuration rather than metadata lookups.

Benefits

  • Reduces the overhead of managing thousands of partitions in the Glue Data Catalog.
  • Highly effective for datasets where data characteristics include highly predictable paths (e.g., s3://bucket/year/month/day/).

[!TIP] Use this when you have millions of partitions or frequently changing time-based data characteristics to avoid MSCK REPAIR TABLE timeouts.

Question

AWS Schema Conversion Tool (AWS SCT)

Answer

A standalone tool used to convert database schemas when moving between different database engines (heterogeneous migration).

Role in Data Characteristics

  • It addresses changes in data types and structural paradigms (e.g., converting an OLTP schema to an OLAP schema like Amazon Redshift).
  • It provides a Migration Assessment Report that identifies items that cannot be converted automatically and require manual intervention.
SourceTarget
Oracle/SQL ServerAmazon Redshift
CassandraAmazon DynamoDB
MongoDBAmazon DocumentDB

Amazon CloudWatch Logs for Application Data(10 cards shown)

Question

CloudWatch Log Group

Answer

A Log Group is a collection of log streams that share the same retention, monitoring, and access control settings.

FeatureDescription
RetentionHow long logs are kept (1 day to 10 years or Infinite).
Access ControlManaged via IAM policies at the group level.
UsageTypically represents a single application or service.

[!NOTE] You define a log group to aggregate logs from multiple instances of the same application component.

Question

Amazon CloudWatch Logs

Answer

A managed service used to centralize, store, and monitor log files from AWS resources and applications. It allows for real-time monitoring of systems and applications using your existing log data.

Key Integrations in Data Engineering

ServiceLog Content
AWS GlueETL job execution status, runtime metrics, and errors.
AWS LambdaFunction execution logs and custom logger output.
Amazon EMRSpark, Hive, and other big data workload performance logs.
Amazon RedshiftConnection, user, and activity logs (must be enabled).

[!NOTE] CloudWatch Logs helps align data engineering practices with regulations like GDPR or HIPAA by providing a central audit trail.

Question

Log Group

Answer

The primary administrative unit in CloudWatch Logs. A Log Group is a collection of log streams that share the same retention, monitoring, and access control settings.

Loading Diagram...

[!TIP] Use Log Groups to organize logs by application or environment (e.g., /prod/ecommerce/web-server).

Question

CloudWatch Log Stream

Answer

A Log Stream is a sequence of log events that share the same source, such as a specific instance of an application or a specific container.

Loading Diagram...

[!TIP] In Lambda, each execution environment (container) creates its own unique Log Stream within the function's Log Group.

Question

CloudWatch Logs Insights

Answer

A fully managed, pay-as-you-go analytics service used to interactively search and analyze log data using a specialized query language.

Common Commands:

  • filter: Search for specific terms or patterns.
  • stats: Calculate aggregations (e.g., count, sum, avg).
  • sort: Order results by timestamp or field.

Example Query:

sql
fields @timestamp, @message | filter @message like /Error/ | stats count(*) by bin(1h)

Question

CloudWatch Logs Insights

Answer

A fully managed, interactive log analysis service that allows you to search and analyze your log data in CloudWatch Logs using a purpose-built query language.

Example Query for Data Pipelines:

sql
fields @timestamp, @message | filter @message like /ERROR/ or @message like /FAIL/ | sort @timestamp desc | limit 20

[!NOTE] It is a pay-per-query service, making it a cost-effective alternative to maintaining a dedicated OpenSearch cluster for infrequent log analysis.

Question

Metric Filter

Answer

A feature that allows you to search and extract specific patterns or terms from log events and transform them into numerical CloudWatch Metrics.

Workflow:

  1. Define Pattern: e.g., [ip, user, timestamp, request, status_code=500, size]
  2. Assign Metric: Create a metric named InternalServerErrorCount.
  3. Set Alarm: Trigger an SNS notification if the count exceeds 5 in a 1-minute period.

[!TIP] Use Metric Filters to monitor the health of your data pipelines without having to write custom monitoring code.

Question

put_log_events (Boto3 / SDK)

Answer

The primary API action used to programmatically upload batches of log events to a specific log stream.

Key Requirements:

  • logGroupName: Destination group.
  • logStreamName: Destination stream.
  • logEvents: Array of objects containing timestamp and message.
  • sequenceToken: Required for subsequent uploads to the same stream to ensure ordering.

[!WARNING] If you provide an invalid sequenceToken, the API returns an InvalidSequenceTokenException containing the correct next token.

Question

CloudWatch Metric Filters

Answer

Metric filters define patterns to search for in log data as it is sent to CloudWatch Logs, turning log data into numerical CloudWatch Metrics.

Loading Diagram...

Use Case: Create a filter for the term "404" in web logs to create a custom metric for NotFoundErrors, then set an alarm if the count exceeds 10 per minute.

Question

Redshift Audit Logging Configuration

Answer

The process of capturing and exporting logs related to cluster security and usage. Unlike basic metrics, audit logging in Amazon Redshift is not enabled by default.

Implementation Steps

  1. Enable Export: You must explicitly enable audit logging in the Redshift console or via API.
  2. Choose Destination: Specify a destination: Amazon CloudWatch Logs or an Amazon S3 prefix.
  3. Define Log Path: For CloudWatch, the group follows a standard path: /aws/redshift/cluster/<cluster_name>/<log_type>

[!WARNING] For connection logs specifically, the path will be /aws/redshift/cluster/<cluster_name>/connectionlog. Ensure IAM permissions are correctly set for the cluster to write to CloudWatch.

Amazon EventBridge & Event Management(5 cards shown)

Question

Amazon EventBridge

Answer

A serverless event bus service that helps build event-driven architectures by routing data from AWS services, custom applications, and SaaS providers to various targets.

[!NOTE] Formerly known as Amazon CloudWatch Events, it uses the same API but offers expanded features like schema registries and third-party SaaS integrations.

Question

Event Bus

Answer

The primary resource in Amazon EventBridge that acts as a router. It receives events and delivers them to zero or more destinations (targets) based on defined rules.

Loading Diagram...

[!TIP] Think of it as a central hub or post office that sorts incoming mail (events) and redirects it to the correct recipients.

Question

EventBridge Rules

Answer

Logic applied to an Event Bus to match incoming events and route them to specific targets. There are two primary types:

Rule TypeDescriptionExample
Event-drivenTriggered by a state change in an AWS resource or custom app.An S3 object creation starts a Glue job.
Schedule-basedTriggered at specific times or intervals (Cron or Rate expressions).Running a cleanup script every Friday at 8 PM.

[!NOTE] A single event can match multiple rules, allowing it to be sent to multiple downstream services simultaneously.

Question

EventBridge Targets

Answer

The downstream resources that EventBridge invokes when an event matches a rule. A single rule can have up to 5 targets.

Common Targets:

  • Compute: AWS Lambda, AWS Batch
  • Orchestration: AWS Step Functions, Amazon MWAA (Airflow)
  • Storage/Streaming: Amazon S3, Kinesis Data Streams, Data Firehose
  • Databases: Amazon Redshift (via Data API)
  • Messaging: Amazon SNS, Amazon SQS

Question

Event Transformation (Input Transformer)

Answer

A feature that allows you to modify the JSON payload of an event before it reaches its target.

Why use it?

  • To extract specific fields from a large event JSON.
  • To reformat data to match the input schema of a target (e.g., a specific Lambda parameter).
  • To add static text or variables to the message.

[!TIP] This is highly useful for creating human-readable notifications in SNS or Slack from raw system events.

Amazon Redshift Data Sharing & Permissions(5 cards shown)

Question

Amazon Redshift Data Sharing

Answer

A feature that allows sharing live, read-only data across Redshift clusters, AWS accounts, or Regions without the need to move or copy the data.

Key Benefits

  • Zero-ETL: Eliminates the need for complex data pipelines to replicate data.
  • Workload Isolation: Consumers can query data without impacting the performance of the producer's compute resources.
  • Data Currency: Consumers see live updates as soon as they are committed in the source cluster.

[!TIP] Use this to move from a siloed architecture to a hub-and-spoke or data mesh model.

Question

Outbound vs. Inbound Shares

Answer

The two primary components involved in the Amazon Redshift data sharing workflow.

ComponentDescription
Outbound ShareCreated by the Producer cluster to define which schemas, tables, or views are shared.
Inbound ShareReceived by the Consumer cluster, which then creates a local database reference to query the shared objects.
Loading Diagram...

Question

Role-Based Access Control (RBAC)

Answer

A security mechanism in Redshift that simplifies permission management by assigning privileges to roles instead of individual users.

Core Features

  • Inheritance: Supports role nesting (assigning a role to another role).
  • Efficiency: Changing a role's permissions automatically updates all assigned users.
  • Commands: Uses GRANT to provide access and REVOKE to remove it.

[!NOTE] RBAC helps implement the Principle of Least Privilege by ensuring users only have the specific permissions required for their role.

Question

Row-Level Security (RLS)

Answer

A granular access control feature that restricts the specific rows a user or role can view within a table based on predefined policies.

Implementation

  • Policy Logic: Defined using SQL predicates (e.g., WHERE region = 'US').
  • Filtering: When a user queries the table, Redshift silently applies the policy to filter results.

[!WARNING] Avoid complex subqueries or excessive table joins within RLS policies, as they can significantly degrade query performance.

Question

Centralized Governance via AWS Lake Formation

Answer

The integration of Redshift data sharing with Lake Formation to manage permissions centrally across the AWS environment.

How it Works

  1. Producer clusters register data shares with Lake Formation.
  2. Administrators use LF-Tags (Tag-Based Access Control) to define permissions.
  3. Consumers access shared data through the Lake Formation catalog, which handles cross-account authorization.
Loading Diagram...

Amazon S3 and Redshift Data Movement(5 cards shown)

Question

COPY Command

Answer

The SQL command used to load data into Amazon Redshift tables from external sources, most commonly Amazon S3.

Key Features:

  • Parallelism: Loads data in parallel using all compute nodes in the cluster.
  • Efficiency: Significantly faster than performing multiple INSERT statements.
  • Flexibility: Supports various formats including CSV, JSON, Parquet, and Avro.

[!TIP] Use a Manifest File (a JSON file listing specific S3 objects) with the COPY command to ensure the correct files are loaded and to handle cross-account or cross-region access.

Question

UNLOAD Command

Answer

The SQL command used to export the results of a query from Amazon Redshift to one or more files in an Amazon S3 bucket.

Characteristics:

FeatureDetails
ParallelismEnabled by default; writes data in parallel to multiple files based on the number of slices in the cluster.
CompressionSupports GZIP, BZIP2, and ZSTD to reduce storage costs in S3.
FormatCan export as delimited text (CSV), JSON, or Parquet.

[!WARNING] By default, UNLOAD creates multiple files. If you need a single file, you must use the PARALLEL OFF option, though this is slower and not recommended for large datasets.

Question

Amazon Redshift Spectrum

Answer

A feature that enables Redshift to execute SQL queries directly against data stored in Amazon S3 without the need to load the data into Redshift local storage.

Loading Diagram...

Use Case: Ideal for querying "cold" or infrequent data, or for performing ad-hoc analysis on massive datasets in the data lake while joining them with "hot" data stored locally in Redshift.

Question

Hot vs. Cold Data Strategy

Answer

An architectural pattern in a Lakehouse environment used to optimize performance and cost by tiering data storage.

  • Hot Data: Frequently accessed, structured data stored in Amazon Redshift for high-performance BI and reporting.
  • Cold Data: Infrequently accessed or raw data stored in Amazon S3 for cost-efficiency.

[!NOTE] The UNLOAD command is frequently used to move "aging" data from Redshift to S3 to free up expensive local SSD storage while keeping it accessible via Redshift Spectrum or Athena.

Question

IAM Role for COPY/UNLOAD

Answer

The security mechanism required for an Amazon Redshift cluster to access Amazon S3 buckets for loading or unloading data.

Implementation:

  1. Create an IAM Role with policies like AmazonS3ReadOnlyAccess (for COPY) or AmazonS3FullAccess (for UNLOAD).
  2. Attach the role to the Redshift Cluster.
  3. Reference the Role's ARN in the SQL command:

COPY table_name FROM 's3://bucket/path' IAM_ROLE 'arn:aws:iam::123456789012:role/MyRedshiftRole';

Showing 30 of 680 flashcards. Study all flashcards →

Ready to ace AWS Certified Data Engineer - Associate (DEA-C01)?

Access all 635 practice questions, 9 timed mock exams, study notes, and flashcards — no sign-up required.

Start Studying — Free