AWS Certified Data Engineer Associate: Reading Data from Batch Sources

This guide covers the essential AWS services and strategies for ingesting and reading data from batch sources, a core competency for the DEA-C01 exam.

Learning Objectives

Identify the primary AWS services used for batch ingestion (S3, Glue, EMR, Redshift, DMS, AppFlow, Lambda).
Select the appropriate batch service based on data volume, complexity, and operational overhead.
Configure ingestion options such as Glue bookmarks, Redshift COPY commands, and S3 event triggers.
Differentiate between serverless (Glue, Lambda) and provisioned (EMR, Redshift) batch processing models.

Key Terms & Glossary

ETL (Extract, Transform, Load): The process of reading data from a source, modifying it, and loading it into a destination.
Data Catalog: A central metadata repository (like AWS Glue Data Catalog) that stores information about data formats and locations.
DPU (Data Processing Unit): The unit of relative power used to measure the capacity of an AWS Glue job.
Job Bookmark: A mechanism in AWS Glue that maintains state information to prevent the re-processing of old data in a source location.
Federated Query: The ability to query data across different sources (e.g., Redshift querying RDS via JDBC) without moving the data.

The "Big Idea"

Batch ingestion is the "heartbeat" of the modern data lake. Unlike streaming (real-time), batch ingestion moves large volumes of data at specific intervals (e.g., hourly, daily). The goal is to move raw data from various sources—on-premises databases, SaaS apps, or flat files—into a centralized AWS landing zone (usually Amazon S3) or a data warehouse (Amazon Redshift) for structured analysis. Selecting the right tool depends on whether you value customization (EMR), simplicity (Glue), SQL-familiarity (Redshift), or event-driven triggers (Lambda).

Formula / Concept Box

Concept	Metric / Rule	Key Consideration
Lambda Timeout	15 Minutes	Use for small files/triggers only
Glue Worker Types	G.1X, G.2X, G.025X	G.2X is for memory-intensive Spark jobs
Redshift COPY	$Parallelism = Slices$	Split files to match the number of slices for speed
DMS Mode	Full Load vs. CDC	Full Load is the "Batch" portion of DMS

Hierarchical Outline

I. Storage as the Foundation (Amazon S3)
- Landing Zone: The primary destination for raw batch data.
- S3 Events: Triggering Lambda or SQS when a new file arrives.
II. Managed ETL Services (AWS Glue)
- Crawlers: Automated schema discovery and cataloging.
- Connectors: Pre-built interfaces for JDBC, MongoDB, and SaaS.
- Bookmarks: Ensuring "exactly-once" processing by tracking state.
III. Big Data Clusters (Amazon EMR)
- Ecosystem: Support for Spark, Hive, Presto, and HBase.
- Customization: Full root access to instances for complex dependencies.
IV. Warehouse Ingestion (Amazon Redshift)
- COPY Command: High-speed batch loading from S3.
- Spectrum: Reading data directly from S3 without loading into Redshift tables.
V. Specialty Ingestion
- Amazon AppFlow: Securely reading data from SaaS (Salesforce, Zendesk, etc.).
- AWS DMS: Batch migration from RDBMS/NoSQL sources to AWS.

Visual Anchors

Batch Data Flow Architecture

Loading Diagram...

S3 Ingestion Trigger Logic

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

Service: Amazon AppFlow
- Definition: A fully managed integration service that enables secure data transfer between SaaS applications and AWS services.
- Example: Automatically pulling Salesforce Opportunity records every night at 12:00 AM and saving them as Parquet files in S3.
Service: AWS Glue Bookmarks
- Definition: A feature that persists state information from a job run to identify data that has already been processed.
- Example: A Glue job reads a folder containing 1,000 CSVs. After run #1, 10 new files are added. Run #2 uses bookmarks to read only the 10 new files.

Worked Examples

Example 1: High-Speed Loading into Redshift

Scenario: You have 500GB of log data in S3 and need to load it into Redshift for a BI dashboard.

Format: Ensure the data is in a columnar format like Parquet or delimited text.
Split: Split the data into multiple files (e.g., 128MB each) to allow Redshift's compute slices to ingest in parallel.
Command: Use the COPY command:
sql
COPY schema.table FROM 's3://mybucket/logs/' IAM_ROLE 'arn:aws:iam::123456789012:role/MyRedshiftRole' FORMAT AS PARQUET;

Example 2: Cross-Account Data Discovery

Scenario: Data is stored in a different AWS account's S3 bucket, and you need to catalog it.

Permissions: Set up a Bucket Policy in the source account allowing the destination Glue IAM role access.
Crawler: Create an AWS Glue Crawler in your account.
Target: Point the Crawler to the cross-account S3 path.
Result: The metadata is populated in your local Glue Data Catalog for use in Athena.

Checkpoint Questions

Which service would you choose if you need to run a Spark job with a custom Python library not supported by Glue?
What is the maximum execution time for an AWS Lambda function used for batch processing?
How does Amazon Redshift Spectrum differ from the Redshift COPY command?
When should you use AWS Glue "Python Shell" jobs instead of "Spark" jobs?

Comparison Tables

Feature	AWS Glue	Amazon EMR	AWS Lambda	Amazon Redshift
Management	Serverless	Managed Clusters	Serverless	Serverless or Provisioned
Scaling	Automatic (DPUs)	Manual or Auto-scaling	Automatic (Concurrency)	Manual or RA3 Auto-scale
Max Runtime	48 Hours	Indefinite	15 Minutes	Indefinite
Primary Language	Python, Scala (Spark)	Spark, Hive, Flink, Pig	Python, Node, Java, etc.	SQL
Best For	Standard ETL, Cataloging	Complex, High-scale Big Data	Small, Event-driven tasks	SQL-heavy Aggregations

Muddy Points & Cross-Refs

Glue vs. EMR: This is a common exam trap. Choose Glue for standard Spark ETL where you want to "set it and forget it." Choose EMR if you need specific versions of open-source tools (like Presto) or need to log into the machine to tune OS-level settings.
JDBC vs. Native S3: Reading via JDBC (e.g., from RDS) is usually slower than reading flat files from S3. For massive batch loads, migrate the DB to S3 first using DMS, then process with Glue.
Pricing: Remember that Glue is billed per DPU-Hour (minimum 1 minute), while EMR is billed per instance-hour plus the EMR software fee.