AWS Certified Data Engineer Associate: Reading Data from Batch Sources
Read data from batch sources (for example, Amazon S3, AWS Glue, Amazon EMR, AWS DMS, Amazon Redshift, AWS Lambda, Amazon AppFlow)
AWS Certified Data Engineer Associate: Reading Data from Batch Sources
This guide covers the essential AWS services and strategies for ingesting and reading data from batch sources, a core competency for the DEA-C01 exam.
Learning Objectives
- Identify the primary AWS services used for batch ingestion (S3, Glue, EMR, Redshift, DMS, AppFlow, Lambda).
- Select the appropriate batch service based on data volume, complexity, and operational overhead.
- Configure ingestion options such as Glue bookmarks, Redshift COPY commands, and S3 event triggers.
- Differentiate between serverless (Glue, Lambda) and provisioned (EMR, Redshift) batch processing models.
Key Terms & Glossary
- ETL (Extract, Transform, Load): The process of reading data from a source, modifying it, and loading it into a destination.
- Data Catalog: A central metadata repository (like AWS Glue Data Catalog) that stores information about data formats and locations.
- DPU (Data Processing Unit): The unit of relative power used to measure the capacity of an AWS Glue job.
- Job Bookmark: A mechanism in AWS Glue that maintains state information to prevent the re-processing of old data in a source location.
- Federated Query: The ability to query data across different sources (e.g., Redshift querying RDS via JDBC) without moving the data.
The "Big Idea"
Batch ingestion is the "heartbeat" of the modern data lake. Unlike streaming (real-time), batch ingestion moves large volumes of data at specific intervals (e.g., hourly, daily). The goal is to move raw data from various sources—on-premises databases, SaaS apps, or flat files—into a centralized AWS landing zone (usually Amazon S3) or a data warehouse (Amazon Redshift) for structured analysis. Selecting the right tool depends on whether you value customization (EMR), simplicity (Glue), SQL-familiarity (Redshift), or event-driven triggers (Lambda).
Formula / Concept Box
| Concept | Metric / Rule | Key Consideration |
|---|---|---|
| Lambda Timeout | 15 Minutes | Use for small files/triggers only |
| Glue Worker Types | G.1X, G.2X, G.025X | G.2X is for memory-intensive Spark jobs |
| Redshift COPY | Split files to match the number of slices for speed | |
| DMS Mode | Full Load vs. CDC | Full Load is the "Batch" portion of DMS |
Hierarchical Outline
- I. Storage as the Foundation (Amazon S3)
- Landing Zone: The primary destination for raw batch data.
- S3 Events: Triggering Lambda or SQS when a new file arrives.
- II. Managed ETL Services (AWS Glue)
- Crawlers: Automated schema discovery and cataloging.
- Connectors: Pre-built interfaces for JDBC, MongoDB, and SaaS.
- Bookmarks: Ensuring "exactly-once" processing by tracking state.
- III. Big Data Clusters (Amazon EMR)
- Ecosystem: Support for Spark, Hive, Presto, and HBase.
- Customization: Full root access to instances for complex dependencies.
- IV. Warehouse Ingestion (Amazon Redshift)
- COPY Command: High-speed batch loading from S3.
- Spectrum: Reading data directly from S3 without loading into Redshift tables.
- V. Specialty Ingestion
- Amazon AppFlow: Securely reading data from SaaS (Salesforce, Zendesk, etc.).
- AWS DMS: Batch migration from RDBMS/NoSQL sources to AWS.
Visual Anchors
Batch Data Flow Architecture
S3 Ingestion Trigger Logic
\begin{tikzpicture}[node distance=2cm, every node/.style={rectangle, draw, fill=blue!10, text centered, rounded corners, minimum width=3cm, minimum height=1cm}] \node (s3) {S3 Bucket}; \node (event) [right of=s3, xshift=2cm] {S3 Event Notification}; \node (lambda) [below of=event] {AWS Lambda}; \node (glue) [left of=lambda, xshift=-2cm] {Glue Job (Process)};
\draw[->, thick] (s3) -- node[above] {File Upload} (event);
\draw[->, thick] (event) -- node[right] {Trigger} (lambda);
\draw[->, thick] (lambda) -- node[above] {Start Job} (glue);\end{tikzpicture}
Definition-Example Pairs
- Service: Amazon AppFlow
- Definition: A fully managed integration service that enables secure data transfer between SaaS applications and AWS services.
- Example: Automatically pulling Salesforce Opportunity records every night at 12:00 AM and saving them as Parquet files in S3.
- Service: AWS Glue Bookmarks
- Definition: A feature that persists state information from a job run to identify data that has already been processed.
- Example: A Glue job reads a folder containing 1,000 CSVs. After run #1, 10 new files are added. Run #2 uses bookmarks to read only the 10 new files.
Worked Examples
Example 1: High-Speed Loading into Redshift
Scenario: You have 500GB of log data in S3 and need to load it into Redshift for a BI dashboard.
- Format: Ensure the data is in a columnar format like Parquet or delimited text.
- Split: Split the data into multiple files (e.g., 128MB each) to allow Redshift's compute slices to ingest in parallel.
- Command: Use the
COPYcommand:sqlCOPY schema.table FROM 's3://mybucket/logs/' IAM_ROLE 'arn:aws:iam::123456789012:role/MyRedshiftRole' FORMAT AS PARQUET;
Example 2: Cross-Account Data Discovery
Scenario: Data is stored in a different AWS account's S3 bucket, and you need to catalog it.
- Permissions: Set up a Bucket Policy in the source account allowing the destination Glue IAM role access.
- Crawler: Create an AWS Glue Crawler in your account.
- Target: Point the Crawler to the cross-account S3 path.
- Result: The metadata is populated in your local Glue Data Catalog for use in Athena.
Checkpoint Questions
- Which service would you choose if you need to run a Spark job with a custom Python library not supported by Glue?
- What is the maximum execution time for an AWS Lambda function used for batch processing?
- How does Amazon Redshift Spectrum differ from the Redshift
COPYcommand? - When should you use AWS Glue "Python Shell" jobs instead of "Spark" jobs?
Comparison Tables
| Feature | AWS Glue | Amazon EMR | AWS Lambda | Amazon Redshift |
|---|---|---|---|---|
| Management | Serverless | Managed Clusters | Serverless | Serverless or Provisioned |
| Scaling | Automatic (DPUs) | Manual or Auto-scaling | Automatic (Concurrency) | Manual or RA3 Auto-scale |
| Max Runtime | 48 Hours | Indefinite | 15 Minutes | Indefinite |
| Primary Language | Python, Scala (Spark) | Spark, Hive, Flink, Pig | Python, Node, Java, etc. | SQL |
| Best For | Standard ETL, Cataloging | Complex, High-scale Big Data | Small, Event-driven tasks | SQL-heavy Aggregations |
Muddy Points & Cross-Refs
- Glue vs. EMR: This is a common exam trap. Choose Glue for standard Spark ETL where you want to "set it and forget it." Choose EMR if you need specific versions of open-source tools (like Presto) or need to log into the machine to tune OS-level settings.
- JDBC vs. Native S3: Reading via JDBC (e.g., from RDS) is usually slower than reading flat files from S3. For massive batch loads, migrate the DB to S3 first using DMS, then process with Glue.
- Pricing: Remember that Glue is billed per DPU-Hour (minimum 1 minute), while EMR is billed per instance-hour plus the EMR software fee.