Study Guide925 words

AWS Certified Data Engineer Associate: Reading Data from Batch Sources

Read data from batch sources (for example, Amazon S3, AWS Glue, Amazon EMR, AWS DMS, Amazon Redshift, AWS Lambda, Amazon AppFlow)

AWS Certified Data Engineer Associate: Reading Data from Batch Sources

This guide covers the essential AWS services and strategies for ingesting and reading data from batch sources, a core competency for the DEA-C01 exam.

Learning Objectives

  • Identify the primary AWS services used for batch ingestion (S3, Glue, EMR, Redshift, DMS, AppFlow, Lambda).
  • Select the appropriate batch service based on data volume, complexity, and operational overhead.
  • Configure ingestion options such as Glue bookmarks, Redshift COPY commands, and S3 event triggers.
  • Differentiate between serverless (Glue, Lambda) and provisioned (EMR, Redshift) batch processing models.

Key Terms & Glossary

  • ETL (Extract, Transform, Load): The process of reading data from a source, modifying it, and loading it into a destination.
  • Data Catalog: A central metadata repository (like AWS Glue Data Catalog) that stores information about data formats and locations.
  • DPU (Data Processing Unit): The unit of relative power used to measure the capacity of an AWS Glue job.
  • Job Bookmark: A mechanism in AWS Glue that maintains state information to prevent the re-processing of old data in a source location.
  • Federated Query: The ability to query data across different sources (e.g., Redshift querying RDS via JDBC) without moving the data.

The "Big Idea"

Batch ingestion is the "heartbeat" of the modern data lake. Unlike streaming (real-time), batch ingestion moves large volumes of data at specific intervals (e.g., hourly, daily). The goal is to move raw data from various sources—on-premises databases, SaaS apps, or flat files—into a centralized AWS landing zone (usually Amazon S3) or a data warehouse (Amazon Redshift) for structured analysis. Selecting the right tool depends on whether you value customization (EMR), simplicity (Glue), SQL-familiarity (Redshift), or event-driven triggers (Lambda).

Formula / Concept Box

ConceptMetric / RuleKey Consideration
Lambda Timeout15 MinutesUse for small files/triggers only
Glue Worker TypesG.1X, G.2X, G.025XG.2X is for memory-intensive Spark jobs
Redshift COPYParallelism=SlicesParallelism = SlicesSplit files to match the number of slices for speed
DMS ModeFull Load vs. CDCFull Load is the "Batch" portion of DMS

Hierarchical Outline

  • I. Storage as the Foundation (Amazon S3)
    • Landing Zone: The primary destination for raw batch data.
    • S3 Events: Triggering Lambda or SQS when a new file arrives.
  • II. Managed ETL Services (AWS Glue)
    • Crawlers: Automated schema discovery and cataloging.
    • Connectors: Pre-built interfaces for JDBC, MongoDB, and SaaS.
    • Bookmarks: Ensuring "exactly-once" processing by tracking state.
  • III. Big Data Clusters (Amazon EMR)
    • Ecosystem: Support for Spark, Hive, Presto, and HBase.
    • Customization: Full root access to instances for complex dependencies.
  • IV. Warehouse Ingestion (Amazon Redshift)
    • COPY Command: High-speed batch loading from S3.
    • Spectrum: Reading data directly from S3 without loading into Redshift tables.
  • V. Specialty Ingestion
    • Amazon AppFlow: Securely reading data from SaaS (Salesforce, Zendesk, etc.).
    • AWS DMS: Batch migration from RDBMS/NoSQL sources to AWS.

Visual Anchors

Batch Data Flow Architecture

Loading Diagram...

S3 Ingestion Trigger Logic

\begin{tikzpicture}[node distance=2cm, every node/.style={rectangle, draw, fill=blue!10, text centered, rounded corners, minimum width=3cm, minimum height=1cm}] \node (s3) {S3 Bucket}; \node (event) [right of=s3, xshift=2cm] {S3 Event Notification}; \node (lambda) [below of=event] {AWS Lambda}; \node (glue) [left of=lambda, xshift=-2cm] {Glue Job (Process)};

code
\draw[->, thick] (s3) -- node[above] {File Upload} (event); \draw[->, thick] (event) -- node[right] {Trigger} (lambda); \draw[->, thick] (lambda) -- node[above] {Start Job} (glue);

\end{tikzpicture}

Definition-Example Pairs

  • Service: Amazon AppFlow
    • Definition: A fully managed integration service that enables secure data transfer between SaaS applications and AWS services.
    • Example: Automatically pulling Salesforce Opportunity records every night at 12:00 AM and saving them as Parquet files in S3.
  • Service: AWS Glue Bookmarks
    • Definition: A feature that persists state information from a job run to identify data that has already been processed.
    • Example: A Glue job reads a folder containing 1,000 CSVs. After run #1, 10 new files are added. Run #2 uses bookmarks to read only the 10 new files.

Worked Examples

Example 1: High-Speed Loading into Redshift

Scenario: You have 500GB of log data in S3 and need to load it into Redshift for a BI dashboard.

  1. Format: Ensure the data is in a columnar format like Parquet or delimited text.
  2. Split: Split the data into multiple files (e.g., 128MB each) to allow Redshift's compute slices to ingest in parallel.
  3. Command: Use the COPY command:
    sql
    COPY schema.table FROM 's3://mybucket/logs/' IAM_ROLE 'arn:aws:iam::123456789012:role/MyRedshiftRole' FORMAT AS PARQUET;

Example 2: Cross-Account Data Discovery

Scenario: Data is stored in a different AWS account's S3 bucket, and you need to catalog it.

  1. Permissions: Set up a Bucket Policy in the source account allowing the destination Glue IAM role access.
  2. Crawler: Create an AWS Glue Crawler in your account.
  3. Target: Point the Crawler to the cross-account S3 path.
  4. Result: The metadata is populated in your local Glue Data Catalog for use in Athena.

Checkpoint Questions

  1. Which service would you choose if you need to run a Spark job with a custom Python library not supported by Glue?
  2. What is the maximum execution time for an AWS Lambda function used for batch processing?
  3. How does Amazon Redshift Spectrum differ from the Redshift COPY command?
  4. When should you use AWS Glue "Python Shell" jobs instead of "Spark" jobs?

Comparison Tables

FeatureAWS GlueAmazon EMRAWS LambdaAmazon Redshift
ManagementServerlessManaged ClustersServerlessServerless or Provisioned
ScalingAutomatic (DPUs)Manual or Auto-scalingAutomatic (Concurrency)Manual or RA3 Auto-scale
Max Runtime48 HoursIndefinite15 MinutesIndefinite
Primary LanguagePython, Scala (Spark)Spark, Hive, Flink, PigPython, Node, Java, etc.SQL
Best ForStandard ETL, CatalogingComplex, High-scale Big DataSmall, Event-driven tasksSQL-heavy Aggregations

Muddy Points & Cross-Refs

  • Glue vs. EMR: This is a common exam trap. Choose Glue for standard Spark ETL where you want to "set it and forget it." Choose EMR if you need specific versions of open-source tools (like Presto) or need to log into the machine to tune OS-level settings.
  • JDBC vs. Native S3: Reading via JDBC (e.g., from RDS) is usually slower than reading flat files from S3. For massive batch loads, migrate the DB to S3 first using DMS, then process with Glue.
  • Pricing: Remember that Glue is billed per DPU-Hour (minimum 1 minute), while EMR is billed per instance-hour plus the EMR software fee.

Ready to study AWS Certified Data Engineer - Associate (DEA-C01)?

Practice tests, flashcards, and all study notes — free, no sign-up needed.

Start Studying — Free