AWS SAA-C03: High-Performing Data Ingestion and Transformation
Determine high-performing data ingestion and transformation solutions
High-Performing Data Ingestion and Transformation Solutions
This study guide focuses on Domain 3.5 of the AWS Certified Solutions Architect - Associate (SAA-C03) exam. It covers the architecture of data pipelines, real-time streaming, and the tools required to transform raw data into actionable insights.
Learning Objectives
By the end of this guide, you should be able to:
- Differentiate between batch and stream ingestion patterns.
- Select the appropriate Kinesis service (Data Streams, Firehose, or Video Streams) based on latency and target requirements.
- Architect serverless ETL (Extract, Transform, Load) pipelines using AWS Glue.
- Implement secure file transfers using AWS Transfer Family.
- Optimize data formats (Parquet/ORC) for downstream analytical performance.
Key Terms & Glossary
- Data Lake: A centralized, elastic repository (typically Amazon S3) that stores structured and unstructured data at any scale.
- ETL (Extract, Transform, Load): The process of pulling data from sources, changing its format/structure, and loading it into a destination.
- Shard: The base unit of throughput in Kinesis Data Streams; provides a fixed capacity of 1MB/sec data input and 2MB/sec data output.
- Schema-on-Read: A data analysis strategy where data is stored in its raw form and a schema is only applied when the data is read (key for Data Lakes).
- Crawler: An AWS Glue feature that scans data stores (S3, RDS) to automatically infer schemas and populate the Glue Data Catalog.
The "Big Idea"
Modern data architecture has shifted from rigid, expensive siloes to the Data Lake model. Instead of cleaning data before storage (which limits future use), AWS promotes ingesting data in its rawest form into Amazon S3. High-performing solutions use Amazon Kinesis for immediate insights and AWS Glue to prepare that data for high-speed querying in tools like Amazon Athena or Redshift. This decoupled approach ensures scalability and cost-efficiency.
Formula / Concept Box
| Feature | Kinesis Data Streams | Kinesis Data Firehose |
|---|---|---|
| Latency | Real-time (~200ms) | Near real-time (60s buffer minimum) |
| Data Retention | 24 hours to 365 days | No retention (transient) |
| Scaling | Manual/Auto Sharding | Fully managed (automatic) |
| Transformation | Requires custom code (Lambda/KCL) | Built-in via Lambda blueprints |
| Destination | Custom Apps, EMR, Kinesis Analytics | S3, Redshift, OpenSearch, Splunk |
[!IMPORTANT] If the exam question mentions "real-time" and "custom processing," think Data Streams. If it mentions "delivery to S3" and "zero administration," think Firehose.
Hierarchical Outline
- Streaming Ingestion (Amazon Kinesis)
- Data Streams: High-throughput, low-latency. Requires managing Shards.
- Data Firehose: Simplest way to load streaming data into AWS data stores. Can transform data using Lambda (e.g., CSV to JSON).
- Video Streams: Ingests and stores video/audio for machine learning and playback.
- Serverless Transformation (AWS Glue)
- Glue Data Catalog: A central metadata repository.
- Glue Studio: Visual interface for creating ETL jobs.
- Glue DataBrew: Visual data preparation tool for non-coders (clean/normalize).
- Managed File Transfer
- AWS Transfer Family: Provides SFTP, FTPS, and FTP access directly to Amazon S3 or EFS.
- Orchestration & Governance
- AWS Lake Formation: Simplifies the setup of a secure Data Lake (manages Glue, S3, and IAM in one place).
Visual Anchors
Data Pipeline Architecture
Kinesis Shard Mechanism
\begin{tikzpicture}[node distance=1.5cm] \draw[thick] (0,0) rectangle (6,3); \node at (3,3.3) {\textbf{Kinesis Data Stream}}; \draw[fill=blue!10] (0.5,0.5) rectangle (5.5,1.1) node[midway] {Shard 1: 1MB/s In | 2MB/s Out}; \draw[fill=blue!10] (0.5,1.4) rectangle (5.5,2.0) node[midway] {Shard 2: 1MB/s In | 2MB/s Out}; \draw[fill=blue!10] (0.5,2.3) rectangle (5.5,2.9) node[midway] {Shard N: Scalable Capacity}; \draw[->, thick] (-1.5,1.5) -- (-0.2,1.5) node[midway, above] {Producers}; \draw[->, thick] (6.2,1.5) -- (7.5,1.5) node[midway, above] {Consumers}; \end{tikzpicture}
Definition-Example Pairs
- Definition: Partitioning: The process of organizing data in S3 into a folder structure (e.g.,
/year/month/day/) to limit the amount of data scanned by queries.- Example: A query for sales in "March 2023" only reads files in
s3://bucket/2023/03/, reducing costs and increasing performance in Amazon Athena.
- Example: A query for sales in "March 2023" only reads files in
- Definition: FindMatches ML: A transform in AWS Glue that uses machine learning to identify and merge duplicate records across data sets.
- Example: Identifying that "John Doe, NY" and "J. Doe, New York" are the same customer without writing complex regular expressions.
Worked Examples
Example 1: Real-time Fraud Detection
Scenario: A financial company needs to analyze credit card transactions as they happen and trigger an alert within 2 seconds if fraud is detected.
- Solution: Use Kinesis Data Streams. Transactions are sent to the stream. A Lambda function is triggered by the stream, runs fraud detection logic, and writes flagged events to a DynamoDB table.
- Why?: Firehose has a minimum 60-second buffer, which is too slow for this requirement.
Example 2: Log Aggregation for Analytics
Scenario: A company wants to store application logs from thousands of EC2 instances in an S3 bucket and convert them from JSON to Parquet for cheap querying with Athena.
- Solution: Use Kinesis Data Firehose. Install the Kinesis Agent on EC2 to push logs. Configure Firehose to use a Lambda blueprint for JSON-to-Parquet conversion and enable S3 Prefix partitioning.
- Why?: This is a "zero-ops" solution that handles scaling and transformation automatically.
Checkpoint Questions
- Which Kinesis service is best for delivering data directly to an Amazon OpenSearch cluster with minimal management?
- Answer: Kinesis Data Firehose (it has a native destination for OpenSearch).
- You have data in an on-premises Hadoop cluster and want to move it to AWS for analysis while keeping the same folder structure via SFTP. Which service do you use?
- Answer: AWS Transfer Family.
- How can you reduce the cost of Amazon Athena queries on a 10TB Data Lake stored in S3?
- Answer: Use AWS Glue to convert the data into a columnar format like Apache Parquet and partition the data by time or region.
- What is the primary difference between AWS Glue Studio and AWS Glue DataBrew?
- Answer: Glue Studio is for building ETL pipelines/jobs; DataBrew is for visual data cleaning and normalization (pre-processing).