Mastering AWS Data Transfer Solutions: SAA-C03 Study Guide
Designing data transfer solutions
Mastering AWS Data Transfer Solutions: SAA-C03 Study Guide
Learning Objectives
After studying this guide, you should be able to:
- Select the appropriate AWS Snow Family device based on data volume and compute requirements.
- Differentiate between AWS DataSync and AWS Transfer Family for online data migrations.
- Design data streaming architectures using Amazon Kinesis Data Streams and Data Firehose.
- Evaluate the cost-effectiveness of physical migration versus over-the-wire transfer.
- Understand the role of AWS Glue in transforming data during the ingestion process.
Key Terms & Glossary
- ETL (Extract, Transform, Load): The process of gathering data from various sources, changing its format, and loading it into a destination (e.g., AWS Glue).
- Hydration: The process of initially loading a large amount of data into a storage system or data lake.
- Point-in-Time Migration: A one-time movement of data, typically using physical devices like Snowball.
- Streaming Data: Data that is generated continuously by thousands of data sources, which typically send in the data records simultaneously (e.g., application logs).
- VPC Endpoint: A private connection between your VPC and supported AWS services without requiring an internet gateway.
The "Big Idea"
Data transfer in AWS is not a "one-size-fits-all" task. It is a balancing act between Volume, Velocity, and Cost. When moving petabytes of data, physical physics (bandwidth limits) often makes shipping hardware faster than using the internet. Conversely, for continuous, small-scale updates, automated managed services provide the efficiency needed for high-performing architectures.
Formula / Concept Box
| Concept | Metric / Formula / Rule | Use Case |
|---|---|---|
| Transfer Time | Deciding between Snowball vs. Direct Connect | |
| Snowball Edge Storage | 80 TB Usable | Large scale data migration |
| Snowcone Storage | 22 TB Usable | Small/Edge location migration |
| Kinesis Data Streams | Real-time (70ms latency) | High-performance analytics |
| Kinesis Firehose | Near real-time (60s+ latency) | Loading data into S3/Redshift |
Hierarchical Outline
- I. Physical Migration (AWS Snow Family)
- Snowcone: Ultra-portable, 22TB storage, 4 vCPUs.
- Snowball Edge:
- Storage Optimized: 80TB storage, 40 vCPUs.
- Compute Optimized: 42TB storage, 52 vCPUs (for Edge ML/Processing).
- Snowmobile: 45ft shipping container, up to 100PB per truck.
- II. Online Data Transfer
- AWS DataSync: Automates moving data between on-premises storage and AWS (S3, EFS, FSx).
- AWS Transfer Family: Managed support for SFTP, FTPS, and FTP.
- Amazon S3 Transfer Acceleration: Uses CloudFront’s edge locations to speed up long-distance uploads.
- III. Data Ingestion & Transformation
- Amazon Kinesis: Handles streaming data (Video, Data, Firehose).
- AWS Glue: Serverless ETL for transforming data (e.g., CSV to Parquet).
Visual Anchors
Migration Decision Logic
DataSync Architecture
\begin{tikzpicture}[node distance=2cm] \draw[thick] (0,0) rectangle (2.5,1.5) node[midway] {\begin{tabular}{c} On-Prem \ Storage \end{tabular}}; \draw[->, thick] (2.5,0.75) -- (4,0.75) node[midway, above] {Agent}; \draw[thick, dashed] (4,-0.5) rectangle (7,2.5) node[at start, below right] {AWS Cloud}; \draw[thick] (4.5,0.75) circle (0.5cm) node {Sync}; \draw[->, thick] (5,0.75) -- (6,0.75); \node at (6.5,0.75) [draw] {S3 / EFS}; \end{tikzpicture}
Definition-Example Pairs
- AWS DataSync
- Definition: An online data transfer service that simplifies, automates, and accelerates moving data between on-premises storage systems and AWS storage services.
- Example: A hospital needs to sync 500GB of daily medical imaging from their local NAS to an Amazon S3 bucket every night.
- Kinesis Data Firehose
- Definition: An extract-transform-load (ETL) service that reliably captures, transforms, and delivers streaming data to data lakes, data stores, and analytics tools.
- Example: A gaming company streaming player clickstream data directly into an S3 bucket to be analyzed later by Amazon Athena.
Worked Examples
Scenario: The Bandwidth Trap
Problem: A company has 100 TB of data to move to AWS. They have a dedicated 100 Mbps internet connection available for this task. Should they use the internet or a Snowball Edge?
Calculation:
- Total Data in bits: .
- Speed in bits per second: .
- Total Seconds: .
- Total Days: .
Solution: Since 101 days is likely unacceptable for a business migration, the company should order two Snowball Edge Storage Optimized devices (80TB usable each) to complete the transfer in approximately 1–2 weeks (including shipping time).
Checkpoint Questions
- Which Snow Family device is specifically designed for compute-heavy workloads at the edge?
- What is the primary difference between Kinesis Data Streams and Kinesis Data Firehose regarding data retention?
- True or False: AWS Transfer Family supports the SMB protocol for file transfers.
- Why would a company choose AWS Glue during a data migration?
▶Click to see answers
- AWS Snowball Edge Compute Optimized.
- Kinesis Data Streams stores data (1-365 days) for multiple consumers; Firehose is for delivery to a destination and does not store data itself for replay.
- False. It supports SFTP, FTPS, and FTP.
- To transform data formats (e.g., CSV to Parquet) or clean data before it reaches the data lake.