Core AWS Data Sources for Machine Learning

This guide explores the primary AWS storage services used in Machine Learning (ML) workflows, specifically focusing on Amazon S3, Amazon EFS, and the Amazon FSx family. Understanding the trade-offs between object, file, and block storage is critical for building performant and cost-effective ML pipelines.

Learning Objectives

By the end of this guide, you should be able to:

Distinguish between object (S3), file (EFS/FSx), and block (EBS) storage architectures.
Select the appropriate storage service based on ML access patterns (e.g., high-throughput training vs. durable data lakes).
Explain how Amazon SageMaker natively integrates with S3, EFS, and FSx for Lustre.
Describe the role of AWS DataSync in migrating data between on-premises systems and AWS storage.

Key Terms & Glossary

Object Storage: A data storage architecture that manages data as objects (data, metadata, and a unique identifier) rather than files in a hierarchy.
11 Nines Durability: A standard representing 99.999999999% durability, meaning data loss is statistically near-zero over a year.
Lazy Loading: A design pattern where data is loaded from a source (like S3) into a cache (like FSx for Lustre) only when it is first requested.
POSIX Compliance: A set of standards that ensures compatibility between different operating systems; essential for applications that require standard file system interactions.
Throughput: The amount of data moved from one place to another in a given time period (e.g., MB/s).

The "Big Idea"

In Machine Learning, data is the oxygen of the model. However, not all data is accessed the same way. A data lake (S3) provides massive, cheap durability for raw data, but it might not be fast enough for high-performance training jobs. File systems (EFS/FSx) bridge this gap by providing high-concurrency access and low-latency file interfaces, acting as the high-speed delivery mechanism that feeds training instances during the model development lifecycle.

Formula / Concept Box

Feature	Amazon S3	Amazon EFS	Amazon FSx for Lustre	Amazon EBS
Storage Type	Object	File (Network)	File (High-Perf)	Block (Direct)
Access Pattern	HTTP/API	Concurrent (Multi-EC2)	Concurrent (Super-fast)	Single Instance
Best For	Data Lakes	Shared notebooks	Fast ML Training	DBs / OS Drives
Scaling	Virtual Unlimited	Elastic (Auto)	Provisioned/Scratch	Manual provision

Hierarchical Outline

Amazon S3 (The Foundation)
- Scalable Object Storage (unlimited capacity and high durability).
- Data Lake Cornerstone: Acts as the central repository for raw and processed ML datasets.
- Cost Optimization: Uses storage classes (Standard, IA, Glacier) to manage lifecycle costs.
Amazon EFS (Collaborative File Storage)
- Serverless File System: Automatically grows and shrinks with data.
- Shared Access: Allows multiple EC2 instances or SageMaker notebooks to access the same dataset simultaneously.
Amazon FSx Family (Specialized Performance)
- FSx for Lustre: Integrated with S3 to provide a high-performance "buffer" for training.
- FSx for NetApp ONTAP: Enterprise-grade features for existing NetApp workloads.
Data Ingestion (DataSync)
- Automated Movement: Simplifies migration from on-prem to S3, EFS, or FSx.
- Security: Includes end-to-end encryption and integrity validation.

Visual Anchors

Data Storage Hierarchy for ML

Loading Diagram...

Access Characteristics

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

Amazon S3: A scalable object storage service used as a data lake.
- Example: A coffee shop chain stores millions of raw JSON transaction logs from thousands of stores globally in a single S3 bucket for future ML analysis.
Amazon EFS: A serverless, scalable file system for concurrent shared access.
- Example: A team of ten data scientists all mount the same EFS volume to their SageMaker Studio instances to share code libraries and processed CSV datasets.
FSx for Lustre: A high-performance file system optimized for fast processing.
- Example: An autonomous vehicle company uses FSx for Lustre to "lazy load" terabytes of video data from S3, allowing their training cluster to start processing images in seconds rather than hours.

Worked Examples

Example 1: Selecting Storage for IoT Data

Scenario: You are building a pipeline for IoT sensor data. The data is unprocessed, comes from thousands of devices, and must be stored in a centralized, highly available repository. Decision: Amazon S3. Reasoning: S3 is the ideal choice for "unprocessed" or "raw" data because it is more cost-effective than EFS or EBS for large volumes, offers massive durability, and integrates easily with ingestion tools.

Example 2: Optimizing Training Time

Scenario: Your ML training job takes too long because it has to download 500GB of images from S3 to the local instance every time the job starts. Solution: Implement Amazon FSx for Lustre with S3 integration. Steps:

Create an FSx for Lustre file system linked to your S3 bucket.
Configure the SageMaker training job to use the FSx file system as the data source.
Result: Data is streamed directly to the training instance (Lazy Loading), eliminating the download delay and providing high-throughput file access.

Checkpoint Questions

Which service provides 11 nines of durability and is considered the "cornerstone" of a data lake?
True or False: Amazon EBS is the best choice for sharing a dataset across 50 different EC2 instances simultaneously.
What are the two data loading design patterns supported by FSx for Lustre when interacting with S3?
Why might a machine learning engineer choose EFS over EBS for a team collaboration environment?

[!TIP] Remember: S3 is for durability/scale, EFS is for ease-of-use/sharing, and FSx for Lustre is for raw speed/throughput.

Muddy Points & Cross-Refs

EFS vs. FSx for Lustre: Students often confuse these. Remember: EFS is "General Purpose" (think home directories, simple sharing). FSx for Lustre is "High Performance" (think massive parallel compute, HPC, and heavy ML training).
Block vs. Object: EBS (Block) is like a hard drive plugged into ONE computer. S3 (Object) is like a web-based storage folder that anything with an API key can talk to.
DataSync: It is not a storage service itself, but a mover. Use it to get data into the sources listed above.

Comparison Tables

File Storage Comparison

Service	Primary Use Case	Scaling Behavior	Performance Profile
Amazon EFS	Shared content, notebooks	Elastic (Automatic)	Consistent, low-latency
FSx for Lustre	ML Training, HPC	Provisioned/Manual	Ultra-high throughput
FSx for ONTAP	Enterprise migrations	Provisioned	Multi-protocol (NFS/SMB)
Amazon EBS	Databases, Boot volumes	Provisioned	Single-digit ms latency