Core AWS Data Sources for Machine Learning
How to use the core AWS data sources (for example, Amazon S3, Amazon Elastic File System [Amazon EFS], Amazon FSx for NetApp ONTAP)
Core AWS Data Sources for Machine Learning
This guide explores the primary AWS storage services used in Machine Learning (ML) workflows, specifically focusing on Amazon S3, Amazon EFS, and the Amazon FSx family. Understanding the trade-offs between object, file, and block storage is critical for building performant and cost-effective ML pipelines.
Learning Objectives
By the end of this guide, you should be able to:
- Distinguish between object (S3), file (EFS/FSx), and block (EBS) storage architectures.
- Select the appropriate storage service based on ML access patterns (e.g., high-throughput training vs. durable data lakes).
- Explain how Amazon SageMaker natively integrates with S3, EFS, and FSx for Lustre.
- Describe the role of AWS DataSync in migrating data between on-premises systems and AWS storage.
Key Terms & Glossary
- Object Storage: A data storage architecture that manages data as objects (data, metadata, and a unique identifier) rather than files in a hierarchy.
- 11 Nines Durability: A standard representing 99.999999999% durability, meaning data loss is statistically near-zero over a year.
- Lazy Loading: A design pattern where data is loaded from a source (like S3) into a cache (like FSx for Lustre) only when it is first requested.
- POSIX Compliance: A set of standards that ensures compatibility between different operating systems; essential for applications that require standard file system interactions.
- Throughput: The amount of data moved from one place to another in a given time period (e.g., MB/s).
The "Big Idea"
In Machine Learning, data is the oxygen of the model. However, not all data is accessed the same way. A data lake (S3) provides massive, cheap durability for raw data, but it might not be fast enough for high-performance training jobs. File systems (EFS/FSx) bridge this gap by providing high-concurrency access and low-latency file interfaces, acting as the high-speed delivery mechanism that feeds training instances during the model development lifecycle.
Formula / Concept Box
| Feature | Amazon S3 | Amazon EFS | Amazon FSx for Lustre | Amazon EBS |
|---|---|---|---|---|
| Storage Type | Object | File (Network) | File (High-Perf) | Block (Direct) |
| Access Pattern | HTTP/API | Concurrent (Multi-EC2) | Concurrent (Super-fast) | Single Instance |
| Best For | Data Lakes | Shared notebooks | Fast ML Training | DBs / OS Drives |
| Scaling | Virtual Unlimited | Elastic (Auto) | Provisioned/Scratch | Manual provision |
Hierarchical Outline
- Amazon S3 (The Foundation)
- Scalable Object Storage (unlimited capacity and high durability).
- Data Lake Cornerstone: Acts as the central repository for raw and processed ML datasets.
- Cost Optimization: Uses storage classes (Standard, IA, Glacier) to manage lifecycle costs.
- Amazon EFS (Collaborative File Storage)
- Serverless File System: Automatically grows and shrinks with data.
- Shared Access: Allows multiple EC2 instances or SageMaker notebooks to access the same dataset simultaneously.
- Amazon FSx Family (Specialized Performance)
- FSx for Lustre: Integrated with S3 to provide a high-performance "buffer" for training.
- FSx for NetApp ONTAP: Enterprise-grade features for existing NetApp workloads.
- Data Ingestion (DataSync)
- Automated Movement: Simplifies migration from on-prem to S3, EFS, or FSx.
- Security: Includes end-to-end encryption and integrity validation.
Visual Anchors
Data Storage Hierarchy for ML
Access Characteristics
\begin{tikzpicture}[scale=0.8] \draw[thick,->] (0,0) -- (8,0) node[right] {Latency}; \draw[thick,->] (0,0) -- (0,6) node[above] {Throughput/Concurrency};
% S3
\draw[fill=blue!20] (6,1) circle (0.8) node {S3};
% EFS
\draw[fill=green!20] (3,3) circle (1) node {EFS};
% FSx Lustre
\draw[fill=red!20] (1.5,5) circle (1) node {FSx/Lustre};
% EBS
\draw[fill=orange!20] (0.5,2) circle (0.5) node {EBS};
\node[below] at (6,-0.2) {High};
\node[below] at (0.5,-0.2) {Low};\end{tikzpicture}
Definition-Example Pairs
- Amazon S3: A scalable object storage service used as a data lake.
- Example: A coffee shop chain stores millions of raw JSON transaction logs from thousands of stores globally in a single S3 bucket for future ML analysis.
- Amazon EFS: A serverless, scalable file system for concurrent shared access.
- Example: A team of ten data scientists all mount the same EFS volume to their SageMaker Studio instances to share code libraries and processed CSV datasets.
- FSx for Lustre: A high-performance file system optimized for fast processing.
- Example: An autonomous vehicle company uses FSx for Lustre to "lazy load" terabytes of video data from S3, allowing their training cluster to start processing images in seconds rather than hours.
Worked Examples
Example 1: Selecting Storage for IoT Data
Scenario: You are building a pipeline for IoT sensor data. The data is unprocessed, comes from thousands of devices, and must be stored in a centralized, highly available repository. Decision: Amazon S3. Reasoning: S3 is the ideal choice for "unprocessed" or "raw" data because it is more cost-effective than EFS or EBS for large volumes, offers massive durability, and integrates easily with ingestion tools.
Example 2: Optimizing Training Time
Scenario: Your ML training job takes too long because it has to download 500GB of images from S3 to the local instance every time the job starts. Solution: Implement Amazon FSx for Lustre with S3 integration. Steps:
- Create an FSx for Lustre file system linked to your S3 bucket.
- Configure the SageMaker training job to use the FSx file system as the data source.
- Result: Data is streamed directly to the training instance (Lazy Loading), eliminating the download delay and providing high-throughput file access.
Checkpoint Questions
- Which service provides 11 nines of durability and is considered the "cornerstone" of a data lake?
- True or False: Amazon EBS is the best choice for sharing a dataset across 50 different EC2 instances simultaneously.
- What are the two data loading design patterns supported by FSx for Lustre when interacting with S3?
- Why might a machine learning engineer choose EFS over EBS for a team collaboration environment?
[!TIP] Remember: S3 is for durability/scale, EFS is for ease-of-use/sharing, and FSx for Lustre is for raw speed/throughput.
Muddy Points & Cross-Refs
- EFS vs. FSx for Lustre: Students often confuse these. Remember: EFS is "General Purpose" (think home directories, simple sharing). FSx for Lustre is "High Performance" (think massive parallel compute, HPC, and heavy ML training).
- Block vs. Object: EBS (Block) is like a hard drive plugged into ONE computer. S3 (Object) is like a web-based storage folder that anything with an API key can talk to.
- DataSync: It is not a storage service itself, but a mover. Use it to get data into the sources listed above.
Comparison Tables
File Storage Comparison
| Service | Primary Use Case | Scaling Behavior | Performance Profile |
|---|---|---|---|
| Amazon EFS | Shared content, notebooks | Elastic (Automatic) | Consistent, low-latency |
| FSx for Lustre | ML Training, HPC | Provisioned/Manual | Ultra-high throughput |
| FSx for ONTAP | Enterprise migrations | Provisioned | Multi-protocol (NFS/SMB) |
| Amazon EBS | Databases, Boot volumes | Provisioned | Single-digit ms latency |