AWS SageMaker AI Script Mode: Deep Dive Study Guide
Using SageMaker AI script mode with SageMaker AI supported frameworks to train models (for example, TensorFlow, PyTorch)
AWS SageMaker AI Script Mode: Deep Dive Study Guide
This guide covers the utilization of SageMaker AI Script Mode, a powerful feature that allows Machine Learning Engineers to use custom training scripts with popular frameworks like TensorFlow and PyTorch while leveraging SageMaker's managed infrastructure.
Learning Objectives
By the end of this module, you should be able to:
- Define the core purpose of Script Mode and when to choose it over built-in algorithms.
- Configure a SageMaker Estimator using the Python SDK for PyTorch or TensorFlow.
- Structure a custom Python training script including data loading and model saving logic.
- Identify the lifecycle of a training job from script submission to S3 artifact generation.
- Differentiate between managed framework containers and Bring Your Own Container (BYOC).
Key Terms & Glossary
- Script Mode: A SageMaker feature where you provide a standard Python script and SageMaker executes it inside a pre-configured framework container.
- Entry Point: The specific Python file (
.py) that contains the main execution logic for your training job. - Estimator: A high-level object in the SageMaker Python SDK that encapsulates the configuration for a training job (e.g.,
PyTorch,TensorFlow). - Model Artifacts: The output files (usually
model.tar.gz) produced by your script and automatically uploaded to Amazon S3 by SageMaker. - Managed Container: Docker images maintained by AWS that come pre-installed with frameworks like PyTorch, TensorFlow, and MXNet.
The "Big Idea"
[!IMPORTANT] The "Big Idea": Script Mode is the "Goldilocks" of SageMaker training. While Built-in Algorithms offer maximum ease and BYOC (Bring Your Own Container) offers maximum control, Script Mode provides the flexibility of custom code with the convenience of AWS-managed, framework-optimized environments.
Formula / Concept Box
| Component | Implementation Detail |
|---|---|
| The Script | Must handle environment variables like SM_MODEL_DIR and SM_CHANNEL_TRAINING. |
| The Estimator | Requires entry_point, framework_version, instance_type, and role. |
| Data Input | Mapping S3 paths to local container paths (e.g., /opt/ml/input/data/). |
| Model Output | Saving files to /opt/ml/model/ ensures they are archived and sent to S3. |
Hierarchical Outline
- I. Anatomy of a Script Mode Training Job
- The Script (
train.py): Standard Python code using specific libraries (TF, PyTorch). - The Environment: Managed Docker containers provided by AWS.
- The Launcher: The SageMaker Python SDK
fit()method.
- The Script (
- II. Developing the Training Script
- Argument Parsing: Using
argparseto receive hyperparameters from SageMaker. - Environment Variables: Accessing data paths via
os.environ(e.g.,SM_CHANNELS). - Saving the Model: Mandatory step of writing to the correct directory for persistence.
- Argument Parsing: Using
- III. Configuring the Estimator
- Selecting Frameworks: Defining versions (e.g.,
PyTorch 2.0). - Compute Resources: Choosing Managed Spot Instances or On-Demand instances.
- Dependencies: Using a
requirements.txtfile in the source directory.
- Selecting Frameworks: Defining versions (e.g.,
Visual Anchors
The Script Mode Workflow
Local vs. Managed Path Mapping
\begin{tikzpicture}[node distance=2cm, every node/.style={draw, rectangle, align=center, fill=blue!10}] \node (s3) [fill=green!10] {Amazon S3 \ s3://bucket/data/}; \node (cont) [right of=s3, xshift=3cm] {SageMaker Container \ /opt/ml/input/data/train/}; \node (script) [below of=cont] {Your Python Script \ (reads from local path)};
\draw[->, thick] (s3) -- node[above] {Data Sync} (cont);
\draw[->, thick] (cont) -- (script);\end{tikzpicture}
Definition-Example Pairs
- Hyperparameter Injection: The process where SageMaker passes parameters to your script as command-line arguments.
- Example: Passing
batch_size=64in the SDKEstimatorresults in SageMaker runningpython train.py --batch_size 64inside the container.
- Example: Passing
- Source Directory: A folder containing your main script and any supporting modules or requirements.
- Example: A folder
src/containingtrain.py,utils.py, andrequirements.txtis compressed and uploaded to S3 automatically.
- Example: A folder
Worked Examples
Example: PyTorch Estimator Setup
This example shows how to initialize a training job using the SageMaker Python SDK.
from sagemaker.pytorch import PyTorch
# Define the Estimator
pytorch_estimator = PyTorch(
entry_point='train.py', # Your custom script
source_dir='src', # Folder with supporting code
role=role, # IAM role with S3 permissions
instance_count=1,
instance_type='ml.p3.2xlarge', # GPU instance
framework_version='2.0',
py_version='py310',
hyperparameters={
'epochs': 10,
'lr': 0.001
}
)
# Start the job
pytorch_estimator.fit({'training': 's3://my-bucket/training-data/'})Step-by-Step Breakdown:
- Entry Point: SageMaker will search for
train.pyinside thesrcfolder. - Infrastructure: AWS provisions an
ml.p3.2xlargeinstance and pulls the official PyTorch 2.0 image. - Data Mapping: The S3 data is downloaded to
/opt/ml/input/data/training/within the container before your script starts. - Execution: SageMaker runs the script. Any model saved to
/opt/ml/model/will be compressed and uploaded back to S3 upon completion.
Comparison Tables
| Feature | Built-in Algorithms | Script Mode | Bring Your Own Container (BYOC) |
|---|---|---|---|
| Code Effort | Zero (Config only) | Medium (Python script) | High (Docker + Code) |
| Flexibility | Low (Fixed logic) | High (Custom code) | Maximum (Custom OS/System) |
| Maintenance | AWS managed | AWS managed container | User managed container |
| Use Case | Common tasks (XGBoost) | Custom logic in TF/PyTorch | Proprietary libraries/Non-Python |
Checkpoint Questions
- Question: Where must your training script save the final model so that SageMaker persists it to S3?
- Answer:
/opt/ml/model/(referenced by the environment variableSM_MODEL_DIR).
- Answer:
- Question: How can you install additional Python libraries that are not in the base SageMaker framework container?
- Answer: Include a
requirements.txtfile in yoursource_dir. SageMaker will runpip installautomatically.
- Answer: Include a
- Question: True or False: Script Mode requires you to write a Dockerfile.
- Answer: False. Script Mode uses AWS-provided Docker images.
Muddy Points & Cross-Refs
- Environment Variables: Many students find the
/opt/ml/...path structure confusing. Remember: SageMaker maps S3 to local disk. Always use the SageMaker Python SDK's helper variables to find your data. - Local Mode: Before running a 1-hour job on a GPU instance, use
instance_type='local'to test your script mode logic on your notebook instance first. - Deep Dive: For full control over the runtime environment (e.g., specific Linux C++ libraries), study BYOC (Bring Your Own Container) in the next chapter.