Hands-On Lab: Training and Refining Models with Amazon SageMaker

This lab provides a hands-on experience in executing the machine learning lifecycle on AWS. You will focus on Content Domain 2: ML Model Development, specifically training a model and refining its performance using Hyperparameter Optimization (HPO).

[!WARNING] Remember to run the teardown commands at the end of this lab to avoid ongoing charges for storage and resource metadata.

Prerequisites

AWS Account: An active AWS account with permissions to manage SageMaker and S3.
IAM Role: A SageMaker execution role with AmazonSageMakerFullAccess and AmazonS3FullAccess policies attached.
AWS CLI: Installed and configured on your local machine with <YOUR_REGION> (e.g., us-east-1).
Knowledge: Basic understanding of Python and the XGBoost algorithm.

Learning Objectives

Configure an Amazon S3 bucket for training data and model artifacts.
Launch a SageMaker Training Job using the built-in XGBoost algorithm.
Execute a Hyperparameter Tuning Job to refine model accuracy.
Analyze the impact of hyperparameters like learning_rate and max_depth on model performance.

Architecture Overview

Loading Diagram...

Step-by-Step Instructions

Step 1: Prepare the S3 Environment

We need a centralized location for our input data and the resulting model weights.

bash

# Replace <UNIQUE_ID> with a random string
aws s3 mb s3://brainybee-lab-training-<UNIQUE_ID>

▶Console alternative

Navigate to S3 in the AWS Management Console.
Click Create bucket.
Name it brainybee-lab-training-<UNIQUE_ID> and click Create.

Step 2: Define Hyperparameters and Algorithm

For our initial training, we will use the SageMaker built-in XGBoost. We must define the initial hyperparameters such as num_round and eta (learning rate).

bash

# Define variables for the training job
JOB_NAME="xgboost-training-$(date +%Y-%m-%d-%H-%M-%S)"
ROLE_ARN="<YOUR_SAGEMAKER_EXECUTION_ROLE_ARN>"
IMAGE="<ECR_IMAGE_URI_FOR_XGBOOST>"

[!TIP] The XGBoost image URI varies by region. Use the SageMaker Python SDK locally to find the latest URI for your region.

Step 3: Launch Training Job

We will trigger a single training job to establish a baseline performance metric.

bash

aws sagemaker create-training-job \
    --training-job-name $JOB_NAME \
    --algorithm-specification TrainingImage=$IMAGE,TrainingInputMode=File \
    --role-arn $ROLE_ARN \
    --input-data-config '[{"ChannelName": "train", "DataSource": {"S3DataSource": {"S3DataType": "S3Prefix", "S3Uri": "s3://brainybee-lab-training/data/", "S3DataDistributionType": "FullyReplicated"}}}]' \
    --output-data-config S3OutputPath=s3://brainybee-lab-training/output/ \
    --resource-config InstanceType=ml.m5.xlarge,InstanceCount=1,VolumeSizeInGB=5 \
    --stopping-condition MaxRuntimeInSeconds=3600

Step 4: Refine the Model with Hyperparameter Tuning (HPO)

Instead of manual trial-and-error, we will use a Tuning Job to search for the best max_depth and eta values using Bayesian Optimization.

bash

aws sagemaker create-hyper-parameter-tuning-job \
    --hyper-parameter-tuning-job-name "hpo-refinement-$(date +%s)" \
    --hyper-parameter-tuning-job-config '{"Strategy": "Bayesian", "HyperParameterTuningJobObjective": {"Type": "Minimize", "MetricName": "validation:rmse"}, "ResourceLimits": {"MaxNumberOfTrainingJobs": 10, "MaxParallelTrainingJobs": 2}, "ParameterRanges": {"ContinuousParameterRanges": [{"Name": "eta", "MinValue": "0.1", "MaxValue": "0.5"}], "IntegerParameterRanges": [{"Name": "max_depth", "MinValue": "3", "MaxValue": "9"}]}}' \
    --training-job-definition '{"AlgorithmSpecification": {"TrainingImage": "'$IMAGE'", "TrainingInputMode": "File"}, "RoleArn": "'$ROLE_ARN'", "InputDataConfig": [{"ChannelName": "train", "DataSource": {"S3DataSource": {"S3DataType": "S3Prefix", "S3Uri": "s3://brainybee-lab-training/data/"}}}], "OutputDataConfig": {"S3OutputPath": "s3://brainybee-lab-training/hpo-output/"}, "ResourceConfig": {"InstanceType": "ml.m5.xlarge", "InstanceCount": 1, "VolumeSizeInGB": 5}, "StoppingCondition": {"MaxRuntimeInSeconds": 3600}}'

Visualizing the Refinement Process

In hyperparameter optimization, the algorithm attempts to find the global minimum of the loss function by adjusting parameters across a search space.

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Checkpoints

Verify Training Status: Run aws sagemaker describe-training-job --training-job-name <JOB_NAME>. The TrainingJobStatus should move from InProgress to Completed.
Check HPO Progress: In the SageMaker Console, under Training > Hyperparameter tuning jobs, ensure your job shows multiple child training jobs being spawned.
Inspect S3 Output: Confirm that a model.tar.gz file exists in your output bucket path.

Clean-Up / Teardown

To prevent costs associated with stored data and training metadata:

bash

# Delete the S3 bucket and all its contents
aws s3 rb s3://brainybee-lab-training-<UNIQUE_ID> --force

# List and delete any created notebooks (if used)
aws sagemaker list-notebook-instances

Troubleshooting

Error	Cause	Solution
`AccessDenied`	IAM role lacks S3 permissions.	Attach `AmazonS3FullAccess` to the execution role.
`ResourceLimitExceeded`	Account has hit training instance quota.	Use a smaller instance (e.g., `ml.m5.large`) or request a quota increase.
`InvalidImageUri`	XGBoost image URI is incorrect for the region.	Check the official AWS SageMaker documentation for the ECR URI in your specific region.

Stretch Challenge

Early Stopping: Modify the Training Job configuration to include an Early Stopping parameter. This prevents overfitting and saves costs by terminating the job when the validation metric stops improving for a defined number of epochs.

Cost Estimate

S3 Storage: Negligible for this lab (<$0.01).
SageMaker Training: ml.m5.xlarge costs approximately $0.23 per hour. Total lab time (~30 mins) with 10 HPO jobs should cost between $1.00 and $2.50.
Free Tier: This lab may be eligible for the SageMaker Free Tier if your account is less than 2 months old and you haven't exceeded the 250-hour limit.

Concept Review

Term	Definition	Real-World Impact
Epoch	One full pass through the dataset.	High epochs can lead to overfitting; low epochs to underfitting.
Batch Size	Number of samples processed before updating parameters.	Influences memory usage and training stability.
L2 Regularization	Adds a squared penalty to weights (Ridge).	Prevents single features from dominating the model, reducing variance.
Bayesian Tuning	Optimization that uses previous results to pick next parameters.	Much more efficient than Random Search for complex parameter spaces.

Hands-On Lab: Training and Refining Models with Amazon SageMaker

[!WARNING] Remember to run the teardown commands at the end of this lab to avoid ongoing charges for storage and resource metadata.

Prerequisites

AWS Account: An active AWS account with permissions to manage SageMaker and S3.
IAM Role: A SageMaker execution role with AmazonSageMakerFullAccess and AmazonS3FullAccess policies attached.
AWS CLI: Installed and configured on your local machine with <YOUR_REGION> (e.g., us-east-1).
Knowledge: Basic understanding of Python and the XGBoost algorithm.

Learning Objectives

Configure an Amazon S3 bucket for training data and model artifacts.
Launch a SageMaker Training Job using the built-in XGBoost algorithm.
Execute a Hyperparameter Tuning Job to refine model accuracy.
Analyze the impact of hyperparameters like learning_rate and max_depth on model performance.

Architecture Overview

Loading Diagram...

Step-by-Step Instructions

Step 1: Prepare the S3 Environment

We need a centralized location for our input data and the resulting model weights.

bash

# Replace <UNIQUE_ID> with a random string
aws s3 mb s3://brainybee-lab-training-<UNIQUE_ID>

▶Console alternative

Navigate to S3 in the AWS Management Console.
Click Create bucket.
Name it brainybee-lab-training-<UNIQUE_ID> and click Create.

Step 2: Define Hyperparameters and Algorithm

For our initial training, we will use the SageMaker built-in XGBoost. We must define the initial hyperparameters such as num_round and eta (learning rate).

bash

# Define variables for the training job
JOB_NAME="xgboost-training-$(date +%Y-%m-%d-%H-%M-%S)"
ROLE_ARN="<YOUR_SAGEMAKER_EXECUTION_ROLE_ARN>"
IMAGE="<ECR_IMAGE_URI_FOR_XGBOOST>"

[!TIP] The XGBoost image URI varies by region. Use the SageMaker Python SDK locally to find the latest URI for your region.

Step 3: Launch Training Job

We will trigger a single training job to establish a baseline performance metric.

bash

aws sagemaker create-training-job \
    --training-job-name $JOB_NAME \
    --algorithm-specification TrainingImage=$IMAGE,TrainingInputMode=File \
    --role-arn $ROLE_ARN \
    --input-data-config '[{"ChannelName": "train", "DataSource": {"S3DataSource": {"S3DataType": "S3Prefix", "S3Uri": "s3://brainybee-lab-training/data/", "S3DataDistributionType": "FullyReplicated"}}}]' \
    --output-data-config S3OutputPath=s3://brainybee-lab-training/output/ \
    --resource-config InstanceType=ml.m5.xlarge,InstanceCount=1,VolumeSizeInGB=5 \
    --stopping-condition MaxRuntimeInSeconds=3600

Step 4: Refine the Model with Hyperparameter Tuning (HPO)

Instead of manual trial-and-error, we will use a Tuning Job to search for the best max_depth and eta values using Bayesian Optimization.

bash

aws sagemaker create-hyper-parameter-tuning-job \
    --hyper-parameter-tuning-job-name "hpo-refinement-$(date +%s)" \
    --hyper-parameter-tuning-job-config '{"Strategy": "Bayesian", "HyperParameterTuningJobObjective": {"Type": "Minimize", "MetricName": "validation:rmse"}, "ResourceLimits": {"MaxNumberOfTrainingJobs": 10, "MaxParallelTrainingJobs": 2}, "ParameterRanges": {"ContinuousParameterRanges": [{"Name": "eta", "MinValue": "0.1", "MaxValue": "0.5"}], "IntegerParameterRanges": [{"Name": "max_depth", "MinValue": "3", "MaxValue": "9"}]}}' \
    --training-job-definition '{"AlgorithmSpecification": {"TrainingImage": "'$IMAGE'", "TrainingInputMode": "File"}, "RoleArn": "'$ROLE_ARN'", "InputDataConfig": [{"ChannelName": "train", "DataSource": {"S3DataSource": {"S3DataType": "S3Prefix", "S3Uri": "s3://brainybee-lab-training/data/"}}}], "OutputDataConfig": {"S3OutputPath": "s3://brainybee-lab-training/hpo-output/"}, "ResourceConfig": {"InstanceType": "ml.m5.xlarge", "InstanceCount": 1, "VolumeSizeInGB": 5}, "StoppingCondition": {"MaxRuntimeInSeconds": 3600}}'

Visualizing the Refinement Process

In hyperparameter optimization, the algorithm attempts to find the global minimum of the loss function by adjusting parameters across a search space.

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Checkpoints

Verify Training Status: Run aws sagemaker describe-training-job --training-job-name <JOB_NAME>. The TrainingJobStatus should move from InProgress to Completed.
Check HPO Progress: In the SageMaker Console, under Training > Hyperparameter tuning jobs, ensure your job shows multiple child training jobs being spawned.
Inspect S3 Output: Confirm that a model.tar.gz file exists in your output bucket path.

Clean-Up / Teardown

To prevent costs associated with stored data and training metadata:

bash

# Delete the S3 bucket and all its contents
aws s3 rb s3://brainybee-lab-training-<UNIQUE_ID> --force

# List and delete any created notebooks (if used)
aws sagemaker list-notebook-instances

Troubleshooting

Error	Cause	Solution
`AccessDenied`	IAM role lacks S3 permissions.	Attach `AmazonS3FullAccess` to the execution role.
`ResourceLimitExceeded`	Account has hit training instance quota.	Use a smaller instance (e.g., `ml.m5.large`) or request a quota increase.
`InvalidImageUri`	XGBoost image URI is incorrect for the region.	Check the official AWS SageMaker documentation for the ECR URI in your specific region.

Stretch Challenge

Cost Estimate

S3 Storage: Negligible for this lab (<$0.01).
SageMaker Training: ml.m5.xlarge costs approximately $0.23 per hour. Total lab time (~30 mins) with 10 HPO jobs should cost between $1.00 and $2.50.
Free Tier: This lab may be eligible for the SageMaker Free Tier if your account is less than 2 months old and you haven't exceeded the 250-hour limit.

Concept Review

Term	Definition	Real-World Impact
Epoch	One full pass through the dataset.	High epochs can lead to overfitting; low epochs to underfitting.
Batch Size	Number of samples processed before updating parameters.	Influences memory usage and training stability.
L2 Regularization	Adds a squared penalty to weights (Ridge).	Prevents single features from dominating the model, reducing variance.
Bayesian Tuning	Optimization that uses previous results to pick next parameters.	Much more efficient than Random Search for complex parameter spaces.