Hands-On Lab: Training and Refining Models with Amazon SageMaker
Train and refine models
Hands-On Lab: Training and Refining Models with Amazon SageMaker
This lab provides a hands-on experience in executing the machine learning lifecycle on AWS. You will focus on Content Domain 2: ML Model Development, specifically training a model and refining its performance using Hyperparameter Optimization (HPO).
[!WARNING] Remember to run the teardown commands at the end of this lab to avoid ongoing charges for storage and resource metadata.
Prerequisites
- AWS Account: An active AWS account with permissions to manage SageMaker and S3.
- IAM Role: A SageMaker execution role with
AmazonSageMakerFullAccessandAmazonS3FullAccesspolicies attached. - AWS CLI: Installed and configured on your local machine with
<YOUR_REGION>(e.g.,us-east-1). - Knowledge: Basic understanding of Python and the XGBoost algorithm.
Learning Objectives
- Configure an Amazon S3 bucket for training data and model artifacts.
- Launch a SageMaker Training Job using the built-in XGBoost algorithm.
- Execute a Hyperparameter Tuning Job to refine model accuracy.
- Analyze the impact of hyperparameters like
learning_rateandmax_depthon model performance.
Architecture Overview
Step-by-Step Instructions
Step 1: Prepare the S3 Environment
We need a centralized location for our input data and the resulting model weights.
# Replace <UNIQUE_ID> with a random string
aws s3 mb s3://brainybee-lab-training-<UNIQUE_ID>▶Console alternative
- Navigate to S3 in the AWS Management Console.
- Click Create bucket.
- Name it
brainybee-lab-training-<UNIQUE_ID>and click Create.
Step 2: Define Hyperparameters and Algorithm
For our initial training, we will use the SageMaker built-in XGBoost. We must define the initial hyperparameters such as num_round and eta (learning rate).
# Define variables for the training job
JOB_NAME="xgboost-training-$(date +%Y-%m-%d-%H-%M-%S)"
ROLE_ARN="<YOUR_SAGEMAKER_EXECUTION_ROLE_ARN>"
IMAGE="<ECR_IMAGE_URI_FOR_XGBOOST>"[!TIP] The XGBoost image URI varies by region. Use the SageMaker Python SDK locally to find the latest URI for your region.
Step 3: Launch Training Job
We will trigger a single training job to establish a baseline performance metric.
aws sagemaker create-training-job \
--training-job-name $JOB_NAME \
--algorithm-specification TrainingImage=$IMAGE,TrainingInputMode=File \
--role-arn $ROLE_ARN \
--input-data-config '[{"ChannelName": "train", "DataSource": {"S3DataSource": {"S3DataType": "S3Prefix", "S3Uri": "s3://brainybee-lab-training/data/", "S3DataDistributionType": "FullyReplicated"}}}]' \
--output-data-config S3OutputPath=s3://brainybee-lab-training/output/ \
--resource-config InstanceType=ml.m5.xlarge,InstanceCount=1,VolumeSizeInGB=5 \
--stopping-condition MaxRuntimeInSeconds=3600Step 4: Refine the Model with Hyperparameter Tuning (HPO)
Instead of manual trial-and-error, we will use a Tuning Job to search for the best max_depth and eta values using Bayesian Optimization.
aws sagemaker create-hyper-parameter-tuning-job \
--hyper-parameter-tuning-job-name "hpo-refinement-$(date +%s)" \
--hyper-parameter-tuning-job-config '{"Strategy": "Bayesian", "HyperParameterTuningJobObjective": {"Type": "Minimize", "MetricName": "validation:rmse"}, "ResourceLimits": {"MaxNumberOfTrainingJobs": 10, "MaxParallelTrainingJobs": 2}, "ParameterRanges": {"ContinuousParameterRanges": [{"Name": "eta", "MinValue": "0.1", "MaxValue": "0.5"}], "IntegerParameterRanges": [{"Name": "max_depth", "MinValue": "3", "MaxValue": "9"}]}}' \
--training-job-definition '{"AlgorithmSpecification": {"TrainingImage": "'$IMAGE'", "TrainingInputMode": "File"}, "RoleArn": "'$ROLE_ARN'", "InputDataConfig": [{"ChannelName": "train", "DataSource": {"S3DataSource": {"S3DataType": "S3Prefix", "S3Uri": "s3://brainybee-lab-training/data/"}}}], "OutputDataConfig": {"S3OutputPath": "s3://brainybee-lab-training/hpo-output/"}, "ResourceConfig": {"InstanceType": "ml.m5.xlarge", "InstanceCount": 1, "VolumeSizeInGB": 5}, "StoppingCondition": {"MaxRuntimeInSeconds": 3600}}'Visualizing the Refinement Process
In hyperparameter optimization, the algorithm attempts to find the global minimum of the loss function by adjusting parameters across a search space.
Checkpoints
- Verify Training Status: Run
aws sagemaker describe-training-job --training-job-name <JOB_NAME>. TheTrainingJobStatusshould move fromInProgresstoCompleted. - Check HPO Progress: In the SageMaker Console, under Training > Hyperparameter tuning jobs, ensure your job shows multiple child training jobs being spawned.
- Inspect S3 Output: Confirm that a
model.tar.gzfile exists in your output bucket path.
Clean-Up / Teardown
To prevent costs associated with stored data and training metadata:
# Delete the S3 bucket and all its contents
aws s3 rb s3://brainybee-lab-training-<UNIQUE_ID> --force
# List and delete any created notebooks (if used)
aws sagemaker list-notebook-instancesTroubleshooting
| Error | Cause | Solution |
|---|---|---|
AccessDenied | IAM role lacks S3 permissions. | Attach AmazonS3FullAccess to the execution role. |
ResourceLimitExceeded | Account has hit training instance quota. | Use a smaller instance (e.g., ml.m5.large) or request a quota increase. |
InvalidImageUri | XGBoost image URI is incorrect for the region. | Check the official AWS SageMaker documentation for the ECR URI in your specific region. |
Stretch Challenge
Early Stopping: Modify the Training Job configuration to include an Early Stopping parameter. This prevents overfitting and saves costs by terminating the job when the validation metric stops improving for a defined number of epochs.
Cost Estimate
- S3 Storage: Negligible for this lab (<$0.01).
- SageMaker Training:
ml.m5.xlargecosts approximately $0.23 per hour. Total lab time (~30 mins) with 10 HPO jobs should cost between $1.00 and $2.50. - Free Tier: This lab may be eligible for the SageMaker Free Tier if your account is less than 2 months old and you haven't exceeded the 250-hour limit.
Concept Review
| Term | Definition | Real-World Impact |
|---|---|---|
| Epoch | One full pass through the dataset. | High epochs can lead to overfitting; low epochs to underfitting. |
| Batch Size | Number of samples processed before updating parameters. | Influences memory usage and training stability. |
| L2 Regularization | Adds a squared penalty to weights (Ridge). | Prevents single features from dominating the model, reducing variance. |
| Bayesian Tuning | Optimization that uses previous results to pick next parameters. | Much more efficient than Random Search for complex parameter spaces. |