Hands-On Lab945 words

Hands-On Lab: Training and Refining Models with Amazon SageMaker

Train and refine models

Hands-On Lab: Training and Refining Models with Amazon SageMaker

This lab provides a hands-on experience in executing the machine learning lifecycle on AWS. You will focus on Content Domain 2: ML Model Development, specifically training a model and refining its performance using Hyperparameter Optimization (HPO).

[!WARNING] Remember to run the teardown commands at the end of this lab to avoid ongoing charges for storage and resource metadata.

Prerequisites

  • AWS Account: An active AWS account with permissions to manage SageMaker and S3.
  • IAM Role: A SageMaker execution role with AmazonSageMakerFullAccess and AmazonS3FullAccess policies attached.
  • AWS CLI: Installed and configured on your local machine with <YOUR_REGION> (e.g., us-east-1).
  • Knowledge: Basic understanding of Python and the XGBoost algorithm.

Learning Objectives

  • Configure an Amazon S3 bucket for training data and model artifacts.
  • Launch a SageMaker Training Job using the built-in XGBoost algorithm.
  • Execute a Hyperparameter Tuning Job to refine model accuracy.
  • Analyze the impact of hyperparameters like learning_rate and max_depth on model performance.

Architecture Overview

Loading Diagram...

Step-by-Step Instructions

Step 1: Prepare the S3 Environment

We need a centralized location for our input data and the resulting model weights.

bash
# Replace <UNIQUE_ID> with a random string aws s3 mb s3://brainybee-lab-training-<UNIQUE_ID>
Console alternative
  1. Navigate to S3 in the AWS Management Console.
  2. Click Create bucket.
  3. Name it brainybee-lab-training-<UNIQUE_ID> and click Create.

Step 2: Define Hyperparameters and Algorithm

For our initial training, we will use the SageMaker built-in XGBoost. We must define the initial hyperparameters such as num_round and eta (learning rate).

bash
# Define variables for the training job JOB_NAME="xgboost-training-$(date +%Y-%m-%d-%H-%M-%S)" ROLE_ARN="<YOUR_SAGEMAKER_EXECUTION_ROLE_ARN>" IMAGE="<ECR_IMAGE_URI_FOR_XGBOOST>"

[!TIP] The XGBoost image URI varies by region. Use the SageMaker Python SDK locally to find the latest URI for your region.

Step 3: Launch Training Job

We will trigger a single training job to establish a baseline performance metric.

bash
aws sagemaker create-training-job \ --training-job-name $JOB_NAME \ --algorithm-specification TrainingImage=$IMAGE,TrainingInputMode=File \ --role-arn $ROLE_ARN \ --input-data-config '[{"ChannelName": "train", "DataSource": {"S3DataSource": {"S3DataType": "S3Prefix", "S3Uri": "s3://brainybee-lab-training/data/", "S3DataDistributionType": "FullyReplicated"}}}]' \ --output-data-config S3OutputPath=s3://brainybee-lab-training/output/ \ --resource-config InstanceType=ml.m5.xlarge,InstanceCount=1,VolumeSizeInGB=5 \ --stopping-condition MaxRuntimeInSeconds=3600

Step 4: Refine the Model with Hyperparameter Tuning (HPO)

Instead of manual trial-and-error, we will use a Tuning Job to search for the best max_depth and eta values using Bayesian Optimization.

bash
aws sagemaker create-hyper-parameter-tuning-job \ --hyper-parameter-tuning-job-name "hpo-refinement-$(date +%s)" \ --hyper-parameter-tuning-job-config '{"Strategy": "Bayesian", "HyperParameterTuningJobObjective": {"Type": "Minimize", "MetricName": "validation:rmse"}, "ResourceLimits": {"MaxNumberOfTrainingJobs": 10, "MaxParallelTrainingJobs": 2}, "ParameterRanges": {"ContinuousParameterRanges": [{"Name": "eta", "MinValue": "0.1", "MaxValue": "0.5"}], "IntegerParameterRanges": [{"Name": "max_depth", "MinValue": "3", "MaxValue": "9"}]}}' \ --training-job-definition '{"AlgorithmSpecification": {"TrainingImage": "'$IMAGE'", "TrainingInputMode": "File"}, "RoleArn": "'$ROLE_ARN'", "InputDataConfig": [{"ChannelName": "train", "DataSource": {"S3DataSource": {"S3DataType": "S3Prefix", "S3Uri": "s3://brainybee-lab-training/data/"}}}], "OutputDataConfig": {"S3OutputPath": "s3://brainybee-lab-training/hpo-output/"}, "ResourceConfig": {"InstanceType": "ml.m5.xlarge", "InstanceCount": 1, "VolumeSizeInGB": 5}, "StoppingCondition": {"MaxRuntimeInSeconds": 3600}}'

Visualizing the Refinement Process

In hyperparameter optimization, the algorithm attempts to find the global minimum of the loss function by adjusting parameters across a search space.

Compiling TikZ diagram…
Running TeX engine…
This may take a few seconds

Checkpoints

  1. Verify Training Status: Run aws sagemaker describe-training-job --training-job-name <JOB_NAME>. The TrainingJobStatus should move from InProgress to Completed.
  2. Check HPO Progress: In the SageMaker Console, under Training > Hyperparameter tuning jobs, ensure your job shows multiple child training jobs being spawned.
  3. Inspect S3 Output: Confirm that a model.tar.gz file exists in your output bucket path.

Clean-Up / Teardown

To prevent costs associated with stored data and training metadata:

bash
# Delete the S3 bucket and all its contents aws s3 rb s3://brainybee-lab-training-<UNIQUE_ID> --force # List and delete any created notebooks (if used) aws sagemaker list-notebook-instances

Troubleshooting

ErrorCauseSolution
AccessDeniedIAM role lacks S3 permissions.Attach AmazonS3FullAccess to the execution role.
ResourceLimitExceededAccount has hit training instance quota.Use a smaller instance (e.g., ml.m5.large) or request a quota increase.
InvalidImageUriXGBoost image URI is incorrect for the region.Check the official AWS SageMaker documentation for the ECR URI in your specific region.

Stretch Challenge

Early Stopping: Modify the Training Job configuration to include an Early Stopping parameter. This prevents overfitting and saves costs by terminating the job when the validation metric stops improving for a defined number of epochs.

Cost Estimate

  • S3 Storage: Negligible for this lab (<$0.01).
  • SageMaker Training: ml.m5.xlarge costs approximately $0.23 per hour. Total lab time (~30 mins) with 10 HPO jobs should cost between $1.00 and $2.50.
  • Free Tier: This lab may be eligible for the SageMaker Free Tier if your account is less than 2 months old and you haven't exceeded the 250-hour limit.

Concept Review

TermDefinitionReal-World Impact
EpochOne full pass through the dataset.High epochs can lead to overfitting; low epochs to underfitting.
Batch SizeNumber of samples processed before updating parameters.Influences memory usage and training stability.
L2 RegularizationAdds a squared penalty to weights (Ridge).Prevents single features from dominating the model, reducing variance.
Bayesian TuningOptimization that uses previous results to pick next parameters.Much more efficient than Random Search for complex parameter spaces.

Ready to study AWS Certified Machine Learning Engineer - Associate (MLA-C01)?

Practice tests, flashcards, and all study notes — free, no sign-up needed.

Start Studying — Free