Lab: Deploying and Scaling ML Infrastructure on AWS
Select deployment infrastructure based on existing architecture and requirements
Lab: Deploying and Scaling ML Infrastructure on AWS
This hands-on lab guides you through the process of selecting, deploying, and configuring auto-scaling for machine learning infrastructure on AWS, specifically focusing on Amazon SageMaker AI endpoints. You will learn to navigate the tradeoffs between different deployment targets and implement infrastructure-as-code principles.
Prerequisites
To successfully complete this lab, you need:
- An AWS Account with AdministratorAccess permissions.
- AWS CLI installed and configured on your local machine with access to
<YOUR_REGION>(e.g.,us-east-1). - Python 3.9+ and the
sagemakerandboto3libraries installed (pip install sagemaker boto3). - Basic familiarity with the AWS Management Console.
Learning Objectives
By the end of this lab, you will be able to:
- Select the appropriate deployment target based on latency and cost requirements.
- Deploy a pre-trained model to a SageMaker Real-Time Endpoint using the Python SDK.
- Configure Auto Scaling policies to handle fluctuating traffic patterns.
- Verify infrastructure scaling using CloudWatch metrics and CLI commands.
Architecture Overview
The following diagram illustrates the infrastructure components you will provision:
Step-by-Step Instructions
Step 1: Create an IAM Execution Role
SageMaker requires an IAM role to access S3 buckets and CloudWatch logs on your behalf.
# Create the trust policy file
echo '{"Version": "2012-10-17","Statement": [{"Effect": "Allow","Principal": {"Service": "sagemaker.amazonaws.com"},"Action": "sts:AssumeRole"}]}' > trust-policy.json
# Create the role
aws iam create-role --role-name BrainyBee-SageMaker-Role --assume-role-policy-document file://trust-policy.json
# Attach the SageMaker Full Access policy
aws iam attach-role-policy --role-name BrainyBee-SageMaker-Role --policy-arn arn:aws:iam::aws:policy/AmazonSageMakerFullAccess▶Console alternative
Navigate to IAM > Roles > Create Role. Select AWS Service and SageMaker. Attach the AmazonSageMakerFullAccess policy and name it BrainyBee-SageMaker-Role.
Step 2: Deploy a Model to a Real-Time Endpoint
We will use a sample pre-trained model provided by AWS to create a Real-Time endpoint.
import sagemaker
from sagemaker.image_uris import retrieve
session = sagemaker.Session()
role = "arn:aws:iam::<YOUR_ACCOUNT_ID>:role/BrainyBee-SageMaker-Role"
region = session.boto_region_name
# Retrieve the container URI for XGBoost
container = retrieve("xgboost", region, "1.5-1")
# Define the model
model = sagemaker.model.Model(
image_uri=container,
model_data="s3://sagemaker-sample-files/datasets/tabular/uci_statlog_german_credit_data/model/xgboost-model.tar.gz",
role=role
)
# Deploy the endpoint
predictor = model.deploy(
initial_instance_count=1,
instance_type="ml.t2.medium",
endpoint_name="brainybee-credit-risk-ep"
)
print(f"Endpoint Status: {predictor.endpoint_context.endpoint_name} is live!")[!NOTE] Deployment typically takes 3–5 minutes as AWS provisions the underlying EC2 instances.
Step 3: Configure Auto Scaling Policies
Now, we will enable the infrastructure to scale from 1 to 3 instances based on the number of invocations.
# 1. Register the endpoint as a scalable target
aws application-autoscaling register-scalable-target \
--service-namespace sagemaker \
--resource-id endpoint/brainybee-credit-risk-ep/variant/AllTraffic \
--scalable-dimension sagemaker:variant:DesiredInstanceCount \
--min-capacity 1 \
--max-capacity 3
# 2. Apply a Target Tracking scaling policy
aws application-autoscaling put-scaling-policy \
--policy-name BrainyBeeScalingPolicy \
--service-namespace sagemaker \
--resource-id endpoint/brainybee-credit-risk-ep/variant/AllTraffic \
--scalable-dimension sagemaker:variant:DesiredInstanceCount \
--policy-type TargetTrackingScaling \
--target-tracking-scaling-policy-configuration '{"TargetValue": 50.0, "PredefinedMetricSpecification": {"PredefinedMetricType": "SageMakerVariantInvocationsPerInstance"}, "ScaleInCooldown": 300, "ScaleOutCooldown": 60}'Checkpoints
- Endpoint Verification: Run
aws sagemaker describe-endpoint --endpoint-name brainybee-credit-risk-ep. TheEndpointStatusshould beInService. - Auto Scaling Registration: Run
aws application-autoscaling describe-scalable-targets --service-namespace sagemaker. You should see your endpoint listed with aMinCapacityof 1 andMaxCapacityof 3.
Visualizing the Scaling Logic
The following graph represents how the auto-scaling policy manages instance count relative to the target metric (Invocations per Instance).
\begin{tikzpicture} \draw[->] (0,0) -- (6,0) node[right]{Load (Invocations/Min)}; \draw[->] (0,0) -- (0,4) node[above]{Instances}; \draw[dashed, red] (0,2) -- (6,2) node[right]{Target: 50}; \draw[blue, thick] (0,1) -- (2.5,1) -- (3,2) -- (4.5,2) -- (5,3); \node at (1.2,0.7) {1 Instance}; \node at (3.8,1.7) {Scaling Out}; \node at (5.5,3.2) {3 Instances (Max)}; \end{tikzpicture}
Troubleshooting
| Error | Cause | Solution |
|---|---|---|
AccessDenied | IAM role lacks sagemaker:CreateEndpoint | Re-attach AmazonSageMakerFullAccess or check role ARN. |
ResourceLimitExceeded | Account limit for ml.t2.medium reached | Request limit increase or use a different instance type. |
ValidationError | Incorrect Resource ID format in CLI | Ensure format is endpoint/<name>/variant/AllTraffic. |
Clean-Up / Teardown
[!WARNING] Failure to delete these resources will result in ongoing charges for the
ml.t2.mediuminstance.
# Delete the scaling policy
aws application-autoscaling delete-scaling-policy \
--policy-name BrainyBeeScalingPolicy \
--service-namespace sagemaker \
--resource-id endpoint/brainybee-credit-risk-ep/variant/AllTraffic \
--scalable-dimension sagemaker:variant:DesiredInstanceCount
# Delete the endpoint and configuration
aws sagemaker delete-endpoint --endpoint-name brainybee-credit-risk-ep
aws sagemaker delete-endpoint-config --endpoint-config-name brainybee-credit-risk-ep
# Delete the model
aws sagemaker delete-model --model-name xgboost-modelStretch Challenge
Serverless Transformation: Modify your Python deployment script to use a ServerlessInferenceConfig instead of a dedicated instance type. Compare the deployment time and how AWS handles scaling without an explicit scaling policy.
Cost Estimate
- SageMaker Real-Time (
ml.t2.medium): ~$0.056 per hour per instance. Running 1 instance for 24 hours costs ~$1.34. - Storage (S3/CloudWatch): Negligible for this lab (~$0.01).
- Total Estimated Spend: <$0.10 for the duration of this 30-minute lab if cleaned up immediately.
Concept Review
| Deployment Type | Use Case | Scaling Behavior | Cost Model |
|---|---|---|---|
| Real-Time | Low latency, steady traffic | Manual or Auto Scaling | Hourly instance fee |
| Serverless | Intermittent traffic, cold-start okay | Managed by AWS (0 to Max) | Pay per invocation/duration |
| Asynchronous | Large payloads, long processing | Managed (scale to zero) | Hourly instance fee |
| Batch | Bulk processing, no persistent endpoint | Not applicable | Pay for job duration |
[!IMPORTANT] Choosing between Managed (SageMaker) and Unmanaged (ECS/EKS) depends on the need for control. Managed endpoints reduce operational burden, while Unmanaged targets provide custom networking and deeper infrastructure control.