Optimizing ML Infrastructure: Monitoring and Cost Management Lab

This lab provides hands-on experience in monitoring machine learning infrastructure and implementing cost-optimization strategies on AWS. You will learn to use Amazon CloudWatch for performance tracking and implement tagging and budgeting for cost visibility.

Prerequisites

AWS Account: An active AWS account with permissions for SageMaker, CloudWatch, and AWS Budgets.
AWS CLI: Installed and configured with administrator-level access.
Region: This lab uses us-east-1, but you can use any region where SageMaker is available.
Knowledge: Basic understanding of SageMaker instances and CloudWatch metrics.

Learning Objectives

Implement a resource tagging strategy for ML cost allocation.
Create an Amazon CloudWatch Dashboard to monitor ML compute performance.
Configure a CloudWatch Alarm to trigger notifications on resource over-utilization.
Set up an AWS Budget to track and alert on ML-specific spending.

Architecture Overview

This lab demonstrates the flow from infrastructure utilization to monitoring and financial alerting.

Loading Diagram...

Step-by-Step Instructions

Step 1: Create a Tagged SageMaker Notebook Instance

Tagging is the foundation of cost allocation. We will create a small instance with specific metadata tags.

bash

aws sagemaker create-notebook-instance \
    --notebook-instance-name "brainybee-lab-notebook" \
    --instance-type "ml.t3.medium" \
    --role-arn "<YOUR_SAGEMAKER_EXECUTION_ROLE_ARN>" \
    --tags Key="Project",Value="ML-Optimization-Lab" Key="Environment",Value="Dev"

▶Console alternative

Open the Amazon SageMaker Console.
Navigate to Notebooks > Notebook instances.
Click Create notebook instance.
Name it brainybee-lab-notebook.
Expand the Tags section.
Add Key: Project | Value: ML-Optimization-Lab.
Add Key: Environment | Value: Dev.
Click Create notebook instance.

[!TIP] Always use lower-cost instances like t3.medium for experimentation to minimize expense.

Step 2: Create a CloudWatch Dashboard

Dashboards provide real-time visibility into infrastructure health (CPU and Memory utilization).

bash

# Note: Dashboards are typically created via JSON configuration.
# This command creates a simple dashboard with a CPU metric widget.
aws cloudwatch put-dashboard \
    --dashboard-name "ML-Infrastructure-Monitor" \
    --dashboard-body '{"widgets":[{"type":"metric","x":0,"y":0,"width":12,"height":6,"properties":{"metrics":[["AWS/SageMaker","CPUUtilization","NotebookInstanceName","brainybee-lab-notebook"]],"period":300,"stat":"Average","region":"us-east-1","title":"Notebook CPU Utilization"}}]}'

▶Console alternative

Navigate to CloudWatch > Dashboards.
Click Create dashboard and name it ML-Infrastructure-Monitor.
Click Add widget > Line.
Select Metrics > SageMaker > Host Metrics.
Search for brainybee-lab-notebook and select CPUUtilization.
Click Create widget and then Save dashboard.

Step 3: Configure a Utilization Alarm

Alarms help identify "zombie" instances or over-utilized resources that need rightsizing.

bash

aws cloudwatch put-metric-alarm \
    --alarm-name "High-CPU-Usage-ML" \
    --metric-name "CPUUtilization" \
    --namespace "AWS/SageMaker" \
    --statistic "Average" \
    --period 300 \
    --threshold 80 \
    --comparison-operator "GreaterThanThreshold" \
    --dimensions Name=NotebookInstanceName,Value=brainybee-lab-notebook \
    --evaluation-periods 2 \
    --unit "Percent"

Step 4: Set Up an AWS Budget

Budgets prevent unexpected costs by alerting you when spending exceeds a set limit.

bash

# Save this as budget.json
# {"BudgetLimit": {"Amount": "10", "Unit": "USD"}, "BudgetName": "ML-Lab-Budget", "BudgetType": "COST", "TimeUnit": "MONTHLY"}

aws budgets create-budget \
    --account-id <YOUR_ACCOUNT_ID> \
    --budget file://budget.json \
    --notifications-with-subscribers '[{"Notification":{"NotificationType":"ACTUAL","ComparisonOperator":"GREATER_THAN","Threshold":80,"ThresholdType":"PERCENT"},"Subscribers":[{"SubscriptionType":"EMAIL","Address":"your-email@example.com"}]}]'

Checkpoints

Verification: Run aws sagemaker list-notebook-instances. Ensure your notebook status is InService.
Monitoring: Navigate to the CloudWatch Console. Does your ML-Infrastructure-Monitor dashboard show a data point for the instance?
Billing: Navigate to AWS Billing > Cost Allocation Tags. Search for Project. If you just created it, it may take 24 hours to appear, but ensure it is active.

Troubleshooting

Error	Cause	Fix
`AccessDeniedException`	Missing IAM permissions for SageMaker or CloudWatch.	Attach `AmazonSageMakerFullAccess` to your IAM user.
`ResourceNotFound`	Metric has not been emitted yet.	Wait 5-10 mins for the SageMaker instance to finish booting and emit metrics.
`LimitExceeded`	You have reached the quota for notebook instances.	Delete old notebooks or request a quota increase.

Teardown

[!IMPORTANT] To avoid ongoing charges, you must delete the resources created in this lab.

bash

# 1. Stop the notebook instance
aws sagemaker stop-notebook-instance --notebook-instance-name "brainybee-lab-notebook"

# 2. Delete the notebook instance (after it stops)
aws sagemaker delete-notebook-instance --notebook-instance-name "brainybee-lab-notebook"

# 3. Delete the CloudWatch Alarm
aws cloudwatch delete-alarms --alarm-names "High-CPU-Usage-ML"

# 4. Delete the Dashboard
aws cloudwatch delete-dashboards --dashboard-names "ML-Infrastructure-Monitor"

Stretch Challenge

Automated Rightsizing: Using AWS Lambda and EventBridge, write a script that automatically stops a SageMaker Notebook instance if its CPU utilization remains below 5% for more than 1 hour.

Cost Estimate

SageMaker t3.medium: ~$0.05/hour (Free tier eligible for first 2 months).
CloudWatch Dashboard: $3.00/month (Free for the first 3 dashboards).
AWS Budgets: First 2 budgets are free.
Total Lab Cost: < $0.10 if completed within 1 hour and resources are deleted.

Concept Review

Tool	Primary Function	ML Optimization Use-Case
CloudWatch	Monitoring & Observability	Identifying underutilized instances for rightsizing.
AWS Budgets	Financial Governance	Alerting on cost overruns in training pipelines.
Cost Explorer	Historical Cost Analysis	Identifying which ML project (via tags) is most expensive.
Compute Optimizer	Rightsizing Recommendations	Suggesting a shift from P3 to G4dn instances for inference.

Monitoring Architecture Visualization

Below is a representation of the metric evaluation cycle using TikZ.

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds