Optimizing ML Infrastructure: Monitoring and Cost Management Lab
Monitor and optimize infrastructure and costs
Optimizing ML Infrastructure: Monitoring and Cost Management Lab
This lab provides hands-on experience in monitoring machine learning infrastructure and implementing cost-optimization strategies on AWS. You will learn to use Amazon CloudWatch for performance tracking and implement tagging and budgeting for cost visibility.
Prerequisites
- AWS Account: An active AWS account with permissions for SageMaker, CloudWatch, and AWS Budgets.
- AWS CLI: Installed and configured with administrator-level access.
- Region: This lab uses
us-east-1, but you can use any region where SageMaker is available. - Knowledge: Basic understanding of SageMaker instances and CloudWatch metrics.
Learning Objectives
- Implement a resource tagging strategy for ML cost allocation.
- Create an Amazon CloudWatch Dashboard to monitor ML compute performance.
- Configure a CloudWatch Alarm to trigger notifications on resource over-utilization.
- Set up an AWS Budget to track and alert on ML-specific spending.
Architecture Overview
This lab demonstrates the flow from infrastructure utilization to monitoring and financial alerting.
Step-by-Step Instructions
Step 1: Create a Tagged SageMaker Notebook Instance
Tagging is the foundation of cost allocation. We will create a small instance with specific metadata tags.
aws sagemaker create-notebook-instance \
--notebook-instance-name "brainybee-lab-notebook" \
--instance-type "ml.t3.medium" \
--role-arn "<YOUR_SAGEMAKER_EXECUTION_ROLE_ARN>" \
--tags Key="Project",Value="ML-Optimization-Lab" Key="Environment",Value="Dev"▶Console alternative
- Open the Amazon SageMaker Console.
- Navigate to Notebooks > Notebook instances.
- Click Create notebook instance.
- Name it
brainybee-lab-notebook. - Expand the Tags section.
- Add Key:
Project| Value:ML-Optimization-Lab. - Add Key:
Environment| Value:Dev. - Click Create notebook instance.
[!TIP] Always use lower-cost instances like
t3.mediumfor experimentation to minimize expense.
Step 2: Create a CloudWatch Dashboard
Dashboards provide real-time visibility into infrastructure health (CPU and Memory utilization).
# Note: Dashboards are typically created via JSON configuration.
# This command creates a simple dashboard with a CPU metric widget.
aws cloudwatch put-dashboard \
--dashboard-name "ML-Infrastructure-Monitor" \
--dashboard-body '{"widgets":[{"type":"metric","x":0,"y":0,"width":12,"height":6,"properties":{"metrics":[["AWS/SageMaker","CPUUtilization","NotebookInstanceName","brainybee-lab-notebook"]],"period":300,"stat":"Average","region":"us-east-1","title":"Notebook CPU Utilization"}}]}'▶Console alternative
- Navigate to CloudWatch > Dashboards.
- Click Create dashboard and name it
ML-Infrastructure-Monitor. - Click Add widget > Line.
- Select Metrics > SageMaker > Host Metrics.
- Search for
brainybee-lab-notebookand select CPUUtilization. - Click Create widget and then Save dashboard.
Step 3: Configure a Utilization Alarm
Alarms help identify "zombie" instances or over-utilized resources that need rightsizing.
aws cloudwatch put-metric-alarm \
--alarm-name "High-CPU-Usage-ML" \
--metric-name "CPUUtilization" \
--namespace "AWS/SageMaker" \
--statistic "Average" \
--period 300 \
--threshold 80 \
--comparison-operator "GreaterThanThreshold" \
--dimensions Name=NotebookInstanceName,Value=brainybee-lab-notebook \
--evaluation-periods 2 \
--unit "Percent"Step 4: Set Up an AWS Budget
Budgets prevent unexpected costs by alerting you when spending exceeds a set limit.
# Save this as budget.json
# {"BudgetLimit": {"Amount": "10", "Unit": "USD"}, "BudgetName": "ML-Lab-Budget", "BudgetType": "COST", "TimeUnit": "MONTHLY"}
aws budgets create-budget \
--account-id <YOUR_ACCOUNT_ID> \
--budget file://budget.json \
--notifications-with-subscribers '[{"Notification":{"NotificationType":"ACTUAL","ComparisonOperator":"GREATER_THAN","Threshold":80,"ThresholdType":"PERCENT"},"Subscribers":[{"SubscriptionType":"EMAIL","Address":"your-email@example.com"}]}]'Checkpoints
- Verification: Run
aws sagemaker list-notebook-instances. Ensure your notebook status isInService. - Monitoring: Navigate to the CloudWatch Console. Does your
ML-Infrastructure-Monitordashboard show a data point for the instance? - Billing: Navigate to AWS Billing > Cost Allocation Tags. Search for
Project. If you just created it, it may take 24 hours to appear, but ensure it is active.
Troubleshooting
| Error | Cause | Fix |
|---|---|---|
AccessDeniedException | Missing IAM permissions for SageMaker or CloudWatch. | Attach AmazonSageMakerFullAccess to your IAM user. |
ResourceNotFound | Metric has not been emitted yet. | Wait 5-10 mins for the SageMaker instance to finish booting and emit metrics. |
LimitExceeded | You have reached the quota for notebook instances. | Delete old notebooks or request a quota increase. |
Teardown
[!IMPORTANT] To avoid ongoing charges, you must delete the resources created in this lab.
# 1. Stop the notebook instance
aws sagemaker stop-notebook-instance --notebook-instance-name "brainybee-lab-notebook"
# 2. Delete the notebook instance (after it stops)
aws sagemaker delete-notebook-instance --notebook-instance-name "brainybee-lab-notebook"
# 3. Delete the CloudWatch Alarm
aws cloudwatch delete-alarms --alarm-names "High-CPU-Usage-ML"
# 4. Delete the Dashboard
aws cloudwatch delete-dashboards --dashboard-names "ML-Infrastructure-Monitor"Stretch Challenge
Automated Rightsizing: Using AWS Lambda and EventBridge, write a script that automatically stops a SageMaker Notebook instance if its CPU utilization remains below 5% for more than 1 hour.
Cost Estimate
- SageMaker t3.medium: ~$0.05/hour (Free tier eligible for first 2 months).
- CloudWatch Dashboard: $3.00/month (Free for the first 3 dashboards).
- AWS Budgets: First 2 budgets are free.
- Total Lab Cost: < $0.10 if completed within 1 hour and resources are deleted.
Concept Review
| Tool | Primary Function | ML Optimization Use-Case |
|---|---|---|
| CloudWatch | Monitoring & Observability | Identifying underutilized instances for rightsizing. |
| AWS Budgets | Financial Governance | Alerting on cost overruns in training pipelines. |
| Cost Explorer | Historical Cost Analysis | Identifying which ML project (via tags) is most expensive. |
| Compute Optimizer | Rightsizing Recommendations | Suggesting a shift from P3 to G4dn instances for inference. |
Monitoring Architecture Visualization
Below is a representation of the metric evaluation cycle using TikZ.
\begin{tikzpicture}[node distance=2cm, every node/.style={rectangle, draw, rounded corners, minimum height=1cm, text centered}] \node (S) {ML Instance}; \node (C) [right of=S, xshift=2cm] {CloudWatch Metrics}; \node (A) [below of=C] {Alarm Threshold}; \node (E) [right of=A, xshift=2cm] {Action (SNS/Stop)};
\draw[->, thick] (S) -- (C) node[midway, above] {CPU/Mem}; \draw[->, thick] (C) -- (A) node[midway, right] {Evaluation}; \draw[->, thick] (A) -- (E) node[midway, above] {Trigger}; \end{tikzpicture}