Hands-On Lab1,085 words

Optimizing ML Infrastructure: Monitoring and Cost Management Lab

Monitor and optimize infrastructure and costs

Optimizing ML Infrastructure: Monitoring and Cost Management Lab

This lab provides hands-on experience in monitoring machine learning infrastructure and implementing cost-optimization strategies on AWS. You will learn to use Amazon CloudWatch for performance tracking and implement tagging and budgeting for cost visibility.

Prerequisites

  • AWS Account: An active AWS account with permissions for SageMaker, CloudWatch, and AWS Budgets.
  • AWS CLI: Installed and configured with administrator-level access.
  • Region: This lab uses us-east-1, but you can use any region where SageMaker is available.
  • Knowledge: Basic understanding of SageMaker instances and CloudWatch metrics.

Learning Objectives

  • Implement a resource tagging strategy for ML cost allocation.
  • Create an Amazon CloudWatch Dashboard to monitor ML compute performance.
  • Configure a CloudWatch Alarm to trigger notifications on resource over-utilization.
  • Set up an AWS Budget to track and alert on ML-specific spending.

Architecture Overview

This lab demonstrates the flow from infrastructure utilization to monitoring and financial alerting.

Loading Diagram...

Step-by-Step Instructions

Step 1: Create a Tagged SageMaker Notebook Instance

Tagging is the foundation of cost allocation. We will create a small instance with specific metadata tags.

bash
aws sagemaker create-notebook-instance \ --notebook-instance-name "brainybee-lab-notebook" \ --instance-type "ml.t3.medium" \ --role-arn "<YOUR_SAGEMAKER_EXECUTION_ROLE_ARN>" \ --tags Key="Project",Value="ML-Optimization-Lab" Key="Environment",Value="Dev"
Console alternative
  1. Open the Amazon SageMaker Console.
  2. Navigate to Notebooks > Notebook instances.
  3. Click Create notebook instance.
  4. Name it brainybee-lab-notebook.
  5. Expand the Tags section.
  6. Add Key: Project | Value: ML-Optimization-Lab.
  7. Add Key: Environment | Value: Dev.
  8. Click Create notebook instance.

[!TIP] Always use lower-cost instances like t3.medium for experimentation to minimize expense.

Step 2: Create a CloudWatch Dashboard

Dashboards provide real-time visibility into infrastructure health (CPU and Memory utilization).

bash
# Note: Dashboards are typically created via JSON configuration. # This command creates a simple dashboard with a CPU metric widget. aws cloudwatch put-dashboard \ --dashboard-name "ML-Infrastructure-Monitor" \ --dashboard-body '{"widgets":[{"type":"metric","x":0,"y":0,"width":12,"height":6,"properties":{"metrics":[["AWS/SageMaker","CPUUtilization","NotebookInstanceName","brainybee-lab-notebook"]],"period":300,"stat":"Average","region":"us-east-1","title":"Notebook CPU Utilization"}}]}'
Console alternative
  1. Navigate to CloudWatch > Dashboards.
  2. Click Create dashboard and name it ML-Infrastructure-Monitor.
  3. Click Add widget > Line.
  4. Select Metrics > SageMaker > Host Metrics.
  5. Search for brainybee-lab-notebook and select CPUUtilization.
  6. Click Create widget and then Save dashboard.

Step 3: Configure a Utilization Alarm

Alarms help identify "zombie" instances or over-utilized resources that need rightsizing.

bash
aws cloudwatch put-metric-alarm \ --alarm-name "High-CPU-Usage-ML" \ --metric-name "CPUUtilization" \ --namespace "AWS/SageMaker" \ --statistic "Average" \ --period 300 \ --threshold 80 \ --comparison-operator "GreaterThanThreshold" \ --dimensions Name=NotebookInstanceName,Value=brainybee-lab-notebook \ --evaluation-periods 2 \ --unit "Percent"

Step 4: Set Up an AWS Budget

Budgets prevent unexpected costs by alerting you when spending exceeds a set limit.

bash
# Save this as budget.json # {"BudgetLimit": {"Amount": "10", "Unit": "USD"}, "BudgetName": "ML-Lab-Budget", "BudgetType": "COST", "TimeUnit": "MONTHLY"} aws budgets create-budget \ --account-id <YOUR_ACCOUNT_ID> \ --budget file://budget.json \ --notifications-with-subscribers '[{"Notification":{"NotificationType":"ACTUAL","ComparisonOperator":"GREATER_THAN","Threshold":80,"ThresholdType":"PERCENT"},"Subscribers":[{"SubscriptionType":"EMAIL","Address":"your-email@example.com"}]}]'

Checkpoints

  1. Verification: Run aws sagemaker list-notebook-instances. Ensure your notebook status is InService.
  2. Monitoring: Navigate to the CloudWatch Console. Does your ML-Infrastructure-Monitor dashboard show a data point for the instance?
  3. Billing: Navigate to AWS Billing > Cost Allocation Tags. Search for Project. If you just created it, it may take 24 hours to appear, but ensure it is active.

Troubleshooting

ErrorCauseFix
AccessDeniedExceptionMissing IAM permissions for SageMaker or CloudWatch.Attach AmazonSageMakerFullAccess to your IAM user.
ResourceNotFoundMetric has not been emitted yet.Wait 5-10 mins for the SageMaker instance to finish booting and emit metrics.
LimitExceededYou have reached the quota for notebook instances.Delete old notebooks or request a quota increase.

Teardown

[!IMPORTANT] To avoid ongoing charges, you must delete the resources created in this lab.

bash
# 1. Stop the notebook instance aws sagemaker stop-notebook-instance --notebook-instance-name "brainybee-lab-notebook" # 2. Delete the notebook instance (after it stops) aws sagemaker delete-notebook-instance --notebook-instance-name "brainybee-lab-notebook" # 3. Delete the CloudWatch Alarm aws cloudwatch delete-alarms --alarm-names "High-CPU-Usage-ML" # 4. Delete the Dashboard aws cloudwatch delete-dashboards --dashboard-names "ML-Infrastructure-Monitor"

Stretch Challenge

Automated Rightsizing: Using AWS Lambda and EventBridge, write a script that automatically stops a SageMaker Notebook instance if its CPU utilization remains below 5% for more than 1 hour.

Cost Estimate

  • SageMaker t3.medium: ~$0.05/hour (Free tier eligible for first 2 months).
  • CloudWatch Dashboard: $3.00/month (Free for the first 3 dashboards).
  • AWS Budgets: First 2 budgets are free.
  • Total Lab Cost: < $0.10 if completed within 1 hour and resources are deleted.

Concept Review

ToolPrimary FunctionML Optimization Use-Case
CloudWatchMonitoring & ObservabilityIdentifying underutilized instances for rightsizing.
AWS BudgetsFinancial GovernanceAlerting on cost overruns in training pipelines.
Cost ExplorerHistorical Cost AnalysisIdentifying which ML project (via tags) is most expensive.
Compute OptimizerRightsizing RecommendationsSuggesting a shift from P3 to G4dn instances for inference.

Monitoring Architecture Visualization

Below is a representation of the metric evaluation cycle using TikZ.

\begin{tikzpicture}[node distance=2cm, every node/.style={rectangle, draw, rounded corners, minimum height=1cm, text centered}] \node (S) {ML Instance}; \node (C) [right of=S, xshift=2cm] {CloudWatch Metrics}; \node (A) [below of=C] {Alarm Threshold}; \node (E) [right of=A, xshift=2cm] {Action (SNS/Stop)};

\draw[->, thick] (S) -- (C) node[midway, above] {CPU/Mem}; \draw[->, thick] (C) -- (A) node[midway, right] {Evaluation}; \draw[->, thick] (A) -- (E) node[midway, above] {Trigger}; \end{tikzpicture}

Ready to study AWS Certified Machine Learning Engineer - Associate (MLA-C01)?

Practice tests, flashcards, and all study notes — free, no sign-up needed.

Start Studying — Free