AWS ML Engineer Associate: Scripting & Creating ML Infrastructure

This guide covers the essential knowledge for Task 3.2: Create and script infrastructure based on existing architecture and requirements from the MLA-C01 exam. It focuses on Infrastructure as Code (IaC), scaling strategies, and containerization for Machine Learning workloads.

Learning Objectives

After studying this guide, you should be able to:

Differentiate between on-demand and provisioned resource models.
Compare and apply different auto-scaling policies for SageMaker endpoints.
Explain the tradeoffs between AWS CloudFormation and AWS CDK.
Identify the correct container service (ECS, EKS, SageMaker) for a given ML requirement.
Automate the provisioning of compute resources using IaC templates.

Key Terms & Glossary

Infrastructure as Code (IaC): Managing and provisioning infrastructure through machine-readable definition files rather than manual configuration.
CloudFormation Stack: A collection of AWS resources that you can manage as a single unit.
AWS CDK (Cloud Development Kit): A software development framework for defining cloud infrastructure in familiar programming languages (Python, TypeScript, etc.).
SageMaker Endpoint: A fully managed service that allows you to host ML models for real-time inference.
BYOC (Bring Your Own Container): The practice of using custom Docker images for SageMaker training or hosting when built-in algorithms are insufficient.

The "Big Idea"

The transition from manual setup to Automated Provisioning is the cornerstone of MLOps. By treating infrastructure as code, ML Engineers ensure that production environments are identical to testing environments, enabling repeatability, version control, and rapid scaling to meet fluctuating inference demands.

Formula / Concept Box

Feature	AWS CloudFormation	AWS CDK
Language	Declarative (YAML/JSON)	Imperative/Object-Oriented (Python, JS, Go)
Abstraction	Low-level (Direct resource mapping)	High-level "Constructs" (Pre-configured patterns)
Execution	Built-in engine	Transpiles to CloudFormation templates
Best For	Simple, static infrastructure	Complex, logic-heavy ML pipelines

Hierarchical Outline

I. Infrastructure as Code (IaC) Paradigms
- Declarative: Defining the what (desired end state). Examples: CloudFormation, Terraform.
- Imperative: Defining the how (step-by-step instructions). Example: Shell scripts using AWS CLI.
II. AWS Provisioning Tools
- CloudFormation: Uses Templates (blueprints) and Stacks (deployed resources).
- AWS CDK: Allows developers to use Python/TypeScript to generate CloudFormation templates.
III. Scaling Policies for ML
- Target Tracking: Adjusts capacity based on a specific metric (e.g., maintain 70% CPU).
- Step Scaling: Increases/decreases capacity based on the size of the alarm breach.
- Scheduled Scaling: Scales based on known time patterns (e.g., business hours).
IV. Containerization Services
- Amazon ECR: Registry for storing Docker images.
- Amazon ECS: Simple, serverless container orchestration (AWS-native).
- Amazon EKS: Managed Kubernetes for complex, portable microservices.

Visual Anchors

The IaC Workflow

Loading Diagram...

Scaling Logic Diagram

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

On-Demand Resources: Resources that are launched and paid for as they are used.
- Example: Launching a SageMaker notebook instance for a quick data exploration task.
Provisioned Resources: Capacity that is pre-allocated and often available instantly, but costs even when idle.
- Example: Using Provisioned Concurrency for Lambda functions to eliminate cold starts in real-time inference.
Metric-Based Scaling: Triggering a scale-out event based on hardware or application performance.
- Example: Scaling a SageMaker endpoint because the InvocationsPerInstance metric exceeded 1000.

Worked Example: CloudFormation Snippet

Creating a SageMaker Endpoint requires three resources: the Model, the Endpoint Configuration, and the Endpoint itself.

yaml

Resources:
  MyModel:
    Type: AWS::SageMaker::Model
    Properties:
      ExecutionRoleArn: !GetAtt MyRole.Arn
      PrimaryContainer:
        Image: !Ref ContainerImageUri

  MyEndpointConfig:
    Type: AWS::SageMaker::EndpointConfig
    Properties:
      ProductionVariants:
        - InitialInstanceCount: 1
          InstanceType: ml.m5.xlarge
          ModelName: !GetAtt MyModel.ModelName
          VariantName: AllTraffic

  MyEndpoint:
    Type: AWS::SageMaker::Endpoint
    Properties:
      EndpointConfigName: !GetAtt MyEndpointConfig.EndpointConfigName

Checkpoint Questions

Which IaC tool would you choose if your team wants to use loops and logic in Python to define 50 different model endpoints? (Answer: AWS CDK)
What is the primary benefit of using Amazon ECR in an ML pipeline? (Answer: It provides a secure, managed registry to store and version Docker images used for training and inference.)
True or False: CloudFormation can roll back all changes if a single resource in a stack fails to provision. (Answer: True)

Muddy Points & Cross-Refs

ECS vs. EKS: Use ECS for AWS-native simplicity; use EKS if you require Kubernetes-specific APIs or are migrating from an on-premises Kubernetes cluster.
Scaling Metrics: Choosing between CPUUtilization and InvocationsPerInstance is tricky. CPU is better for compute-heavy models, while Invocations is better for light models with high throughput.
SageMaker Neo: Often confused with scaling. Neo is for optimizing the model for specific hardware (edge devices), while Auto Scaling is for managing the number of instances.

Comparison Tables

Deployment Target	Use Case	Pros	Cons
SageMaker Endpoints	Standard ML Hosting	Managed, easy auto-scaling	Can be more expensive
AWS Lambda	Intermittent/Spiky traffic	Pay-per-use, serverless	Cold starts, 15-min limit
Amazon ECS/EKS	Microservices Architecture	High control, portability	Operational overhead
SageMaker Batch	Large non-real-time datasets	Cost-effective for bulk	High latency (not real-time)

AWS ML Engineer Associate: Scripting & Creating ML Infrastructure

Learning Objectives

After studying this guide, you should be able to:

Differentiate between on-demand and provisioned resource models.
Compare and apply different auto-scaling policies for SageMaker endpoints.
Explain the tradeoffs between AWS CloudFormation and AWS CDK.
Identify the correct container service (ECS, EKS, SageMaker) for a given ML requirement.
Automate the provisioning of compute resources using IaC templates.

Key Terms & Glossary

Infrastructure as Code (IaC): Managing and provisioning infrastructure through machine-readable definition files rather than manual configuration.
CloudFormation Stack: A collection of AWS resources that you can manage as a single unit.
AWS CDK (Cloud Development Kit): A software development framework for defining cloud infrastructure in familiar programming languages (Python, TypeScript, etc.).
SageMaker Endpoint: A fully managed service that allows you to host ML models for real-time inference.
BYOC (Bring Your Own Container): The practice of using custom Docker images for SageMaker training or hosting when built-in algorithms are insufficient.

The "Big Idea"

Formula / Concept Box

Feature	AWS CloudFormation	AWS CDK
Language	Declarative (YAML/JSON)	Imperative/Object-Oriented (Python, JS, Go)
Abstraction	Low-level (Direct resource mapping)	High-level "Constructs" (Pre-configured patterns)
Execution	Built-in engine	Transpiles to CloudFormation templates
Best For	Simple, static infrastructure	Complex, logic-heavy ML pipelines

Hierarchical Outline

I. Infrastructure as Code (IaC) Paradigms
- Declarative: Defining the what (desired end state). Examples: CloudFormation, Terraform.
- Imperative: Defining the how (step-by-step instructions). Example: Shell scripts using AWS CLI.
II. AWS Provisioning Tools
- CloudFormation: Uses Templates (blueprints) and Stacks (deployed resources).
- AWS CDK: Allows developers to use Python/TypeScript to generate CloudFormation templates.
III. Scaling Policies for ML
- Target Tracking: Adjusts capacity based on a specific metric (e.g., maintain 70% CPU).
- Step Scaling: Increases/decreases capacity based on the size of the alarm breach.
- Scheduled Scaling: Scales based on known time patterns (e.g., business hours).
IV. Containerization Services
- Amazon ECR: Registry for storing Docker images.
- Amazon ECS: Simple, serverless container orchestration (AWS-native).
- Amazon EKS: Managed Kubernetes for complex, portable microservices.

Visual Anchors

The IaC Workflow

Loading Diagram...

Scaling Logic Diagram

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

On-Demand Resources: Resources that are launched and paid for as they are used.
- Example: Launching a SageMaker notebook instance for a quick data exploration task.
Provisioned Resources: Capacity that is pre-allocated and often available instantly, but costs even when idle.
- Example: Using Provisioned Concurrency for Lambda functions to eliminate cold starts in real-time inference.
Metric-Based Scaling: Triggering a scale-out event based on hardware or application performance.
- Example: Scaling a SageMaker endpoint because the InvocationsPerInstance metric exceeded 1000.

Worked Example: CloudFormation Snippet

Creating a SageMaker Endpoint requires three resources: the Model, the Endpoint Configuration, and the Endpoint itself.

yaml

Resources:
  MyModel:
    Type: AWS::SageMaker::Model
    Properties:
      ExecutionRoleArn: !GetAtt MyRole.Arn
      PrimaryContainer:
        Image: !Ref ContainerImageUri

  MyEndpointConfig:
    Type: AWS::SageMaker::EndpointConfig
    Properties:
      ProductionVariants:
        - InitialInstanceCount: 1
          InstanceType: ml.m5.xlarge
          ModelName: !GetAtt MyModel.ModelName
          VariantName: AllTraffic

  MyEndpoint:
    Type: AWS::SageMaker::Endpoint
    Properties:
      EndpointConfigName: !GetAtt MyEndpointConfig.EndpointConfigName

Checkpoint Questions

Which IaC tool would you choose if your team wants to use loops and logic in Python to define 50 different model endpoints? (Answer: AWS CDK)
What is the primary benefit of using Amazon ECR in an ML pipeline? (Answer: It provides a secure, managed registry to store and version Docker images used for training and inference.)
True or False: CloudFormation can roll back all changes if a single resource in a stack fails to provision. (Answer: True)

Muddy Points & Cross-Refs

ECS vs. EKS: Use ECS for AWS-native simplicity; use EKS if you require Kubernetes-specific APIs or are migrating from an on-premises Kubernetes cluster.
Scaling Metrics: Choosing between CPUUtilization and InvocationsPerInstance is tricky. CPU is better for compute-heavy models, while Invocations is better for light models with high throughput.
SageMaker Neo: Often confused with scaling. Neo is for optimizing the model for specific hardware (edge devices), while Auto Scaling is for managing the number of instances.

Comparison Tables

Deployment Target	Use Case	Pros	Cons
SageMaker Endpoints	Standard ML Hosting	Managed, easy auto-scaling	Can be more expensive
AWS Lambda	Intermittent/Spiky traffic	Pay-per-use, serverless	Cold starts, 15-min limit
Amazon ECS/EKS	Microservices Architecture	High control, portability	Operational overhead
SageMaker Batch	Large non-real-time datasets	Cost-effective for bulk	High latency (not real-time)

AWS ML Engineer Associate: Scripting & Creating ML Infrastructure (Task 3.2)

AWS ML Engineer Associate: Scripting & Creating ML Infrastructure

Learning Objectives

Key Terms & Glossary

The "Big Idea"

Formula / Concept Box

Hierarchical Outline

Visual Anchors

The IaC Workflow

Scaling Logic Diagram

Definition-Example Pairs

Worked Example: CloudFormation Snippet

Checkpoint Questions

Muddy Points & Cross-Refs

Comparison Tables

AWS ML Engineer Associate: Scripting & Creating ML Infrastructure (Task 3.2)

AWS ML Engineer Associate: Scripting & Creating ML Infrastructure

Learning Objectives

Key Terms & Glossary

The "Big Idea"

Formula / Concept Box

Hierarchical Outline

Visual Anchors

The IaC Workflow

Scaling Logic Diagram

Definition-Example Pairs

Worked Example: CloudFormation Snippet

Checkpoint Questions

Muddy Points & Cross-Refs

Comparison Tables