AWS ML Model Selection: Strategic Approaches and Customization Tiers

[!IMPORTANT] This study guide focuses on Domain 2.1: Choose a modeling approach for the AWS Certified Machine Learning Engineer - Associate (MLA-C01) exam. It covers the transition from data preparation to selecting the optimal AWS service tier and algorithm.

Learning Objectives

After studying this guide, you should be able to:

Differentiate between the three tiers of AWS ML services (AI Services, ML Services, and Frameworks).
Select a modeling approach based on specific business requirements and data types.
Identify appropriate feature engineering techniques to handle outliers and data skewness prior to modeling.
Recognize pre-training bias metrics and strategies to mitigate class imbalance.

Key Terms & Glossary

AI Services: Fully managed, pre-built models (e.g., Amazon Rekognition, Lex) that require no ML expertise to integrate via APIs.
Amazon SageMaker: A fully managed service that provides the tools to build, train, and deploy ML models at scale (ML Services tier).
Class Imbalance (CI): A situation where one class in a dataset is significantly more frequent than others, potentially leading to biased models.
Skewness: A measure of the asymmetry of the probability distribution of a real-valued random variable.
Inference: The process of using a trained ML model to make predictions on new, unseen data.

The "Big Idea"

Choosing a modeling approach is about finding the right balance between speed-to-market and customization. AWS categorizes its offerings into three tiers so that engineers can decide whether to use a "black-box" API for common tasks (AI Services) or to build a custom architecture from scratch (Frameworks/Infrastructure) for unique niche problems.

Formula / Concept Box

Concept	Purpose / Definition	Key Implementation
Log Transformation	Handles right-skewed data	`y = log(x)`
Box-Cox	Normalizes non-normal data	Requires strictly positive data
Z-Score Scaling	Scales mean to 0, Std Dev to 1	Note: Does NOT fix skewness
Class Imbalance (CI)	Metric: $CI = \frac{n_a - n_b}{n_a + n_b}$	SageMaker Clarify

Hierarchical Outline

Defining the Problem
- Identify the business goal (e.g., churn prediction, image recognition).
- Determine the data type (Structured, Text, Image, Audio).
The Three Tiers of Customization
- Tier 1: AI Services (Pre-built, low effort, high abstraction).
- Tier 2: ML Services (SageMaker) (Managed environment, high flexibility).
- Tier 3: ML Frameworks (PyTorch, TensorFlow on EC2/EKS, maximum control).
Data Integrity & Bias Check
- Measuring pre-training bias (DPL, CI).
- Handling outliers (Removal vs. Imputation).
Inference Strategy Selection
- Real-time: Low latency (e.g., web apps).
- Batch: Large-scale, non-interactive (e.g., weekly reports).

Visual Anchors

AWS ML Service Selection Flow

Loading Diagram...

Handling Data Skewness

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

Skewness Handling: Applying mathematical functions to normalize data distribution.
- Example: Transforming home prices (which usually have a long tail of very expensive homes) using a Logarithmic Transformation to make the data more "bell-shaped" for a linear regression model.
Class Imbalance Mitigation: Techniques to ensure a model doesn't ignore the minority class.
- Example: In credit card fraud detection, where 99.9% of transactions are legitimate, using SMOTE (Synthetic Minority Over-sampling Technique) to generate artificial fraud cases for training.

Worked Examples

Example 1: The Coffee Shop Churn Model

Scenario: A coffee shop wants to predict which customers will stop visiting based on transaction frequency and average spend.

Approach: This is a tabular classification problem. Since the shop has specific customer data, Amazon SageMaker (Tier 2) is chosen over pre-built AI services.
Data Prep: Engineers notice the "Spend" column has outliers (one person bought a $500 espresso machine). They decide to use Median Imputation to reduce the impact of this outlier on the training process.
Bias Check: Only 5% of the data represents "churned" customers. The team uses SageMaker Data Wrangler to resample the data to ensure the model learns the patterns of those who leave.

Checkpoint Questions

Which AWS service tier is best for a developer with no machine learning experience who needs to add speech-to-text to an app?
Why doesn't MinMax Scaling fix a skewed distribution?
What is the difference between Class Imbalance (CI) and Difference in Proportions of Labels (DPL)?
When would you choose Batch Inference over Real-time Inference?

Muddy Points & Cross-Refs

Standardization vs. Normalization: A common point of confusion. Standardization (Z-score) makes data have a mean of 0 but preserves the shape (and skewness). If your model requires a Gaussian distribution (like many linear models), you must use Power Transforms (Box-Cox) first.
Cross-Ref: For more on how to technically implement these transforms, see Chapter 3: Data Transformation and Feature Engineering.

Comparison Tables

AWS ML Tiers Comparison

Feature	AI Services	ML Services (SageMaker)	ML Frameworks
Control	Minimum (API Only)	High (Notebooks/SDK)	Maximum (Infra/OS)
ML Expertise	Not required	Intermediate/Expert	Expert
Cost Model	Per-request	Managed Instance Hours	EC2/Compute Costs
Best Use Case	Text/Vision/Search	Custom Business Logic	Research/Niche Architecture

Skewness Transformation Comparison

Method	Best For	Requirement
Log Transform	Right-skewed data	$x > 0$
Square Root	Mild skewness	$x \ge 0$
Box-Cox	General Power Transform	$x > 0$
Yeo-Johnson	General Power Transform	Works with negative numbers