AWS ML Model Selection: Strategic Approaches and Customization Tiers
Choose a modeling approach
AWS ML Model Selection: Strategic Approaches and Customization Tiers
[!IMPORTANT] This study guide focuses on Domain 2.1: Choose a modeling approach for the AWS Certified Machine Learning Engineer - Associate (MLA-C01) exam. It covers the transition from data preparation to selecting the optimal AWS service tier and algorithm.
Learning Objectives
After studying this guide, you should be able to:
- Differentiate between the three tiers of AWS ML services (AI Services, ML Services, and Frameworks).
- Select a modeling approach based on specific business requirements and data types.
- Identify appropriate feature engineering techniques to handle outliers and data skewness prior to modeling.
- Recognize pre-training bias metrics and strategies to mitigate class imbalance.
Key Terms & Glossary
- AI Services: Fully managed, pre-built models (e.g., Amazon Rekognition, Lex) that require no ML expertise to integrate via APIs.
- Amazon SageMaker: A fully managed service that provides the tools to build, train, and deploy ML models at scale (ML Services tier).
- Class Imbalance (CI): A situation where one class in a dataset is significantly more frequent than others, potentially leading to biased models.
- Skewness: A measure of the asymmetry of the probability distribution of a real-valued random variable.
- Inference: The process of using a trained ML model to make predictions on new, unseen data.
The "Big Idea"
Choosing a modeling approach is about finding the right balance between speed-to-market and customization. AWS categorizes its offerings into three tiers so that engineers can decide whether to use a "black-box" API for common tasks (AI Services) or to build a custom architecture from scratch (Frameworks/Infrastructure) for unique niche problems.
Formula / Concept Box
| Concept | Purpose / Definition | Key Implementation |
|---|---|---|
| Log Transformation | Handles right-skewed data | y = log(x) |
| Box-Cox | Normalizes non-normal data | Requires strictly positive data |
| Z-Score Scaling | Scales mean to 0, Std Dev to 1 | Note: Does NOT fix skewness |
| Class Imbalance (CI) | Metric: | SageMaker Clarify |
Hierarchical Outline
- Defining the Problem
- Identify the business goal (e.g., churn prediction, image recognition).
- Determine the data type (Structured, Text, Image, Audio).
- The Three Tiers of Customization
- Tier 1: AI Services (Pre-built, low effort, high abstraction).
- Tier 2: ML Services (SageMaker) (Managed environment, high flexibility).
- Tier 3: ML Frameworks (PyTorch, TensorFlow on EC2/EKS, maximum control).
- Data Integrity & Bias Check
- Measuring pre-training bias (DPL, CI).
- Handling outliers (Removal vs. Imputation).
- Inference Strategy Selection
- Real-time: Low latency (e.g., web apps).
- Batch: Large-scale, non-interactive (e.g., weekly reports).
Visual Anchors
AWS ML Service Selection Flow
Handling Data Skewness
Definition-Example Pairs
- Skewness Handling: Applying mathematical functions to normalize data distribution.
- Example: Transforming home prices (which usually have a long tail of very expensive homes) using a Logarithmic Transformation to make the data more "bell-shaped" for a linear regression model.
- Class Imbalance Mitigation: Techniques to ensure a model doesn't ignore the minority class.
- Example: In credit card fraud detection, where 99.9% of transactions are legitimate, using SMOTE (Synthetic Minority Over-sampling Technique) to generate artificial fraud cases for training.
Worked Examples
Example 1: The Coffee Shop Churn Model
Scenario: A coffee shop wants to predict which customers will stop visiting based on transaction frequency and average spend.
- Approach: This is a tabular classification problem. Since the shop has specific customer data, Amazon SageMaker (Tier 2) is chosen over pre-built AI services.
- Data Prep: Engineers notice the "Spend" column has outliers (one person bought a $500 espresso machine). They decide to use Median Imputation to reduce the impact of this outlier on the training process.
- Bias Check: Only 5% of the data represents "churned" customers. The team uses SageMaker Data Wrangler to resample the data to ensure the model learns the patterns of those who leave.
Checkpoint Questions
- Which AWS service tier is best for a developer with no machine learning experience who needs to add speech-to-text to an app?
- Why doesn't MinMax Scaling fix a skewed distribution?
- What is the difference between Class Imbalance (CI) and Difference in Proportions of Labels (DPL)?
- When would you choose Batch Inference over Real-time Inference?
Muddy Points & Cross-Refs
- Standardization vs. Normalization: A common point of confusion. Standardization (Z-score) makes data have a mean of 0 but preserves the shape (and skewness). If your model requires a Gaussian distribution (like many linear models), you must use Power Transforms (Box-Cox) first.
- Cross-Ref: For more on how to technically implement these transforms, see Chapter 3: Data Transformation and Feature Engineering.
Comparison Tables
AWS ML Tiers Comparison
| Feature | AI Services | ML Services (SageMaker) | ML Frameworks |
|---|---|---|---|
| Control | Minimum (API Only) | High (Notebooks/SDK) | Maximum (Infra/OS) |
| ML Expertise | Not required | Intermediate/Expert | Expert |
| Cost Model | Per-request | Managed Instance Hours | EC2/Compute Costs |
| Best Use Case | Text/Vision/Search | Custom Business Logic | Research/Niche Architecture |
Skewness Transformation Comparison
| Method | Best For | Requirement |
|---|---|---|
| Log Transform | Right-skewed data | $x > 0 |
| Square Root | Mild skewness | x \ge 0 |
| Box-Cox | General Power Transform | x > 0$ |
| Yeo-Johnson | General Power Transform | Works with negative numbers |