Study Guide895 words

AWS ML Model Selection: Strategic Approaches and Customization Tiers

Choose a modeling approach

AWS ML Model Selection: Strategic Approaches and Customization Tiers

[!IMPORTANT] This study guide focuses on Domain 2.1: Choose a modeling approach for the AWS Certified Machine Learning Engineer - Associate (MLA-C01) exam. It covers the transition from data preparation to selecting the optimal AWS service tier and algorithm.

Learning Objectives

After studying this guide, you should be able to:

  • Differentiate between the three tiers of AWS ML services (AI Services, ML Services, and Frameworks).
  • Select a modeling approach based on specific business requirements and data types.
  • Identify appropriate feature engineering techniques to handle outliers and data skewness prior to modeling.
  • Recognize pre-training bias metrics and strategies to mitigate class imbalance.

Key Terms & Glossary

  • AI Services: Fully managed, pre-built models (e.g., Amazon Rekognition, Lex) that require no ML expertise to integrate via APIs.
  • Amazon SageMaker: A fully managed service that provides the tools to build, train, and deploy ML models at scale (ML Services tier).
  • Class Imbalance (CI): A situation where one class in a dataset is significantly more frequent than others, potentially leading to biased models.
  • Skewness: A measure of the asymmetry of the probability distribution of a real-valued random variable.
  • Inference: The process of using a trained ML model to make predictions on new, unseen data.

The "Big Idea"

Choosing a modeling approach is about finding the right balance between speed-to-market and customization. AWS categorizes its offerings into three tiers so that engineers can decide whether to use a "black-box" API for common tasks (AI Services) or to build a custom architecture from scratch (Frameworks/Infrastructure) for unique niche problems.

Formula / Concept Box

ConceptPurpose / DefinitionKey Implementation
Log TransformationHandles right-skewed datay = log(x)
Box-CoxNormalizes non-normal dataRequires strictly positive data
Z-Score ScalingScales mean to 0, Std Dev to 1Note: Does NOT fix skewness
Class Imbalance (CI)Metric: CI=nanbna+nbCI = \frac{n_a - n_b}{n_a + n_b}SageMaker Clarify

Hierarchical Outline

  1. Defining the Problem
    • Identify the business goal (e.g., churn prediction, image recognition).
    • Determine the data type (Structured, Text, Image, Audio).
  2. The Three Tiers of Customization
    • Tier 1: AI Services (Pre-built, low effort, high abstraction).
    • Tier 2: ML Services (SageMaker) (Managed environment, high flexibility).
    • Tier 3: ML Frameworks (PyTorch, TensorFlow on EC2/EKS, maximum control).
  3. Data Integrity & Bias Check
    • Measuring pre-training bias (DPL, CI).
    • Handling outliers (Removal vs. Imputation).
  4. Inference Strategy Selection
    • Real-time: Low latency (e.g., web apps).
    • Batch: Large-scale, non-interactive (e.g., weekly reports).

Visual Anchors

AWS ML Service Selection Flow

Loading Diagram...

Handling Data Skewness

Compiling TikZ diagram…
Running TeX engine…
This may take a few seconds

Definition-Example Pairs

  • Skewness Handling: Applying mathematical functions to normalize data distribution.
    • Example: Transforming home prices (which usually have a long tail of very expensive homes) using a Logarithmic Transformation to make the data more "bell-shaped" for a linear regression model.
  • Class Imbalance Mitigation: Techniques to ensure a model doesn't ignore the minority class.
    • Example: In credit card fraud detection, where 99.9% of transactions are legitimate, using SMOTE (Synthetic Minority Over-sampling Technique) to generate artificial fraud cases for training.

Worked Examples

Example 1: The Coffee Shop Churn Model

Scenario: A coffee shop wants to predict which customers will stop visiting based on transaction frequency and average spend.

  1. Approach: This is a tabular classification problem. Since the shop has specific customer data, Amazon SageMaker (Tier 2) is chosen over pre-built AI services.
  2. Data Prep: Engineers notice the "Spend" column has outliers (one person bought a $500 espresso machine). They decide to use Median Imputation to reduce the impact of this outlier on the training process.
  3. Bias Check: Only 5% of the data represents "churned" customers. The team uses SageMaker Data Wrangler to resample the data to ensure the model learns the patterns of those who leave.

Checkpoint Questions

  1. Which AWS service tier is best for a developer with no machine learning experience who needs to add speech-to-text to an app?
  2. Why doesn't MinMax Scaling fix a skewed distribution?
  3. What is the difference between Class Imbalance (CI) and Difference in Proportions of Labels (DPL)?
  4. When would you choose Batch Inference over Real-time Inference?

Muddy Points & Cross-Refs

  • Standardization vs. Normalization: A common point of confusion. Standardization (Z-score) makes data have a mean of 0 but preserves the shape (and skewness). If your model requires a Gaussian distribution (like many linear models), you must use Power Transforms (Box-Cox) first.
  • Cross-Ref: For more on how to technically implement these transforms, see Chapter 3: Data Transformation and Feature Engineering.

Comparison Tables

AWS ML Tiers Comparison

FeatureAI ServicesML Services (SageMaker)ML Frameworks
ControlMinimum (API Only)High (Notebooks/SDK)Maximum (Infra/OS)
ML ExpertiseNot requiredIntermediate/ExpertExpert
Cost ModelPer-requestManaged Instance HoursEC2/Compute Costs
Best Use CaseText/Vision/SearchCustom Business LogicResearch/Niche Architecture

Skewness Transformation Comparison

MethodBest ForRequirement
Log TransformRight-skewed data$x > 0
Square RootMild skewnessx \ge 0
Box-CoxGeneral Power Transformx > 0$
Yeo-JohnsonGeneral Power TransformWorks with negative numbers

Ready to study AWS Certified Machine Learning Engineer - Associate (MLA-C01)?

Practice tests, flashcards, and all study notes — free, no sign-up needed.

Start Studying — Free