Unit 2 Study Guide: ML Model Development

This guide covers the core concepts of Content Domain 2 for the AWS Certified Machine Learning Engineer – Associate exam, focusing on selecting, training, and refining machine learning models.

Learning Objectives

After studying this unit, you should be able to:

Distinguish between the three tiers of AWS ML tools: AI Services, ML Services, and ML Frameworks.
Identify appropriate ML algorithms based on specific business problem requirements.
Explain the iterative nature of model training and parameter adjustment.
Quantify data bias using metrics like Class Imbalance (CI) and Difference in Proportions of Labels (DPL).
Select strategies to mitigate bias, such as resampling and synthetic data generation.

Key Terms & Glossary

Model Development: The heart of the ML lifecycle; the process of designing, training, and optimizing a model to recognize patterns.
Class Imbalance (CI): A bias metric occurring when one category is significantly more frequent than others in a dataset.
Difference in Proportions of Labels (DPL): A metric used to identify bias by comparing the observed outcomes across different facets of a dataset.
Hyperparameters: External configurations set before training (e.g., learning rate) that guide the learning process.
SageMaker Clarify: An AWS tool used specifically for detecting bias in datasets and monitoring model behavior for fairness.

The "Big Idea"

Model development is not a "one-and-done" task; it is an iterative cycle of trial and error. It represents the transition from raw data preparation to the creation of an intelligent artifact. In the AWS ecosystem, the goal is to find the right "Tier of Abstraction"—balancing the speed of pre-built AI services against the total control of custom frameworks.

Formula / Concept Box

Metric	Description / Rule	Application
Class Imbalance (CI)	$CI = \frac{n_a - n_b}{n_a + n_b}$	Measures the disparity between the number of samples in different classes.
DPL	$q_a - q_b$	Compares the proportion of positive outcomes between two groups (facets).
Training Goal	$\min(Loss)$	The process of adjusting parameters to minimize the error function.

Hierarchical Outline

I. Choosing a Modeling Approach (The Three Tiers)
- AI Services: Fully managed, pre-trained (e.g., Rekognition, Polly). No ML expertise required.
- ML Services (Amazon SageMaker): Managed platform for custom models. Use your own data without managing infrastructure.
- ML Frameworks & Infrastructure: Deep customization using PyTorch, TensorFlow, or MXNet on EC2/EKS.
II. Model Training & Refinement
- Pattern Recognition: Feeding data to identify relationships.
- Parameter Tuning: The model adjusts internal weights to minimize prediction error.
- Generalization: The ability of a model to perform well on unseen data.
III. Data Integrity & Bias Mitigation
- Pre-training Bias: Detecting issues before the model learns (e.g., Selection bias).
- Mitigation Strategies: Shuffling, augmenting data, or using synthetic data generation (SMOTE).

Visual Anchors

Model Selection Logic

Loading Diagram...

The Learning Curve (Loss vs. Iterations)

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

AI Service → A pre-built API for common tasks. → Example: Using Amazon Transcribe to convert a recorded meeting into a text transcript without training a speech model yourself.
ML Framework → A library of code for building custom neural networks. → Example: Building a proprietary trading algorithm using PyTorch on AWS P4d instances.
Data Augmentation → Increasing dataset size by creating modified versions of existing data. → Example: Flipping or rotating images of products to ensure a computer vision model recognizes them from any angle.

Worked Examples

Case Study: Churn Prediction

Problem: A subscription service wants to predict customer churn.

Data Collection: Gather logs of customer logins, payment history, and support tickets.
Feature Engineering: Create a feature called days_since_last_login.
Selection: Choose SageMaker XGBoost (an ML Service) because the data is tabular and requires custom weights.
Training: Feed 80% of data to the model. The model discovers that support_tickets > 3 is a strong predictor of churn.
Evaluation: Test on the remaining 20% to ensure accuracy.

Calculating Class Imbalance (CI)

Scenario: You have 1000 images of cats and 200 images of dogs.

$n_{cats} = 1000$
$n_{dogs} = 200$
CI = (1000 - 200) / (1000 + 200) = 800 / 1200 = 0.67
Interpretation: A CI of 0.67 indicates a high imbalance toward cats, which may lead to a model that struggles to identify dogs accurately.

Checkpoint Questions

Which AWS service tier is most appropriate for a developer who needs to add image recognition to an app in 24 hours but has no ML experience?
What is the primary difference between a parameter and a hyperparameter?
Name two pre-training bias metrics supported by SageMaker Clarify.
If a dataset has high Class Imbalance, what technique can be used to generate more samples of the minority class?

Muddy Points & Cross-Refs

Confusion between parameters and hyperparameters: Remember that Parameters are learned by the model (e.g., weights in a linear regression), while Hyperparameters are set by you (e.g., the depth of a decision tree).
Bias vs. Variance: While this guide focuses on bias metrics (fairness), remember to cross-reference with Domain 2.3 regarding model performance (overfitting/underfitting).

Comparison Tables

Feature	AI Services	ML Services (SageMaker)	ML Frameworks (IaaS)
Ease of Use	Very High	Medium	Low
Customization	Low	High	Very High
Model Ownership	AWS Owned	User Owned	User Owned
Expertise Needed	None	Data Science / Dev	Deep ML Engineering
Examples	Rekognition, Lex	SageMaker Studio	EC2 + PyTorch