Establishing and Monitoring Performance Baselines in Machine Learning

Performance baselines are the foundational "yardsticks" used to evaluate the effectiveness of machine learning models. Without a baseline, it is impossible to quantify whether a complex model is actually providing value or if it is simply consuming more resources for marginal gains.

Learning Objectives

By the end of this guide, you should be able to:

Define the purpose of a performance baseline in the ML lifecycle.
Identify appropriate "simple models" to serve as initial baselines.
Explain how to use SageMaker Monitoring to create constraints from baselines.
Differentiate between data drift and model drift monitoring.
Select evaluation metrics that align with specific business objectives.

Key Terms & Glossary

Baseline: A reference point (often a simple model or heuristic) used to compare the performance of more complex models.
Data Drift: A change in the statistical properties of input data over time (e.g., a feature's mean shifts).
Model Drift (Concept Drift): A decline in model predictive power due to changes in the relationship between input features and the target variable.
Ground Truth: The actual, verified "correct" answer for a prediction, used to calculate model drift in production.
Constraint: A specific threshold (e.g., Recall > 0.8) generated during a baseline job that triggers an alert if violated.
Deequ: An open-source tool (used by AWS) built on Apache Spark for measuring data quality in large datasets.

The "Big Idea"

Think of a performance baseline as the "Minimum Viable Model." In the same way an architect uses a scale model before building a skyscraper, an ML engineer uses a baseline to prove that the problem is solvable and to set a floor for performance. If a simple linear regression achieves 80% accuracy, a complex deep learning model costing 10x more must significantly exceed that 80% to justify its existence.

Formula / Concept Box

Metric	Formula	Best Used For...
Precision	$\frac{TP}{TP + FP}$	Minimizing False Positives (e.g., Spam detection)
Recall	$\frac{TP}{TP + FN}$	Minimizing False Negatives (e.g., Fraud detection)
F1-Score	$$2 \cdot \frac{Precision \cdot Recall}{Precision + Recall}$$	Balancing Precision and Recall
RMSE	$\sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}$	Regression problems (sensitive to outliers)

Hierarchical Outline

Phase 1: Initial Baseline Creation
- Simple Model Selection: Use Linear Regression (regression) or Logistic Regression (classification).
- Metric Selection: Align metrics with Business Objectives (e.g., Precision vs. Recall).
- Data Validation: Use simple models to identify data leakage or bias early.
Phase 2: SageMaker Baseline Jobs
- Metric Computation: Automated jobs calculate statistics (mean, median, standard deviation).
- Constraint Generation: SageMaker creates a constraints.json file representing the "normal" range for these metrics.
Phase 3: Production Monitoring
- Data Drift: Comparing real-time input data to the baseline using Deequ.
- Model Drift: Comparing model predictions to Ground Truth labels stored in S3.
- Alerting: Integration with CloudWatch and SNS to trigger retraining or human intervention.

Visual Anchors

ML Baseline Workflow

Loading Diagram...

Visualizing a Baseline (ROC Curve)

This diagram represents why we need a baseline. The diagonal line represents a random guess baseline ( $AUC = 0.5$ ).

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

Simple Model Baseline: Starting with a model that has high interpretability and low compute cost.
- Example: Using Logistic Regression to predict customer churn before trying a Gradient Boosted Tree (XGBoost).
Performance Constraints: Hard limits set on evaluation metrics that indicate a model is no longer fit for purpose.
- Example: Setting a constraint that Recall must not drop below 0.8. If a new batch of data results in a recall of 0.75, a CloudWatch alarm triggers.
Data Drift Monitoring: Detecting when the distribution of the features entering the model changes.
- Example: A housing price model trained on data from 2020 (low interest rates) receives 2024 data (high interest rates); the "Interest Rate" feature mean has drifted significantly.

Worked Examples

Step-by-Step: Comparing a Baseline to a Complex Model

Scenario: You are building a model to predict whether a credit card transaction is fraudulent.

Set the Baseline: You train a Logistic Regression model.
- Result: Precision: 0.85, Recall: 0.70.
- Observation: The model is fast and cheap, but misses 30% of fraud.
Train the Complex Model: You train an XGBoost model with hyperparameter tuning.
- Result: Precision: 0.86, Recall: 0.92.
Analysis:
- Precision Improvement: +1% (Marginal).
- Recall Improvement: +22% (Significant).
- Decision: The XGBoost model is justified because the significant increase in Recall (catching fraud) outweighs the extra compute cost compared to the baseline.

Checkpoint Questions

Why is it beneficial to start with a simple model like Linear Regression before moving to a Deep Learning architecture?
Which service is used to log results from SageMaker Drift Monitoring and trigger alerts?
True or False: The baseline job typically runs after the model is deployed to production.
What is the difference between Data Drift and Model Drift?

▶Click to see answers

Simple models provide a reference point for improvement, help identify data issues like bias/leakage, and reduce initial computational costs.
Amazon CloudWatch (integrated with SNS for notifications).
False. The baseline job runs before training or retraining to establish constraints.
Data Drift focuses on changes in input features; Model Drift focuses on changes in the accuracy/quality of predictions compared to the actual outcome (ground truth).

Muddy Points & Cross-Refs

Data Drift vs. Model Drift: This is a common exam pitfall. Remember: Data Drift = Input features change (e.g., users get younger). Model Drift = Predictions are wrong (e.g., the model can no longer identify a fraudulent transaction even if input ranges look normal).
Interpretation with Clarify: While baselines set the performance floor, SageMaker Clarify is used to explain why the model is making certain predictions and to detect bias. These are often used together in the "Analyze Model Performance" phase.

Comparison Tables

Simple Baseline vs. Advanced Model

Feature	Simple Model (Baseline)	Advanced Model (Production)
Examples	Logistic Regression, Mean-filler	XGBoost, Neural Networks
Interpretability	High	Low ("Black Box")
Compute Cost	Low	High
Primary Goal	Establish a performance floor	Maximize business value

Data Drift vs. Model Drift

Attribute	Data Drift Monitoring	Model Drift Monitoring
What is measured?	Input feature distributions	Prediction vs. Actuals
Tools used	Deequ / Spark	S3 Ground Truth Comparison
Required Data	Inference Request data	Ground Truth Labels
CloudWatch Metric	Feature mean, % missing values	Precision, Recall, F1

[!IMPORTANT] For the AWS exam, remember that Deequ is the underlying engine for SageMaker Model Monitor's data quality checks. Also, recall that Model Drift requires you to periodically upload labeled data to S3 for comparison.