SageMaker Model Monitor: Detecting Data Drift in Production
Monitor model inference
SageMaker Model Monitor: Detecting Data Drift in Production
Machine learning models often degrade in performance as the data they encounter in production diverges from their training data. This phenomenon, known as Data Drift, can lead to inaccurate predictions and business loss. In this lab, you will configure Amazon SageMaker Model Monitor to automatically detect these deviations by establishing a baseline and scheduling periodic quality checks.
[!WARNING] This lab involves provisioning AWS resources that may incur costs. Remember to run the teardown commands at the end to avoid ongoing charges.
Prerequisites
- AWS Account: An active AWS account with Administrator access.
- CLI Tools: AWS CLI configured with
<YOUR_REGION>(e.g.,us-east-1). - IAM Permissions: Ensure your execution role has
AmazonSageMakerFullAccessandCloudWatchFullAccess. - S3 Bucket: A bucket named
brainybee-lab-monitor-<YOUR_ACCOUNT_ID>to store baselines and captured data.
Learning Objectives
- Enable Data Capture on an active SageMaker Inference Endpoint.
- Execute a Baseline Job to generate statistical constraints from training data.
- Configure a Monitoring Schedule to compare production traffic against the baseline.
- Visualize Data Quality Metrics and drift alerts in Amazon CloudWatch.
Architecture Overview
Step-by-Step Instructions
Step 1: Enable Data Capture on Endpoint
To monitor a model, we must first tell SageMaker to save a sample of the incoming requests and outgoing predictions to S3.
# Define the Data Capture configuration
aws sagemaker update-endpoint-config \
--endpoint-config-name "MyModelConfig" \
--data-capture-config '{ "EnableCapture": true, "InitialSamplingPercentage": 100, "DestinationS3Uri": "s3://brainybee-lab-monitor-<YOUR_ACCOUNT_ID>/captured-data/", "CaptureOptions": [{"CaptureMode": "Input"}, {"CaptureMode": "Output"}] }'▶Console alternative
- Navigate to SageMaker > Inference > Endpoints.
- Select your endpoint and click Update.
- Under Data Capture, toggle to Enabled.
- Set Sampling percentage to 100% and provide your S3 path.
Step 2: Establish a Baseline
Before we can detect drift, we need to know what "normal" looks like. We use a baseline job to analyze our training dataset.
# Note: This is typically done via the SageMaker SDK in a notebook
# It triggers a processing job that creates statistics.json and constraints.json[!TIP] The baseline job calculates the mean, standard deviation, and distribution for every feature in your dataset.
Step 3: Create a Monitoring Schedule
Now, schedule a recurring job (e.g., hourly) to compare the captured production data against the baseline.
aws sagemaker create-monitoring-schedule \
--monitoring-schedule-name "DailyDataQualityShift" \
--monitoring-schedule-config '{ "MonitoringJobDefinitionName": "DataQualityJobDef", "MonitoringType": "DataQuality", "ScheduleConfig": { "ScheduleExpression": "cron(0 * * * ? *)" } }'Checkpoints
- S3 Verification: Run
aws s3 ls s3://brainybee-lab-monitor-<YOUR_ACCOUNT_ID>/captured-data/to ensure JSONL files are appearing after you send test traffic to the endpoint. - Status Check: Check the SageMaker console under Model Monitoring to ensure the schedule status is
ScheduledorExecuting.
Visualizing Drift
The following diagram illustrates the concept of a feature distribution shift that Model Monitor would flag:
\begin{tikzpicture}[scale=0.8] % Baseline Distribution \draw[blue, thick] (-3,0) .. controls (-1,0) and (-0.5,3) .. (0,3) .. controls (0.5,3) and (1,0) .. (3,0); \node[blue] at (0,3.3) {Baseline (Training)};
% Drifted Distribution
\draw[red, dashed, thick] (-1,0) .. controls (1,0) and (1.5,3) .. (2,3) .. controls (2.5,3) and (3,0) .. (5,0);
\node[red] at (2,3.3) {Production (Drifted)};
% Axes
\draw[->] (-4,0) -- (6,0) node[right] {Feature Value};
\draw[->] (-4,0) -- (-4,4) node[above] {Density};
% Indicator
\draw[<->, thick] (0,1.5) -- (2,1.5);
\node at (1,1.8) {\bf{DRIFT}};\end{tikzpicture}
Troubleshooting
| Issue | Likely Cause | Fix |
|---|---|---|
| No Data in S3 | DataCaptureConfig not applied | Check endpoint configuration status in CLI. |
| Job Fails | IAM Role missing S3 permissions | Ensure the SageMaker execution role has s3:PutObject for your bucket. |
| Low Metrics | Low traffic volume | Send a burst of at least 50-100 requests to trigger a meaningful analysis. |
Clean-Up / Teardown
Avoid charges by deleting the monitoring schedule and endpoint.
# 1. Delete Monitoring Schedule
aws sagemaker delete-monitoring-schedule --monitoring-schedule-name "DailyDataQualityShift"
# 2. Delete Endpoint
aws sagemaker delete-endpoint --endpoint-name "MyModelEndpoint"
# 3. Empty S3 Bucket
aws s3 rm s3://brainybee-lab-monitor-<YOUR_ACCOUNT_ID> --recursiveStretch Challenge
Model Quality Monitoring: Instead of just monitoring input data (Data Quality), set up a Model Quality monitor. This requires providing "Ground Truth" labels for the inferences made in Step 1 and comparing the model's accuracy/F1-score against the baseline performance.
Cost Estimate
| Service | Usage Type | Estimated Cost |
|---|---|---|
| SageMaker Endpoint | ml.m5.xlarge (On-Demand) | ~$0.23 / hour |
| Monitoring Job | ml.m5.xlarge (Processing) | ~$0.23 / job run |
| S3 Storage | Data Logs | <$0.01 (negligible for lab) |
Concept Review
Comparison Table: Monitoring Types
| Feature | Data Quality | Model Quality | Bias Drift | Feature Attribution |
|---|---|---|---|---|
| What it monitors | Distribution of input features | Accuracy, Precision, Recall | Fairness metrics over time | Change in feature importance |
| Requirement | Baseline + Live Data | Baseline + Ground Truth | Baseline + Live Data | Baseline + Live Data |
| Detection | Outliers, Missing values | Accuracy drop | Prediction bias | Changing rank of features |