Hands-On Lab: Evaluating Foundation Models with Amazon Bedrock
Methods to evaluate foundation models (FM) performance
Hands-On Lab: Evaluating Foundation Models with Amazon Bedrock
Foundation Models (FMs) generate highly capable text, but determining whether they meet business objectives requires rigorous evaluation. As highlighted in the AWS Certified AI Practitioner (AIF-C01) curriculum, model evaluation relies on automated standard metrics (like ROUGE, BLEU, and BERTScore), benchmark datasets, and human evaluation.
In this guided lab, you will configure an automated model evaluation job in Amazon Bedrock. You will evaluate a text summarization model using the Built-in CNN/DailyMail dataset and extract a quantitative benchmark score using the ROUGE metric.
Prerequisites
Before starting this lab, ensure you have:
- An active AWS Account with administrator access.
- The AWS CLI v2 installed and configured locally.
- Amazon Bedrock Model Access granted for the model you intend to evaluate (e.g., Amazon Titan Text Premier or Anthropic Claude).
- Basic familiarity with foundational generative AI concepts and JSON.
Learning Objectives
By completing this lab, you will be able to:
- Explain the role of automated model evaluation using benchmark datasets.
- Configure and launch a Model Evaluation Job using Amazon Bedrock.
- Understand standard evaluation metrics, specifically how ROUGE measures n-gram overlap.
- Securely store and retrieve model performance scores using Amazon S3.
Architecture Overview
The following diagram illustrates the flow of our automated model evaluation process:
Step-by-Step Instructions
Step 1: Set Up Storage for Evaluation Results
Amazon Bedrock requires an S3 bucket to store the evaluation output reports, which contain our metrics (like ROUGE and BERTScore).
aws s3 mb s3://brainybee-eval-results-<YOUR_ACCOUNT_ID> --region us-east-1[!TIP] S3 bucket names must be globally unique. Replace
<YOUR_ACCOUNT_ID>with your actual 12-digit AWS account number to ensure uniqueness.
▶Console alternative
- Navigate to the Amazon S3 console.
- Click Create bucket.
- Enter the bucket name:
brainybee-eval-results-<YOUR_ACCOUNT_ID>. - Set the AWS Region to
us-east-1(or your preferred region). - Leave all other settings as default and click Create bucket.
Step 2: Configure IAM Permissions for the Evaluation Job
Bedrock needs permission to read benchmark datasets (if custom) and write the performance score to your S3 bucket.
First, create a trust policy file named trust-policy.json:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": { "Service": "bedrock.amazonaws.com" },
"Action": "sts:AssumeRole"
}
]
}Next, execute the CLI command to create the IAM role:
aws iam create-role \
--role-name BedrockEvalRole \
--assume-role-policy-document file://trust-policy.jsonAttach the required managed policies for S3 and Bedrock access (Note: For production, use least-privilege inline policies. We use managed policies here to simplify the lab):
aws iam attach-role-policy \
--role-name BedrockEvalRole \
--policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess
aws iam attach-role-policy \
--role-name BedrockEvalRole \
--policy-arn arn:aws:iam::aws:policy/AmazonBedrockFullAccess▶Console alternative
- Navigate to the IAM console and select Roles > Create role.
- Select Custom trust policy and paste the JSON provided above.
- Click Next, search for
AmazonS3FullAccessandAmazonBedrockFullAccess, and check the boxes next to both. - Click Next, name the role
BedrockEvalRole, and click Create role.
Step 3: Understand the Evaluation Metric (ROUGE)
Before running the job, let's look at the metric we are about to measure: ROUGE (Recall-Oriented Understudy for Gisting Evaluation). ROUGE evaluates text summarization by comparing the overlap of n-grams (words) between an FM's generated text and a human-written reference text.
Step 4: Define the Evaluation Job Configuration
Create a file named eval-config.json. This tells Bedrock to use an automated judge model to run the Built-in CNN/DailyMail dataset for Summarization and evaluate it using ROUGE.
{
"automated": {
"datasetMetricConfigs": [
{
"taskType": "Summarization",
"dataset": { "name": "Builtin.CNNDailyMail" },
"metricNames": ["ROUGE_N"]
}
]
}
}Step 5: Execute the Bedrock Evaluation Job
Now, submit the job to Amazon Bedrock to kick off the automated evaluation.
aws bedrock create-evaluation-job \
--job-name "brainybee-rouge-eval-01" \
--role-arn "arn:aws:iam::<YOUR_ACCOUNT_ID>:role/BedrockEvalRole" \
--evaluation-config file://eval-config.json \
--inference-config '{"models": [{"bedrockModel": {"modelIdentifier":"amazon.titan-text-premier-v1:0"}}]}' \
--output-data-config '{"s3Uri":"s3://brainybee-eval-results-<YOUR_ACCOUNT_ID>/output/"}'[!IMPORTANT] You must substitute
<YOUR_ACCOUNT_ID>with your actual account number. Ensure you have requested access to the Amazon Titan Text Premier model in the Bedrock console prior to running this command.
▶Console alternative
- Navigate to the Amazon Bedrock console.
- In the left navigation pane, under Assessment and deployment, choose Model evaluation.
- Click Create evaluation job.
- Select Automated evaluation.
- Give the job a name:
brainybee-rouge-eval-01. - Select Summarization as the Task Type, and select ROUGE as the metric.
- For dataset, choose the built-in dataset CNN/DailyMail.
- For the model, select Amazon Titan Text Premier.
- Specify your newly created S3 bucket
brainybee-eval-results-<YOUR_ACCOUNT_ID>for output. - Under IAM Role, choose the
BedrockEvalRoleyou created earlier, then click Create.
Checkpoints
Verify that your job was successfully created and is processing.
- Check Job Status:
aws bedrock list-evaluation-jobs --query "jobSummaries[0].{Name:jobName, Status:status}"Expected Output:
{
"Name": "brainybee-rouge-eval-01",
"Status": "InProgress"
}(Wait 10-15 minutes for the job to complete and status to turn to Completed).
- Check the Output in S3:
Once the job is completed, verify that the quantitative performance scores have been uploaded.
aws s3 ls s3://brainybee-eval-results-<YOUR_ACCOUNT_ID>/output/ --recursiveExpected Output: You should see a JSON file containing the calculated ROUGE scores for the model.
Teardown
[!WARNING] Remember to run the teardown commands to avoid ongoing charges and maintain a clean environment.
Clean up the resources provisioned during this lab:
- Empty and delete the S3 bucket:
aws s3 rm s3://brainybee-eval-results-<YOUR_ACCOUNT_ID> --recursive
aws s3 rb s3://brainybee-eval-results-<YOUR_ACCOUNT_ID>- Detach policies and delete the IAM role:
aws iam detach-role-policy --role-name BedrockEvalRole --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess
aws iam detach-role-policy --role-name BedrockEvalRole --policy-arn arn:aws:iam::aws:policy/AmazonBedrockFullAccess
aws iam delete-role --role-name BedrockEvalRole(Note: Bedrock evaluation jobs themselves are historical records and do not incur ongoing hourly charges once completed, so they do not need to be actively "deleted" from the console).
Troubleshooting
| Error Message / Behavior | Probable Cause | Fix / Solution |
|---|---|---|
AccessDeniedException: Model access not granted | Your AWS Account has not been granted access to the specified FM. | Go to the Bedrock Console > Model Access > Request access to Amazon Titan Text Premier. |
ValidationException: Role does not have necessary permissions | The IAM role lacks the proper Bedrock or S3 permissions, or the trust policy is malformed. | Verify that AmazonS3FullAccess and AmazonBedrockFullAccess are attached to BedrockEvalRole. |
InvalidBucketName | You forgot to replace <YOUR_ACCOUNT_ID> with your actual AWS account ID, or left uppercase letters in the bucket name. | Re-run the bucket creation using a globally unique, all-lowercase string. |
| The job fails immediately in the console | Your selected region might not support the chosen model or automated evaluation. | Ensure you are operating in a supported region like us-east-1 (N. Virginia) or us-west-2 (Oregon). |