Hands-On Lab966 words

Hands-On Lab: Evaluating Foundation Models with Amazon Bedrock

Methods to evaluate foundation models (FM) performance

Hands-On Lab: Evaluating Foundation Models with Amazon Bedrock

Foundation Models (FMs) generate highly capable text, but determining whether they meet business objectives requires rigorous evaluation. As highlighted in the AWS Certified AI Practitioner (AIF-C01) curriculum, model evaluation relies on automated standard metrics (like ROUGE, BLEU, and BERTScore), benchmark datasets, and human evaluation.

In this guided lab, you will configure an automated model evaluation job in Amazon Bedrock. You will evaluate a text summarization model using the Built-in CNN/DailyMail dataset and extract a quantitative benchmark score using the ROUGE metric.

Prerequisites

Before starting this lab, ensure you have:

  • An active AWS Account with administrator access.
  • The AWS CLI v2 installed and configured locally.
  • Amazon Bedrock Model Access granted for the model you intend to evaluate (e.g., Amazon Titan Text Premier or Anthropic Claude).
  • Basic familiarity with foundational generative AI concepts and JSON.

Learning Objectives

By completing this lab, you will be able to:

  1. Explain the role of automated model evaluation using benchmark datasets.
  2. Configure and launch a Model Evaluation Job using Amazon Bedrock.
  3. Understand standard evaluation metrics, specifically how ROUGE measures n-gram overlap.
  4. Securely store and retrieve model performance scores using Amazon S3.

Architecture Overview

The following diagram illustrates the flow of our automated model evaluation process:

Loading Diagram...

Step-by-Step Instructions

Step 1: Set Up Storage for Evaluation Results

Amazon Bedrock requires an S3 bucket to store the evaluation output reports, which contain our metrics (like ROUGE and BERTScore).

bash
aws s3 mb s3://brainybee-eval-results-<YOUR_ACCOUNT_ID> --region us-east-1

[!TIP] S3 bucket names must be globally unique. Replace <YOUR_ACCOUNT_ID> with your actual 12-digit AWS account number to ensure uniqueness.

Console alternative
  1. Navigate to the Amazon S3 console.
  2. Click Create bucket.
  3. Enter the bucket name: brainybee-eval-results-<YOUR_ACCOUNT_ID>.
  4. Set the AWS Region to us-east-1 (or your preferred region).
  5. Leave all other settings as default and click Create bucket.

Step 2: Configure IAM Permissions for the Evaluation Job

Bedrock needs permission to read benchmark datasets (if custom) and write the performance score to your S3 bucket.

First, create a trust policy file named trust-policy.json:

json
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": "bedrock.amazonaws.com" }, "Action": "sts:AssumeRole" } ] }

Next, execute the CLI command to create the IAM role:

bash
aws iam create-role \ --role-name BedrockEvalRole \ --assume-role-policy-document file://trust-policy.json

Attach the required managed policies for S3 and Bedrock access (Note: For production, use least-privilege inline policies. We use managed policies here to simplify the lab):

bash
aws iam attach-role-policy \ --role-name BedrockEvalRole \ --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess aws iam attach-role-policy \ --role-name BedrockEvalRole \ --policy-arn arn:aws:iam::aws:policy/AmazonBedrockFullAccess
Console alternative
  1. Navigate to the IAM console and select Roles > Create role.
  2. Select Custom trust policy and paste the JSON provided above.
  3. Click Next, search for AmazonS3FullAccess and AmazonBedrockFullAccess, and check the boxes next to both.
  4. Click Next, name the role BedrockEvalRole, and click Create role.

Step 3: Understand the Evaluation Metric (ROUGE)

Before running the job, let's look at the metric we are about to measure: ROUGE (Recall-Oriented Understudy for Gisting Evaluation). ROUGE evaluates text summarization by comparing the overlap of n-grams (words) between an FM's generated text and a human-written reference text.

Compiling TikZ diagram…
Running TeX engine…
This may take a few seconds

Step 4: Define the Evaluation Job Configuration

Create a file named eval-config.json. This tells Bedrock to use an automated judge model to run the Built-in CNN/DailyMail dataset for Summarization and evaluate it using ROUGE.

json
{ "automated": { "datasetMetricConfigs": [ { "taskType": "Summarization", "dataset": { "name": "Builtin.CNNDailyMail" }, "metricNames": ["ROUGE_N"] } ] } }

Step 5: Execute the Bedrock Evaluation Job

Now, submit the job to Amazon Bedrock to kick off the automated evaluation.

bash
aws bedrock create-evaluation-job \ --job-name "brainybee-rouge-eval-01" \ --role-arn "arn:aws:iam::<YOUR_ACCOUNT_ID>:role/BedrockEvalRole" \ --evaluation-config file://eval-config.json \ --inference-config '{"models": [{"bedrockModel": {"modelIdentifier":"amazon.titan-text-premier-v1:0"}}]}' \ --output-data-config '{"s3Uri":"s3://brainybee-eval-results-<YOUR_ACCOUNT_ID>/output/"}'

[!IMPORTANT] You must substitute <YOUR_ACCOUNT_ID> with your actual account number. Ensure you have requested access to the Amazon Titan Text Premier model in the Bedrock console prior to running this command.

Console alternative
  1. Navigate to the Amazon Bedrock console.
  2. In the left navigation pane, under Assessment and deployment, choose Model evaluation.
  3. Click Create evaluation job.
  4. Select Automated evaluation.
  5. Give the job a name: brainybee-rouge-eval-01.
  6. Select Summarization as the Task Type, and select ROUGE as the metric.
  7. For dataset, choose the built-in dataset CNN/DailyMail.
  8. For the model, select Amazon Titan Text Premier.
  9. Specify your newly created S3 bucket brainybee-eval-results-<YOUR_ACCOUNT_ID> for output.
  10. Under IAM Role, choose the BedrockEvalRole you created earlier, then click Create.

Checkpoints

Verify that your job was successfully created and is processing.

  1. Check Job Status:
bash
aws bedrock list-evaluation-jobs --query "jobSummaries[0].{Name:jobName, Status:status}"

Expected Output:

json
{ "Name": "brainybee-rouge-eval-01", "Status": "InProgress" }

(Wait 10-15 minutes for the job to complete and status to turn to Completed).

  1. Check the Output in S3:

Once the job is completed, verify that the quantitative performance scores have been uploaded.

bash
aws s3 ls s3://brainybee-eval-results-<YOUR_ACCOUNT_ID>/output/ --recursive

Expected Output: You should see a JSON file containing the calculated ROUGE scores for the model.

Teardown

[!WARNING] Remember to run the teardown commands to avoid ongoing charges and maintain a clean environment.

Clean up the resources provisioned during this lab:

  1. Empty and delete the S3 bucket:
bash
aws s3 rm s3://brainybee-eval-results-<YOUR_ACCOUNT_ID> --recursive aws s3 rb s3://brainybee-eval-results-<YOUR_ACCOUNT_ID>
  1. Detach policies and delete the IAM role:
bash
aws iam detach-role-policy --role-name BedrockEvalRole --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess aws iam detach-role-policy --role-name BedrockEvalRole --policy-arn arn:aws:iam::aws:policy/AmazonBedrockFullAccess aws iam delete-role --role-name BedrockEvalRole

(Note: Bedrock evaluation jobs themselves are historical records and do not incur ongoing hourly charges once completed, so they do not need to be actively "deleted" from the console).

Troubleshooting

Error Message / BehaviorProbable CauseFix / Solution
AccessDeniedException: Model access not grantedYour AWS Account has not been granted access to the specified FM.Go to the Bedrock Console > Model Access > Request access to Amazon Titan Text Premier.
ValidationException: Role does not have necessary permissionsThe IAM role lacks the proper Bedrock or S3 permissions, or the trust policy is malformed.Verify that AmazonS3FullAccess and AmazonBedrockFullAccess are attached to BedrockEvalRole.
InvalidBucketNameYou forgot to replace <YOUR_ACCOUNT_ID> with your actual AWS account ID, or left uppercase letters in the bucket name.Re-run the bucket creation using a globally unique, all-lowercase string.
The job fails immediately in the consoleYour selected region might not support the chosen model or automated evaluation.Ensure you are operating in a supported region like us-east-1 (N. Virginia) or us-west-2 (Oregon).

Ready to study AWS Certified AI Practitioner (AIF-C01)?

Practice tests, flashcards, and all study notes — free, no sign-up needed.

Start Studying — Free