Hands-On Lab: Evaluating Foundation Models with Amazon Bedrock

Welcome to this guided hands-on lab. In this session, you will learn how to evaluate a Foundation Model's (FM) performance using benchmark datasets and standard evaluation metrics such as ROUGE and BLEU.

We will use Amazon Bedrock to provision an automated model evaluation job. This process uses a "Judge Model" and automated algorithms to compare the FM's generated responses against Reference Answers provided by Subject Matter Experts (SMEs).

Prerequisites

Before starting this lab, ensure you have the following:

AWS Account: Active AWS account with Administrator access or permissions to use Amazon Bedrock, IAM, and S3.
AWS CLI: Installed and configured with your credentials (aws configure).
Model Access: Access requested and granted for Amazon Titan Text G1 - Express in the Amazon Bedrock console (us-east-1 region recommended).
Prior Knowledge: Basic understanding of Foundation Models, prompting, and standard evaluation metrics (ROUGE, BLEU, BERTScore).

Learning Objectives

By the end of this lab, you will be able to:

Prepare and format a benchmark dataset for automated FM evaluation.
Configure an automated model evaluation job using Amazon Bedrock.
Understand how ROUGE and BLEU metrics measure text similarity.
Analyze the performance score and quantitative metrics generated by the evaluation.

Architecture Overview

The following diagram illustrates the automated evaluation workflow we will build. A benchmark dataset is fed into Amazon Bedrock, which generates responses and passes them to a Judge Model/Evaluation Engine to compute the performance score.

Loading Diagram...

Understanding ROUGE-N Overlap

Before diving into the steps, recall that metrics like ROUGE evaluate automatic summarization by measuring n-gram overlap. Below is a visual representation of how ROUGE-1 (unigram) overlap is calculated between an FM's output and a human reference:

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Step-by-Step Instructions

Step 1: Create an S3 Bucket for Datasets and Results

First, we need a location to store our benchmark datasets and the output metrics from our evaluation job.

bash

aws s3 mb s3://brainybee-lab-eval-<YOUR_ACCOUNT_ID> --region us-east-1

[!TIP] Replace <YOUR_ACCOUNT_ID> with your actual 12-digit AWS account ID to ensure the bucket name is globally unique.

▶Console alternative

Navigate to S3 in the AWS Console.
Click Create bucket.
Name it brainybee-lab-eval-<YOUR_ACCOUNT_ID> and select us-east-1.
Leave all other settings as default and click Create bucket.

Step 2: Create and Upload the Benchmark Dataset

Automated evaluation requires a dataset containing a prompt and a reference response. We will create a JSON Lines (.jsonl) file focusing on a summarization task to test the ROUGE metric.

Create a local file named dataset.jsonl:

bash

cat <<EOF > dataset.jsonl
{"prompt": "Summarize the following text: The company's Q3 revenue increased by 15% due to strong cloud sales, though hardware sales declined.", "referenceResponse": "Q3 revenue grew 15% driven by cloud sales, offsetting hardware declines."}
{"prompt": "Summarize the following text: The protagonist travels across the desert for forty days, only to find that the hidden city was a mirage created by the heat.", "referenceResponse": "The protagonist journeys through the desert for forty days to discover the hidden city is just a mirage."}
EOF

Upload this file to your S3 bucket:

bash

aws s3 cp dataset.jsonl s3://brainybee-lab-eval-<YOUR_ACCOUNT_ID>/datasets/dataset.jsonl

Step 3: Configure the IAM Role for Bedrock Evaluation

Amazon Bedrock requires permissions to read your dataset from S3 and write the performance score back to S3.

[!IMPORTANT] Creating IAM roles for Bedrock via CLI requires specific trust policies. For this step, using the AWS Console is highly recommended as it auto-generates the correct policies.

▶Console path (Recommended)

We will let the Bedrock console create the IAM role automatically in Step 4. You do not need to do anything manually here. The wizard will create a role named AmazonBedrock-EvaluationJob... with the required s3:GetObject and s3:PutObject permissions.

If you must use the CLI, you will need to create a trust policy for bedrock.amazonaws.com and attach an S3 read/write policy to it.

Step 4: Run the Automated Model Evaluation Job

Now, we will evaluate the FM's ability to summarize text. The job will generate its own answers and compare them to our reference answers using metrics like ROUGE, BLEU, and BERTScore.

▶Console path (Recommended)

Navigate to Amazon Bedrock > Model evaluation in the AWS Console.
Click Create evaluation.
Choose Automated evaluation.
Job name: brainybee-summarization-eval
Task type: Select Summarization (This task natively triggers ROUGE metrics).
Model: Select Titan Text G1 - Express.
Dataset: Choose Provide your own dataset.
- S3 URI: s3://brainybee-lab-eval-<YOUR_ACCOUNT_ID>/datasets/dataset.jsonl
- Prompt column: prompt
- Reference column: referenceResponse
Evaluation results: Specify your bucket: s3://brainybee-lab-eval-<YOUR_ACCOUNT_ID>/results/
IAM role: Select Create a new service role.
Click Create.

To view the job status via CLI:

bash

aws bedrock list-evaluation-jobs --query "jobSummaries[*].[jobName,status]" --output table

Step 5: Analyze the Performance Score

Once the job status changes to Completed (this takes about 5-10 minutes), you can retrieve the performance score and metrics.

bash

aws s3 ls s3://brainybee-lab-eval-<YOUR_ACCOUNT_ID>/results/ --recursive

📸 Screenshot: You should see a metrics.json file in your S3 output path.

Download the metrics to view the evaluation:

bash

aws s3 cp s3://brainybee-lab-eval-<YOUR_ACCOUNT_ID>/results/<JOB_ID>/metrics.json .
cat metrics.json

Inside the JSON, you will see scores for ROUGE-N and BERTScore. These provide a quantitative measure of the FM's performance on your benchmark dataset.

Checkpoints

After completing the steps above, run the following verification commands:

Verify S3 Upload:

bash
aws s3 ls s3://brainybee-lab-eval-<YOUR_ACCOUNT_ID>/datasets/

Expected Output: Shows dataset.jsonl with a file size > 0.
Check Job Status:

bash
aws bedrock list-evaluation-jobs --max-items 1

Expected Output: A JSON block showing "status": "Completed" (or "InProgress").

Troubleshooting

Error / Issue	Cause	Fix
`AccessDenied` when running Bedrock job	IAM role lacks S3 permissions.	Ensure the auto-created Bedrock role has access to the specific S3 bucket you created.
`ValidationException` on dataset	The `.jsonl` file is formatted incorrectly.	Ensure each line is a valid JSON object. Do not use standard JSON array brackets `[` `]` at the start/end of the file.
`ModelNotAvailableException`	You haven't requested access to Titan Text.	Go to Bedrock > Model access > Manage model access, and request access to Titan Text G1 - Express.

Clean-Up / Teardown

[!WARNING] Remember to run the teardown commands to avoid ongoing charges. S3 storage incurs costs over time.

Delete the S3 Bucket and all its contents:

bash
aws s3 rm s3://brainybee-lab-eval-<YOUR_ACCOUNT_ID> --recursive aws s3 rb s3://brainybee-lab-eval-<YOUR_ACCOUNT_ID>
Delete the IAM Role (if using Console auto-creation):

bash
aws iam delete-role --role-name <NAME_OF_AUTOCREATED_ROLE>

(Note: You must detach the policy from the role before deleting it via CLI, or simply delete it via the IAM Console).
Bedrock Jobs: Evaluation jobs in Bedrock do not incur ongoing costs once completed. They remain in your job history.

Hands-On Lab: Evaluating Foundation Models with Amazon Bedrock

Prerequisites

Before starting this lab, ensure you have the following:

AWS Account: Active AWS account with Administrator access or permissions to use Amazon Bedrock, IAM, and S3.
AWS CLI: Installed and configured with your credentials (aws configure).
Model Access: Access requested and granted for Amazon Titan Text G1 - Express in the Amazon Bedrock console (us-east-1 region recommended).
Prior Knowledge: Basic understanding of Foundation Models, prompting, and standard evaluation metrics (ROUGE, BLEU, BERTScore).

Learning Objectives

By the end of this lab, you will be able to:

Prepare and format a benchmark dataset for automated FM evaluation.
Configure an automated model evaluation job using Amazon Bedrock.
Understand how ROUGE and BLEU metrics measure text similarity.
Analyze the performance score and quantitative metrics generated by the evaluation.

Architecture Overview

Loading Diagram...

Understanding ROUGE-N Overlap

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Step-by-Step Instructions

Step 1: Create an S3 Bucket for Datasets and Results

First, we need a location to store our benchmark datasets and the output metrics from our evaluation job.

bash

aws s3 mb s3://brainybee-lab-eval-<YOUR_ACCOUNT_ID> --region us-east-1

[!TIP] Replace <YOUR_ACCOUNT_ID> with your actual 12-digit AWS account ID to ensure the bucket name is globally unique.

▶Console alternative

Navigate to S3 in the AWS Console.
Click Create bucket.
Name it brainybee-lab-eval-<YOUR_ACCOUNT_ID> and select us-east-1.
Leave all other settings as default and click Create bucket.

Step 2: Create and Upload the Benchmark Dataset

Automated evaluation requires a dataset containing a prompt and a reference response. We will create a JSON Lines (.jsonl) file focusing on a summarization task to test the ROUGE metric.

Create a local file named dataset.jsonl:

bash

cat <<EOF > dataset.jsonl
{"prompt": "Summarize the following text: The company's Q3 revenue increased by 15% due to strong cloud sales, though hardware sales declined.", "referenceResponse": "Q3 revenue grew 15% driven by cloud sales, offsetting hardware declines."}
{"prompt": "Summarize the following text: The protagonist travels across the desert for forty days, only to find that the hidden city was a mirage created by the heat.", "referenceResponse": "The protagonist journeys through the desert for forty days to discover the hidden city is just a mirage."}
EOF

Upload this file to your S3 bucket:

bash

aws s3 cp dataset.jsonl s3://brainybee-lab-eval-<YOUR_ACCOUNT_ID>/datasets/dataset.jsonl

Step 3: Configure the IAM Role for Bedrock Evaluation

Amazon Bedrock requires permissions to read your dataset from S3 and write the performance score back to S3.

[!IMPORTANT] Creating IAM roles for Bedrock via CLI requires specific trust policies. For this step, using the AWS Console is highly recommended as it auto-generates the correct policies.

▶Console path (Recommended)

If you must use the CLI, you will need to create a trust policy for bedrock.amazonaws.com and attach an S3 read/write policy to it.

Step 4: Run the Automated Model Evaluation Job

Now, we will evaluate the FM's ability to summarize text. The job will generate its own answers and compare them to our reference answers using metrics like ROUGE, BLEU, and BERTScore.

▶Console path (Recommended)

Navigate to Amazon Bedrock > Model evaluation in the AWS Console.
Click Create evaluation.
Choose Automated evaluation.
Job name: brainybee-summarization-eval
Task type: Select Summarization (This task natively triggers ROUGE metrics).
Model: Select Titan Text G1 - Express.
Dataset: Choose Provide your own dataset.
- S3 URI: s3://brainybee-lab-eval-<YOUR_ACCOUNT_ID>/datasets/dataset.jsonl
- Prompt column: prompt
- Reference column: referenceResponse
Evaluation results: Specify your bucket: s3://brainybee-lab-eval-<YOUR_ACCOUNT_ID>/results/
IAM role: Select Create a new service role.
Click Create.

To view the job status via CLI:

bash

aws bedrock list-evaluation-jobs --query "jobSummaries[*].[jobName,status]" --output table

Step 5: Analyze the Performance Score

Once the job status changes to Completed (this takes about 5-10 minutes), you can retrieve the performance score and metrics.

bash

aws s3 ls s3://brainybee-lab-eval-<YOUR_ACCOUNT_ID>/results/ --recursive

📸 Screenshot: You should see a metrics.json file in your S3 output path.

Download the metrics to view the evaluation:

bash

aws s3 cp s3://brainybee-lab-eval-<YOUR_ACCOUNT_ID>/results/<JOB_ID>/metrics.json .
cat metrics.json

Inside the JSON, you will see scores for ROUGE-N and BERTScore. These provide a quantitative measure of the FM's performance on your benchmark dataset.

Checkpoints

After completing the steps above, run the following verification commands:

Verify S3 Upload:

bash
aws s3 ls s3://brainybee-lab-eval-<YOUR_ACCOUNT_ID>/datasets/

Expected Output: Shows dataset.jsonl with a file size > 0.
Check Job Status:

bash
aws bedrock list-evaluation-jobs --max-items 1

Expected Output: A JSON block showing "status": "Completed" (or "InProgress").

Troubleshooting

Error / Issue	Cause	Fix
`AccessDenied` when running Bedrock job	IAM role lacks S3 permissions.	Ensure the auto-created Bedrock role has access to the specific S3 bucket you created.
`ValidationException` on dataset	The `.jsonl` file is formatted incorrectly.	Ensure each line is a valid JSON object. Do not use standard JSON array brackets `[` `]` at the start/end of the file.
`ModelNotAvailableException`	You haven't requested access to Titan Text.	Go to Bedrock > Model access > Manage model access, and request access to Titan Text G1 - Express.

Clean-Up / Teardown

[!WARNING] Remember to run the teardown commands to avoid ongoing charges. S3 storage incurs costs over time.

Delete the S3 Bucket and all its contents:

bash
aws s3 rm s3://brainybee-lab-eval-<YOUR_ACCOUNT_ID> --recursive aws s3 rb s3://brainybee-lab-eval-<YOUR_ACCOUNT_ID>
Delete the IAM Role (if using Console auto-creation):

bash
aws iam delete-role --role-name <NAME_OF_AUTOCREATED_ROLE>

(Note: You must detach the policy from the role before deleting it via CLI, or simply delete it via the IAM Console).
Bedrock Jobs: Evaluation jobs in Bedrock do not incur ongoing costs once completed. They remain in your job history.