Curriculum Overview860 words

Curriculum Overview: Evaluating FM Applications (RAG, Agents, Workflows)

Identify approaches to evaluate the performance of applications built with FMs (for example, RAG, agents, workflows)

Curriculum Overview: Evaluating FM Applications (RAG, Agents, Workflows)

Welcome to the curriculum overview for evaluating the performance of applications built with Foundation Models (FMs). Evaluating generative AI is fundamentally different from traditional software testing. Because models are non-deterministic and creative, we must rely on a blend of automated metrics, judge models, and human-in-the-loop assessments to ensure applications like Retrieval-Augmented Generation (RAG) and Agentic workflows are production-ready.


Prerequisites

Before diving into this curriculum, learners must possess foundational knowledge in the following areas:

  • Generative AI Fundamentals: Understanding tokens, embeddings, prompt engineering, and the role of transformer-based LLMs.
  • Vector Search & Embeddings: Familiarity with how data is transformed into numerical vectors and stored in specialized databases (e.g., Amazon OpenSearch, pgvector in Amazon Aurora).
  • RAG Architecture: A conceptual grasp of how Retrieval-Augmented Generation pairs a retrieval mechanism with an FM to ground responses in external knowledge.
  • Basic ML Evaluation: Understanding traditional evaluation metrics such as Accuracy, F1 Score (F1=2×Precision×RecallPrecision+RecallF_1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}), and Area Under the Curve (AUC).

[!IMPORTANT] If you are not yet familiar with the difference between a Foundation Model's base weights and in-context learning (like RAG), review those concepts before beginning this evaluation curriculum.


Module Breakdown

This curriculum is structured to take you from evaluating raw text output to assessing complex, multi-step AI agents.

ModuleFocus AreaDifficultyKey Concepts Covered
Module 1Base FM EvaluationBeginnerROUGE, BLEU, BERTScore, Human Evaluation, Automated Benchmarks
Module 2RAG Application EvaluationIntermediateContext Relevance, Context Coverage, Correctness, Completeness, Faithfulness
Module 3Evaluating AI AgentsAdvancedOrchestration, Multi-step logic, Task completion rates, Tool use accuracy
Module 4Business Value & ROIIntermediateUser engagement, Productivity, TCO (Total Cost of Ownership), Risk mitigation

Curriculum Learning Path

Loading Diagram...

Learning Objectives per Module

Module 1: Base FM Evaluation

  • Identify quantitative metrics used to evaluate text summarization and translation (ROUGE for n-gram overlap, BLEU for phrase precision, and BERTScore for deep contextual embeddings).
  • Differentiate between automated benchmarking and human evaluation (focus groups, emotional intelligence checks, bias detection).
  • Design a "Judge Model" workflow where an AI evaluates another AI's outputs against Subject Matter Expert (SME) reference answers.

Module 2: Evaluating RAG Applications

  • Measure Retrieval Quality: Assess Context Relevance (did we fetch the right data?) and Context Coverage (did we fetch all of it?).
  • Measure Generation Quality: Assess Faithfulness (is the answer strictly derived from the retrieved context without hallucinations?) and Completeness (does it fully answer the user's prompt?).
  • Implement automated evaluation frameworks using services like Amazon Bedrock Model Evaluation.

Module 3: Evaluating Agentic Workflows

  • Evaluate multi-step task execution using Amazon Bedrock Agents.
  • Debug model routing and logical reasoning capabilities, especially when the agent is required to invoke external APIs or search internal documents.
  • Assess error recovery: How well does the agent handle API failures or ambiguous user prompts?

Module 4: Business Value & ROI

  • Correlate technical metrics (like BERTScore) with business metrics (efficiency, conversion rates, customer lifetime value).
  • Determine if an FM effectively meets overarching business objectives, balancing latency, accuracy, and operational costs.
Click to expand: The "Judge Model" Pattern

A common pattern in evaluating complex LLM applications is using a secondary, highly capable LLM as a "Judge."

  1. FM Processing: The primary model generates an answer.
  2. Judge Model: The evaluator AI compares the generated answer to an SME's reference answer.
  3. Performance Score: The Judge assigns a quantitative score based on criteria like relevance and safety.

Success Metrics

How will you know you have mastered this curriculum? Mastery is achieved when you can successfully:

  1. Select the Right Metric: Given a use-case (e.g., machine translation vs. document summarization), immediately identify whether BLEU, ROUGE, or a semantic metric like BERTScore is appropriate.
  2. Diagnose RAG Failures: Look at a poor output from a RAG application and correctly determine if the failure was in the retrieval phase (poor vector search) or the generation phase (hallucination or lack of faithfulness).
  3. Calculate Application ROI: Articulate the cost-tradeoffs of custom evaluation workflows versus out-of-the-box managed tools like Amazon Bedrock.

Visualizing RAG Evaluation Focus Areas

Compiling TikZ diagram…
Running TeX engine…
This may take a few seconds

Real-World Application

Why does this matter for your career as an AI Practitioner?

Moving a Generative AI application from a "Proof of Concept" (PoC) to a production environment is the hardest part of the AI lifecycle. In a sandbox, a chatbot hallucinating a feature might be funny; in production, hallucinating a refund policy can cost a company millions of dollars and damage its reputation.

By mastering the evaluation of FM applications, you will be the engineer who ensures safety, reliability, and business alignment.

[!TIP] Real-World Scenario: You deploy an Amazon Bedrock Agent to help internal HR staff retrieve benefits information. Users complain the bot is "unhelpful." Using the skills from this curriculum, you apply automated RAG evaluation. You discover the Generation Correctness is high, but Context Coverage is low. You realize the chunking strategy on the vector database is cutting off the bottom half of the policy documents. You fix the chunking, evaluation scores rise, and user satisfaction goes up. That is the power of systematic AI evaluation!

This robust evaluation framework ensures ethical, responsible, and highly effective AI systems, bridging the gap between raw technological capability and actual enterprise value.

Ready to study AWS Certified AI Practitioner (AIF-C01)?

Practice tests, flashcards, and all study notes — free, no sign-up needed.

Start Studying — Free