[Speaker Report] JAWS-UG Osaka re:Invent re:Cap Lightning Talk Event: I gave a presentation titled "Let's evaluate AI agents with Amazon Bedrock AgentCore Evaluations!" with forced termination if UFOs appear!

[Speaker Report] JAWS-UG Osaka re:Invent re:Cap Lightning Talk Event: I gave a presentation titled "Let's evaluate AI agents with Amazon Bedrock AgentCore Evaluations!" with forced termination if UFOs appear!

2026.01.26

This page has been translated by machine translation. View original

Introduction

Hello, I'm Kanno from the Consulting Department, and I'm a big fan of La Mu supermarket.

I gave a presentation titled "Let's Evaluate AI Agents with Amazon Bedrock AgentCore Evaluations!" at the JAWS-UG Osaka re:Invent re:Cap LT Conference held on Monday, January 26, 2026!

This event was a re:Invent 2025 recap Lightning Talk conference, with presentation slots of 1, 3, or 5 minutes to choose from. I presented in the 5-minute slot. There was a unique rule that if the UFO team appeared, you would be forcibly ended (finishing too early was also not allowed), so I was nervous about fitting within the time limit...!

Actually, re:Invent 2025 was my first attendance! Good memories include watching The Wizard of Oz at the Sphere and having a kind foreigner retrieve my earphones that I dropped on the airplane.

Among all the announcements, I personally wanted to talk about Amazon Bedrock AgentCore Evaluations, which I introduced in this Lightning Talk.

Presentation Materials

What is Amazon Bedrock AgentCore?

Before discussing Amazon Bedrock AgentCore Evaluations, I'll briefly touch on Amazon Bedrock AgentCore.

Amazon Bedrock AgentCore is a managed service that allows you to easily create AI agents.

CleanShot 2026-01-26 at 15.34.24@2x

For example, when creating an AI agent with knowledge about supermarkets, you can host the AI agent on Runtime and implement it with the following configuration:

CleanShot 2026-01-26 at 15.18.51@2x

For a question like "Tell me about an affordable and recommended supermarket," the agent would use tools to research the knowledge base and respond with something like "The recommended supermarket is La Mu."

Are You Evaluating Your AI Agents?

After creating an AI agent, are you evaluating it?

Not doing it... seems difficult... but I want to evaluate whether it's working properly and improve the AI agent I created...
Indeed, we don't want to just build it once and be done; we want to analyze and continuously update it, right?

Good news for those with such concerns!

There was a welcome update at re:Invent 2025. That is Amazon Bedrock AgentCore Evaluations (Preview).

https://dev.classmethod.jp/articles/agentcore-evaluations/

Amazon Bedrock AgentCore Evaluations

Amazon Bedrock AgentCore Evaluations is a feature that allows you to evaluate the AI agents you're developing and operating directly from the console.

It can be easily checked from the dashboard, and it uses LLM-based evaluation (LLM-as-a-Judge).

CleanShot 2026-01-26 at 14.53.00@2x

It's great that you can evaluate the interaction between users and AI agents, including tool usage, based on criteria, and check it in near real-time from the console!

Evaluation Methods

Since the evaluation is based on logs, it doesn't impact operational agents and can be viewed in near real-time on the console.

There are two evaluation methods:

Evaluation Method Overview
Online Evaluation Enables continuous real-time monitoring of agent quality; you can specify sampling rates and filter conditions. Evaluation results can also be viewed from the Observability dashboard
On-demand evaluation Allows on-demand evaluation by specifying particular session IDs, etc. Easily implemented with the Starter Toolkit

It's nice that both methods don't affect the operational agent.

This time, I'll try Online Evaluation!

If you're interested in On-demand evaluation, there's a blog post that demonstrates it, which you can refer to as needed!

https://dev.classmethod.jp/articles/amazon-bedrock-agentcore-evaluations-on-demand-evaluation-starter-toolkit/

Trying It Out

Prerequisites

The source code used in this demonstration is available in the repository below:

https://github.com/yuu551/supermarket-agent-cdk

It's configured to deploy an AI agent with knowledge focused on Kansai region supermarkets using CDK.

Deploying the Agent

First, let's deploy the AI agent.

git clone https://github.com/yuu551/supermarket-agent-cdk.git
cd supermarket-agent-cdk

Install the necessary dependencies and deploy with CDK.

pnpm install
pnpm dlx cdk deploy

Once deployment is complete, a supermarket AI agent will be built on Amazon Bedrock AgentCore.

The configuration looks like this:

CleanShot 2026-01-26 at 15.21.16@2x

The supermarket knowledge documents are uploaded and the knowledge base is created, but it's not synchronized yet, so I'll synchronize it manually.

CleanShot 2026-01-26 at 15.05.54@2x

Setting Up Evaluation in the Console

Configure the settings from the Amazon Bedrock AgentCore console. I selected the following three evaluation metrics:

Evaluation Metric Overview
Faithfulness Evaluates whether the information in the response is supported by the provided context/sources
Goal success rate Evaluates whether the conversation successfully achieved the user's goal
Tool selection accuracy Evaluates whether the agent selected the appropriate tool for the task

CleanShot 2026-01-26 at 14.58.49@2x

From the Select evaluators section in the console, check these three options and save the settings, and you're good to go!

Asking Questions to the AI Agent

Once the configuration is complete, let's ask the AI agent some questions.
You can test it from the Agent Sandbox in the console.

CleanShot 2026-01-26 at 15.08.44@2x

I asked questions that would likely be answered correctly and incorrectly:

Question 1: Tell me about an affordable and recommended supermarket
Question 2: Tell me about Costco

Question 1 can be answered because the information is in the knowledge base, but Question 2 about Costco should not be answerable since that information is not in the knowledge base.

CleanShot 2026-01-24 at 22.50.33@2x

It's properly recommending supermarkets based on the knowledge base.

Looking at the actual response to Question 2, the Costco question was met with "I apologize, but I couldn't find any information about Costco in the search results." As expected, it couldn't answer.

CleanShot 2026-01-26 at 15.10.26@2x

Viewing Evaluation Results

The evaluation summary for several questions can be viewed from the Gen AI Observability dashboard. The dashboard shows the average score for each evaluation metric.

CleanShot 2026-01-26 at 15.10.57@2x

It's great to be able to quickly see how well it's meeting the criteria!

Diving Deeper into the Evaluation Results

Let's break it down further to see specifically how the Costco question was evaluated. The evaluation results are stored in CloudWatch Logs, so let's look at GoalSuccessRate as an example.

You can check the log group from the link in the settings screen.

CleanShot 2026-01-26 at 15.12.03@2x

The logs were recorded as follows:

CleanShot 2026-01-25 at 16.52.48@2x

The translated evaluation result log is as follows:

The user asked about Costco in Japanese.

The AI assistant appropriately used the "retrieve" tool to search for information about Costco in the knowledge base.

As a result of the search, 5 documents with scores above 0.4 were found, but none of them contained specific information about Costco. Instead, information about various supermarkets in the Kansai region (La Mu, Tamade, Hankyu Oasis, Life, AEON, Konan) was displayed.

The AI assistant correctly interpreted the tool output and honestly informed the user that no direct information about Costco was found in the knowledge base.

The user's goal was to learn about Costco, but as confirmed by the tool output, the knowledge base does not contain that information, so the assistant could not directly fulfill this request.

The assistant responded professionally to this limitation by maintaining transparency while suggesting alternatives.

This question received a 0 for Goal Success Rate.
Indeed, the goal of providing information about Costco was not achieved. In this case, we'd simply want to add Costco knowledge.

On the other hand, it received a score of 1 for Faithfulness.

CleanShot 2026-01-25 at 16.59.31@2x

The user asked about "Costco". The assistant executed a search using the keyword "Costco", but the results returned were 5 pieces of information about various supermarkets in the Kansai region (La Mu, Life, AEON, Tamade, Konan, etc.) and contained no information about Costco.

The assistant's response accurately reflects this situation in the following ways:

It acknowledges that no direct information about Costco was included in the search results.
It correctly identifies the information actually found (information about Kansai region supermarkets such as La Mu, Tamade, Hankyu Oasis, Life, and AEON).
It explicitly states that Costco information is not available in the current knowledge base.
It suggests that if there are specific questions about Costco, it will provide support again.
The assistant's response is completely faithful to the course of the conversation so far.
It does not fabricate information about Costco, correctly conveys the output of the search tool, and appropriately acknowledges the limitations of the available data. There are no contradictions between the assistant's response and the conversation history.

From the perspective of faithfulness, it seems that the agent has met the criteria.

Insights from Evaluation Results

It's important to check not just the scores but also what AI used to make its judgments...!!

There might be gaps between AI evaluation criteria and human evaluation criteria. As necessary, combine visual checks to make judgments! Analyze to identify bottlenecks and improve your AI agent's behavior!

Additional Note: Strands Agents Eval Feature

The Strands Agents also has an Eval feature, so I recommend trying it for cases where you want more sophisticated evaluation than AgentCore Evaluations!

You can implement it like this:

from strands import Agent
from strands_evals import Case, Experiment
from strands_evals.evaluators import OutputEvaluator

# Define test cases
test_cases = [
    Case(name="knowledge-1", input="What is the capital of France?", expected_output="Paris"),
    Case(name="math-1", input="What is 5 × 12 × 1.08?", expected_output="64.8"),
    Case(
        name="knowledge-2",
        input="Who is Yudai Kanno?",
        expected_output="I don't know",
    ),
]

# Task function
def get_response(case: Case) -> str:
    agent = Agent(system_prompt="An assistant that provides accurate information")
    return str(agent(case.input))

# LLM Judge evaluator
evaluator = OutputEvaluator(rubric="Evaluate accuracy and completeness on a scale of 1.0-0.0")

# Run test
experiment = Experiment(cases=test_cases, evaluators=[evaluator])
reports = experiment.run_evaluations(get_response)
reports[0].run_display()

When you run it, the evaluation results are displayed like test code, which is easy to understand.

CleanShot 2026-01-25 at 17.10.59@2x

Conclusion

AI agents aren't just "build and forget"; this is a very welcome update for accumulating feedback and evaluations to make improvements and get closer to ideal behavior!
Let's use Evaluations to keep improving the AI agents we create!

At this JAWS-UG Osaka re:Invent re:Cap LT Conference, I finished with 7 seconds remaining, just slightly early, and was taken away by the UFO team at the end!
2 more seconds... so frustrating...
Thank you to all the organizers and everyone who listened!

If this presentation makes you want to learn more about AgentCore, I'd be very happy if you also read the following article!

https://dev.classmethod.jp/articles/amazon-bedrock-agentcore-2025-summary/

I hope this article was helpful. Thank you for reading until the end!

Share this article

FacebookHatena blogX

Related articles