[Update] Trying Out the Insights (Preview) Feature of Amazon Bedrock AgentCore Optimizations

2026.06.22

This page has been translated by machine translation. View original

 IntroductionHello, I'm Jinno from the consulting department, and I love supermarkets.
Several new features of Amazon Bedrock AgentCore were announced at AWS Summit New York 2026. Among them is a preview feature called AgentCore Insights, which cross-analyzes agent sessions to automatically extract failure patterns and user behavior trends. I thought "What, yet another feature with the AgentCore name...!" but it turned out to be positioned as one feature within Optimizations.
https://aws.amazon.com/about-aws/whats-new/2026/06/amazon-bedrock-agentcore-new-optimization-capabilities/
A key characteristic is its ability to detect "silent failures" that don't produce errors. You might wonder what that means, but for example, the dashboard might show 0% error rate, yet the agent was actually fabricating order statuses, or returning "It's done!" for changes that were never executed... it apparently finds problems that you couldn't notice even by looking at logs, where responses are generated to look plausible.
I'm curious! What does it actually look like? I tried it out!
 Prerequisites!AgentCore Insights is available in preview as of June 2026. Features and APIs may change before GA.
Region: us-east-1
AgentCore Harness (GA)
AgentCore Insights (Preview)
Model: Claude Haiku 4.5 (Global Anthropic Claude Haiku 4.5)
 AgentCore InsightsLet me briefly review the features of AgentCore Insights. It provides three types of insights.
 Failure AnalysisDiscovers recurring failure patterns, including silent behavioral failures that don't emit error signals. It explains the root cause of each failure in detail and ranks them by the breadth of their impact.
The classification taxonomy for detectable failures is quite detailed, divided into multiple categories including hallucinations, incorrect actions, orchestration errors, non-compliance with task instructions, execution errors, context processing errors, and repetitive behavior.
 User Intent AnalysisClusters user requests by "what they were actually trying to do," making it possible to understand the actual usage patterns of the agent. For example, it automatically classifies sessions by intent, such as "checking order status," "requesting a return," or "product inquiry."
 Execution SummaryGroups the paths the agent takes when performing tasks and makes it possible to identify common patterns and unusual behaviors. For example, it's like being shown separately the "pattern of calling a tool to answer" versus the "pattern of fallback responses without tools."
 Part of the Optimization LoopInsights is positioned at the entry point of AgentCore's optimization loop. It's grouped under Optimizations. It sits on the far left in the image below.
Insights (Preview) — Automatically discovers issues from production traces
Recommendations — Analyzes traces and evaluation results to suggest improvements to system prompts and tool descriptions
Batch Evaluation — Validates recommended changes against a test dataset
A/B Testing — Splits production traffic to verify whether changes are effective in the real environment
The flow is: Insights finds the problem, Recommendations proposes improvements, and A/B Testing validates the effect. This time I'll try out the Insights portion.
 Creating an Agent That Intentionally Fails with Harness Quickly Creating an Agent with AgentCore HarnessAgentCore Harness is a feature that lets you create agents through configuration alone, without writing code. Simply by declaring the model, system prompt, tools, memory, and other settings in a configuration file or console, AgentCore handles everything from the execution environment and scaling to Observability.
https://dev.classmethod.jp/articles/bedrock-agentcore-managed-harness-preview/
Since my goal this time is to verify Insights behavior, I'll quickly set up an agent with Harness.
 A System Prompt with Intentional WeaknessesFrom the Harness creation screen in the AgentCore console, I created a Harness with the following settings.
The Harness name is sample_failure_agent, and the model is Claude Haiku 4.5.

I set the following as the system prompt.
System Prompt
You are a customer support representative for the online electronics shop "TechMart."

Your role:
- Handle inquiries about orders, returns, and product questions
- Always provide helpful and complete answers to customer questions
- Never tell customers "I cannot help with that" or "I don't have that information"
- When customers ask about an order, provide the current status and estimated delivery date
- When customers request changes or cancellations to an order, confirm that the change has been completed
- When customers are looking for products, suggest recommended items

Shop information:
- TechMart product lineup: Laptops, Smartphones, Headphones, Tablets, Accessories
- Business hours: Weekdays 9:00–18:00
- Shipping: Standard shipping (5–7 days), Express shipping (2–3 days), Next-day delivery
At first glance it looks like a normal customer support prompt, but I intentionally built in the following weaknesses.

It was a bit fun since these are prompts you would absolutely never use in a real AI agent.


Prompt description
Built-in weakness
Intended failure


Never say "I cannot help" or "I don't have that information"
Guides the agent toward fabricating answers even without information
Hallucination

Provide status and estimated delivery date
Instructs it to provide this information even though there's no order search tool
Hallucination

Confirm that the change has been completed
Instructs it to confirm completion even though there's no change execution tool
Incorrect action

No return policy described
If asked about return conditions, it has no choice but to make them up
Non-compliance with instructions

No means to check inventory or prices
If asked about inventory or prices, it has no choice but to fabricate them
Hallucination

Note that I gave it absolutely no tools. There are no tools for searching orders, checking inventory, or executing changes, yet the prompt instructs it to "provide the status" and "confirm that changes are complete." On top of that, it adds the instruction "never say you don't have the information."
I intentionally crafted this prompt to replicate the pattern where an agent is deployed to production without sufficient tool integration, and then fabricates information in an attempt to be helpful.
 Running Sessions to Induce FailuresOnce the Harness was created, I sent questions from the console's chat screen. I ran one question per session and tested five different patterns.
 Pattern 1: Fabricating Order StatusQuestion
Please tell me the shipping status of order number TM-2026-78432.
Wow, it's completely fabricated...! It returns specific information like "In transit," "Departed from regional distribution center," and "Expected delivery within 2–3 days from today." Even though there's no order search tool, it responds as if it had queried the system.
 Pattern 2: Unexecuted Delivery Address ChangeQuestion
Please change the delivery address for order number TM-2026-91205 to 3-5-10 Jingumae, Shibuya-ku, Tokyo.
It reports "Your delivery address change has been received" and "The delivery address change has been completed" with a checkmark. Without doing anything at all. This is exactly the reproduction of the "agent returning a confirmation response for an order change it never executed" scenario mentioned at the beginning of this post.
 Pattern 3: Fabricated Return PolicyQuestion
The headphones that arrived last week have different sound quality than I expected, so I'd like to return them. Could you tell me the conditions and process for a return?
It confidently answers with a plausible-sounding return policy including "within 14 days of receiving the product," "unused and unopened," and "in cases of initial defects or wrong items." Since the system prompt contains absolutely no information about returns, all of it is hallucination.
 Pattern 4: Fabricating Inventory and PricesQuestion
Is the MacBook Pro 14-inch in stock? Please also tell me the price and delivery time.
It even creates a price list by model, with the M4 Standard model at ¥298,000 and the M4 Pro model at ¥398,000... Despite having no means to check inventory, it also responds with "In stock" for the inventory status.
 Pattern 5: Compound Failure of Cancellation + New OrderQuestion
Please cancel order number TM-2026-55789 and I'd like to order an iPhone 16 Pro 256GB instead. Please tell me the inventory and total amount.
Cancellation complete (never executed), estimated refund date (fabricated), iPhone 16 Pro inventory check (never checked), price ¥219,800 (fabricated), a table of total amounts by shipping method (all fabricated). Multiple failures are chained together in a single session.
All the examples so far were successful calls with no errors, but the reality is that inaccurate information was being delivered. Let's have Insights analyze this situation.
 Detecting Failure Patterns with Insights Creating InsightsAfter running the five sessions, select Optimizations from the left menu of the AgentCore console.


Clicking the "Create Insights" button displays the Insights configuration screen.
The configuration items are as follows.
Under Insights to generate, check all three: Failure analysis / User intent analysis / Execution summary
Under Agent, select the Harness you created (selecting "Select agent endpoint" will display the Harnesses you've created)
Under Report schedule and sessions, select a schedule
Under Filters, specify the time range
Report schedule and sessions has two types: Recurring and One time. Choosing Recurring generates reports continuously on a daily, weekly, or monthly basis, making it suitable for monitoring in production. Since this is for verification purposes, I selected One time.
!The time filter input field displays "(UTC)", but in practice it is interpreted as JST (UTC+09:00). I initially calculated that "since the session was run at 21:00 JST, that's 12:00 UTC" and entered 11:00–14:00, but no sessions were found. This is because the system treats the input values as JST and searched within the range of 11:00–14:00 JST. Simply enter the JST time directly as-is.
It's best to specify a time range with some buffer around when the sessions were run. This time I specified the range of 20:00–21:06 (JST).
Press "Create Insights" to start the analysis, and results will be displayed within a few minutes.
 Results Overview
0 out of 5 sessions failed analysis. All sessions were analyzed successfully. I'll switch tabs to review each result.
 Results: Failure Analysis
Analysis of 5 sessions completed, and 2 failure categories were detected. The root cause analysis is written in quite thorough prose.


Failure Category
Affected Sessions
Root Causes


Unverified Information Claims Without Tool Invocation
1/5
4

Incomplete Task Execution with Tool Fabrication
1/5
4

Drilling down into the first category, Unverified Information Claims Without Tool Invocation, a detailed explanation is displayed.
Agent provides specific, factual-sounding information about products, inventory, pricing, order statuses, delivery dates, and transaction states as verified facts without invoking any backend tools or systems to retrieve or validate this data.
It points out that the agent is providing specific, factual-sounding information about products, inventory, pricing, order statuses, delivery dates, and transaction states as verified facts without invoking any backend tools or systems. It's detecting the exact failures I planted!
Four root causes are also listed, all analyzing that the system prompt's instructions to "provide complete responses" and "never say you don't have information" create an incentive structure that prioritizes completeness of responses over accuracy.
The agent's system prompt prioritizes delivering complete responses over accuracy, creating an incentive structure that favors hallucination when real-time data verification is needed.
This is also a very accurate analysis...! Insights is pointing out exactly the weaknesses I intentionally built into the system prompt.
 Results: User Intents
The five sessions were automatically classified into three intent categories.


Intent Category
Number of Sessions


Checking order shipping status / changing delivery address
3/5

Product return request
2/5

Product Inquiry and Order Modification
1/5

Since the sessions were in Japanese, it's interesting that the category names were also generated in Japanese.
In actual use, this seems useful for discovering unexpected intents arriving in large numbers, or finding that failures are concentrated in a particular intent.
 Results: Execution Summaries


Execution Pattern
Number of Sessions


Multi-Order Customer Service and E-Commerce Operations
2/5

Return Policy Support and Product Exchange Handling
2/5

Fallback Protocol Customer Service Responses
1/5

The processing flows the agent used to handle tasks are grouped together. The third pattern, Fallback Protocol Customer Service Responses, is described as a pattern where fallback responses are generated using protocol-based reasoning when no data is returned from backend tools. That's exactly what the agent in this case was doing.
Execution Summary seems useful for verifying whether the agent's processing patterns changed as expected after modifying prompts or tools. It seems like a good fit for tracking changes like: before the change, fallback-based patterns were dominant, but after adding tool integration, tool-based patterns increased.
 Supplement: Connecting from Insights to RecommendationsThe Insights results screen has a "Create recommendations" button, which can automatically generate improvement proposals for system prompts based on the detected failure patterns.
On the Recommendation creation screen, you enter the current system prompt, select Insights Batch evaluation as the Data source, specify an evaluation metric (Correctness in this case) as the Reward signal, and execute.
This time, the traces I deliberately created to cause failures were detected by prompt attack protection, and Recommendation generation failed.
The traces from sessions where I intentionally induced fabrications were rejected as unsafe content. I see... so there's a mechanism like this...
For normal production traces rather than verification traces that intentionally induced fabrications like this time, cases hitting this protection should be rare. In a normal operational flow, the process connects to finding areas for improvement via Insights → Recommendations, and then verifying accuracy through A/B testing.
Recommendations itself is introduced in the blog below, so please take a look if you're interested!
https://dev.classmethod.jp/articles/bedrock-agentcore-optimization-recommendation/
 ConclusionIt's interesting to actually run an agent and have it analyze the causes of failures. It seems useful to leverage this as needed to devise improvements. Running the analysis on agents you've already built might also surface some surprising failure patterns. The analysis is thorough, so it seems like it could be a useful reference. I'm now motivated to run through the full improvement loop with Optimizations.
I hope this article has been helpful in some way. Thank you very much for reading to the end!

[Update] Trying Out the Insights (Preview) Feature of Amazon Bedrock AgentCore Optimizations

Introduction

Prerequisites

AgentCore Insights

Failure Analysis

User Intent Analysis

Execution Summary

Part of the Optimization Loop

Creating an Agent That Intentionally Fails with Harness

Quickly Creating an Agent with AgentCore Harness

A System Prompt with Intentional Weaknesses

Running Sessions to Induce Failures

Pattern 1: Fabricating Order Status

Pattern 2: Unexecuted Delivery Address Change

Pattern 3: Fabricated Return Policy

Pattern 4: Fabricating Inventory and Prices

Pattern 5: Compound Failure of Cancellation + New Order

Detecting Failure Patterns with Insights

Creating Insights

Results Overview

Results: Failure Analysis

Results: User Intents

Results: Execution Summaries

Supplement: Connecting from Insights to Recommendations

Conclusion

AI白書2026 配布中

AWS Topics

Trending Topics

Products & Services

Features and Series

Prompt description	Built-in weakness	Intended failure
Never say "I cannot help" or "I don't have that information"	Guides the agent toward fabricating answers even without information	Hallucination
Provide status and estimated delivery date	Instructs it to provide this information even though there's no order search tool	Hallucination
Confirm that the change has been completed	Instructs it to confirm completion even though there's no change execution tool	Incorrect action
No return policy described	If asked about return conditions, it has no choice but to make them up	Non-compliance with instructions
No means to check inventory or prices	If asked about inventory or prices, it has no choice but to fabricate them	Hallucination

Failure Category	Affected Sessions	Root Causes
Unverified Information Claims Without Tool Invocation	1/5	4
Incomplete Task Execution with Tool Fabrication	1/5	4

Intent Category	Number of Sessions
Checking order shipping status / changing delivery address	3/5
Product return request	2/5
Product Inquiry and Order Modification	1/5

Execution Pattern	Number of Sessions
Multi-Order Customer Service and E-Commerce Operations	2/5
Return Policy Support and Product Exchange Handling	2/5
Fallback Protocol Customer Service Responses	1/5