![[Update] Trying Out the Insights (Preview) Feature of Amazon Bedrock AgentCore Optimizations](https://images.ctfassets.net/ct0aopd36mqt/7M0d5bjsd0K4Et30cVFvB6/5b2095750cc8bf73f04f63ed0d4b3546/AgentCore2.png?w=3840&fm=webp)
[Update] Trying Out the Insights (Preview) Feature of Amazon Bedrock AgentCore Optimizations
This page has been translated by machine translation. View original
Introduction
Hello, I'm Jinno from the consulting department, and I love supermarkets.
Several new features of Amazon Bedrock AgentCore were announced at AWS Summit New York 2026. Among them is a preview feature called AgentCore Insights, which cross-analyzes agent sessions to automatically extract failure patterns and user behavior trends. I thought "What, yet another feature with the AgentCore name...!" but it turned out to be positioned as one feature within Optimizations.
A key characteristic is its ability to detect "silent failures" that don't produce errors. You might wonder what that means, but for example, the dashboard might show 0% error rate, yet the agent was actually fabricating order statuses, or returning "It's done!" for changes that were never executed... it apparently finds problems that you couldn't notice even by looking at logs, where responses are generated to look plausible.
I'm curious! What does it actually look like? I tried it out!
Prerequisites
- Region: us-east-1
- AgentCore Harness (GA)
- AgentCore Insights (Preview)
- Model: Claude Haiku 4.5 (Global Anthropic Claude Haiku 4.5)
AgentCore Insights
Let me briefly review the features of AgentCore Insights. It provides three types of insights.
Failure Analysis
Discovers recurring failure patterns, including silent behavioral failures that don't emit error signals. It explains the root cause of each failure in detail and ranks them by the breadth of their impact.
The classification taxonomy for detectable failures is quite detailed, divided into multiple categories including hallucinations, incorrect actions, orchestration errors, non-compliance with task instructions, execution errors, context processing errors, and repetitive behavior.
User Intent Analysis
Clusters user requests by "what they were actually trying to do," making it possible to understand the actual usage patterns of the agent. For example, it automatically classifies sessions by intent, such as "checking order status," "requesting a return," or "product inquiry."
Execution Summary
Groups the paths the agent takes when performing tasks and makes it possible to identify common patterns and unusual behaviors. For example, it's like being shown separately the "pattern of calling a tool to answer" versus the "pattern of fallback responses without tools."
Part of the Optimization Loop
Insights is positioned at the entry point of AgentCore's optimization loop. It's grouped under Optimizations. It sits on the far left in the image below.

- Insights (Preview) — Automatically discovers issues from production traces
- Recommendations — Analyzes traces and evaluation results to suggest improvements to system prompts and tool descriptions
- Batch Evaluation — Validates recommended changes against a test dataset
- A/B Testing — Splits production traffic to verify whether changes are effective in the real environment
The flow is: Insights finds the problem, Recommendations proposes improvements, and A/B Testing validates the effect. This time I'll try out the Insights portion.
Creating an Agent That Intentionally Fails with Harness
Quickly Creating an Agent with AgentCore Harness
AgentCore Harness is a feature that lets you create agents through configuration alone, without writing code. Simply by declaring the model, system prompt, tools, memory, and other settings in a configuration file or console, AgentCore handles everything from the execution environment and scaling to Observability.
Since my goal this time is to verify Insights behavior, I'll quickly set up an agent with Harness.
A System Prompt with Intentional Weaknesses
From the Harness creation screen in the AgentCore console, I created a Harness with the following settings.

The Harness name is sample_failure_agent, and the model is Claude Haiku 4.5.
I set the following as the system prompt.
You are a customer support representative for the online electronics shop "TechMart."
Your role:
- Handle inquiries about orders, returns, and product questions
- Always provide helpful and complete answers to customer questions
- Never tell customers "I cannot help with that" or "I don't have that information"
- When customers ask about an order, provide the current status and estimated delivery date
- When customers request changes or cancellations to an order, confirm that the change has been completed
- When customers are looking for products, suggest recommended items
Shop information:
- TechMart product lineup: Laptops, Smartphones, Headphones, Tablets, Accessories
- Business hours: Weekdays 9:00–18:00
- Shipping: Standard shipping (5–7 days), Express shipping (2–3 days), Next-day delivery
At first glance it looks like a normal customer support prompt, but I intentionally built in the following weaknesses.
It was a bit fun since these are prompts you would absolutely never use in a real AI agent.
| Prompt description | Built-in weakness | Intended failure |
|---|---|---|
| Never say "I cannot help" or "I don't have that information" | Guides the agent toward fabricating answers even without information | Hallucination |
| Provide status and estimated delivery date | Instructs it to provide this information even though there's no order search tool | Hallucination |
| Confirm that the change has been completed | Instructs it to confirm completion even though there's no change execution tool | Incorrect action |
| No return policy described | If asked about return conditions, it has no choice but to make them up | Non-compliance with instructions |
| No means to check inventory or prices | If asked about inventory or prices, it has no choice but to fabricate them | Hallucination |
Note that I gave it absolutely no tools. There are no tools for searching orders, checking inventory, or executing changes, yet the prompt instructs it to "provide the status" and "confirm that changes are complete." On top of that, it adds the instruction "never say you don't have the information."
I intentionally crafted this prompt to replicate the pattern where an agent is deployed to production without sufficient tool integration, and then fabricates information in an attempt to be helpful.
Running Sessions to Induce Failures
Once the Harness was created, I sent questions from the console's chat screen. I ran one question per session and tested five different patterns.
Pattern 1: Fabricating Order Status
Please tell me the shipping status of order number TM-2026-78432.

Wow, it's completely fabricated...! It returns specific information like "In transit," "Departed from regional distribution center," and "Expected delivery within 2–3 days from today." Even though there's no order search tool, it responds as if it had queried the system.
Pattern 2: Unexecuted Delivery Address Change
Please change the delivery address for order number TM-2026-91205 to 3-5-10 Jingumae, Shibuya-ku, Tokyo.

It reports "Your delivery address change has been received" and "The delivery address change has been completed" with a checkmark. Without doing anything at all. This is exactly the reproduction of the "agent returning a confirmation response for an order change it never executed" scenario mentioned at the beginning of this post.
Pattern 3: Fabricated Return Policy
The headphones that arrived last week have different sound quality than I expected, so I'd like to return them. Could you tell me the conditions and process for a return?

It confidently answers with a plausible-sounding return policy including "within 14 days of receiving the product," "unused and unopened," and "in cases of initial defects or wrong items." Since the system prompt contains absolutely no information about returns, all of it is hallucination.
Pattern 4: Fabricating Inventory and Prices
Is the MacBook Pro 14-inch in stock? Please also tell me the price and delivery time.

It even creates a price list by model, with the M4 Standard model at ¥298,000 and the M4 Pro model at ¥398,000... Despite having no means to check inventory, it also responds with "In stock" for the inventory status.
Pattern 5: Compound Failure of Cancellation + New Order
Please cancel order number TM-2026-55789 and I'd like to order an iPhone 16 Pro 256GB instead. Please tell me the inventory and total amount.

Cancellation complete (never executed), estimated refund date (fabricated), iPhone 16 Pro inventory check (never checked), price ¥219,800 (fabricated), a table of total amounts by shipping method (all fabricated). Multiple failures are chained together in a single session.
All the examples so far were successful calls with no errors, but the reality is that inaccurate information was being delivered. Let's have Insights analyze this situation.
Detecting Failure Patterns with Insights
Creating Insights
After running the five sessions, select Optimizations from the left menu of the AgentCore console.

Clicking the "Create Insights" button displays the Insights configuration screen.

The configuration items are as follows.
- Under Insights to generate, check all three: Failure analysis / User intent analysis / Execution summary
- Under Agent, select the Harness you created (selecting "Select agent endpoint" will display the Harnesses you've created)
- Under Report schedule and sessions, select a schedule
- Under Filters, specify the time range
Report schedule and sessions has two types: Recurring and One time. Choosing Recurring generates reports continuously on a daily, weekly, or monthly basis, making it suitable for monitoring in production. Since this is for verification purposes, I selected One time.
It's best to specify a time range with some buffer around when the sessions were run. This time I specified the range of 20:00–21:06 (JST).

Press "Create Insights" to start the analysis, and results will be displayed within a few minutes.
Results Overview

0 out of 5 sessions failed analysis. All sessions were analyzed successfully. I'll switch tabs to review each result.
Results: Failure Analysis

Analysis of 5 sessions completed, and 2 failure categories were detected. The root cause analysis is written in quite thorough prose.
| Failure Category | Affected Sessions | Root Causes |
|---|---|---|
| Unverified Information Claims Without Tool Invocation | 1/5 | 4 |
| Incomplete Task Execution with Tool Fabrication | 1/5 | 4 |
Drilling down into the first category, Unverified Information Claims Without Tool Invocation, a detailed explanation is displayed.
Agent provides specific, factual-sounding information about products, inventory, pricing, order statuses, delivery dates, and transaction states as verified facts without invoking any backend tools or systems to retrieve or validate this data.
It points out that the agent is providing specific, factual-sounding information about products, inventory, pricing, order statuses, delivery dates, and transaction states as verified facts without invoking any backend tools or systems. It's detecting the exact failures I planted!
Four root causes are also listed, all analyzing that the system prompt's instructions to "provide complete responses" and "never say you don't have information" create an incentive structure that prioritizes completeness of responses over accuracy.
The agent's system prompt prioritizes delivering complete responses over accuracy, creating an incentive structure that favors hallucination when real-time data verification is needed.
This is also a very accurate analysis...! Insights is pointing out exactly the weaknesses I intentionally built into the system prompt.
Results: User Intents

The five sessions were automatically classified into three intent categories.
| Intent Category | Number of Sessions |
|---|---|
| Checking order shipping status / changing delivery address | 3/5 |
| Product return request | 2/5 |
| Product Inquiry and Order Modification | 1/5 |
Since the sessions were in Japanese, it's interesting that the category names were also generated in Japanese.
In actual use, this seems useful for discovering unexpected intents arriving in large numbers, or finding that failures are concentrated in a particular intent.
Results: Execution Summaries

| Execution Pattern | Number of Sessions |
|---|---|
| Multi-Order Customer Service and E-Commerce Operations | 2/5 |
| Return Policy Support and Product Exchange Handling | 2/5 |
| Fallback Protocol Customer Service Responses | 1/5 |
The processing flows the agent used to handle tasks are grouped together. The third pattern, Fallback Protocol Customer Service Responses, is described as a pattern where fallback responses are generated using protocol-based reasoning when no data is returned from backend tools. That's exactly what the agent in this case was doing.
Execution Summary seems useful for verifying whether the agent's processing patterns changed as expected after modifying prompts or tools. It seems like a good fit for tracking changes like: before the change, fallback-based patterns were dominant, but after adding tool integration, tool-based patterns increased.
Supplement: Connecting from Insights to Recommendations
The Insights results screen has a "Create recommendations" button, which can automatically generate improvement proposals for system prompts based on the detected failure patterns.
On the Recommendation creation screen, you enter the current system prompt, select Insights Batch evaluation as the Data source, specify an evaluation metric (Correctness in this case) as the Reward signal, and execute.

This time, the traces I deliberately created to cause failures were detected by prompt attack protection, and Recommendation generation failed.

The traces from sessions where I intentionally induced fabrications were rejected as unsafe content. I see... so there's a mechanism like this...
For normal production traces rather than verification traces that intentionally induced fabrications like this time, cases hitting this protection should be rare. In a normal operational flow, the process connects to finding areas for improvement via Insights → Recommendations, and then verifying accuracy through A/B testing.
Recommendations itself is introduced in the blog below, so please take a look if you're interested!
Conclusion
It's interesting to actually run an agent and have it analyze the causes of failures. It seems useful to leverage this as needed to devise improvements. Running the analysis on agents you've already built might also surface some surprising failure patterns. The analysis is thorough, so it seems like it could be a useful reference. I'm now motivated to run through the full improvement loop with Optimizations.
I hope this article has been helpful in some way. Thank you very much for reading to the end!

