[SPS311] Building Operational Resilience for Workloads Using Generative AI #AWSreInvent

AWS re:Invent 2025
2025.12.04
This page has been translated by machine translation. View original
Hello.
I am Takaaki Tanaka from the Smart Factory team in the Technology Department of the Manufacturing Business.
I would like to organize and introduce the content of the workshop [SPS311] Building operational resilience using generative AI that I attended at AWS re:Invent 2024.
 Session OverviewBuilding operational resilience requires proactively identifying and mitigating risks. In this workshop, you'll learn how to use AWS managed Generative AI services in real-world scenarios to assess readiness, proactively improve architecture, respond quickly to events, troubleshoot issues, and implement effective observability practices. You'll also leverage AWS Countdown and the AWS Well-Architected Framework as reference frameworks when starting to use Generative AI services in operations. Through hands-on activities, you'll learn strategies for debugging problems, detecting anomalies and incidents, and optimizing architecture to improve workload resilience. Please bring a laptop to participate.
 Scenario and Workshop ObjectivesParticipants are set as key members of a Cloud Center of Excellence (CCoE) team. The mission is quite realistic, including:
About to release a business-critical internal application
Thousands of users across the organization will use it
Currently configured with a single EC2 instance, not production-ready
Expecting 2,500 connections per second during peak times on release day
Unclear how much the current configuration can handle
Limited time available for evaluation, redesign, and implementation
This is where AWS and the AI agent Kiro appear as technical partners.

The flow involves implementing the AWS Countdown process, onboarding incident detection and response systems, and using generative AI to accelerate analysis, automate evaluation, and make improvements in less time than traditional manual approaches.
 What We're BuildingStarting from an unoptimized single-instance configuration, we apply AWS Countdown and AWS Well-Architected Framework best practices to transform it into a production-ready, highly resilient system.
 Performance RequirementsProcess 2,500 simultaneous connections per second
Maintain average response time under 250 milliseconds
Achieve a success rate of 99.5% or higher
 Resilience RequirementsBe fault-tolerant against single Availability Zone (AZ) failures
Maximum downtime of 1 minute during AZ failover
RTO (Recovery Time Objective): 2 hours
RPO (Recovery Point Objective): 10 minutes
 Generative AI and MCPThis workshop provided hands-on experience with operational resilience using generative AI and MCP.
 Role of Generative AIGenerative AI creates new content such as text, code, and configuration files, going beyond standard analysis to produce human-like suggestions and outputs.
The workshop utilizes it particularly for:
Analyzing existing infrastructure
Suggesting improvements based on best practices
Generating specific implementation steps (code, configuration files, architecture diagrams that can be opened in Draw.io)
 What is MCPMCP (Model Context Protocol) is an open protocol that connects LLM applications with external data sources and tools.
The MCP server acts as a standardized interface, allowing AI assistants like Kiro to interact with AWS environments:
Reading AWS configurations
Running scripts
Real-time monitoring of execution results and metrics
This workshop used the following three MCP servers:
AWS Documentation MCP Server
AWS Knowledge MCP Server
AWS API MCP Server
 Kiro IDEKiro is an agent-type AI with IDE and CLI capabilities that supports the transition from prototype to production environment through specification-driven development.
Transforms user prompts into detailed specifications
Generates practical code, documentation, and test code from specifications
Takes you all the way to deliverables that can be shared with the team
Kiro's agents assist and automate tasks such as:
Breaking down and solving complex technical challenges
Generating technical documentation
Automatically generating and supporting unit tests
This allows engineers to maintain control throughout the development lifecycle, not just during the prototype stage.
 Kiro's Value in Operational ResilienceOperational resilience is not just about "preventing failures." When inevitable problems occur, what matters is how quickly you can understand the situation, diagnose the cause, and implement a solution.
Kiro functions as an AI pair programmer who understands both your codebase and AWS environment:
Debugging infrastructure code (Terraform, CDK, CloudFormation)
Reviewing and suggesting improvements for templates
Assisting with runbook creation
Supporting troubleshooting during production incidents
Through these capabilities, it contributes to increased work speed and reduced human errors.
Kiro IDE came pre-installed on a Windows EC2 instance for the workshop, accessible via AWS SSM Fleet Manager. CLI operation instructions were also provided.
 Checking Service Quotas with KiroBy integrating Kiro with the AWS MCP Server, it can help check service quotas and support the quota increase request process.
Service quotas can be accessed directly from the AWS Management Console
Not only checking current values but also requesting quota increases through the console
Kiro can analyze and guide on "which quotas are becoming bottlenecks" and "which quotas should be increased"
Being able to interactively handle such information through generative AI makes capacity planning and pre-release checks more efficient.
 Evaluating Initial Architecture, Observability, and Load TestingThe workshop structure includes both traditional procedures where you directly operate the AWS console and procedures where you issue instructions from Kiro's chat screen.
Console operations are presented as "reference implementations," with the main focus on experiencing "how much can be delegated to Kiro" and "how much effort can be reduced."
 Adding AWS User NotificationsFor operational resilience, it's essential that stakeholders receive immediate notifications when incidents occur.

In the workshop, we added notification settings linked to CloudWatch alarms and incident detection.
Setting up notification channels (email, Slack, etc.)
Organizing notification content (which metrics and thresholds trigger notifications)
Creating templates for runbooks (initial response procedures) after receiving notifications
Kiro can also assist with template generation and configuration support in this area, following a process where you decide "what to monitor and at what thresholds" through AI guidance.
 CloudWatch Anomaly DetectionThe workshop also covers using CloudWatch Anomaly Detection to automatically detect deviations from normal metric patterns.
Automatic learning of baselines
No need to manually determine thresholds
Early detection of traffic spikes or increased error rates
Here too, you can efficiently configure CloudWatch by instructing Kiro with commands like "set up anomaly detection for this metric" and "tell me the recommended thresholds and notification conditions."
 Enhancing Core Infrastructure Resilience VPCTo ensure a successful launch, infrastructure that can scale with demand is essential. The workshop application's VPC is in the following state:
Just enough to support the initial deployment
Very small CIDR block with no room for future scale-out
You're asked to verify the following before service launch:
Network IP capacity
Whether IP address depletion will occur during scaling
If the design can support long-term growth
When IP addresses run short, the following problems can occur:
Auto Scaling groups can't launch new instances
Load balancers can't be provisioned
Scaling out comes to a complete halt
The goal is to analyze VPC IP capacity, identify bottlenecks, and implement a network design that supports both initial launch and future growth.

For this analysis and redesign, Kiro helps by reading existing configurations and suggesting improvements, assisting with CIDR redesign or additional VPC considerations.
 NAT GatewayA typical single point of failure is the NAT gateway, which in the current configuration was:
Only one NAT gateway in the VPC
If that AZ fails, private subnets in all AZs lose Internet access
External API calls, software updates, and outbound traffic all stop, causing the application to fail
You're asked to verify network fault tolerance against single AZ failures and fix it before release.
Ask Kiro to help with the following:
Design private subnets in each AZ to route through NAT gateways in the same AZ
Implement multi-AZ NAT gateway configuration to maintain internet connectivity even if one AZ fails
 Migration to Three-Tier ArchitectureThere are strong concerns about the scalability of the current compute layer, for simple reasons:
All traffic is handled by a single EC2 instance
If that instance fails, the entire application goes down
During traffic spikes that exceed capacity, everything stops again
This is a typical "single instance operation."
To withstand production operations, the following requirements must be met:
Be scalable
Have high availability
Continue operations without interruption even during AZ failures
What you want Kiro to help with:
Replace the single EC2 instance with a multi-AZ Auto Scaling group
Add an Application Load Balancer (ALB) to distribute requests to the Auto Scaling group
Minimize the risk of failure due to EC2 capacity shortages
Here too, through dialogue with Kiro, it made the following changes:
Creation of Launch Templates/Launch Configurations based on existing EC2 settings
Automatic generation of target groups, ALB, and listener configurations
Proposal of Auto Scaling policies (based on CPU, RPS, response time, etc.)
 SummaryThrough this workshop, I was able to practically experience how the combination of MCP Server integration with AWS environments, Kiro's design/implementation/operational support, and AWS Countdown/Well-Architected Framework/Incident Detection and Response can help improve operational resilience.
I was particularly impressed by the following points:
Generative AI strongly supports the "research, design, and review" parts that traditionally required manual time investment
The final decisions are still made by humans, with AI functioning as a "pair engineer"
I'd like to consider how much of this can be incorporated into future design and operations.