Testing the AWS DevOps Agent (Preview) according to test scenarios #AWSreInvent

AWS re:Invent 2025
2025.12.03
This page has been translated by machine translation. View original
Is everyone enjoying AWS re:Invent2025?
I'll be trying out AWS DevOps Agent which was announced yesterday and looks very useful for troubleshooting.

There are already articles about AWS DevOps Agent, so please check them out as well.

https://dev.classmethod.jp/articles/aws-devops-agent-25/

https://dev.classmethod.jp/articles/aws-devops-agent-preview-awsreinvent-troubleshooting/
 What is AWS DevOps AgentAWS DevOps Agent is a newly announced AI agent service that autonomously resolves and prevents incidents as a "Frontier Agent".

Frontier Agent is a new class of AI agent defined by AWS that operates independently without human intervention and can handle numerous tasks.
Other agents such as AWS Security Agent and Kiro Autonomous Agent have also been announced.

https://dev.classmethod.jp/articles/aws-security-agent-public-preview-reinvent-2025/

https://dev.classmethod.jp/articles/kiro-autonomous-agent-early-access/
Among these, AWS DevOps Agent can automatically investigate alerts, conduct investigations using natural language, and identify root causes to resolve incidents.
 Trying a test with EC2 CPU usage alertsWhile there are many features, you probably want to try a simple use case first.

Conveniently, AWS documentation provides a test scenario where EC2 CPU usage increases and triggers an alert.
https://docs.aws.amazon.com/devopsagent/latest/userguide/getting-started-create-a-test-environment.html#test-option-a-ec2-cpu-capacity-test
 Test option A: EC2 CPU capacity testPurpose: Validate AWS DevOps Agent's ability to detect and investigate EC2 performance issues

Estimated time: 5 minutes setup + 10 minutes automatic execution

Difficulty: Fully automated (no manual steps required)
Let's try this one today.
 Test Scenario FlowHere's the flow of this test scenario:
We'll log into the EC2 instance, run a script that puts load on the CPU, and trigger a CloudWatch alarm.
Let's have AWS DevOps Agent investigate the alarm and identify the root cause.
 PrerequisitesAn Agent Space needs to be created first, so let's create one.
 Creating an Agent SpaceEnter a space name and configure the role and tag settings. This time I created a role with the default settings.

You can open the Web App for incident investigation from Operator access.

If you can open this screen, you're ready to go.

Now let's create a test environment.
 Creating a Test EnvironmentSave the following CFn template mentioned in the documentation as AWS-AIDevOps-ec2-test.yaml.
AWS-AIDevOps-EC2-test.yamlAWSTemplateFormatVersion: '2010-09-09'
Description: 'AWS AIDevOps EC2 CPU Test Stack'

Parameters:
  MyIP:
    Type: String
    Description: Your current IP address for SSH access (find at https://whatismyipaddress.com)
    Default: '0.0.0.0/0'

Resources:
  # Security Group for SSH access
  TestSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupName: AWS-AIDevOps-test-sg
      GroupDescription: AWS AIDevOps beta testing security group
      SecurityGroupIngress:
        - IpProtocol: tcp
          FromPort: 22
          ToPort: 22
          CidrIp: !Ref MyIP
          Description: SSH access from your IP
      Tags:
        - Key: Name
          Value: AWS-AIDevOps-Test-SG
        - Key: Purpose
          Value: AWS-AIDevOps-Testing

  # Key Pair for SSH access
  TestKeyPair:
    Type: AWS::EC2::KeyPair
    Properties:
      KeyName: AWS-AIDevOps-test-key
      KeyType: rsa
      Tags:
        - Key: Name
          Value: AWS-AIDevOps-Test-Key
        - Key: Purpose
          Value: AWS-AIDevOps-Testing

  # EC2 Instance for CPU testing
  TestInstance:
    Type: AWS::EC2::Instance
    Properties:
      InstanceType: t3.micro
      ImageId: '{{resolve:ssm:/aws/service/ami-amazon-linux-latest/al2023-ami-kernel-6.1-x86_64}}'
      KeyName: !Ref TestKeyPair
      SecurityGroupIds:
        - !Ref TestSecurityGroup
      UserData:
        Fn::Base64: !Sub |
          #!/bin/bash
          yum update -y
          yum install -y htop

          # Create the CPU stress test script
          cat > /home/ec2-user/cpu-stress-test.sh << 'EOF'
          #!/bin/bash
          echo "Starting AWS AIDevOps CPU Stress Test"
          echo "Time: $(date)"
          echo "Instance: $(curl -s http://169.254.169.254/latest/meta-data/instance-id)"
          echo ""

          # Get number of CPU cores
          CORES=$(nproc)
          echo "CPU Cores: $CORES"
          echo ""

          echo "Starting stress test (5 minutes)..."
          echo "This will generate >70% CPU usage to trigger CloudWatch alarm"
          echo ""

          # Create CPU load using yes command
          echo "Starting CPU load processes..."
          for i in $(seq 1 $CORES); do
              (yes > /dev/null) &
              CPU_PID=$!
              echo "Started CPU load process $i (PID: $CPU_PID)"
              echo $CPU_PID >> /tmp/cpu_test_pids
          done

          # Auto-cleanup after 5 minutes
          (sleep 300 && echo "Stopping CPU load processes..." && kill $(cat /tmp/cpu_test_pids 2>/dev/null) 2>/dev/null && rm -f /tmp/cpu_test_pids) &

          echo ""
          echo "CPU load processes started for 5 minutes"
          echo "Check CloudWatch for alarm trigger in 3-5 minutes"
          EOF

          chmod +x /home/ec2-user/cpu-stress-test.sh
          chown ec2-user:ec2-user /home/ec2-user/cpu-stress-test.sh

          # Create auto-shutdown script (safety mechanism)
          cat > /home/ec2-user/auto-shutdown.sh << 'SHUTDOWN_EOF'
          #!/bin/bash
          echo "Auto-shutdown scheduled for 2 hours from now: $(date)"
          sleep 7200
          echo "Auto-shutdown executing at: $(date)"
          sudo shutdown -h now
          SHUTDOWN_EOF

          chmod +x /home/ec2-user/auto-shutdown.sh
          nohup /home/ec2-user/auto-shutdown.sh > /home/ec2-user/auto-shutdown.log 2>&1 &

          echo "AWS AIDevOps test setup completed at $(date)" > /home/ec2-user/setup-complete.txt
      Tags:
        - Key: Name
          Value: AWS-AIDevOps-Test-Instance
        - Key: Purpose
          Value: AWS-AIDevOps-Testing

  # CloudWatch Alarm for CPU utilization
  CPUAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: AWS-AIDevOps-EC2-CPU-Test
      AlarmDescription: AWS-AIDevOps beta test - EC2 CPU utilization alarm
      MetricName: CPUUtilization
      Namespace: AWS/EC2
      Statistic: Average
      Period: 60
      EvaluationPeriods: 1
      Threshold: 70
      ComparisonOperator: GreaterThanThreshold
      Dimensions:
        - Name: InstanceId
          Value: !Ref TestInstance
      TreatMissingData: notBreaching

Outputs:
  InstanceId:
    Description: EC2 Instance ID for testing
    Value: !Ref TestInstance

  SecurityGroupId:
    Description: Security Group ID
    Value: !Ref TestSecurityGroup

  AlarmName:
    Description: CloudWatch Alarm Name
    Value: !Ref CPUAlarm

  SSHCommand:
    Description: SSH command to connect to instance
    Value: !Sub 'ssh -i "AWS-AIDevOps-test-key.pem" ec2-user@${TestInstance.PublicDnsName}'
Here's a list of resources created by running this template:


Resource
Type
Role


TestSecurityGroup
EC2::SecurityGroup
Allows SSH access (port 22)

TestKeyPair
EC2::KeyPair
RSA key for EC2 SSH authentication

TestInstance
EC2::Instance
Instance for CPU load testing (t3.micro)

CPUAlarm
CloudWatch::Alarm
Detects CPU usage exceeding 70％

Let's upload and deploy it to CloudFormation.

Set the stack name to AWS-AIDevOps-EC2-Test.

The CPU stress test will start automatically 5 minutes after the instance launches.

https://docs.aws.amazon.com/devopsagent/latest/userguide/getting-started-create-a-test-environment.html
The documentation mentions the above, but no alert was triggered in my environment, so I logged into the EC2 instance and manually ran ./cpu-stress-test.sh.
   ,     #_
   ~\_  ####_        Amazon Linux 2023
  ~~  \_#####\
  ~~     \###|
  ~~       \#/ ___   https://aws.amazon.com/linux/amazon-linux-2023
   ~~       V~' '->
    ~~~         /
      ~~._.   _/
         _/ _/
       _/m/'
[ec2-user@ip-172-31-18-113 ~]$ ./cpu-stress-test.sh
Starting AWS AIDevOps CPU Stress Test
Time: Wed Dec  3 02:04:11 UTC 2025
Instance: 

CPU Cores: 2

Starting stress test (5 minutes)...
This will generate >70% CPU usage to trigger CloudWatch alarm

Starting CPU load processes...
Started CPU load process 1 (PID: 26055)
Started CPU load process 2 (PID: 26056)

CPU load processes started for 5 minutes
Check CloudWatch for alarm trigger in 3-5 minutes
[ec2-user@ip-172-31-18-113 ~]$ 
I was able to confirm the alert successfully.

Now let's see how AWS DevOps Agent investigates this issue.
 Investigating the error with AWS DevOps AgentLet's go back to the Web App we opened earlier and enter the details in Start Investigation.
Investigation details

Enter that "AWS-AIDevOps-EC2-CPU-Test" occurred.
Investigate the cause of the "AWS-AIDevOps-EC2-CPU-Test" alert.
Investigation starting point

Enter the time when the CloudWatch Alarm went into alarm state.
CloudWatch Alarm occurred in alarm state at 2025-12-03 11:06:45
I left the Data and time of incident as default since not much time had passed since the alarm occurred.
When the investigation starts, a timeline appears and the investigation progresses. It's fun to watch the timeline as the investigation automatically advances.

After waiting a few minutes, the Root Cause tab appeared as the root cause was identified.

When checking the content, it correctly identified that I ran the test script.

User executed CPU stress test script during SSH session

The CPU spike to 98.34％ on EC2 instance i-0ac6ad9e37829b762 was caused by manual execution of the CPU stress test script '/home/EC2-user/cpu-stress-test.sh' during an SSH セッション. User cm-suzuki.jun established an SSH connection via EC2 Instance Connect at 02:04:02 UTC, and the CPU spike began at 02:05:00 UTC (1 minute later), lasting exactly 5 minutes until 02:10:00 UTC.

That's spot on. It even identified who caused the incident and how it occurred.
We already know the cause, but there's a Generate Mitigation Plan button, so let's try it.
It will suggest countermeasures based on the incident.
A new tab was created and the Mitigation Plan was displayed.
↓Partial Japanese translation excerpt
This is not an operational failure requiring remediation, but an expected result of system behavior in a test environment. Since the system is functioning as designed, no immediate operational mitigation measures are necessary.
Since this was just running a test script, no action was needed, but it's great that it would provide a plan for actual incident response.
Don't forget to delete the CloudFormation stack you created after the test.
 SummaryI ran AWS's official test scenario with AWS DevOps Agent (Preview) and confirmed how it works.
Having it automatically identify root causes and suggest countermeasures is really helpful.
In particular, not having to manually trace CloudTrail logs and having it identify who did what and when will likely significantly reduce incident response time.
It seems there are also integration features like GitHub integration, MCP server connection, and Slack notifications. If you're interested, please give it a try!
 ReferencesFrontier agent – AWS DevOps Agent – AWS

What is AWS DevOps Agent - AWS DevOps Agent

AWS DevOps Agent helps you accelerate incident response and improve system reliability (preview) | AWS News Blog
Testing the AWS DevOps Agent (Preview) according to test scenarios #AWSreInvent

What is AWS DevOps Agent

Trying a test with EC2 CPU usage alerts

Test option A: EC2 CPU capacity test

Test Scenario Flow

Prerequisites

Creating an Agent Space

Creating a Test Environment

Investigating the error with AWS DevOps Agent

Summary

References

Related articles

AWS Topics

Trending Topics

Products & Services

Features and Series

Resource	Type	Role
TestSecurityGroup	EC2::SecurityGroup	Allows SSH access (port 22)
TestKeyPair	EC2::KeyPair	RSA key for EC2 SSH authentication
TestInstance	EC2::Instance	Instance for CPU load testing (t3.micro)
CPUAlarm	CloudWatch::Alarm	Detects CPU usage exceeding 70％