I tried to see how far AWS DevOps Agent can identify CDK configuration errors #AWSreInvent

I tried to see how far AWS DevOps Agent can identify CDK configuration errors #AWSreInvent

You identified problems in the problematic CDK code! That's truly DevOps!
2025.12.05

This page has been translated by machine translation. View original

At re:Invent 2025, AWS DevOps Agent preview was announced.
DevelopersIO is also very excited about it!

https://dev.classmethod.jp/articles/aws-devops-agent-25/

https://dev.classmethod.jp/articles/reinvent2025-report-cop362/

https://dev.classmethod.jp/articles/aws-devops-agent-datadog-mcp-connect/

https://dev.classmethod.jp/articles/aws-devops-agent-preview-pagerduty-webhook-integration/

https://dev.classmethod.jp/articles/aws-devops-agent-slack-integration/

https://dev.classmethod.jp/articles/aws-devops-agent-preview-awsreinvent-troubleshooting/

To maximize this tool's potential, I tried to see if I could have it identify the release causing an issue and analyze the code.

Application Premise

Let's assume a simple application like this:

Lambda triggered every minute → writes to DynamoDB

That's it.

import * as cdk from "aws-cdk-lib";
import * as dynamodb from "aws-cdk-lib/aws-dynamodb";
import * as events from "aws-cdk-lib/aws-events";
import * as targets from "aws-cdk-lib/aws-events-targets";
import * as nodejs from "aws-cdk-lib/aws-lambda-nodejs";

const app = new cdk.App();
const stack = new cdk.Stack(app, "PlayDevopsAgentsStack", {
	env: {
		region: "us-east-1",
	},
});

// Prepare a DynamoDB Table
const table = new dynamodb.TableV2(stack, "Table", {
	partitionKey: { name: "pk", type: dynamodb.AttributeType.STRING },
	sortKey: { name: "sk", type: dynamodb.AttributeType.STRING },
	removalPolicy: cdk.RemovalPolicy.DESTROY,
});

// Prepare a Lambda that writes to the Table
const fn = new nodejs.NodejsFunction(stack, "Function", {
	environment: {
		TABLE_NAME: table.tableName,
	},
});
table.grantReadWriteData(fn);

// Schedule Lambda to run periodically
const rule = new events.Rule(stack, "Rule", {
	schedule: events.Schedule.rate(cdk.Duration.minutes(1)),
});
rule.addTarget(new targets.LambdaFunction(fn));

// Set an alarm for Lambda errors
fn.metricErrors().createAlarm(stack, "FunctionErrorAlarm", {
	threshold: 1,
	evaluationPeriods: 1,
});

The Lambda content simply writes arbitrary values to DynamoDB:

import { DynamoDBClient } from "@aws-sdk/client-dynamodb";
import { DynamoDBDocument } from "@aws-sdk/lib-dynamodb";

const client = new DynamoDBClient({});
const doc = DynamoDBDocument.from(client);

const TABLE_NAME = process.env.TABLE_NAME;

export const handler = async () => {
	await doc.put({
		TableName: TABLE_NAME,
		Item: {
			pk: "1",
			sk: new Date().toISOString(),
			val: Math.random() * 100,
		},
	});
};

Causing an Error

After creating an AWS DevOps Agent space, I'll revoke Lambda permissions to cause an error.
I'll comment out the following code:
I want the Agent to identify this difference (the commented out line)...

// Remove write permissions from Lambda
// table.grantReadWriteData(fn);

After deploying with CDK, the CloudWatch Alarm started firing as expected.

Screenshot 2025-12-04 10.05.18

Requesting Investigation from AWS DevOps Agent

I asked the Agent to start an investigation.
(I wonder if this can be automated...)

Screenshot 2025-12-04 9.57.55

Here are the investigation results:

Screenshot 2025-12-04 10.30.33

Translation:

The Lambda function received an AccessDeniedException error when trying to write to the DynamoDB Table.
The Lambda function's role does not have permission to execute dynamodb:PutItem.
Infrastructure was deployed at 17:42-17:43Z but this permission was not granted.
It was successful for 8 minutes when DynamoDB writes weren't needed, but began failing when writes started.

The analysis is incorrect since the function was writing from the beginning, and errors started when permissions were removed. If it could understand code differences, it would provide more accurate analysis.

I also asked for a Migration Plan, but it suggested using CLI to grant permissions:

aws iam put-role-policy ...

I want a more accurate analysis and specific corrections.

Setting up a Pipeline to Identify Releases

To help the Agent identify problematic releases and previous ones, I set up a DevOps Agent Pipeline.

This feature connects with GitHub or GitLab to monitor workflows.

https://docs.aws.amazon.com/devopsagent/latest/userguide/configuring-capabilities-connecting-ci-cd-pipelines-index.html

You can add a Pipeline from AWS DevOps Agent console → Select Agent Space → Capabilities tab → Pipeline section.

Screenshot 2025-12-04 11.21.43

Screenshot 2025-12-04 11.22.24

Screenshot 2025-12-04 11.23.49

I set up GitHub Action deployments as well.

First, I restored the CDK to working state and deployed, then commented out again to cause the error.

// Once again, removing write permissions from Lambda
// table.grantReadWriteData(fn);

Requesting Investigation from AWS DevOps Agent v2

After confirming the CloudWatch Alarm was triggered, I started an Investigation.

Screenshot 2025-12-04 12.20.58

Translation:

The Lambda function received an AccessDeniedException error when trying to execute dynamodb:PutItem. This was caused by missing necessary permissions in the execution role.
This permission gap was caused by a CloudFormation deployment that removed the policy at 19:36:26Z.

It still doesn't mention CDK or GitHub Actions.
I asked it to propose a fix.

Screenshot 2025-12-04 12.28.56

Oh! It's mentioning GitHub Actions!

Translation:

To restore the IAM state from before Lambda started failing, we should redeploy the workflow:
Go to GitHub repository https://github.com/yamatatsu/play-devops-agents → Actions → Deploy workflow → Find the execution that worked previously → Click 'Re-run all jobs' to execute a rollback → Watch it succeed → Wait for CloudFormation to reach UPDATE_COMPLETE

It even identified the repository! 👀
Though it would have been nice if it identified the specific job too...

Setting up MCP and RunBook

This still isn't quite production-ready.
AWS CDK philosophy is that "deployed cloud resources are a complete snapshot of the code repository. To rollback, you should rollback the code in the repository", so rather than redeploying past jobs, I'd prefer it to suggest code corrections.

To enable this, I first set up github-mcp-server so the Agent can analyze code changes in detail.
(This might already be possible after setting up Pipeline... I'll experiment separately.)

MCP Server can be added from AWS DevOps Agent console → Select Agent Space → Capabilities tab → MCP Server section.

Screenshot 2025-12-04 14.26.29

Screenshot 2025-12-04 14.18.10

Screenshot 2025-12-04 14.18.25

Screenshot 2025-12-04 14.20.39

I also set up a RunBook.
Unlike previous configurations, RunBook is configured from AWS DevOps Agent WebApp.

Screenshot 2025-12-04 14.33.32

Screenshot 2025-12-04 14.33.47

The configured RunBook is committed to GitHub:

Requesting Investigation from AWS DevOps Agent v3

Following the same procedure, I caused another error and asked AWS DevOps Agent to investigate.

Here's the result:
Screenshot 2025-12-04 15.12.25

Translation:

This happened because CloudFormation removed Lambda permissions

It doesn't seem to have used the RunBook... I asked it to consider a Migration Plan.

Screenshot 2025-12-04 15.15.13

Translation:

I couldn't identify immediate operational mitigation measures.
Root cause analysis and CloudFormation template examination reveal this isn't a rollback-eligible deployment, but a flaw in the existing Infrastructure as Code configuration.
The Lambda function was deployed at 2025-12-04T19:19:11Z with code trying to write to DynamoDB table, but the CloudFormation template didn't include the necessary IAM permissions.
This configuration gap exists in the source CloudFormation/CDK template itself. Rolling back the Lambda deployment won't solve the issue, as the IAM role settings were insufficient from the initial deployment.
To solve this problem, you need to update the Infrastructure-as-Code template, add DynamoDB permissions to the Lambda execution role, and redeploy the stack.
This requires code changes and development work, not an immediate operational rollback or configuration adjustment.

Oh! It suddenly got better? It seems like it read the RunBook!
But it would have been better if it used it during the investigation phase for proper code analysis...

Requesting Investigation from AWS DevOps Agent v4

Looking at the official documentation to find out if RunBook can be used during investigation, I learned you can ask "Please reinvestigate using RunBook" in the chat.

https://docs.aws.amazon.com/devopsagent/latest/userguide/userguide-devops-agent-runbooks.html

I made the request:

Screenshot 2025-12-04 15.25.27

It looks like it's going to investigate!

Screenshot 2025-12-04 15.25.47

It's using the MCP Server to investigate!

Screenshot 2025-12-04 15.27.07

It's happening!!!

Screenshot 2025-12-04 15.34.16

Translation:

The grant part of CDK code is commented out
The root cause is in packages/cdk/src/app.ts. Line 26 table.grantReadWriteData(fn); is commented out.
This is code that grants read & write permissions to the Lambda function for the DynamoDB table. Commenting it out removes the permissions and causes the AccessDeniedException error.

That's completely correct! 💯

We're getting closer to the dream where AWS DevOps Agent automatically finds the root cause in code during AWS incidents!

Conclusion

In this article, I successfully connected AWS DevOps Agent with GitHub to identify code mistakes.
Ultimately, I hope we can reach a world where if a problem occurs at night, a fix PR is ready by morning.

For that to happen, it needs to start investigations automatically from CloudWatch Alarms...
And it needs to reference RunBooks without being asked via chat...
I'll continue researching how to make this possible.

That's all!

Share this article

FacebookHatena blogX

Related articles