I tried APM investigation and creating recommended monitors from CloudFormation on the official Datadog MCP server
This page has been translated by machine translation. View original
Hello. I'm Shiina from the Operations Department.
Introduction
The long-awaited request for the official Datadog MCP server Preview has finally been approved.
Using the MCP server, AI agents like Claude Code can directly access Datadog observability data.
With natural language instructions like "investigate recent error traces" or "create a monitor suitable for this configuration," it can handle APM trace analysis, log investigation, and monitor creation.
In addition to setup procedures, I tested use cases for APM trace analysis, error investigation, and recommended monitor creation based on CloudFormation templates.
What is the Datadog MCP Server
The Datadog MCP server acts as a bridge connecting Datadog's observability data with AI agents.
It enables direct access to Datadog's data and tools like logs, traces, and monitors from MCP-compatible clients such as Claude Code and Cursor.
Key Disclaimers
When using the Preview, be aware of the following:
- Production use is not supported
- Only Datadog organizations on the allowlist can use it
- Users are responsible for verifying compliance requirements for AI tools
- Datadog may collect information about MCP server usage
Toolsets
The Datadog MCP server supports a mechanism called toolsets.
By enabling only necessary toolsets, you can reduce context window consumption.
Available toolsets are as follows:
| Toolset | Overview |
|---|---|
| core | Default set for handling logs, metrics, traces, dashboards, monitors, incidents, hosts, services, events, and notebooks |
| alerting | Monitor validation, group search, and template retrieval |
| apm | APM trace analysis, span search, Watchdog insights, performance investigation |
| dbm | Integration with Database Monitoring |
| error-tracking | Error Tracking operations |
| feature-flags | Feature flag creation, listing, updates, and management |
| llmobs | LLM Observability span search and analysis |
| product-analytics | Product analytics query operations |
| networks | Cloud Network Monitoring and Network Device Monitoring analysis |
| onboarding | Guided setup and configuration for Datadog |
| security | Code security scanning, security signal and finding searches |
| software-delivery | Integration with Software Delivery, CI Visibility, Test Optimization, etc. |
| synthetics | Synthetic test operations |
Usage
When using Claude Code, configure the MCP server in ~/.claude.json.
Toolsets are specified as query parameters for the endpoint URL.
- For using the core tools (default)
{
"mcpServers": {
"datadog": {
"type": "http",
"url": "https://mcp.datadoghq.com/api/unstable/mcp-server/mcp"
}
}
}
- For using core, apm, and alerting tools
{
"mcpServers": {
"datadog": {
"type": "http",
"url": "https://mcp.datadoghq.com/api/unstable/mcp-server/mcp?toolsets=core,apm,alerting"
}
}
}
Authentication Setup
You can use OAuth 2.0 for MCP server authentication.
Browser authentication is available, so you don't need to define API and APP keys.
This is safer as you don't have to store authentication information in your local environment.
Permission Scopes
The scopes granted with OAuth 2.0 authentication are mainly for read operations.
You can access a wide range of data including APM, logs, metrics, dashboards, and monitors.
However, for update operations like creating monitors or editing dashboards, you need to use the Datadog API with API and APP keys.
Setup
Let's use the US1 site endpoint and enable toolsets (core, apm, alerting).
1. MCP Server Configuration
- Add configuration using the
mcp addcommand.
claude mcp add -s user --transport http datadog "https://mcp.datadoghq.com/api/unstable/mcp-server/mcp?toolsets=core,apm,alerting"
The file has been modified.
Added HTTP MCP server datadog with URL: https://mcp.datadoghq.com/api/unstable/mcp-server/mcp?toolsets=core,apm,alerting to user config
File modified: /Users/shiina.yuichi/.claude.json
Checking ~/.claude.json, the configuration has been added.
"mcpServers": {
"datadog": {
"type": "http",
"url": "https://mcp.datadoghq.com/api/unstable/mcp-server/mcp?toolsets=core,apm,alerting"
},
- Check the MCP list.
claude mcp list
Setup is complete if the datadog configuration appears in the list.
Checking MCP server health...
claude.ai Slack: https://mcp.slack.com/mcp - ✓ Connected
claude.ai Gmail: https://gmail.mcp.claude.com/mcp - ! Needs authentication
claude.ai Google Calendar: https://gcal.mcp.claude.com/mcp - ✓ Connected
awslabs.aws-documentation-mcp-server: uvx awslabs.aws-documentation-mcp-server@latest - ✓ Connected
awslabs.aws-iac-mcp-server: uvx awslabs.aws-iac-mcp-server@latest - ✓ Connected
datadog: https://mcp.datadoghq.com/api/unstable/mcp-server/mcp?toolsets=core,apm,alerting (HTTP) - ! Needs authentication
2. MCP Server Authentication
- Launch Claude Code.
claude
- Check with the
/mcpcommand.
Before authentication, it shows "needs authentication".
User MCPs (/Users/shiina.yuichi/.claude.json)
❯ awslabs.aws-documentation-mcp-server · ✔ connected
awslabs.aws-iac-mcp-server · ✔ connected
datadog · △ needs authentication
pencil · ✔ connected
claude.ai
claude.ai Gmail · △ needs authentication
claude.ai Google Calendar · ✔ connected
claude.ai Slack · ✔ connected
-
Move the cursor to datadog and press Enter.
-
Select "1. Authenticate" from the details.

-
Select "Allow" on the Authorize access screen in the browser.

-
Select your Datadog organization.

-
Confirm the access permissions and authorize.

-
Upon success, "Authentication Successful!" is displayed.

-
Claude shows "Authentication successful. Connected to datadog."
-
Check the status again with the
/mcpcommand.
Authentication is complete if the datadog status is "connected".
User MCPs (/Users/shiina.yuichi/.claude.json)
❯ awslabs.aws-documentation-mcp-server · ✔ connected
awslabs.aws-iac-mcp-server · ✔ connected
datadog · ✔ connected
pencil · ✔ connected
claude.ai
claude.ai Gmail · △ needs authentication
claude.ai Google Calendar · ✔ connected
claude.ai Slack · ✔ connected
3. Checking Available Tools
Let's check what tools are available with the Datadog MCP server.
What tools can I use with the Datadog MCP?
I found that 24 tools are available.
They cover a wide range of daily operational tasks, including log searches, trace analysis, and monitor operations.

Test Scenarios
Using Claude Code (Opus 4.6 model), I tested the following two scenarios:
-
APM Investigation of a Shopping Web Application
- Trace analysis, identifying error spans and latency spans, proposing recommended actions
-
Creating Monitoring Monitors for a Serverless Application
- Auto-creating recommended monitors based on an IaC template defining DynamoDB, API Gateway, and Lambda
APM Investigation of a Shopping Web Application
I'll analyze APM traces, investigate errors, and identify the causes of delays via the MCP server for a shopping site that intentionally includes delay logic and random errors.
Trace Analysis
There have been reports of response delays on the website "shopping-site".
Please analyze recent APM traces.

With just natural language instructions, it automatically performed service configuration understanding, span searching, and analysis.
It detected 8 error spans and identified the top bottlenecks among the delay spans.
Error Span Investigation
There seem to be errors when adding items to the cart despite inventory being available.
Please identify the cause from spans and logs

The investigation revealed two root causes:
- Random error generation logic causing product retrieval failures before reaching the DB, resulting in cart addition errors
- While DB operations in POST
/api/cart/addtake about 19ms in total, there's a delay of about 5.3 seconds in uninstrumented processing
Although these are intentional issues embedded in the application, it identified the root causes from span attributes and trace timelines.
Delay Span Investigation
GET /api/products takes over 1 second in some cases.
Please identify the root cause

By comparing the DB query processing times, it detected that delays were intentionally added in the application layer.
Furthermore, from the correlation between delay_ms and search_length in logs, it even estimated the delay logic as "search string length × 200ms".
Recommended Actions
Do you have any suggestions for optimizing this application?


Based on the analysis results, optimization suggestions organized by urgency were provided.
They cover application code issues (removing delay logic and random errors) and architecture improvements (asynchronous log sending and connection pooling optimization).
The expected effects after improvements include latency comparisons for each endpoint, making it easier to prioritize responses.
Creating Monitoring Monitors for a Serverless Application
I'll implement the design, creation, and status checking of recommended monitors for infrastructure resources defined in a CloudFormation template.
Since Datadog query syntax and metric names have unique specifications, there's a risk of hallucinations in AI-generated definitions.
A validate_monitor_definition tool is available for validation, which I'll utilize.
Monitor Design
Please create Datadog monitoring monitors for the infrastructure resources defined in template.yaml in this directory.
- template.yaml analysis: Read the file and identify defined infrastructure resources (AWS Lambda, DynamoDB, API Gateway, SQS, SNS, etc.).
- Recommended monitoring items for each resource: Select appropriate monitoring monitors (error rates, latency, throttling, queue depth, etc.) for each resource type based on Datadog best practices.
- Query syntax and metric name verification: Before creating monitor definitions, verify that the query syntax and metric names used in Datadog are correct.
- Monitor definition creation: Create Datadog monitors using verified queries.
- Alert notification destination: example@example.com
- Notification template: Please follow best practices.

It identified 8 Lambda functions, 3 DynamoDB tables, and API Gateway from template.yaml and generated 7 monitor definitions.
The workflow involved first checking existing metric names in the environment using the List Metrics tool, then validating each monitor definition with validate_monitor_definition.

There were actually validation errors with new_group_delay in 2 API Gateway cases, but it automatically fixed the options and passed re-verification.
For Lambda Duration monitors, it read the timeout values defined in template.yaml (900 seconds for scan/ai-analysis, 30 seconds for API systems) and set 80-83% of these as thresholds.
The notification templates include response procedures for Alert, Warning, and Recovery states, with quality ready for actual operations.
Creating Monitors
Since monitors can't be created via the MCP server, we'll use the Datadog API.
API and APP keys need to be set as environment variables, so I'll exit Claude Code once and set them from the shell.
export DD_API_KEY=""
export DD_APP_KEY=""
export DD_SITE=datadoghq.com
Resuming with the claude --resume command.

There was a JSON parsing error during the first curl execution, but it checked the response and proceeded with test monitor creation and deletion.
Switching to batch creation with a Python script, it created all 7 monitors.
Monitor Status Check
Please tell me the current status of the created monitors in detail

It searched for created monitors with the service:key-inventory tag and retrieved the status of all 7 monitors and 11 groups.
The result shows 10 groups are OK and 1 group is No Data.
Right after creating monitors, seeing "No Data" can cause concern about whether monitoring is working properly.
It's helpful that it explains the reasons for No Data in each monitor.
I confirmed that the monitors are functioning as intended.


Summary
I tested the official Datadog MCP server with Claude Code in APM investigation and monitor creation scenarios.
For APM investigation, span analysis, bottleneck identification, and root cause estimation were accomplished end-to-end with just natural language instructions.
Compared to traditional investigations that require navigating multiple console screens, the efficiency improvement is tangible.
For monitor creation, it read resources from the CloudFormation template and generated monitor definitions following best practices.
Since Datadog query syntax has unique specifications, it's important to include validation with the validate_monitor_definition tool to prevent hallucination-induced misconfiguration.
Although still in Preview, I clearly felt the potential for AI agents to improve observability operations efficiency.
I'm looking forward to the GA release.
I hope this article has been helpful.
References


