I tried APM investigation and creating recommended monitors from CloudFormation on the official Datadog MCP server

I tried APM investigation and creating recommended monitors from CloudFormation on the official Datadog MCP server

I tried connecting the Datadog official MCP server to Claude Code to test APM investigation and creating recommended monitors based on CloudFormation. I've summarized the setup procedure and verification results for you.
2026.03.09

This page has been translated by machine translation. View original

Hello. I'm Shiina from the Operations department.

Introduction

The long-awaited Datadog official MCP server Preview request has finally been approved.
Using the MCP server, AI agents like Claude Code can directly access Datadog observability data.
With just natural language instructions like "investigate recent error traces" or "create monitors suitable for this configuration," it can handle APM trace analysis, log investigation, and monitor creation.
In addition to setup procedures, I tried use cases of APM trace analysis, error investigation, and recommended monitor creation based on CloudFormation templates.

What is a Datadog MCP Server

The Datadog MCP server acts as a bridge connecting Datadog observability data with AI agents.
From MCP-compatible clients like Claude Code or Cursor, you can directly access Datadog's data and tools such as logs, traces, and monitors.

Main Disclaimers

When using the Preview, the following points should be noted:

  • Use in production environments is not supported
  • Only Datadog organizations on the allowlist can use it
  • Checking compliance requirements for AI tools is the user's responsibility
  • Datadog may collect information about MCP server usage

Toolsets

The Datadog MCP server supports a mechanism called toolsets.
By enabling only the necessary toolsets, you can reduce context window consumption.

Available toolsets are as follows:

Toolset Overview
core Default set handling logs, metrics, traces, dashboards, monitors, incidents, hosts, services, events, and notebooks
alerting Monitor validation, group search, and template retrieval
apm APM trace analysis, span search, Watchdog insights, performance investigation
dbm Integration with Database Monitoring
error-tracking Error Tracking operations
feature-flags Feature flag creation, listing, updating, and management
llmobs LLM Observability span search and analysis
product-analytics Product analytics query operations
networks Cloud Network Monitoring and Network Device Monitoring analysis
onboarding Guided setup and configuration of Datadog
security Code security scanning, security signal and detection result searches
software-delivery Integration with Software Delivery such as CI Visibility and Test Optimization
synthetics Synthetic test operations

Usage Method

When using with Claude Code, configure the MCP server in ~/.claude.json.
Toolsets are specified as query parameters for the endpoint URL.

  • When using the core tool (default)
{
  "mcpServers": {
    "datadog": {
      "type": "http",
      "url": "https://mcp.datadoghq.com/api/unstable/mcp-server/mcp"
    }
  }
}
  • When using core, apm, and alerting tools
{
  "mcpServers": {
    "datadog": {
      "type": "http",
      "url": "https://mcp.datadoghq.com/api/unstable/mcp-server/mcp?toolsets=core,apm,alerting"
    }
  }
}

Authentication Settings

OAuth 2.0 can be used for MCP server authentication.
Since browser authentication is possible, there is no need to define API and APP keys.
This provides peace of mind as authentication information doesn't need to be stored in the local environment.

Permission Scopes

The scopes granted by OAuth 2.0 authentication are mainly read-only.
You can read a wide range of data including APM, logs, metrics, dashboards, and monitors.
On the other hand, to perform update operations such as creating monitors or editing dashboards, you need to use the Datadog API with API and APP keys.

Setup

Let's use the US1 site endpoint and enable toolsets (core, apm, alerting).

1. MCP Server Configuration

  1. Add configuration using the mcp add command.
claude mcp add -s user --transport http datadog "https://mcp.datadoghq.com/api/unstable/mcp-server/mcp?toolsets=core,apm,alerting"

The file has been modified.

Added HTTP MCP server datadog with URL: https://mcp.datadoghq.com/api/unstable/mcp-server/mcp?toolsets=core,apm,alerting to user config
File modified: /Users/shiina.yuichi/.claude.json

Checking ~/.claude.json, the configuration has been added.

  "mcpServers": {
    "datadog": {
      "type": "http",
      "url": "https://mcp.datadoghq.com/api/unstable/mcp-server/mcp?toolsets=core,apm,alerting"
    },
  1. Check the MCP list.
claude mcp list

If datadog settings are added to the list, setup is complete.

Checking MCP server health...

claude.ai Slack: https://mcp.slack.com/mcp - Connected
claude.ai Gmail: https://gmail.mcp.claude.com/mcp - ! Needs authentication
claude.ai Google Calendar: https://gcal.mcp.claude.com/mcp - Connected
awslabs.aws-documentation-mcp-server: uvx awslabs.aws-documentation-mcp-server@latest - Connected
awslabs.aws-iac-mcp-server: uvx awslabs.aws-iac-mcp-server@latest - Connected
datadog: https://mcp.datadoghq.com/api/unstable/mcp-server/mcp?toolsets=core,apm,alerting (HTTP) - ! Needs authentication

2. MCP Server Authentication

  1. Launch Claude Code.
claude
  1. Check with the /mcp command.
    Before authentication, "needs authentication" is displayed.
    User MCPs (/Users/shiina.yuichi/.claude.json)
 awslabs.aws-documentation-mcp-server · connected
    awslabs.aws-iac-mcp-server · connected
    datadog · needs authentication
    pencil · connected

    claude.ai
    claude.ai Gmail · needs authentication
    claude.ai Google Calendar · connected
    claude.ai Slack · connected
  1. Move the cursor to datadog and press Enter.

  2. Select "1. Authenticate" from the details.
    0-1

  3. Select "Allow" in the Authorize access screen in the browser.
    1-1

  4. Select your Datadog organization.
    1-2

  1. Check the access permissions and authorize.
    1-3

  2. If successful, "Authentication Successful!" will be displayed.
    1-4-png-3078×2282--03-09-2026_06_34_PM

  3. On the Claude side, "Authentication successful. Connected to datadog." is displayed.

  4. Check the status again with the /mcp command.
    Authentication is complete if the datadog status is "connected".

   User MCPs (/Users/shiina.yuichi/.claude.json)
 awslabs.aws-documentation-mcp-server · connected
    awslabs.aws-iac-mcp-server · connected
    datadog · connected
    pencil · connected

    claude.ai
    claude.ai Gmail · needs authentication
    claude.ai Google Calendar · connected
    claude.ai Slack · connected

3. Check Available Tools

Let's check what tools are available with the Datadog MCP server.

What tools can I use with Datadog MCP?

Upon checking, 24 tools are available.
They cover a wide range of daily operational tasks such as log searching, trace analysis, and monitor operations.
1-5

Verification Scenarios

Using Claude Code (Opus 4.6 model), I verified the following two scenarios:

  1. APM Investigation of a Shopping Web Application

    • Trace analysis, identification of error spans and latency spans, suggestion of recommended actions
  2. Creating Monitoring for a Serverless Application

    • Automatic creation of recommended monitors based on an IaC template defining DynamoDB, API Gateway, and Lambda

APM Investigation of a Shopping Web Application

I'll analyze APM traces, investigate errors, and identify the causes of delays via the MCP server for a shopping site with intentionally added delay logic and random errors.

Trace Analysis

We've received reports of response delays for the "shopping-site" website.
Please analyze recent APM traces.

2-1-png-1900×1536--03-09-2026_08_21_PM

With just natural language instructions, it automatically performed service configuration understanding, span searching, and analysis.
It detected 8 error spans and identified the top bottlenecks among the latency spans.

Error Span Investigation

It seems that errors occur when adding items to the cart despite items being in stock.
Please identify the cause from spans and logs

2-2

It continued the investigation and identified two root causes:

  • Due to random error generation logic, product retrieval fails before reaching the DB, causing cart addition errors
  • In POST /api/cart/add, DB operations take about 19ms in total, but there's a 5.3-second gap in uninstrumented processing causing delays

These are both intentional issues I added to the app, but it identified the root causes from span attributes and trace timelines.

Latency Span Investigation

Some GET /api/products are taking over 1 second.
Please identify the root cause

2-3

By comparing the facts of DB query processing times, it detected that delays were intentionally added in the application layer.
Furthermore, from the correlation between delay_ms and search_length in the logs, it even estimated the delay logic as "search string length × 200ms".

Suggesting Recommendations

Do you have any suggestions for optimizing this application?

2-4-1

2-4-2

Based on the analysis results, it output optimization suggestions organized by urgency.
It covers application code issues (removing delay logic and random errors) and architectural improvements (asynchronous log sending and connection pooling optimization).
It also presents a comparison of expected latency for each endpoint after improvement, making it easy to determine response priorities.

Creating Monitoring Monitors for a Serverless Application

I'll design, create, and check the status of recommended monitors for infrastructure resources defined in a CloudFormation template.
Since Datadog's query syntax and metric names have unique specifications, there's a risk of hallucinations in AI-generated definitions.
I'll use the validate_monitor_definition tool for validation before creating the monitors.

Monitor Design

Please create Datadog monitoring monitors for the infrastructure resources defined in template.yaml in the directory.

- Analyze template.yaml: Read the file and identify the defined infrastructure resources (AWS Lambda, DynamoDB, API Gateway, SQS, SNS, etc.).
- Identify recommended monitoring items for each resource: Based on Datadog best practices, select appropriate monitoring monitors (error rates, latency, throttling, queue depth, etc.) for each resource type.
- Verify query syntax and metric names: Before creating monitor definitions, verify that the query syntax and metric names used in Datadog are correct.
- Create monitor definitions: Create Datadog monitors using validated queries.
- Alert notification destination: example@example.com
- Notification template: Please follow best practices.

3-1-2
It identified 8 Lambda functions, 3 DynamoDB tables, and API Gateway from template.yaml, and generated 7 monitor definitions.
The process flow first checks existing metric names in the environment using the List Metrics tool, then validates each monitor definition with validate_monitor_definition.

3-1-1

Validation errors for new_group_delay occurred in 2 API Gateway monitors, but it automatically fixed the options and passed the re-validation.

For the Lambda Duration monitor, it read timeout values defined in template.yaml (900 seconds for scan/ai-analysis, 30 seconds for API series) and set 80-83% of those values as thresholds.
The notification templates include response procedures for Alert / Warning / Recovery states, and are of quality that can be used directly in production.

Creating Monitors

Since monitors cannot be created via the MCP server, we'll use the Datadog API.
API and APP keys need to be set as environment variables, so exit Claude Code once and set them from the shell.

export DD_API_KEY=""
export DD_APP_KEY=""
export DD_SITE=datadoghq.com

Resume with the claude --resume command.

3-2

Although a JSON parsing error occurred during the first curl execution, it performed response checking, test monitor creation, and deletion.
While switching to batch creation with a Python script, it created all 7 monitors.

Checking Monitors

Please tell me the current status of the created monitors in detail

3-3-0

It searched for created monitors using the service:key-inventory tag and retrieved the status of all 7 monitors and 11 groups.
The result shows 10 groups are OK and 1 group is No Data.

Right after creating monitors, seeing "No Data" can make you worry "Is it monitoring properly?"
It's helpful that it explains the reasons for No Data for each monitor.
I confirmed that the monitors are functioning as intended.
3-3-1

3-3-2

Summary

I tried combining the official Datadog MCP server with Claude Code for APM investigation and monitor creation scenarios.
For APM investigation, it completed everything from span analysis to bottleneck identification and root cause estimation with just natural language instructions.
I felt an improvement in efficiency compared to conventional investigations that require navigating multiple screens in the console.
For monitor creation, it read resources from the CloudFormation template and generated monitor definitions following best practices.
Since Datadog's query syntax has unique specifications, it's important to use the validate_monitor_definition tool for validation to prevent hallucination-induced misconfiguration.
I felt enough potential to make observability operations more efficient with AI agents.

I hope this article is helpful.

References

https://docs.datadoghq.com/bits_ai/mcp_server/
https://docs.datadoghq.com/bits_ai/mcp_server/setup?tab=cursor
https://github.com/datadog-labs/mcp-server

Share this article