I tried APM investigation and creating recommended monitors from CloudFormation on the official Datadog MCP server

I tried APM investigation and creating recommended monitors from CloudFormation on the official Datadog MCP server

I tried connecting Datadog's official MCP server to Claude Code to test APM investigation and creating recommended monitors based on CloudFormation. I'll share the setup procedure and validation results.
2026.03.09

This page has been translated by machine translation. View original

Hello. I'm Shiina from the Operations Department.

Introduction

The long-awaited request for the official Datadog MCP server Preview has finally been approved.
Using the MCP server, AI agents like Claude Code can directly access Datadog observability data.
With natural language instructions like "investigate recent error traces" or "create a monitor suitable for this configuration," it can handle APM trace analysis, log investigation, and monitor creation.
In addition to setup procedures, I tested use cases for APM trace analysis, error investigation, and recommended monitor creation based on CloudFormation templates.

What is the Datadog MCP Server

The Datadog MCP server acts as a bridge connecting Datadog's observability data with AI agents.
It enables direct access to Datadog's data and tools like logs, traces, and monitors from MCP-compatible clients such as Claude Code and Cursor.

Key Disclaimers

When using the Preview, be aware of the following:

  • Production use is not supported
  • Only Datadog organizations on the allowlist can use it
  • Users are responsible for verifying compliance requirements for AI tools
  • Datadog may collect information about MCP server usage

Toolsets

The Datadog MCP server supports a mechanism called toolsets.
By enabling only necessary toolsets, you can reduce context window consumption.

Available toolsets are as follows:

Toolset Overview
core Default set for handling logs, metrics, traces, dashboards, monitors, incidents, hosts, services, events, and notebooks
alerting Monitor validation, group search, and template retrieval
apm APM trace analysis, span search, Watchdog insights, performance investigation
dbm Integration with Database Monitoring
error-tracking Error Tracking operations
feature-flags Feature flag creation, listing, updates, and management
llmobs LLM Observability span search and analysis
product-analytics Product analytics query operations
networks Cloud Network Monitoring and Network Device Monitoring analysis
onboarding Guided setup and configuration for Datadog
security Code security scanning, security signal and finding searches
software-delivery Integration with Software Delivery, CI Visibility, Test Optimization, etc.
synthetics Synthetic test operations

Usage

When using Claude Code, configure the MCP server in ~/.claude.json.
Toolsets are specified as query parameters for the endpoint URL.

  • For using the core tools (default)
{
  "mcpServers": {
    "datadog": {
      "type": "http",
      "url": "https://mcp.datadoghq.com/api/unstable/mcp-server/mcp"
    }
  }
}
  • For using core, apm, and alerting tools
{
  "mcpServers": {
    "datadog": {
      "type": "http",
      "url": "https://mcp.datadoghq.com/api/unstable/mcp-server/mcp?toolsets=core,apm,alerting"
    }
  }
}

Authentication Setup

You can use OAuth 2.0 for MCP server authentication.
Browser authentication is available, so you don't need to define API and APP keys.
This is safer as you don't have to store authentication information in your local environment.

Permission Scopes

The scopes granted with OAuth 2.0 authentication are mainly for read operations.
You can access a wide range of data including APM, logs, metrics, dashboards, and monitors.
However, for update operations like creating monitors or editing dashboards, you need to use the Datadog API with API and APP keys.

Setup

Let's use the US1 site endpoint and enable toolsets (core, apm, alerting).

1. MCP Server Configuration

  1. Add configuration using the mcp add command.
claude mcp add -s user --transport http datadog "https://mcp.datadoghq.com/api/unstable/mcp-server/mcp?toolsets=core,apm,alerting"

The file has been modified.

Added HTTP MCP server datadog with URL: https://mcp.datadoghq.com/api/unstable/mcp-server/mcp?toolsets=core,apm,alerting to user config
File modified: /Users/shiina.yuichi/.claude.json

Checking ~/.claude.json, the configuration has been added.

  "mcpServers": {
    "datadog": {
      "type": "http",
      "url": "https://mcp.datadoghq.com/api/unstable/mcp-server/mcp?toolsets=core,apm,alerting"
    },
  1. Check the MCP list.
claude mcp list

Setup is complete if the datadog configuration appears in the list.

Checking MCP server health...

claude.ai Slack: https://mcp.slack.com/mcp - Connected
claude.ai Gmail: https://gmail.mcp.claude.com/mcp - ! Needs authentication
claude.ai Google Calendar: https://gcal.mcp.claude.com/mcp - Connected
awslabs.aws-documentation-mcp-server: uvx awslabs.aws-documentation-mcp-server@latest - Connected
awslabs.aws-iac-mcp-server: uvx awslabs.aws-iac-mcp-server@latest - Connected
datadog: https://mcp.datadoghq.com/api/unstable/mcp-server/mcp?toolsets=core,apm,alerting (HTTP) - ! Needs authentication

2. MCP Server Authentication

  1. Launch Claude Code.
claude
  1. Check with the /mcp command.
    Before authentication, it shows "needs authentication".
    User MCPs (/Users/shiina.yuichi/.claude.json)
 awslabs.aws-documentation-mcp-server · connected
    awslabs.aws-iac-mcp-server · connected
    datadog · needs authentication
    pencil · connected

    claude.ai
    claude.ai Gmail · needs authentication
    claude.ai Google Calendar · connected
    claude.ai Slack · connected
  1. Move the cursor to datadog and press Enter.

  2. Select "1. Authenticate" from the details.
    0-1

  3. Select "Allow" on the Authorize access screen in the browser.
    1-1

  4. Select your Datadog organization.
    1-2

  1. Confirm the access permissions and authorize.
    1-3

  2. Upon success, "Authentication Successful!" is displayed.
    1-4-png-3078×2282--03-09-2026_06_34_PM

  3. Claude shows "Authentication successful. Connected to datadog."

  4. Check the status again with the /mcp command.
    Authentication is complete if the datadog status is "connected".

   User MCPs (/Users/shiina.yuichi/.claude.json)
 awslabs.aws-documentation-mcp-server · connected
    awslabs.aws-iac-mcp-server · connected
    datadog · connected
    pencil · connected

    claude.ai
    claude.ai Gmail · needs authentication
    claude.ai Google Calendar · connected
    claude.ai Slack · connected

3. Checking Available Tools

Let's check what tools are available with the Datadog MCP server.

What tools can I use with the Datadog MCP?

I found that 24 tools are available.
They cover a wide range of daily operational tasks, including log searches, trace analysis, and monitor operations.
1-5

Test Scenarios

Using Claude Code (Opus 4.6 model), I tested the following two scenarios:

  1. APM Investigation of a Shopping Web Application

    • Trace analysis, identifying error spans and latency spans, proposing recommended actions
  2. Creating Monitoring Monitors for a Serverless Application

    • Auto-creating recommended monitors based on an IaC template defining DynamoDB, API Gateway, and Lambda

APM Investigation of a Shopping Web Application

I'll analyze APM traces, investigate errors, and identify the causes of delays via the MCP server for a shopping site that intentionally includes delay logic and random errors.

Trace Analysis

There have been reports of response delays on the website "shopping-site".
Please analyze recent APM traces.

2-1-png-1900×1536--03-09-2026_08_21_PM

With just natural language instructions, it automatically performed service configuration understanding, span searching, and analysis.
It detected 8 error spans and identified the top bottlenecks among the delay spans.

Error Span Investigation

There seem to be errors when adding items to the cart despite inventory being available.
Please identify the cause from spans and logs

2-2

The investigation revealed two root causes:

  • Random error generation logic causing product retrieval failures before reaching the DB, resulting in cart addition errors
  • While DB operations in POST /api/cart/add take about 19ms in total, there's a delay of about 5.3 seconds in uninstrumented processing

Although these are intentional issues embedded in the application, it identified the root causes from span attributes and trace timelines.

Delay Span Investigation

GET /api/products takes over 1 second in some cases.
Please identify the root cause

2-3

By comparing the DB query processing times, it detected that delays were intentionally added in the application layer.
Furthermore, from the correlation between delay_ms and search_length in logs, it even estimated the delay logic as "search string length × 200ms".

Do you have any suggestions for optimizing this application?

2-4-1

2-4-2

Based on the analysis results, optimization suggestions organized by urgency were provided.
They cover application code issues (removing delay logic and random errors) and architecture improvements (asynchronous log sending and connection pooling optimization).
The expected effects after improvements include latency comparisons for each endpoint, making it easier to prioritize responses.

Creating Monitoring Monitors for a Serverless Application

I'll implement the design, creation, and status checking of recommended monitors for infrastructure resources defined in a CloudFormation template.
Since Datadog query syntax and metric names have unique specifications, there's a risk of hallucinations in AI-generated definitions.
A validate_monitor_definition tool is available for validation, which I'll utilize.

Monitor Design

Please create Datadog monitoring monitors for the infrastructure resources defined in template.yaml in this directory.

- template.yaml analysis: Read the file and identify defined infrastructure resources (AWS Lambda, DynamoDB, API Gateway, SQS, SNS, etc.).
- Recommended monitoring items for each resource: Select appropriate monitoring monitors (error rates, latency, throttling, queue depth, etc.) for each resource type based on Datadog best practices.
- Query syntax and metric name verification: Before creating monitor definitions, verify that the query syntax and metric names used in Datadog are correct.
- Monitor definition creation: Create Datadog monitors using verified queries.
- Alert notification destination: example@example.com
- Notification template: Please follow best practices.

3-1-2
It identified 8 Lambda functions, 3 DynamoDB tables, and API Gateway from template.yaml and generated 7 monitor definitions.
The workflow involved first checking existing metric names in the environment using the List Metrics tool, then validating each monitor definition with validate_monitor_definition.

3-1-1

There were actually validation errors with new_group_delay in 2 API Gateway cases, but it automatically fixed the options and passed re-verification.

For Lambda Duration monitors, it read the timeout values defined in template.yaml (900 seconds for scan/ai-analysis, 30 seconds for API systems) and set 80-83% of these as thresholds.
The notification templates include response procedures for Alert, Warning, and Recovery states, with quality ready for actual operations.

Creating Monitors

Since monitors can't be created via the MCP server, we'll use the Datadog API.
API and APP keys need to be set as environment variables, so I'll exit Claude Code once and set them from the shell.

export DD_API_KEY=""
export DD_APP_KEY=""
export DD_SITE=datadoghq.com

Resuming with the claude --resume command.

3-2

There was a JSON parsing error during the first curl execution, but it checked the response and proceeded with test monitor creation and deletion.
Switching to batch creation with a Python script, it created all 7 monitors.

Monitor Status Check

Please tell me the current status of the created monitors in detail

3-3-0

It searched for created monitors with the service:key-inventory tag and retrieved the status of all 7 monitors and 11 groups.
The result shows 10 groups are OK and 1 group is No Data.

Right after creating monitors, seeing "No Data" can cause concern about whether monitoring is working properly.
It's helpful that it explains the reasons for No Data in each monitor.
I confirmed that the monitors are functioning as intended.
3-3-1

3-3-2

Summary

I tested the official Datadog MCP server with Claude Code in APM investigation and monitor creation scenarios.
For APM investigation, span analysis, bottleneck identification, and root cause estimation were accomplished end-to-end with just natural language instructions.
Compared to traditional investigations that require navigating multiple console screens, the efficiency improvement is tangible.
For monitor creation, it read resources from the CloudFormation template and generated monitor definitions following best practices.
Since Datadog query syntax has unique specifications, it's important to include validation with the validate_monitor_definition tool to prevent hallucination-induced misconfiguration.
Although still in Preview, I clearly felt the potential for AI agents to improve observability operations efficiency.
I'm looking forward to the GA release.

I hope this article has been helpful.

References

https://docs.datadoghq.com/bits_ai/mcp_server/
https://docs.datadoghq.com/bits_ai/mcp_server/setup?tab=cursor
https://github.com/datadog-labs/mcp-server

Share this article

FacebookHatena blogX

Related articles