Visualizing the Incident Investigation Thought Process Using Datadog MCP: Creating Skills

I created a skill that visualizes the incident investigation thought process by combining Datadog MCP server and Skills.

Shiina Yuichi

2026.03.25

This page has been translated by machine translation. View original

Hello. I am Shiina from the Operations department.
 IntroductionRecently, Datadog MCP servers have become generally available, and many of you are likely using them for incident investigations.

While MCP servers allow AI to directly obtain and analyze telemetry data, there is a challenge in understanding how it reached its conclusions.

To share findings with the team and allow humans to verify AI judgments, I felt we needed a mechanism to record and visualize the investigation process itself.

So this time, I created a skill using Claude Code Skills that allows you to keep a report of the investigation process while cycling through the "observation → hypothesis formulation → verification" loop.

I tried conducting an actual investigation using this skill, so I'd like to introduce it.
 Created Skill OverviewThis is an incident investigation support skill using Datadog MCP server.

It narrows down causes by repeating the "observation → hypothesis formulation → verification" loop and visualizes the entire investigation process as an interactive HTML report.
 Main FeaturesSupports root cause identification through "observation → hypothesis formulation → verification" in incident investigations
Evidence collection using Datadog MCP tools (metrics, logs, APM, events, etc.)
Management and visualization of hypothesis status (Validated / Inconclusive / Invalidated)
Automatic generation of interactive HTML investigation reports
 SKILL.mdPlease refer to the repository below.
https://github.com/shiina-yuichi/dd-mcp-investigator/blob/main/.claude/skills/incident-investigation/SKILL.md
 Setup PrerequisitesClaude Code CLI installed
Datadog MCP server configured

Please also refer to the following article for setup instructions.

https://dev.classmethod.jp/articles/datadog-mcp-server-apm-cloudformation-monitor/
 Skill InstallationClone the repository in any directory.
git clone https://github.com/shiina-yuichi/dd-mcp-investigator.git
The file structure will be as follows.
dd-mcp-investigator/
├── .claude/
│   └── skills/
│       └── incident-investigation/
│           ├── SKILL.md                        # Skill definition (auto-detected by Claude Code)
│           └── templates/
│               └── investigation-report.html   # HTML report template
├── CLAUDE.md                                   # Project config
└── README.md
Navigate to the directory and launch Claude Code.

SKILL.md placed under .claude/skills/ will be automatically detected by Claude Code.
cd dd-mcp-investigator
claude
Confirm the skill registration.

Use the /skills command to display the skill list and confirm that incident-investigation is displayed.
  Skills
  1 skills

  Project skills (.claude/skills)
  incident-investigation · ~40 description tokens

 UsageCall the skill with the /incident-investigation command and request an investigation in natural language.

⏺ Please tell me about the incident you want to investigate. Providing any of the following information will help me start the investigation:

  1. Monitor alert — monitor name or ID
  2. Incident — incident ID or summary
  3. User report — problem symptoms, affected service name, time of occurrence, etc.

  What problem would you like to investigate?
 Trying It OutI tried the following incident investigations using Claude Code (Opus 4.6 model):
User report (APM latency investigation)
Monitor alert
 APM Latency InvestigationInstructions
Investigate around 3/18 14:30 (JST)
The API request "GET /api/products" on the website "shopping-site" has some requests taking over 1 second.
Please identify the root cause
Output Results



A complete incident investigation was performed using the Datadog MCP server.

Let's check the file output in HTML format.
Report

Evidence

Steps

Evaluation

It helps narrow down the causes by repeating the verification loop based on several hypotheses.

The investigation process, including what observational data was examined and how judgments were made, was visualized.

In this investigation, although there were hypotheses such as computational explosion in regex/full-text search, sufficient evidence was found for blocking processing at the application layer, which led to the conclusion.
 Monitor Alert InvestigationInstructions
⏺ Please tell me about the incident or alert you want to investigate. Any of the following information will help me start the investigation:

  - Monitor alert: monitor name, ID, or alert content
  - Incident: incident ID, title, or description
  - User report: problem symptoms, affected services or host names

  What problem would you like to investigate?
For investigating a monitor alert, I specified a monitor ID.
Investigate monitor ID XXXXX
Output Results

This also completed a full incident investigation using the Datadog MCP server.

Let's check the file output in HTML format.
Report

Evidence

Steps

Evaluation

After examining metrics thoroughly, evidence was discovered from logs.

The evaluation also shows why certain hypotheses were rejected.
 ConclusionI created a skill that visualizes the thought process of incident investigations by combining Claude Code Skills and Datadog MCP server.

By cycling through the "observation → hypothesis formulation → verification" loop, it's possible to track what data the AI examined and how it made judgments, allowing humans to verify its validity.

Additionally, saving it as a report can be useful for sharing knowledge within the team and for post-incident reviews.

As the emergence of MCP servers increases instances of delegating incident investigations to AI, I believe visualizing and recording the investigation process will become increasingly important in the future.
I hope this article is helpful to you.