I implemented disk monitoring without Lambda using Step Functions + JSONata in Kiro which supports Opus 4.6

I implemented disk monitoring without Lambda using Step Functions + JSONata in Kiro which supports Opus 4.6

I implemented disk depletion prediction monitoring without Lambda using Kiro CLI's latest model Claude Opus 4.6, Step Functions SDK integration, and JSONata. It analyzes past trends from CloudWatch Agent metrics to predict disk full situations in advance and sends notifications via SNS. I'll also introduce implementation challenges and the improved JSONata processing capabilities of Opus 4.6.
2026.02.07

This page has been translated by machine translation. View original

When monitoring disk usage on EC2, the straightforward approach involves setting up CloudWatch alarms for each instance. As the number grows to 10 or 20 instances, alarm creation and management becomes complicated, requiring operational changes to alarm settings whenever instances are added or removed.

Additionally, CloudWatch alarms present challenges when trying to predict scenarios like "disk will be full in 3 days at this rate," requiring complex custom Metric Math configurations.

With support from Kiro using the new Opus 4.6 model, I had the opportunity to attempt building a system that could predict disk depletion across numerous EC2 instances in advance using only Step Functions SDK integration and JSONata built-in functions, which I'll introduce here.

Design Session with Kiro CLI

Kiro CLI allows easy model switching using the /model command.

$ /model

 Press (Up/Down) to navigate - Enter to select model
  auto | 1x credit | Models chosen by task for optimal usage and consistent quality
> claude-opus-4.6 (current) | 2.2x credit | Experimental preview of Claude Opus 4.6
  claude-opus-4.5 | 2.2x credit | The latest Claude Opus model
  claude-sonnet-4.5 | 1.3x credit | The latest Claude Sonnet model
  claude-sonnet-4 | 1.3x credit | Hybrid reasoning and coding for regular use
  claude-haiku-4.5 | 0.4x credit | The latest Claude Haiku model

I selected Claude Opus 4.6, which became available in February 2026. Models with higher reasoning capabilities are better suited for design partners.

https://dev.classmethod.jp/articles/amazon-bedrock-claude-opus-4-5-opus-4-6-check/

The initial requirement was simple:

Monitor HDD usage. Notify via SNS if disk space is likely to be depleted or reach 95% within 72 hours. Implement without Lambda using Step Functions SDK integration and JSONATA.

Through dialogue with Kiro, I refined the plan over multiple iterations.

Evolution Process

v1 (Basic Design): Simple configuration retrieving trend data with GetMetricStatistics, predicting depletion with JSONata, and sending SNS notifications on threshold exceedance.

v2 (Operational Extension): Added tag-based dynamic instance retrieval, parallel processing with Map, CRITICAL/WARNING hierarchy, CloudFormation integration (Express/Standard toggle), and various parameterization.

v3 (Multi-disk/Multi-OS Support): Introduced automatic disk detection with ListMetrics and parallel processing using a double Map structure (instance x disk).

The "Dimensions exact match problem" was a key discovery. CloudWatch's GetMetricStatistics only returns data when dimensions match exactly those used when storing metrics. We solved this by incorporating ListMetrics, enabling support for Windows and Linux environments with multiple attached drives.

Issues Discovered During Testing

Several issues emerged during testing after design and implementation:

  • Ghost disk problem: Detached volumes remain in ListMetrics. Solution: Use RecentlyActive: PT3H to focus only on disks active in the last 3 hours
  • JSONata corner cases: Added defensive logic for insufficient data, negative increments, and division by zero
  • Immediate threshold detection: Changed order to evaluate CRITICAL even with insufficient data if current value exceeds threshold

Architecture

EventBridge Scheduler (rate(1 hour))
    |
    v
Step Functions (EXPRESS / STANDARD switchable)
    |
    +-- EC2 DescribeInstances (tag filter)
    |
    +-- Map (per instance)
    |   +-- CloudWatch ListMetrics (detect all disks)
    |   +-- Map (per disk)
    |       +-- CloudWatch GetMetricStatistics (last 14 days)
    |       +-- JSONata (prediction calculation & evaluation)
    |
    +-- JSONata (generate CRITICAL/WARNING/OK aggregated report)
    |
    +-- Choice -- SNS Publish or normal completion

The double Map structure processes instance x disk combinations in parallel. Cost is about $0.22/month for 10 instances with 2 disks each.

Prediction Logic

Instead of average growth rate, we use the largest daily increase from the past 14 days:

1. Get daily average data for the past 14 days (14 data points)
2. Exclude negative increments (disk cleanup, etc.) from adjacent day differences
3. Get the maximum positive increment
4. daysUntilFull = (100 - current usage) / maximum daily increment

This worst-case evaluation can detect patterns with temporary rapid increases from batch processing.

Implementation Challenges

There were several gotchas with Step Functions' JSONata + CloudFormation combination:

1. {% %} is mandatory

In Step Functions' JSONata mode, expressions must be enclosed in {% %}. This is clearly stated in AWS official documentation.

# OK: correct
Output: "{% $states.input.Reservations[].Instances[] %}"

# NG: doesn't work
Output: "$states.input.Reservations[].Instances[]"

2. DefinitionSubstitutions type issues

Values passed through DefinitionSubstitutions become strings. For numeric fields (e.g., Period), conversion with $number() is necessary.

DefinitionSubstitutions:
  MetricPeriod: !Ref MetricPeriod

# Within state machine
Period: "{% $number('${MetricPeriod}') %}"

3. $sort doesn't work as expected

In Step Functions' JSONata implementation, $sort's comparison function didn't work as expected for Timestamp string sorting. Resolved by switching to $reduce to get the latest value.

# Using $reduce instead of $sort to get the latest data point
$latest := $reduce($dp, function($acc, $v) {
  $v.Timestamp > $acc.Timestamp ? $v : $acc
})

4. and operator unavailable

The and operator causes errors in Step Functions' JSONata. Use nested ternary operators instead.

# NG: error
$a != null and $a <= 3 ? 'WARNING' : 'OK'

# OK: works
$a != null ? ($a <= 3 ? 'WARNING' : 'OK') : 'OK'

5. Single element array issue

JSONata expressions generating object arrays sometimes return a single object (not an array) when there's only one element. This causes type errors when passed to Map Items, so explicitly make it an array with [...].

# NG: becomes an object with 1 element
$states.input.disks.{ 'dimensions': dimensions }

# OK: always an array
[$states.input.disks.{ 'dimensions': dimensions }]

Test Environment and Execution

Test EC2

Created a CloudFormation template that creates:

  • t4g.nano / EBS 8GB gp3
  • SSM Core + CloudWatch Agent managed policies
  • UserData to auto-start CWAgent with standard settings
  • Monitor:DiskCheck tag

CRITICAL Test Execution

Filled the disk to 97% using SSM RunCommand and manually executed the state machine:

# Fill the disk
aws ssm send-command \
  --instance-ids i-0123456789abcdef0 \
  --document-name "AWS-RunShellScript" \
  --parameters 'commands=["dd if=/dev/zero of=/var/fill_disk bs=1M count=5600"]'

# Execute state machine
aws stepfunctions start-execution \
  --state-machine-arn arn:aws:states:ap-northeast-1:xxxxxxxxxxxx:stateMachine:disk-usage-monitor

During testing, MetricPeriod=300 (5 minutes) and LookbackDays=1 were set to reduce data accumulation wait time.

Test Results

{
  "subject": "Disk Usage Alert",
  "timestamp": "2026-02-07T08:47:49.163Z",
  "summary": {
    "totalInstances": 1,
    "totalDisks": 1,
    "criticalCount": 1,
    "warningCount": 0,
    "okCount": 0
  },
  "critical": [
    {
      "instanceId": "i-0123456789abcdef0",
      "instanceName": "disk-usage-monitor-test-env-test",
      "path": "/",
      "device": "nvme0n1p1",
      "fstype": "xfs",
      "currentUsage": 96.1,
      "maxDailyIncrease": 71.46,
      "daysUntilFull": 0.1,
      "worstDay": "2026-02-07"
    }
  ]
}

The CRITICAL evaluation, instance information, and predicted values were all correctly output.

Cost

Item Monthly (10 instances × 2 disks)
CloudWatch API (ListMetrics + GetMetricStatistics) $0.22
Step Functions Express (state transitions + execution time) Less than $0.01
EventBridge Scheduler Within free tier
Total Approximately $0.23/month

Summary

By combining Step Functions SDK integrations with JSONata, we've achieved a Lambda-less disk depletion prediction monitoring system. Using ListMetrics to automatically detect all disks on each instance, GetMetricStatistics to retrieve trend data, and JSONata for prediction calculations, we've created a versatile system that doesn't depend on specific OS or disk configurations.

While JSONata on Step Functions allows implementing complex logic, it has unique constraints such as the unavailability of the and operator and $sort comparison functions not working as expected. Additionally, when embedding state machine definitions in CloudFormation, you must use the {% %} syntax, and values passed via DefinitionSubstitutions become strings requiring $number() type conversion. There are many issues that documentation alone doesn't fully cover, making implementation complexity non-trivial.

In previous attempts with Kiro, we tried implementing Step Functions with JSONata, but due to challenges with complex expression generation and handling corner cases, we often settled for the proven JSONPath + Lambda approach.

With the latest Opus 4.6 model, JSONata expression generation accuracy has clearly improved, providing effective support during design, implementation, and testing phases, demonstrating the enhanced processing capabilities of JSONata.
If maintaining CloudWatch alarms or Lambda functions for monitoring purposes is challenging, I recommend trying Step Functions with JSONata implementation supported by Kiro.

Below are the complete plan document and CloudFormation template created with Kiro.

plan.md - Implementation Plan v4
# Disk Usage Monitor Step Functions Implementation Plan v4

## Change History

| Version | Changes |
|---|---|
| v3 | Initial version. All disk auto-detection with ListMetrics, double Map structure |
| v4 | Reflecting implementation differences. Empty array guards, improved prediction logic, MetricPeriod parameter added |

## 1. Overview

Monitor HDD usage custom metrics collected by CloudWatch Agent using Step Functions (SDK integration + JSONata) to detect disk depletion risks and send structured JSON notifications to SNS. Implemented without Lambda.

Monitoring logic:
- Use ListMetrics to automatically detect all disks (Dimension combinations) for each instance
- Identify maximum daily increment from the past 14 days of daily data
- Determine if disk will reach full (100%) within 3 days if that increment continues
- Immediate alert if current value exceeds threshold (default 95%)

## 2. Prerequisites

- CloudWatch Agent configured on target EC2 (namespace: `CWAgent`, metric: `disk_used_percent`)
- Identification tags applied to monitored EC2 instances (e.g., `Monitor:DiskCheck`)
- SNS topic created
- CWAgent Dimensions settings (append_dimensions, etc.) irrelevant. Auto-detected by ListMetrics

## 3. Architecture

EventBridge Scheduler (regular execution: rate(1 hour))
    |
    v
Step Functions (default: EXPRESS / can switch to STANDARD via parameter)
    |
    +-- [1] EC2 DescribeInstances (tag filter + running only)
    |
    +-- [2] Pass/JSONata (format instance IDs/names into array)
    |
    +-- [2.5] Choice (if 0 instances, exit normally)
    |
    +-- [3] Map (parallel: per instance)
    |       |
    |       +-- [3-1] CloudWatch ListMetrics (get all disk Dimension combinations)
    |       |
    |       +-- [3-2] Pass/JSONata (format Dimension combinations into array)
    |       |
    |       +-- [3-2.5] Choice (if 0 disks, end with empty result)
    |       |
    |       +-- [3-3] Map (parallel: per disk)
    |       |       +-- CloudWatch GetMetricStatistics (past 14 days, Period as parameter)
    |       |       +-- Pass/JSONata (calculate maximum increment through all pair comparisons, determine depletion risk)
    |       |
    |       +-- Aggregate instanceId/instanceName/results in Output
    |
    +-- [4] Pass/JSONata (categorize into CRITICAL/WARNING/OK, generate report)
    |
    +-- [5] Choice -- CRITICAL or WARNING exists -- SNS Publish
                   -- none -- normal exit

Express mode outputs to CloudWatch Logs (14-day retention)

## 4. State Machine Definition

| State | Type | Process |
|---|---|---|
| GetInstances | Task (SDK: EC2) | DescribeInstances. Get targets using tag filter + instance-state-name=running |
| ExtractInstanceList | Pass (JSONata) | Format instance IDs and Name tags into array |
| CheckHasInstances | Choice | If 0 instances, go to NoAction. Prevents empty array error in Map |
| CheckEachInstance | Map (parallel) | Execute following for each instance (outer Map). ToleratedFailurePercentage: 100 |
| -- ListDiskMetrics | Task (SDK: CloudWatch) | Get all disk Dimensions for that instance using ListMetrics |
| -- ExtractDimensions | Pass (JSONata) | Format Dimension combinations into array. Return empty array if metrics is null/empty |
| -- CheckHasDisks | Choice | If 0 disks, go to NoDisksFound. Prevents empty array error in Map |
| -- NoDisksFound | Pass | Return empty results and end |
| -- CheckEachDisk | Map (parallel) | Execute following for each disk (inner Map). ToleratedFailurePercentage: 100 |
| ---- GetDiskMetrics | Task (SDK: CloudWatch) | Get past LookbackDays data with GetMetricStatistics. Period is MetricPeriod parameter |
| ---- EvaluateRisk | Pass (JSONata) | Calculate max increment via all pair comparisons, determine depletion risk |
| AggregateReport | Pass (JSONata) | Categorize into CRITICAL/WARNING/OK, generate JSON report |
| HasAlerts | Choice | criticalCount > 0 or warningCount > 0 |
| SendReport | Task (SDK: SNS) | Publish JSON report. Subject: "Disk Usage Alert" |
| NoAction | Succeed | Normal exit |

## 5. CloudWatch API Specifications

### 5.1 ListMetrics (Disk Auto-Detection)

Once per instance. Gets all Dimension combinations for all disks on that instance.

| Parameter | Value |
|---|---|
| Namespace | CWAgent |
| MetricName | disk_used_percent |
| Dimensions | [{"Name": "InstanceId", "Value": "<target ID>"}] |
| RecentlyActive | PT3H |

RecentlyActive: PT3H returns only disks that have sent data within the last 3 hours.

### 5.2 GetMetricStatistics (Trend Data Retrieval)

Once per disk. Specifies the complete Dimensions obtained from ListMetrics as-is.

| Parameter | Value |
|---|---|
| Namespace | CWAgent |
| MetricName | disk_used_percent |
| Period | MetricPeriod parameter (default: 86400 seconds = 1 day) |
| Statistics | ["Average"] |
| StartTime | now - LookbackDays |
| EndTime | now |
| Dimensions | Use ListMetrics response as-is |

### 5.3 Cost Estimate

For 10 instances x avg. 2 disks/instance:
- ListMetrics: 10 calls/execution
- GetMetricStatistics: 20 calls/execution
- Total: 30 calls/execution x 24 executions/day x 30 days = 21,600 calls/month -> approx. $0.22/month

## 6. Prediction Logic (JSONata)

### 6.1 Processing Flow

1. Get latest datapoint using $reduce ($sort is unstable in Step Functions' JSONata, so not used)
2. If fewer than 2 datapoints -> unpredictable (INSUFFICIENT_DATA)
3. Calculate maximum increment and its date through all pair comparisons ($map + $reduce)
4. If maximum increment ≤ 0 -> no increasing trend (daysUntilFull: null)
5. daysUntilFull = (100 - currentUsage) / maxDailyIncrease

### 6.2 Judgment Criteria

| Condition | Status | Priority |
|---|---|---|
| currentUsage >= thresholdPercent | CRITICAL | 1 (highest) |
| datapoints < 2 | INSUFFICIENT_DATA | 2 |
| daysUntilFull <= predictionDays | WARNING | 3 |
| others | OK | 4 |

### 6.3 Corner Case Handling

| Case | Behavior |
|---|---|
| New instance (1 or fewer datapoints) | INSUFFICIENT_DATA. Not alert target |
| Disk cleanup days with usage decrease | Consider only positive increments in all pair comparisons |
| Usage flat or decreasing across all days | No positive increments -> daysUntilFull: null -> OK |
| Current value already 100% | CRITICAL (immediate judgment on threshold exceeded, even with insufficient data) |
| No disk metrics for instance | Return empty results (NoDisksFound state) |
| 0 monitored instances | Normal exit at CheckHasInstances |

### 6.4 EvaluateRisk Output Fields

| Field | Type | Description |
|---|---|---|
| path | string | Mount path (e.g., /, /data, C:) |
| device | string | Device name (e.g., nvme0n1p1, C:) |
| fstype | string | File system (e.g., xfs, NTFS) |
| currentUsage | number | Current usage (1 decimal place) |
| maxDailyIncrease | number/null | Maximum daily increment (2 decimal places). Null if insufficient data |
| daysUntilFull | number/null | Days until 100% (1 decimal place). Null if no increasing trend |
| worstDay | string/null | Date of maximum increment (YYYY-MM-DD). Null if no increasing trend |
| status | string | CRITICAL / WARNING / INSUFFICIENT_DATA / OK |

## 7. SNS Notification Message

Subject: Disk Usage Alert

json
{
  "subject": "Disk Usage Alert",
  "timestamp": "2026-02-07T08:47:49.163Z",
  "summary": {
    "totalInstances": 10,
    "totalDisks": 18,
    "criticalCount": 1,
    "warningCount": 2,
    "okCount": 15
  },
  "critical": [
    {
      "instanceId": "i-0123456789abcdef0",
      "instanceName": "db-server",
      "path": "/data",
      "device": "nvme1n1",
      "fstype": "xfs",
      "currentUsage": 96.2,
      "maxDailyIncrease": 2.8,
      "daysUntilFull": 1.4,
      "worstDay": "2026-02-03"
    }
  ],
  "warning": [
    {
      "instanceId": "i-0abcdef1234567890",
      "instanceName": "app-server",
      "path": "/",
      "device": "nvme0n1p1",
      "fstype": "xfs",
      "currentUsage": 87.3,
      "maxDailyIncrease": 5.1,
      "daysUntilFull": 2.5,
      "worstDay": "2026-02-05"
    }
  ]
}

## 8. CloudFormation Parameters

| Parameter | Default | Description |
|---|---|---|
| StateMachineType | EXPRESS | EXPRESS or STANDARD |
| TagKey | Monitor | Monitoring target tag key |
| TagValue | DiskCheck | Monitoring target tag value |
| ThresholdPercent | 95 | CRITICAL judgment threshold (%) |
| PredictionDays | 3 | WARNING judgment days |
| LookbackDays | 14 | Historical days for trend analysis |
| MetricPeriod | 86400 | Metric aggregation period (seconds). Set to 300 during testing |
| SnsTopicArn | (required) | Notification SNS topic ARN |
| ScheduleExpression | rate(1 hour) | Execution schedule |
template.yaml - CloudFormation Template
AWSTemplateFormatVersion: '2010-09-09'
Description: Disk Usage Monitor - Step Functions (Lambdaless / JSONata)

Parameters:
  StateMachineType:
    Type: String
    Default: EXPRESS
    AllowedValues: [EXPRESS, STANDARD]
  TagKey:
    Type: String
    Default: Monitor
  TagValue:
    Type: String
    Default: DiskCheck
  ThresholdPercent:
    Type: Number
    Default: 95
  PredictionDays:
    Type: Number
    Default: 3
  LookbackDays:
    Type: Number
    Default: 14
  SnsTopicArn:
    Type: String
    AllowedPattern: arn:aws:sns:.+
  MetricPeriod:
    Type: Number
    Default: 86400
    Description: Metric aggregation period in seconds (86400=daily, 300=5min for testing)
  ScheduleExpression:
    Type: String
    Default: rate(1 hour)

Conditions:
  IsExpress: !Equals [!Ref StateMachineType, EXPRESS]

Resources:
  StateMachineLogGroup:
    Type: AWS::Logs::LogGroup
    Condition: IsExpress
    Properties:
      LogGroupName: !Sub /aws/vendedlogs/states/${AWS::StackName}
      RetentionInDays: 14

  StateMachineRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service: states.amazonaws.com
            Action: sts:AssumeRole
      Policies:
        - PolicyName: DiskMonitorPolicy
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Action:
                  - ec2:DescribeInstances
                  - cloudwatch:ListMetrics
                  - cloudwatch:GetMetricStatistics
                Resource: '*'
              - Effect: Allow
                Action: sns:Publish
                Resource: !Ref SnsTopicArn
              - !If
                - IsExpress
                - Effect: Allow
                  Action:
                    - logs:CreateLogDelivery
                    - logs:GetLogDelivery
                    - logs:UpdateLogDelivery
                    - logs:DeleteLogDelivery
                    - logs:ListLogDeliveries
                    - logs:PutResourcePolicy
                    - logs:DescribeResourcePolicies
                    - logs:DescribeLogGroups
                  Resource: '*'
                - !Ref AWS::NoValue

  SchedulerRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service: scheduler.amazonaws.com
            Action: sts:AssumeRole
      Policies:
        - PolicyName: InvokeStateMachine
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Action:
                  - states:StartExecution
                  - states:StartSyncExecution
                Resource: !GetAtt StateMachine.Arn

  StateMachine:
    Type: AWS::StepFunctions::StateMachine
    Properties:
      StateMachineName: !Sub ${AWS::StackName}
      StateMachineType: !Ref StateMachineType
      RoleArn: !GetAtt StateMachineRole.Arn
      LoggingConfiguration: !If
        - IsExpress
        - Level: ALL
          IncludeExecutionData: true
          Destinations:
            - CloudWatchLogsLogGroup:
                LogGroupArn: !GetAtt StateMachineLogGroup.Arn
        - !Ref AWS::NoValue
      DefinitionSubstitutions:
        TagKey: !Ref TagKey
        TagValue: !Ref TagValue
        ThresholdPercent: !Ref ThresholdPercent
        PredictionDays: !Ref PredictionDays
        LookbackDays: !Ref LookbackDays
        MetricPeriod: !Ref MetricPeriod
        SnsTopicArn: !Ref SnsTopicArn
      Definition:
        QueryLanguage: JSONata
        Comment: Disk Usage Monitor - Lambdaless
        StartAt: GetInstances
        States:
          GetInstances:
            Type: Task
            Resource: arn:aws:states:::aws-sdk:ec2:describeInstances
            Arguments:
              Filters:
                - Name: "tag:${TagKey}"
                  Values:
                    - "${TagValue}"
                - Name: instance-state-name
                  Values:
                    - running
            Next: ExtractInstanceList

          ExtractInstanceList:
            Type: Pass
            Output: >-
              {% [$states.input.Reservations[].Instances[].{
                'instanceId': InstanceId,
                'instanceName': Tags[Key='Name'].Value
              }] %}
            Next: CheckHasInstances

          CheckHasInstances:
            Type: Choice
            Choices:
              - Condition: "{% $count($states.input) > 0 %}"
                Next: CheckEachInstance
            Default: NoAction

          CheckEachInstance:
            Type: Map
            Items: "{% $states.input %}"
            MaxConcurrency: 10
            ToleratedFailurePercentage: 100
            ItemProcessor:
              ProcessorConfig:
                Mode: INLINE
              StartAt: ListDiskMetrics
              States:
                ListDiskMetrics:
                  Type: Task
                  Resource: arn:aws:states:::aws-sdk:cloudwatch:listMetrics
                  Arguments:
                    Namespace: CWAgent
                    MetricName: disk_used_percent
                    Dimensions:
                      - Name: InstanceId
                        Value: "{% $states.input.instanceId %}"
                    RecentlyActive: PT3H
                  Output:
                    instanceId: "{% $states.input.instanceId %}"
                    instanceName: "{% $states.input.instanceName %}"
                    metrics: "{% $states.result.Metrics %}"
                  Next: ExtractDimensions

                ExtractDimensions:
                  Type: Pass
                  Output:
                    instanceId: "{% $states.input.instanceId %}"
                    instanceName: "{% $states.input.instanceName %}"
                    disks: >-
                      {% $states.input.metrics
                        ? [$states.input.metrics.{ 'dimensions': Dimensions }]
                        : [] %}
                  Next: CheckHasDisks

                CheckHasDisks:
                  Type: Choice
                  Choices:
                    - Condition: "{% $count($states.input.disks) > 0 %}"
                      Next: CheckEachDisk
                  Default: NoDisksFound

                NoDisksFound:
                  Type: Pass
                  Output:
                    instanceId: "{% $states.input.instanceId %}"
                    instanceName: "{% $states.input.instanceName %}"
                    results: []
                  End: true

                CheckEachDisk:
                  Type: Map
                  Items: >-
                    {% [$states.input.disks.{
                      'dimensions': dimensions,
                      'instanceId': $states.input.instanceId,
                      'instanceName': $states.input.instanceName
                    }] %}
                  MaxConcurrency: 5
                  ToleratedFailurePercentage: 100
                  ItemProcessor:
                    ProcessorConfig:
                      Mode: INLINE
                    StartAt: GetDiskMetrics
                    States:
                      GetDiskMetrics:
                        Type: Task
                        Resource: arn:aws:states:::aws-sdk:cloudwatch:getMetricStatistics
                        Arguments:
                          Namespace: CWAgent
                          MetricName: disk_used_percent
                          Dimensions: "{% $states.input.dimensions %}"
                          StartTime: >-
                            {% $fromMillis($millis() - ${LookbackDays} * 86400000) %}
                          EndTime: "{% $fromMillis($millis()) %}"
                          Period: "{% $number('${MetricPeriod}') %}"
                          Statistics:
                            - Average
                        Output:
                          dimensions: "{% $states.input.dimensions %}"
                          datapoints: "{% $states.result.Datapoints %}"
                        Next: EvaluateRisk

                      EvaluateRisk:
                        Type: Pass
                        Output: >-
                          {% (
                            $dims := $states.input.dimensions;
                            $path := ($dims[Name='path']).Value;
                            $device := ($dims[Name='device']).Value;
                            $fstype := ($dims[Name='fstype']).Value;
                            $dp := $states.input.datapoints;
                            $cnt := $count($dp);
                            $latest := $cnt > 0
                              ? $reduce($dp, function($acc, $v) {
                                  $v.Timestamp > $acc.Timestamp ? $v : $acc
                                })
                              : {'Average': 0};
                            $current := $latest.Average;
                            $prediction := $cnt < 2
                              ? { 'daysUntilFull': null, 'maxDailyIncrease': null, 'worstDay': null }
                              : (
                                $maxes := $map($dp, function($a) {
                                  $reduce($dp, function($best, $b) {
                                    ($a.Timestamp > $b.Timestamp)
                                      ? ( $d := $a.Average - $b.Average;
                                          $d > $best.delta
                                            ? {'delta': $d, 'date': $a.Timestamp}
                                            : $best )
                                      : $best
                                  }, {'delta': 0, 'date': ''})
                                });
                                $worst := $reduce($maxes, function($acc, $m) {
                                  $m.delta > $acc.delta ? $m : $acc
                                });
                                $worst.delta > 0
                                  ? {
                                    'daysUntilFull': $round((100 - $current) / $worst.delta, 1),
                                    'maxDailyIncrease': $round($worst.delta, 2),
                                    'worstDay': $substringBefore($worst.date, 'T')
                                  }
                                  : { 'daysUntilFull': null, 'maxDailyIncrease': 0, 'worstDay': null }
                              );
                            $status := $current >= ${ThresholdPercent}
                              ? 'CRITICAL'
                              : $cnt < 2
                                ? 'INSUFFICIENT_DATA'
                                : $prediction.daysUntilFull != null
                                  ? ($prediction.daysUntilFull <= ${PredictionDays}
                                      ? 'WARNING' : 'OK')
                                  : 'OK';
                            {
                              'path': $path,
                              'device': $device,
                              'fstype': $fstype,
                              'currentUsage': $round($current, 1),
                              'maxDailyIncrease': $prediction.maxDailyIncrease,
                              'daysUntilFull': $prediction.daysUntilFull,
                              'worstDay': $prediction.worstDay,
                              'status': $status
                            }
                          ) %}
                        End: true
                  Output:
                    instanceId: "{% $states.input.instanceId %}"
                    instanceName: "{% $states.input.instanceName %}"
                    results: "{% $states.result %}"
                  End: true
            Next: AggregateReport

          AggregateReport:
            Type: Pass
            Output: >-
              {% (
                $all := $states.input;
                $allDisks := $all.results[];
                {
                  'subject': 'Disk Usage Alert',
                  'timestamp': $now(),
                  'summary': {
                    'totalInstances': $count($all),
                    'totalDisks': $count($allDisks),
                    'criticalCount': $count($filter($allDisks, function($d) { $d.status = 'CRITICAL' })),
                    'warningCount': $count($filter($allDisks, function($d) { $d.status = 'WARNING' })),
                    'okCount': $count($filter($allDisks, function($d) { $d.status = 'OK' }))
                  },
                  'critical': $all.(
                    $inst := $;
                    results[status='CRITICAL'].{
                      'instanceId': $inst.instanceId,
                      'instanceName': $inst.instanceName,
                      'path': path, 'device': device, 'fstype': fstype,
                      'currentUsage': currentUsage,
                      'maxDailyIncrease': maxDailyIncrease,
                      'daysUntilFull': daysUntilFull,
                      'worstDay': worstDay
                    }
                  )[],
                  'warning': $all.(
                    $inst := $;
                    results[status='WARNING'].{
                      'instanceId': $inst.instanceId,
                      'instanceName': $inst.instanceName,
                      'path': path, 'device': device, 'fstype': fstype,
                      'currentUsage': currentUsage,
                      'maxDailyIncrease': maxDailyIncrease,
                      'daysUntilFull': daysUntilFull,
                      'worstDay': worstDay
                    }
                  )[]
                }
              ) %}
            Next: HasAlerts

          HasAlerts:
            Type: Choice
            Choices:
              - Condition: >-
                  {% $states.input.summary.criticalCount > 0
                    or $states.input.summary.warningCount > 0 %}
                Next: SendReport
            Default: NoAction

          SendReport:
            Type: Task
            Resource: arn:aws:states:::aws-sdk:sns:publish
            Arguments:
              TopicArn: "${SnsTopicArn}"
              Subject: "{% $states.input.subject %}"
              Message: "{% $string($states.input) %}"
            End: true

          NoAction:
            Type: Succeed

  Scheduler:
    Type: AWS::Scheduler::Schedule
    Properties:
      Name: !Sub ${AWS::StackName}-scheduler
      ScheduleExpression: !Ref ScheduleExpression
      FlexibleTimeWindow:
        Mode: 'OFF'
      Target:
        Arn: !GetAtt StateMachine.Arn
        RoleArn: !GetAtt SchedulerRole.Arn
        Input: '{}'

Outputs:
  StateMachineArn:
    Value: !GetAtt StateMachine.Arn
  StateMachineType:
    Value: !Ref StateMachineType
  LogGroupName:
    Condition: IsExpress
    Value: !Ref StateMachineLogGroup

Share this article

FacebookHatena blogX

Related articles