I implemented disk monitoring without Lambda using Step Functions + JSONata in Kiro which supports Opus 4.6
This page has been translated by machine translation. View original
When monitoring disk usage on EC2, the straightforward approach involves setting up CloudWatch alarms for each instance. As the number grows to 10 or 20 instances, alarm creation and management becomes complicated, requiring operational changes to alarm settings whenever instances are added or removed.
Additionally, CloudWatch alarms present challenges when trying to predict scenarios like "disk will be full in 3 days at this rate," requiring complex custom Metric Math configurations.
With support from Kiro using the new Opus 4.6 model, I had the opportunity to attempt building a system that could predict disk depletion across numerous EC2 instances in advance using only Step Functions SDK integration and JSONata built-in functions, which I'll introduce here.
Design Session with Kiro CLI
Kiro CLI allows easy model switching using the /model command.
$ /model
Press (Up/Down) to navigate - Enter to select model
auto | 1x credit | Models chosen by task for optimal usage and consistent quality
> claude-opus-4.6 (current) | 2.2x credit | Experimental preview of Claude Opus 4.6
claude-opus-4.5 | 2.2x credit | The latest Claude Opus model
claude-sonnet-4.5 | 1.3x credit | The latest Claude Sonnet model
claude-sonnet-4 | 1.3x credit | Hybrid reasoning and coding for regular use
claude-haiku-4.5 | 0.4x credit | The latest Claude Haiku model
I selected Claude Opus 4.6, which became available in February 2026. Models with higher reasoning capabilities are better suited for design partners.
The initial requirement was simple:
Monitor HDD usage. Notify via SNS if disk space is likely to be depleted or reach 95% within 72 hours. Implement without Lambda using Step Functions SDK integration and JSONATA.
Through dialogue with Kiro, I refined the plan over multiple iterations.
Evolution Process
v1 (Basic Design): Simple configuration retrieving trend data with GetMetricStatistics, predicting depletion with JSONata, and sending SNS notifications on threshold exceedance.
v2 (Operational Extension): Added tag-based dynamic instance retrieval, parallel processing with Map, CRITICAL/WARNING hierarchy, CloudFormation integration (Express/Standard toggle), and various parameterization.
v3 (Multi-disk/Multi-OS Support): Introduced automatic disk detection with ListMetrics and parallel processing using a double Map structure (instance x disk).
The "Dimensions exact match problem" was a key discovery. CloudWatch's GetMetricStatistics only returns data when dimensions match exactly those used when storing metrics. We solved this by incorporating ListMetrics, enabling support for Windows and Linux environments with multiple attached drives.
Issues Discovered During Testing
Several issues emerged during testing after design and implementation:
- Ghost disk problem: Detached volumes remain in ListMetrics. Solution: Use
RecentlyActive: PT3Hto focus only on disks active in the last 3 hours - JSONata corner cases: Added defensive logic for insufficient data, negative increments, and division by zero
- Immediate threshold detection: Changed order to evaluate CRITICAL even with insufficient data if current value exceeds threshold
Architecture
EventBridge Scheduler (rate(1 hour))
|
v
Step Functions (EXPRESS / STANDARD switchable)
|
+-- EC2 DescribeInstances (tag filter)
|
+-- Map (per instance)
| +-- CloudWatch ListMetrics (detect all disks)
| +-- Map (per disk)
| +-- CloudWatch GetMetricStatistics (last 14 days)
| +-- JSONata (prediction calculation & evaluation)
|
+-- JSONata (generate CRITICAL/WARNING/OK aggregated report)
|
+-- Choice -- SNS Publish or normal completion
The double Map structure processes instance x disk combinations in parallel. Cost is about $0.22/month for 10 instances with 2 disks each.
Prediction Logic
Instead of average growth rate, we use the largest daily increase from the past 14 days:
1. Get daily average data for the past 14 days (14 data points)
2. Exclude negative increments (disk cleanup, etc.) from adjacent day differences
3. Get the maximum positive increment
4. daysUntilFull = (100 - current usage) / maximum daily increment
This worst-case evaluation can detect patterns with temporary rapid increases from batch processing.
Implementation Challenges
There were several gotchas with Step Functions' JSONata + CloudFormation combination:
1. {% %} is mandatory
In Step Functions' JSONata mode, expressions must be enclosed in {% %}. This is clearly stated in AWS official documentation.
# OK: correct
Output: "{% $states.input.Reservations[].Instances[] %}"
# NG: doesn't work
Output: "$states.input.Reservations[].Instances[]"
2. DefinitionSubstitutions type issues
Values passed through DefinitionSubstitutions become strings. For numeric fields (e.g., Period), conversion with $number() is necessary.
DefinitionSubstitutions:
MetricPeriod: !Ref MetricPeriod
# Within state machine
Period: "{% $number('${MetricPeriod}') %}"
3. $sort doesn't work as expected
In Step Functions' JSONata implementation, $sort's comparison function didn't work as expected for Timestamp string sorting. Resolved by switching to $reduce to get the latest value.
# Using $reduce instead of $sort to get the latest data point
$latest := $reduce($dp, function($acc, $v) {
$v.Timestamp > $acc.Timestamp ? $v : $acc
})
4. and operator unavailable
The and operator causes errors in Step Functions' JSONata. Use nested ternary operators instead.
# NG: error
$a != null and $a <= 3 ? 'WARNING' : 'OK'
# OK: works
$a != null ? ($a <= 3 ? 'WARNING' : 'OK') : 'OK'
5. Single element array issue
JSONata expressions generating object arrays sometimes return a single object (not an array) when there's only one element. This causes type errors when passed to Map Items, so explicitly make it an array with [...].
# NG: becomes an object with 1 element
$states.input.disks.{ 'dimensions': dimensions }
# OK: always an array
[$states.input.disks.{ 'dimensions': dimensions }]
Test Environment and Execution
Test EC2
Created a CloudFormation template that creates:
- t4g.nano / EBS 8GB gp3
- SSM Core + CloudWatch Agent managed policies
- UserData to auto-start CWAgent with standard settings
Monitor:DiskChecktag
CRITICAL Test Execution
Filled the disk to 97% using SSM RunCommand and manually executed the state machine:
# Fill the disk
aws ssm send-command \
--instance-ids i-0123456789abcdef0 \
--document-name "AWS-RunShellScript" \
--parameters 'commands=["dd if=/dev/zero of=/var/fill_disk bs=1M count=5600"]'
# Execute state machine
aws stepfunctions start-execution \
--state-machine-arn arn:aws:states:ap-northeast-1:xxxxxxxxxxxx:stateMachine:disk-usage-monitor
During testing, MetricPeriod=300 (5 minutes) and LookbackDays=1 were set to reduce data accumulation wait time.
Test Results
{
"subject": "Disk Usage Alert",
"timestamp": "2026-02-07T08:47:49.163Z",
"summary": {
"totalInstances": 1,
"totalDisks": 1,
"criticalCount": 1,
"warningCount": 0,
"okCount": 0
},
"critical": [
{
"instanceId": "i-0123456789abcdef0",
"instanceName": "disk-usage-monitor-test-env-test",
"path": "/",
"device": "nvme0n1p1",
"fstype": "xfs",
"currentUsage": 96.1,
"maxDailyIncrease": 71.46,
"daysUntilFull": 0.1,
"worstDay": "2026-02-07"
}
]
}
The CRITICAL evaluation, instance information, and predicted values were all correctly output.
Cost
| Item | Monthly (10 instances × 2 disks) |
|---|---|
| CloudWatch API (ListMetrics + GetMetricStatistics) | $0.22 |
| Step Functions Express (state transitions + execution time) | Less than $0.01 |
| EventBridge Scheduler | Within free tier |
| Total | Approximately $0.23/month |
Summary
By combining Step Functions SDK integrations with JSONata, we've achieved a Lambda-less disk depletion prediction monitoring system. Using ListMetrics to automatically detect all disks on each instance, GetMetricStatistics to retrieve trend data, and JSONata for prediction calculations, we've created a versatile system that doesn't depend on specific OS or disk configurations.
While JSONata on Step Functions allows implementing complex logic, it has unique constraints such as the unavailability of the and operator and $sort comparison functions not working as expected. Additionally, when embedding state machine definitions in CloudFormation, you must use the {% %} syntax, and values passed via DefinitionSubstitutions become strings requiring $number() type conversion. There are many issues that documentation alone doesn't fully cover, making implementation complexity non-trivial.
In previous attempts with Kiro, we tried implementing Step Functions with JSONata, but due to challenges with complex expression generation and handling corner cases, we often settled for the proven JSONPath + Lambda approach.
With the latest Opus 4.6 model, JSONata expression generation accuracy has clearly improved, providing effective support during design, implementation, and testing phases, demonstrating the enhanced processing capabilities of JSONata.
If maintaining CloudWatch alarms or Lambda functions for monitoring purposes is challenging, I recommend trying Step Functions with JSONata implementation supported by Kiro.
Below are the complete plan document and CloudFormation template created with Kiro.
plan.md - Implementation Plan v4
# Disk Usage Monitor Step Functions Implementation Plan v4
## Change History
| Version | Changes |
|---|---|
| v3 | Initial version. All disk auto-detection with ListMetrics, double Map structure |
| v4 | Reflecting implementation differences. Empty array guards, improved prediction logic, MetricPeriod parameter added |
## 1. Overview
Monitor HDD usage custom metrics collected by CloudWatch Agent using Step Functions (SDK integration + JSONata) to detect disk depletion risks and send structured JSON notifications to SNS. Implemented without Lambda.
Monitoring logic:
- Use ListMetrics to automatically detect all disks (Dimension combinations) for each instance
- Identify maximum daily increment from the past 14 days of daily data
- Determine if disk will reach full (100%) within 3 days if that increment continues
- Immediate alert if current value exceeds threshold (default 95%)
## 2. Prerequisites
- CloudWatch Agent configured on target EC2 (namespace: `CWAgent`, metric: `disk_used_percent`)
- Identification tags applied to monitored EC2 instances (e.g., `Monitor:DiskCheck`)
- SNS topic created
- CWAgent Dimensions settings (append_dimensions, etc.) irrelevant. Auto-detected by ListMetrics
## 3. Architecture
EventBridge Scheduler (regular execution: rate(1 hour))
|
v
Step Functions (default: EXPRESS / can switch to STANDARD via parameter)
|
+-- [1] EC2 DescribeInstances (tag filter + running only)
|
+-- [2] Pass/JSONata (format instance IDs/names into array)
|
+-- [2.5] Choice (if 0 instances, exit normally)
|
+-- [3] Map (parallel: per instance)
| |
| +-- [3-1] CloudWatch ListMetrics (get all disk Dimension combinations)
| |
| +-- [3-2] Pass/JSONata (format Dimension combinations into array)
| |
| +-- [3-2.5] Choice (if 0 disks, end with empty result)
| |
| +-- [3-3] Map (parallel: per disk)
| | +-- CloudWatch GetMetricStatistics (past 14 days, Period as parameter)
| | +-- Pass/JSONata (calculate maximum increment through all pair comparisons, determine depletion risk)
| |
| +-- Aggregate instanceId/instanceName/results in Output
|
+-- [4] Pass/JSONata (categorize into CRITICAL/WARNING/OK, generate report)
|
+-- [5] Choice -- CRITICAL or WARNING exists -- SNS Publish
-- none -- normal exit
Express mode outputs to CloudWatch Logs (14-day retention)
## 4. State Machine Definition
| State | Type | Process |
|---|---|---|
| GetInstances | Task (SDK: EC2) | DescribeInstances. Get targets using tag filter + instance-state-name=running |
| ExtractInstanceList | Pass (JSONata) | Format instance IDs and Name tags into array |
| CheckHasInstances | Choice | If 0 instances, go to NoAction. Prevents empty array error in Map |
| CheckEachInstance | Map (parallel) | Execute following for each instance (outer Map). ToleratedFailurePercentage: 100 |
| -- ListDiskMetrics | Task (SDK: CloudWatch) | Get all disk Dimensions for that instance using ListMetrics |
| -- ExtractDimensions | Pass (JSONata) | Format Dimension combinations into array. Return empty array if metrics is null/empty |
| -- CheckHasDisks | Choice | If 0 disks, go to NoDisksFound. Prevents empty array error in Map |
| -- NoDisksFound | Pass | Return empty results and end |
| -- CheckEachDisk | Map (parallel) | Execute following for each disk (inner Map). ToleratedFailurePercentage: 100 |
| ---- GetDiskMetrics | Task (SDK: CloudWatch) | Get past LookbackDays data with GetMetricStatistics. Period is MetricPeriod parameter |
| ---- EvaluateRisk | Pass (JSONata) | Calculate max increment via all pair comparisons, determine depletion risk |
| AggregateReport | Pass (JSONata) | Categorize into CRITICAL/WARNING/OK, generate JSON report |
| HasAlerts | Choice | criticalCount > 0 or warningCount > 0 |
| SendReport | Task (SDK: SNS) | Publish JSON report. Subject: "Disk Usage Alert" |
| NoAction | Succeed | Normal exit |
## 5. CloudWatch API Specifications
### 5.1 ListMetrics (Disk Auto-Detection)
Once per instance. Gets all Dimension combinations for all disks on that instance.
| Parameter | Value |
|---|---|
| Namespace | CWAgent |
| MetricName | disk_used_percent |
| Dimensions | [{"Name": "InstanceId", "Value": "<target ID>"}] |
| RecentlyActive | PT3H |
RecentlyActive: PT3H returns only disks that have sent data within the last 3 hours.
### 5.2 GetMetricStatistics (Trend Data Retrieval)
Once per disk. Specifies the complete Dimensions obtained from ListMetrics as-is.
| Parameter | Value |
|---|---|
| Namespace | CWAgent |
| MetricName | disk_used_percent |
| Period | MetricPeriod parameter (default: 86400 seconds = 1 day) |
| Statistics | ["Average"] |
| StartTime | now - LookbackDays |
| EndTime | now |
| Dimensions | Use ListMetrics response as-is |
### 5.3 Cost Estimate
For 10 instances x avg. 2 disks/instance:
- ListMetrics: 10 calls/execution
- GetMetricStatistics: 20 calls/execution
- Total: 30 calls/execution x 24 executions/day x 30 days = 21,600 calls/month -> approx. $0.22/month
## 6. Prediction Logic (JSONata)
### 6.1 Processing Flow
1. Get latest datapoint using $reduce ($sort is unstable in Step Functions' JSONata, so not used)
2. If fewer than 2 datapoints -> unpredictable (INSUFFICIENT_DATA)
3. Calculate maximum increment and its date through all pair comparisons ($map + $reduce)
4. If maximum increment ≤ 0 -> no increasing trend (daysUntilFull: null)
5. daysUntilFull = (100 - currentUsage) / maxDailyIncrease
### 6.2 Judgment Criteria
| Condition | Status | Priority |
|---|---|---|
| currentUsage >= thresholdPercent | CRITICAL | 1 (highest) |
| datapoints < 2 | INSUFFICIENT_DATA | 2 |
| daysUntilFull <= predictionDays | WARNING | 3 |
| others | OK | 4 |
### 6.3 Corner Case Handling
| Case | Behavior |
|---|---|
| New instance (1 or fewer datapoints) | INSUFFICIENT_DATA. Not alert target |
| Disk cleanup days with usage decrease | Consider only positive increments in all pair comparisons |
| Usage flat or decreasing across all days | No positive increments -> daysUntilFull: null -> OK |
| Current value already 100% | CRITICAL (immediate judgment on threshold exceeded, even with insufficient data) |
| No disk metrics for instance | Return empty results (NoDisksFound state) |
| 0 monitored instances | Normal exit at CheckHasInstances |
### 6.4 EvaluateRisk Output Fields
| Field | Type | Description |
|---|---|---|
| path | string | Mount path (e.g., /, /data, C:) |
| device | string | Device name (e.g., nvme0n1p1, C:) |
| fstype | string | File system (e.g., xfs, NTFS) |
| currentUsage | number | Current usage (1 decimal place) |
| maxDailyIncrease | number/null | Maximum daily increment (2 decimal places). Null if insufficient data |
| daysUntilFull | number/null | Days until 100% (1 decimal place). Null if no increasing trend |
| worstDay | string/null | Date of maximum increment (YYYY-MM-DD). Null if no increasing trend |
| status | string | CRITICAL / WARNING / INSUFFICIENT_DATA / OK |
## 7. SNS Notification Message
Subject: Disk Usage Alert
json
{
"subject": "Disk Usage Alert",
"timestamp": "2026-02-07T08:47:49.163Z",
"summary": {
"totalInstances": 10,
"totalDisks": 18,
"criticalCount": 1,
"warningCount": 2,
"okCount": 15
},
"critical": [
{
"instanceId": "i-0123456789abcdef0",
"instanceName": "db-server",
"path": "/data",
"device": "nvme1n1",
"fstype": "xfs",
"currentUsage": 96.2,
"maxDailyIncrease": 2.8,
"daysUntilFull": 1.4,
"worstDay": "2026-02-03"
}
],
"warning": [
{
"instanceId": "i-0abcdef1234567890",
"instanceName": "app-server",
"path": "/",
"device": "nvme0n1p1",
"fstype": "xfs",
"currentUsage": 87.3,
"maxDailyIncrease": 5.1,
"daysUntilFull": 2.5,
"worstDay": "2026-02-05"
}
]
}
## 8. CloudFormation Parameters
| Parameter | Default | Description |
|---|---|---|
| StateMachineType | EXPRESS | EXPRESS or STANDARD |
| TagKey | Monitor | Monitoring target tag key |
| TagValue | DiskCheck | Monitoring target tag value |
| ThresholdPercent | 95 | CRITICAL judgment threshold (%) |
| PredictionDays | 3 | WARNING judgment days |
| LookbackDays | 14 | Historical days for trend analysis |
| MetricPeriod | 86400 | Metric aggregation period (seconds). Set to 300 during testing |
| SnsTopicArn | (required) | Notification SNS topic ARN |
| ScheduleExpression | rate(1 hour) | Execution schedule |
template.yaml - CloudFormation Template
AWSTemplateFormatVersion: '2010-09-09'
Description: Disk Usage Monitor - Step Functions (Lambdaless / JSONata)
Parameters:
StateMachineType:
Type: String
Default: EXPRESS
AllowedValues: [EXPRESS, STANDARD]
TagKey:
Type: String
Default: Monitor
TagValue:
Type: String
Default: DiskCheck
ThresholdPercent:
Type: Number
Default: 95
PredictionDays:
Type: Number
Default: 3
LookbackDays:
Type: Number
Default: 14
SnsTopicArn:
Type: String
AllowedPattern: arn:aws:sns:.+
MetricPeriod:
Type: Number
Default: 86400
Description: Metric aggregation period in seconds (86400=daily, 300=5min for testing)
ScheduleExpression:
Type: String
Default: rate(1 hour)
Conditions:
IsExpress: !Equals [!Ref StateMachineType, EXPRESS]
Resources:
StateMachineLogGroup:
Type: AWS::Logs::LogGroup
Condition: IsExpress
Properties:
LogGroupName: !Sub /aws/vendedlogs/states/${AWS::StackName}
RetentionInDays: 14
StateMachineRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service: states.amazonaws.com
Action: sts:AssumeRole
Policies:
- PolicyName: DiskMonitorPolicy
PolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Action:
- ec2:DescribeInstances
- cloudwatch:ListMetrics
- cloudwatch:GetMetricStatistics
Resource: '*'
- Effect: Allow
Action: sns:Publish
Resource: !Ref SnsTopicArn
- !If
- IsExpress
- Effect: Allow
Action:
- logs:CreateLogDelivery
- logs:GetLogDelivery
- logs:UpdateLogDelivery
- logs:DeleteLogDelivery
- logs:ListLogDeliveries
- logs:PutResourcePolicy
- logs:DescribeResourcePolicies
- logs:DescribeLogGroups
Resource: '*'
- !Ref AWS::NoValue
SchedulerRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service: scheduler.amazonaws.com
Action: sts:AssumeRole
Policies:
- PolicyName: InvokeStateMachine
PolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Action:
- states:StartExecution
- states:StartSyncExecution
Resource: !GetAtt StateMachine.Arn
StateMachine:
Type: AWS::StepFunctions::StateMachine
Properties:
StateMachineName: !Sub ${AWS::StackName}
StateMachineType: !Ref StateMachineType
RoleArn: !GetAtt StateMachineRole.Arn
LoggingConfiguration: !If
- IsExpress
- Level: ALL
IncludeExecutionData: true
Destinations:
- CloudWatchLogsLogGroup:
LogGroupArn: !GetAtt StateMachineLogGroup.Arn
- !Ref AWS::NoValue
DefinitionSubstitutions:
TagKey: !Ref TagKey
TagValue: !Ref TagValue
ThresholdPercent: !Ref ThresholdPercent
PredictionDays: !Ref PredictionDays
LookbackDays: !Ref LookbackDays
MetricPeriod: !Ref MetricPeriod
SnsTopicArn: !Ref SnsTopicArn
Definition:
QueryLanguage: JSONata
Comment: Disk Usage Monitor - Lambdaless
StartAt: GetInstances
States:
GetInstances:
Type: Task
Resource: arn:aws:states:::aws-sdk:ec2:describeInstances
Arguments:
Filters:
- Name: "tag:${TagKey}"
Values:
- "${TagValue}"
- Name: instance-state-name
Values:
- running
Next: ExtractInstanceList
ExtractInstanceList:
Type: Pass
Output: >-
{% [$states.input.Reservations[].Instances[].{
'instanceId': InstanceId,
'instanceName': Tags[Key='Name'].Value
}] %}
Next: CheckHasInstances
CheckHasInstances:
Type: Choice
Choices:
- Condition: "{% $count($states.input) > 0 %}"
Next: CheckEachInstance
Default: NoAction
CheckEachInstance:
Type: Map
Items: "{% $states.input %}"
MaxConcurrency: 10
ToleratedFailurePercentage: 100
ItemProcessor:
ProcessorConfig:
Mode: INLINE
StartAt: ListDiskMetrics
States:
ListDiskMetrics:
Type: Task
Resource: arn:aws:states:::aws-sdk:cloudwatch:listMetrics
Arguments:
Namespace: CWAgent
MetricName: disk_used_percent
Dimensions:
- Name: InstanceId
Value: "{% $states.input.instanceId %}"
RecentlyActive: PT3H
Output:
instanceId: "{% $states.input.instanceId %}"
instanceName: "{% $states.input.instanceName %}"
metrics: "{% $states.result.Metrics %}"
Next: ExtractDimensions
ExtractDimensions:
Type: Pass
Output:
instanceId: "{% $states.input.instanceId %}"
instanceName: "{% $states.input.instanceName %}"
disks: >-
{% $states.input.metrics
? [$states.input.metrics.{ 'dimensions': Dimensions }]
: [] %}
Next: CheckHasDisks
CheckHasDisks:
Type: Choice
Choices:
- Condition: "{% $count($states.input.disks) > 0 %}"
Next: CheckEachDisk
Default: NoDisksFound
NoDisksFound:
Type: Pass
Output:
instanceId: "{% $states.input.instanceId %}"
instanceName: "{% $states.input.instanceName %}"
results: []
End: true
CheckEachDisk:
Type: Map
Items: >-
{% [$states.input.disks.{
'dimensions': dimensions,
'instanceId': $states.input.instanceId,
'instanceName': $states.input.instanceName
}] %}
MaxConcurrency: 5
ToleratedFailurePercentage: 100
ItemProcessor:
ProcessorConfig:
Mode: INLINE
StartAt: GetDiskMetrics
States:
GetDiskMetrics:
Type: Task
Resource: arn:aws:states:::aws-sdk:cloudwatch:getMetricStatistics
Arguments:
Namespace: CWAgent
MetricName: disk_used_percent
Dimensions: "{% $states.input.dimensions %}"
StartTime: >-
{% $fromMillis($millis() - ${LookbackDays} * 86400000) %}
EndTime: "{% $fromMillis($millis()) %}"
Period: "{% $number('${MetricPeriod}') %}"
Statistics:
- Average
Output:
dimensions: "{% $states.input.dimensions %}"
datapoints: "{% $states.result.Datapoints %}"
Next: EvaluateRisk
EvaluateRisk:
Type: Pass
Output: >-
{% (
$dims := $states.input.dimensions;
$path := ($dims[Name='path']).Value;
$device := ($dims[Name='device']).Value;
$fstype := ($dims[Name='fstype']).Value;
$dp := $states.input.datapoints;
$cnt := $count($dp);
$latest := $cnt > 0
? $reduce($dp, function($acc, $v) {
$v.Timestamp > $acc.Timestamp ? $v : $acc
})
: {'Average': 0};
$current := $latest.Average;
$prediction := $cnt < 2
? { 'daysUntilFull': null, 'maxDailyIncrease': null, 'worstDay': null }
: (
$maxes := $map($dp, function($a) {
$reduce($dp, function($best, $b) {
($a.Timestamp > $b.Timestamp)
? ( $d := $a.Average - $b.Average;
$d > $best.delta
? {'delta': $d, 'date': $a.Timestamp}
: $best )
: $best
}, {'delta': 0, 'date': ''})
});
$worst := $reduce($maxes, function($acc, $m) {
$m.delta > $acc.delta ? $m : $acc
});
$worst.delta > 0
? {
'daysUntilFull': $round((100 - $current) / $worst.delta, 1),
'maxDailyIncrease': $round($worst.delta, 2),
'worstDay': $substringBefore($worst.date, 'T')
}
: { 'daysUntilFull': null, 'maxDailyIncrease': 0, 'worstDay': null }
);
$status := $current >= ${ThresholdPercent}
? 'CRITICAL'
: $cnt < 2
? 'INSUFFICIENT_DATA'
: $prediction.daysUntilFull != null
? ($prediction.daysUntilFull <= ${PredictionDays}
? 'WARNING' : 'OK')
: 'OK';
{
'path': $path,
'device': $device,
'fstype': $fstype,
'currentUsage': $round($current, 1),
'maxDailyIncrease': $prediction.maxDailyIncrease,
'daysUntilFull': $prediction.daysUntilFull,
'worstDay': $prediction.worstDay,
'status': $status
}
) %}
End: true
Output:
instanceId: "{% $states.input.instanceId %}"
instanceName: "{% $states.input.instanceName %}"
results: "{% $states.result %}"
End: true
Next: AggregateReport
AggregateReport:
Type: Pass
Output: >-
{% (
$all := $states.input;
$allDisks := $all.results[];
{
'subject': 'Disk Usage Alert',
'timestamp': $now(),
'summary': {
'totalInstances': $count($all),
'totalDisks': $count($allDisks),
'criticalCount': $count($filter($allDisks, function($d) { $d.status = 'CRITICAL' })),
'warningCount': $count($filter($allDisks, function($d) { $d.status = 'WARNING' })),
'okCount': $count($filter($allDisks, function($d) { $d.status = 'OK' }))
},
'critical': $all.(
$inst := $;
results[status='CRITICAL'].{
'instanceId': $inst.instanceId,
'instanceName': $inst.instanceName,
'path': path, 'device': device, 'fstype': fstype,
'currentUsage': currentUsage,
'maxDailyIncrease': maxDailyIncrease,
'daysUntilFull': daysUntilFull,
'worstDay': worstDay
}
)[],
'warning': $all.(
$inst := $;
results[status='WARNING'].{
'instanceId': $inst.instanceId,
'instanceName': $inst.instanceName,
'path': path, 'device': device, 'fstype': fstype,
'currentUsage': currentUsage,
'maxDailyIncrease': maxDailyIncrease,
'daysUntilFull': daysUntilFull,
'worstDay': worstDay
}
)[]
}
) %}
Next: HasAlerts
HasAlerts:
Type: Choice
Choices:
- Condition: >-
{% $states.input.summary.criticalCount > 0
or $states.input.summary.warningCount > 0 %}
Next: SendReport
Default: NoAction
SendReport:
Type: Task
Resource: arn:aws:states:::aws-sdk:sns:publish
Arguments:
TopicArn: "${SnsTopicArn}"
Subject: "{% $states.input.subject %}"
Message: "{% $string($states.input) %}"
End: true
NoAction:
Type: Succeed
Scheduler:
Type: AWS::Scheduler::Schedule
Properties:
Name: !Sub ${AWS::StackName}-scheduler
ScheduleExpression: !Ref ScheduleExpression
FlexibleTimeWindow:
Mode: 'OFF'
Target:
Arn: !GetAtt StateMachine.Arn
RoleArn: !GetAtt SchedulerRole.Arn
Input: '{}'
Outputs:
StateMachineArn:
Value: !GetAtt StateMachine.Arn
StateMachineType:
Value: !Ref StateMachineType
LogGroupName:
Condition: IsExpress
Value: !Ref StateMachineLogGroup
