ELB(ALB)配下のスポットインスタンスの中断処理をStep Functionsで実装してみた

ELB(ALB)配下のスポットインスタンスの中断処理をStep Functionsで実装してみた

ELB(ALB)配下でスポットインスタンスを利用する環境において、Step FunctionsのAWS SDK サービス統合を利用した、Lambdaを使用せずインスタンス終了通知への自動対応を試してみました
Clock Icon2024.10.24

ELB(ALB)配下でスポットインスタンスを利用する環境において、インスタンス終了通知への自動対応をAWS Step Functionsで実装しました。

Step Functionsの AWS SDK サービス統合を活用する事で、Lambdaを使用せずに実現しています。

処理概要

a. スポットインスタンス中断通知の検知

  • EventBridgeがEC2スポットインスタンスの中断通知を検知
  • Step Functionsのステートマシンを起動

b. インスタンス情報の取得と検証

  • PrepareInstanceId: 中断対象インスタンスIDの抽出
  • GetEC2InstanceInfo: インスタンスの詳細情報取得
  • ExtractTags: Auto Scalingグループタグの抽出
  • ValidateTag: タグの存在確認
  • GetAutoScalingGroupInfo: ASGの情報取得

c. パラレル処理(2つの並列処理):
ブランチ1: ターゲットグループ処理

  • ExtractTargetGroupARNs: ターゲットグループARNの抽出
  • GetTargetGroupInfo: ターゲットグループ情報の取得
  • ValidateTargetGroups: ターゲットグループの検証
  • GetTargetHealth: ターゲットヘルスチェック
  • PrepareForFiltering: フィルタリング準備
  • FindMatchingTarget: 対象ターゲットの特定
  • DeregisterTarget: ターゲットの登録解除

ブランチ2: Auto Scaling処理

  • PrepareCapacityComparison: キャパシティ計算
  • CheckScalingCondition: スケーリング条件確認
  • UpdateAutoScalingGroup: ASGの更新

Step Functions Workflow Studio

SpotInterruptionStateMachine

CloudFormation

EventBridge、 Step Functions は、CloudFormationを利用して設置しました。

AWSTemplateFormatVersion: '2010-09-09'
Description: 'Spot Instance Interruption Handler with ALB and Auto Scaling Integration'

Parameters:
  SpotInterruptionEventRule:
    Type: AWS::Events::Rule
    Properties:
      Description: "Capture Spot Instance Interruption Warnings"
      EventPattern:
        source:
          - aws.ec2
        detail-type:
          - EC2 Spot Instance Interruption Warning
      State: "ENABLED"
      Targets: 
        - Arn: !Ref SpotInterruptionStateMachine
          Id: "SpotInterruptionStateMachine"
          RoleArn: !GetAtt EventBridgeRole.Arn

  EventBridgeRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: "2012-10-17"
        Statement:
          - Effect: Allow
            Principal:
              Service: events.amazonaws.com
            Action: sts:AssumeRole
      Policies:
        - PolicyName: InvokeStepFunction
          PolicyDocument:
            Version: "2012-10-17"
            Statement:
              - Effect: Allow
                Action: states:StartExecution
                Resource: !Ref SpotInterruptionStateMachine

  SpotInterruptionStateMachine:
    Type: AWS::StepFunctions::StateMachine
    Properties:
      StateMachineType: EXPRESS
      LoggingConfiguration:
        Level: ALL
        IncludeExecutionData: true
        Destinations:
          - CloudWatchLogsLogGroup:
              LogGroupArn: !GetAtt StepFunctionsLogGroup.Arn
      DefinitionString: !Sub |
        {
          "Comment": "Handle Spot Instance Interruption",
          "StartAt": "PrepareInstanceId",
          "States": {
            "PrepareInstanceId": {
              "Type": "Pass",
              "Parameters": {
                "InstanceId.$": "$.detail.instance-id"
              },
              "Next": "GetEC2InstanceInfo"
            },
            "GetEC2InstanceInfo": {
              "Type": "Task",
              "Resource": "arn:aws:states:::aws-sdk:ec2:describeInstances",
              "Parameters": {
                "InstanceIds.$": "States.Array($.InstanceId)"
              },
              "ResultPath": "$.InstanceInfo",
              "Next": "ExtractTags",
              "Catch": [
                {
                  "ErrorEquals": ["States.ALL"],
                  "Next": "HandleError"
                }
              ]
            },
            "ExtractTags": {
              "Type": "Pass",
              "Parameters": {
                "InstanceId.$": "$.InstanceId",
                "Tags.$": "$.InstanceInfo.Reservations[0].Instances[0].Tags[?(@.Key == 'aws:autoscaling:groupName')]"
              },
              "Next": "ValidateTag"
            },
            "ValidateTag": {
              "Type": "Choice",
              "Choices": [
                {
                  "And": [
                    {
                      "Variable": "$.Tags[0].Value",
                      "IsPresent": true
                    }
                  ],
                  "Next": "PrepareAutoScalingGroupName"
                }
              ],
              "Default": "HandleError"
            },
            "PrepareAutoScalingGroupName": {
              "Type": "Pass",
              "Parameters": {
                "InstanceId.$": "$.InstanceId",
                "AutoScalingGroupName.$": "$.Tags[0].Value"
              },
              "Next": "GetAutoScalingGroupInfo"
            },
            "GetAutoScalingGroupInfo": {
              "Type": "Task",
              "Resource": "arn:aws:states:::aws-sdk:autoscaling:describeAutoScalingGroups",
              "Parameters": {
                "AutoScalingGroupNames.$": "States.Array($.AutoScalingGroupName)"
              },
              "ResultPath": "$.AutoScalingGroupInfo",
              "Next": "ParallelProcessing",
              "Catch": [
                {
                  "ErrorEquals": ["States.ALL"],
                  "Next": "HandleError"
                }
              ]
            },
            "ParallelProcessing": {
              "Type": "Parallel",
              "Branches": [
                {
                  "StartAt": "ExtractTargetGroupARNs",
                  "States": {
                    "ExtractTargetGroupARNs": {
                      "Type": "Pass",
                      "Parameters": {
                        "InstanceId.$": "$.InstanceId",
                        "TargetGroupARNs.$": "$.AutoScalingGroupInfo.AutoScalingGroups[0].TargetGroupARNs",
                        "AutoScalingGroupInfo.$": "$.AutoScalingGroupInfo"
                      },
                      "Next": "GetTargetGroupInfo"
                    },
                    "GetTargetGroupInfo": {
                      "Type": "Task",
                      "Resource": "arn:aws:states:::aws-sdk:elasticloadbalancingv2:describeTargetGroups",
                      "Parameters": {
                        "TargetGroupArns.$": "$.TargetGroupARNs"
                      },
                      "ResultPath": "$.TargetGroupInfo",
                      "Next": "ValidateTargetGroups",
                      "Catch": [
                        {
                          "ErrorEquals": ["States.ALL"],
                          "Next": "HandleTargetGroupError"
                        }
                      ]
                    },
                    "ValidateTargetGroups": {
                      "Type": "Choice",
                      "Choices": [
                        {
                          "And": [
                            {
                              "Variable": "$.TargetGroupInfo.TargetGroups[0]",
                              "IsPresent": true
                            }
                          ],
                          "Next": "GetTargetHealth"
                        }
                      ],
                      "Default": "HandleTargetGroupError"
                    },
                    "GetTargetHealth": {
                      "Type": "Task",
                      "Resource": "arn:aws:states:::aws-sdk:elasticloadbalancingv2:describeTargetHealth",
                      "Parameters": {
                        "TargetGroupArn.$": "$.TargetGroupInfo.TargetGroups[0].TargetGroupArn"
                      },
                      "ResultPath": "$.TargetHealth",
                      "Next": "PrepareForFiltering",
                      "Catch": [
                        {
                          "ErrorEquals": ["States.ALL"],
                          "Next": "HandleTargetGroupError"
                        }
                      ]
                    },
                    "PrepareForFiltering": {
                      "Type": "Pass",
                      "Parameters": {
                        "TargetGroupArn.$": "$.TargetGroupInfo.TargetGroups[0].TargetGroupArn",
                        "InstanceId.$": "$.InstanceId",
                        "Targets.$": "$.TargetHealth.TargetHealthDescriptions"
                      },
                      "Next": "FindMatchingTarget"
                    },
                    "FindMatchingTarget": {
                      "Type": "Pass",
                      "Parameters": {
                        "TargetGroupArn.$": "$.TargetGroupArn",
                        "InstanceId.$": "$.InstanceId",
                        "MatchedTarget.$": "$.Targets[?(@.Target.Id == $.InstanceId && @.TargetHealth.State == 'healthy')]"
                      },
                      "Next": "ValidateTargetForDeregister"
                    },
                    "ValidateTargetForDeregister": {
                      "Type": "Choice",
                      "Choices": [
                        {
                          "And": [
                            {
                              "Variable": "$.MatchedTarget[0]",
                              "IsPresent": true
                            }
                          ],
                          "Next": "DeregisterTarget"
                        }
                      ],
                      "Default": "TargetGroupSuccess"
                    },
                    "DeregisterTarget": {
                      "Type": "Task",
                      "Resource": "arn:aws:states:::aws-sdk:elasticloadbalancingv2:deregisterTargets",
                      "Parameters": {
                        "TargetGroupArn.$": "$.TargetGroupArn",
                        "Targets": [
                          {
                            "Id.$": "$.InstanceId"
                          }
                        ]
                      },
                      "ResultPath": "$.DeregisterResult",
                      "Next": "TargetGroupSuccess",
                      "Catch": [
                        {
                          "ErrorEquals": ["States.ALL"],
                          "Next": "HandleTargetGroupError"
                        }
                      ]
                    },
                    "HandleTargetGroupError": {
                      "Type": "Pass",
                      "Parameters": {
                        "Status": "Error",
                        "Error": {
                          "Message": "Target group operation failed",
                          "InstanceId.$": "$.InstanceId"
                        }
                      },
                      "End": true
                    },
                    "TargetGroupSuccess": {
                      "Type": "Pass",
                      "Parameters": {
                        "Status": "Success",
                        "InstanceId.$": "$.InstanceId",
                        "DeregisterResult.$": "$.DeregisterResult"
                      },
                      "End": true
                    }
                  }
                },
                {
                  "StartAt": "PrepareCapacityComparison",
                  "States": {
                    "PrepareCapacityComparison": {
                      "Type": "Pass",
                      "Parameters": {
                        "InstanceId.$": "$.InstanceId",
                        "AutoScalingGroupInfo.$": "$.AutoScalingGroupInfo",
                        "NewDesiredCapacity.$": "States.MathAdd($.AutoScalingGroupInfo.AutoScalingGroups[0].DesiredCapacity, 1)",
                        "MaxSize.$": "$.AutoScalingGroupInfo.AutoScalingGroups[0].MaxSize"
                      },
                      "Next": "CheckScalingCondition"
                    },
                    "CheckScalingCondition": {
                      "Type": "Choice",
                      "Choices": [
                        {
                          "And": [
                            {
                              "Variable": "$.NewDesiredCapacity",
                              "NumericLessThanEqualsPath": "$.MaxSize"
                            }
                          ],
                          "Next": "UpdateAutoScalingGroup"
                        }
                      ],
                      "Default": "ASGSuccess"
                    },

                    "UpdateAutoScalingGroup": {
                      "Type": "Task",
                      "Resource": "arn:aws:states:::aws-sdk:autoscaling:updateAutoScalingGroup",
                      "Parameters": {
                        "AutoScalingGroupName.$": "$.AutoScalingGroupInfo.AutoScalingGroups[0].AutoScalingGroupName",
                        "DesiredCapacity.$": "$.NewDesiredCapacity"
                      },
                      "ResultPath": "$.UpdateASGResult",
                      "Catch": [
                        {
                          "ErrorEquals": ["States.ALL"],
                          "ResultPath": "$.UpdateASGError",
                          "Next": "ASGWithError"
                        }
                      ],
                      "Next": "ASGSuccess"
                    },
                    "ASGWithError": {
                      "Type": "Pass",
                      "Parameters": {
                        "Status": "Warning",
                        "InstanceId.$": "$.InstanceId",
                        "Error.$": "$.UpdateASGError",
                        "Message": "Auto Scaling Group update failed but continuing execution"
                      },
                      "End": true
                    },
                    "ASGSuccess": {
                      "Type": "Pass",
                      "Parameters": {
                        "Status": "Success",
                        "InstanceId.$": "$.InstanceId",
                        "AutoScalingInfo": {
                          "CurrentDesiredCapacity.$": "$.AutoScalingGroupInfo.AutoScalingGroups[0].DesiredCapacity",
                          "MaxSize.$": "$.MaxSize",
                          "RequestedDesiredCapacity.$": "$.NewDesiredCapacity",
                          "UpdateSkipped": {
                            "Reason.$": "States.Format('Requested capacity {} exceeds max size {}', $.NewDesiredCapacity, $.MaxSize)"
                          }
                        }
                      },
                      "End": true
                    }
                  }
                }
              ],
              "ResultPath": "$.ParallelResults",
              "Next": "MergeResults"
            },
            "MergeResults": {
              "Type": "Pass",
              "Parameters": {
                "InstanceId.$": "$.InstanceId",
                "FinalState": {
                  "TargetGroupOperations.$": "$.ParallelResults[0]",
                  "AutoScalingOperations.$": "$.ParallelResults[1]"
                }
              },
              "Next": "Success"
            },
            "HandleError": {
              "Type": "Fail",
              "Error": "SpotInterruptionHandlingError",
              "Cause": "Error occurred during spot interruption handling"
            },
            "Success": {
              "Type": "Succeed"
            }
          }
        }
      RoleArn: !GetAtt StepFunctionsExecutionRole.Arn

  StepFunctionsExecutionRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: "2012-10-17"
        Statement:
          - Effect: Allow
            Principal:
              Service: states.amazonaws.com
            Action: sts:AssumeRole
      Policies:
        - PolicyName: EC2AndAutoScalingAccess
          PolicyDocument:
            Version: "2012-10-17"
            Statement:
              - Effect: Allow
                Action:
                  - ec2:DescribeInstances
                  - autoscaling:DescribeAutoScalingGroups
                  - autoscaling:UpdateAutoScalingGroup
                  - elasticloadbalancing:DescribeTargetGroups
                  - elasticloadbalancing:DescribeTargetHealth
                  - elasticloadbalancing:DeregisterTargets
                Resource: "*"
        - PolicyName: CloudWatchLogsAccess
          PolicyDocument:
            Version: "2012-10-17"
            Statement:
              - Effect: Allow
                Action:
                  - logs:CreateLogDelivery
                  - logs:GetLogDelivery
                  - logs:UpdateLogDelivery
                  - logs:DeleteLogDelivery
                  - logs:ListLogDeliveries
                  - logs:PutResourcePolicy
                  - logs:DescribeResourcePolicies
                  - logs:DescribeLogGroups
                Resource: "*"

  StepFunctionsLogGroup:
    Type: AWS::Logs::LogGroup
    Properties:
      LogGroupName: !Sub "/aws/states/${AWS::StackName}-spotinterruption"
      RetentionInDays: 180

動作検証

テスト環境の構築

スポットインスタンスの中断シナリオを評価するため、以下のコンポーネントをCloudFormationで構築しました:

  • Application Load Balancer (ALB)
  • Auto Scaling Group (スポットインスタンス設定)
  • セキュリティグループ
  • 基本的なWebサーバー設定(起動テンプレート)
AWSTemplateFormatVersion: '2010-09-09'
Description: 'Spot Instance EC2 Autoscaling with ALB'
Parameters:
  VPCID:
    Type: AWS::EC2::VPC::Id
    Description: Select the VPC where you want to deploy the resources
  EC2SubnetIds:
    Type: List<AWS::EC2::Subnet::Id>
    Description: Select subnets for EC2 instances and ALB
  KeyName:
    Description: Name of an existing EC2 KeyPair to enable SSH access to the instances
    Type: AWS::EC2::KeyPair::KeyName
    ConstraintDescription: must be the name of an existing EC2 KeyPair.
  LatestAmiId:
    Type: AWS::SSM::Parameter::Value<AWS::EC2::Image::Id>
    Default: /aws/service/ami-amazon-linux-latest/al2023-ami-kernel-6.1-x86_64

Resources:
  ALBSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: Security group for ALB
      VpcId: !Ref VPCID
      SecurityGroupIngress:
        - IpProtocol: tcp
          FromPort: 80
          ToPort: 80
          CidrIp: 0.0.0.0/0
  EC2SecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: Security group for EC2 instances
      VpcId: !Ref VPCID
      SecurityGroupIngress:
        - IpProtocol: tcp
          FromPort: 80
          ToPort: 80
          SourceSecurityGroupId: !Ref ALBSecurityGroup
  ApplicationLoadBalancer:
    Type: AWS::ElasticLoadBalancingV2::LoadBalancer
    Properties:
      Scheme: internet-facing
      SecurityGroups: 
        - !Ref ALBSecurityGroup
      Subnets: !Ref EC2SubnetIds
      Type: application
  ALBListener:
    Type: AWS::ElasticLoadBalancingV2::Listener
    Properties:
      DefaultActions:
        - Type: forward
          TargetGroupArn: !Ref ALBTargetGroup
      LoadBalancerArn: !Ref ApplicationLoadBalancer
      Port: 80
      Protocol: HTTP
  ALBTargetGroup:
    Type: AWS::ElasticLoadBalancingV2::TargetGroup
    Properties:
      HealthCheckPath: /
      Name: my-alb-target-group
      Port: 80
      Protocol: HTTP
      TargetType: instance
      VpcId: !Ref VPCID
  Ec2InstanceLaunchTemplate:
    Type: AWS::EC2::LaunchTemplate
    Properties:
      LaunchTemplateName: EC2AutoScalingLaunchTemplate
      LaunchTemplateData:
        SecurityGroupIds:
          - !Ref EC2SecurityGroup
        InstanceInitiatedShutdownBehavior: terminate
        KeyName: !Ref 'KeyName'
        ImageId: !Ref LatestAmiId
        InstanceType: t3.nano
        UserData:
          Fn::Base64: !Sub |
            #!/bin/bash
            yum install -y httpd
            systemctl start httpd
            systemctl enable httpd
            echo "<h1>Hello from $(hostname -f)</h1>" > /var/www/html/index.html
  Ec2InstanceAutoScalingGroup:
    Type: AWS::AutoScaling::AutoScalingGroup
    Properties:
      VPCZoneIdentifier: !Ref EC2SubnetIds
      MinSize: 2
      MaxSize: 6
      DesiredCapacity: 2
      HealthCheckType: ELB
      HealthCheckGracePeriod: 300
      TargetGroupARNs:
        - !Ref ALBTargetGroup
      MixedInstancesPolicy:
        InstancesDistribution:
          OnDemandAllocationStrategy: prioritized
          OnDemandBaseCapacity: 0
          OnDemandPercentageAboveBaseCapacity: 0
          SpotAllocationStrategy: capacity-optimized
        LaunchTemplate:
          LaunchTemplateSpecification:
            LaunchTemplateId: !Ref 'Ec2InstanceLaunchTemplate'
            Version: !GetAtt 'Ec2InstanceLaunchTemplate.LatestVersionNumber'
          Overrides:
            - InstanceType: t3a.nano
            - InstanceType: t3.nano

中断テストの実施

AWS FIS (Fault Injection Simulator) を使用してスポットインスタンスの中断をシミュレーションしました。

スポットインスタンスの中断

https://dev.classmethod.jp/articles/aws-fis-spot-instance-interruptions-ec2-console/

検証結果

パフォーマンス評価

  • スポット停止予告の発生:13:55:59
    "detail-type": "EC2 Spot Instance Interruption Warning",
    "source": "aws.ec2",
    "time": "2024-10-23T13:55:59Z",
  • ターゲットデタッチ実行:13:56:01.525(約2秒後)

DeregusterTarget

グラフビュー

グラフビュー

テーブルビュー

ステートマシンの実行時間は 0.474秒でした。

テーブルビュー

まとめ

スポットインスタンスの中断処理は Lambdaで類似の実装する事も可能ですが、今回、Step Functions を利用した事で、例外や並列処理を強化する事ができました。

またLambdaを採用した場合、利用するランタイムの更新が定期的に求められますが、Step Functions の AWS SDK サービス統合であればこの保守も不要となります。

https://dev.classmethod.jp/articles/ec2-spot-interruption-lambda/

ELB(ALB)配下でオートスケール起動のスポットインスタンスを利用する場合、停止予告に備える必要がある場合には、今回の仕組みを是非お試し下さい。

Share this article

facebook logohatena logotwitter logo

© Classmethod, Inc. All rights reserved.