[アップデート] Amazon SageMaker HyperPod が Amazon EventBridge と統合され、ステータス変更イベントを配信できるようになりました

2025.05.11
こんにちは！クラウド事業本部コンサルティング部のたかくに（@takakuni_）です。
Amazon SageMaker HyperPod は Amazon EventBridge と統合され、ステータス変更イベントを配信できるようになりました。
https://aws.amazon.com/jp/about-aws/whats-new/2025/05/amazon-sagemaker-hyperpod-integrates-amazon-eventbridge-status-change-events/
 アップデート内容タイトルのとおり、Amazon EventBridge と統合したことで HyperPod クラスターおよび、クラスター内に所属するノードの状態イベントを取得できるようになりました。
イベントの内容は以下の形式で配信されます。
 HyperPod クラスターの状態遷移{
   "version": "0",
   "id": "0bd4a141-0a02-9d8a-f977-3924c3fb259c",
   "detail-type": "SageMaker HyperPod Cluster State Change",
   "source": "aws.sagemaker",
   "account": "111122223333",
   "time": "2025-04-28T16:59:01Z",
   "region": "us-west-2",
   "resources": [
      "arn:aws:sagemaker:us-west-2:111122223333:cluster/sample-cluster"
   ],
   "detail": {
      "ClusterArn": "arn:aws:sagemaker:us-west-2:111122223333:cluster/sample-cluster",
      "ClusterName": "sample-cluster",
      "ClusterStatus": "InService",
      "CreationTime": 1745858447412,
      "FailureMessage": "",
      "InstanceGroups": [
         {
            "CurrentCount": 1,
            "ExecutionRole": "arn:aws:iam::111122223333:role/sagemaker-hyperpod-AmazonSagemakerClusterExecutionR-123OTacPcKk1",
            "InstanceGroupName": "example instance group name",
            "InstanceStorageConfigs": [
               {}
            ],
            "InstanceType": "ml.t3.medium",
            "LifeCycleConfig": {
               "OnCreate": "on_create.sh",
               "SourceS3Uri": "s3://sagemaker-hyperpod//LifeCycleScripts/base-config/provisioning_parameters.json"
            },
            "OnStartDeepHealthChecks": [
               "example health checks"
            ],
            "OverrideVpcConfig": {
               "SecurityGroupIds": [
                  "SecurityGroupId1"
               ],
               "Subnets": [
                  "Subnet1"
               ]
            },
            "Status": "Failed",
            "TargetCount": 2,
            "ThreadsPerCore": 2,
            "TrainingPlanArn": "arn:aws:sagemaker:us-west-2:111122223333:training-plan/large-models-fine-tuning",
            "TrainingPlanStatus": "NotApplicable"
         }
      ],
      "NodeRecovery": "Automatic",
      "Orchestrator": {
         "Eks": {
            "ClusterArn": "arn:aws:eks:us-west-2:111122223333:cluster/my-hyperPod-eks-cluster"
         }
      },
      "VpcConfig": {
         "SecurityGroupIds": [
            "SecurityGroupId2"
         ],
         "Subnets": [
            "Subnet2"
         ]
      }
   }
}
 クラスター内に所属するノードの状態変更{
    "version": "0",
    "id": "0bd4a141-0a02-9d8a-f977-3924c3fb259c",
    "detail-type": "SageMaker HyperPod Cluster Node Health Event",
    "source": "aws.sagemaker",
    "account": "111122223333",
    "time": "2021-10-25T01:52:12Z",
    "region": "us-west-2",
    "resources": [
        "arn:aws:sagemaker:us-west-2:111122223333:cluster/sample-cluster"
    ],
    "detail": {
        "ClusterName": "sample-cluster",
        "ClusterArn": "arn:aws:sagemaker:us-west-2:111122223333:cluster/sample-cluster",
        "InstanceId": "i-12345678abcdefghi",
        "Tags": {},
        "HealthSummary": {
            "HealthStatus": "Unhealthy",
            "HealthStatusReason": "HyperPod Health Monitoring Agent (HMA) has detected fault type NvidiaErrorTerminate on this node and is unhealthy.",
            "RepairAction": "None",
            "Recommendation": "Please Replace the Faulty Node."
        }
    }
}
https://docs.aws.amazon.com/sagemaker/latest/dg/automating-sagemaker-with-eventbridge.html#eventbridge-hyperpod-node-health
 想定されるユースケースまずはじめに EventBridge と統合することで、次のような利用イメージが考えられます。
SNS
運用者用のメールアドレス宛にメール通知
Amazon Q Developer in chat applications (旧称: AWS Chatbot) 経由で Slack や Teams への通知

Systems Manager
Run Command による障害復旧/一時調査

CloudWatch Logs
恒久対応を打つためにログとして残しておく

HyperPod クラスターの状態遷移イベントでは、次のクラスターの状態遷移をキャッチします。
Creating：クラスターの作成中
Deleting：クラスターの削除中
Failed：：クラスターの作成、更新失敗
InService：クラスターが利用可能な状態
RollingBack：クラスターの作成、更新においてロールバック状態
SystemUpdating：クラスターのシステムアップデート中
Updating：クラスターの設定変更中
ClusterStatus
初回のクラスター作成や、システムアップデートによる InService の状態変更は便利そうな気がしました。
クラスター内に所属するノードの状態変更では、HealthSummary 内の HealthStatusReason など、ノードのヘルスイベントが記録されるのはありがたいですね。
 やってみる今回は、 EventBridge のイベントを CloudWatch Logs に記録するパターンを使用して、ドキュメントに記録されているイベント以外に、どんなイベントが発行されるのか確認してみます。
EventBridge + CloudWatch Logs の部分はマネジメントコンソール、SageMaker HyperPod の部分は HashiCorp Terraform を利用します。
 ロググループの作成まずは記録先となる、ロググループの作成を行います。
ロググループは /event/hyperpod/cluster と /event/hyperpod/node の 2 種類作成しました。
 イベントの作成続いてイベントの作成です。こちらもクラスター状態変更用とノード変更用の 2 つを作成します。
AWS のイベントとして、すでに登録可能になっていました。
 HyperPod クラスターの状態遷移
{
  "source": ["aws.sagemaker"],
  "detail-type": ["SageMaker HyperPod Cluster State Change"]
}
 クラスター内に所属するノードの状態変更
{
  "source": ["aws.sagemaker"],
  "detail-type": ["SageMaker HyperPod Cluster Node Health Event"]
}
ターゲットにロググループを選択して完了です。
 クラスターの作成クラスターの作成を行います。オーケストレーターは Slurm を利用します。
https://github.com/takakuni-classmethod/genai-blog/tree/main/sagemaker_hyperpod_101
 Creating のログクラスター作成時に Creating のログが作成されました。
{
    "version": "0",
    "id": "4c3ca6b2-3d93-0fdc-24da-1adbfa06d6df",
    "detail-type": "SageMaker HyperPod Cluster State Change",
    "source": "aws.sagemaker",
    "account": "111122223333",
    "time": "2025-05-11T07:54:44Z",
    "region": "us-west-2",
    "resources": [
        "arn:aws:sagemaker:us-west-2:111122223333:cluster/ac7vzqgh29mf"
    ],
    "detail": {
        "SdkResponseMetadata": null,
        "SdkHttpMetadata": null,
        "ClusterArn": "arn:aws:sagemaker:us-west-2:111122223333:cluster/ac7vzqgh29mf",
        "ClusterName": "slurm-orchestrator-no-vpc",
        "ClusterStatus": "Creating",
        "CreationTime": 1746950084370,
        "FailureMessage": "",
        "InstanceGroups": [
            {
                "CurrentCount": 0,
                "TargetCount": 1,
                "InstanceGroupName": "controller-machine",
                "InstanceType": "ml.t3.medium",
                "LifeCycleConfig": {
                    "SourceS3Uri": "s3://sagemaker-hyperpod-lifecycle-111122223333-us-west-2/config/",
                    "OnCreate": "on_create.sh"
                },
                "ExecutionRole": "arn:aws:iam::111122223333:role/sagemaker-hyperpod-role",
                "ThreadsPerCore": 2,
                "InstanceStorageConfigs": null,
                "EnableBurnInTest": null,
                "OnStartDeepHealthCheck": null,
                "OnStartDeepHealthChecks": null,
                "Status": "Creating",
                "FailureMessages": null,
                "ScalingConfig": null,
                "TrainingPlanArn": "",
                "TrainingPlanStatus": "NotApplicable",
                "OverrideVpcConfig": null,
                "CustomMetadata": null,
                "ScheduledUpdateConfig": null,
                "CurrentImageId": null,
                "DesiredImageId": null
            },
            {
                "CurrentCount": 0,
                "TargetCount": 1,
                "InstanceGroupName": "worker-group-1",
                "InstanceType": "ml.t3.medium",
                "LifeCycleConfig": {
                    "SourceS3Uri": "s3://sagemaker-hyperpod-lifecycle-111122223333-us-west-2/config/",
                    "OnCreate": "on_create.sh"
                },
                "ExecutionRole": "arn:aws:iam::111122223333:role/sagemaker-hyperpod-role",
                "ThreadsPerCore": 2,
                "InstanceStorageConfigs": null,
                "EnableBurnInTest": null,
                "OnStartDeepHealthCheck": null,
                "OnStartDeepHealthChecks": null,
                "Status": "Creating",
                "FailureMessages": null,
                "ScalingConfig": null,
                "TrainingPlanArn": "",
                "TrainingPlanStatus": "NotApplicable",
                "OverrideVpcConfig": null,
                "CustomMetadata": null,
                "ScheduledUpdateConfig": null,
                "CurrentImageId": null,
                "DesiredImageId": null
            }
        ],
        "RestrictedInstanceGroups": null,
        "VpcConfig": null,
        "Orchestrator": null,
        "ResilienceConfig": null,
        "NodeRecovery": "",
        "Tags": {}
    }
}
 InService のログクラスターが利用可能になると InService に変化しました。想定通りですね。
{
    "version": "0",
    "id": "1dd4d7d3-01ce-e9d1-7f07-2f28d53daa62",
    "detail-type": "SageMaker HyperPod Cluster State Change",
    "source": "aws.sagemaker",
    "account": "111122223333",
    "time": "2025-05-11T08:05:33Z",
    "region": "us-west-2",
    "resources": [
        "arn:aws:sagemaker:us-west-2:111122223333:cluster/ac7vzqgh29mf"
    ],
    "detail": {
        "SdkResponseMetadata": null,
        "SdkHttpMetadata": null,
        "ClusterArn": "arn:aws:sagemaker:us-west-2:111122223333:cluster/ac7vzqgh29mf",
        "ClusterName": "slurm-orchestrator-no-vpc",
        "ClusterStatus": "InService",
        "CreationTime": 1746950084370,
        "FailureMessage": "",
        "InstanceGroups": [
            {
                "CurrentCount": 1,
                "TargetCount": 1,
                "InstanceGroupName": "controller-machine",
                "InstanceType": "ml.t3.medium",
                "LifeCycleConfig": {
                    "SourceS3Uri": "s3://sagemaker-hyperpod-lifecycle-111122223333-us-west-2/config/",
                    "OnCreate": "on_create.sh"
                },
                "ExecutionRole": "arn:aws:iam::111122223333:role/sagemaker-hyperpod-role",
                "ThreadsPerCore": 2,
                "InstanceStorageConfigs": null,
                "EnableBurnInTest": null,
                "OnStartDeepHealthCheck": null,
                "OnStartDeepHealthChecks": null,
                "Status": "InService",
                "FailureMessages": null,
                "ScalingConfig": null,
                "TrainingPlanArn": "",
                "TrainingPlanStatus": "NotApplicable",
                "OverrideVpcConfig": null,
                "CustomMetadata": null,
                "ScheduledUpdateConfig": null,
                "CurrentImageId": null,
                "DesiredImageId": null
            }
        ],
        "RestrictedInstanceGroups": null,
        "VpcConfig": null,
        "Orchestrator": null,
        "ResilienceConfig": null,
        "NodeRecovery": "",
        "Tags": {}
    }
}
 ノードのヘルスイベントノードのヘルスイベントを確認します。異常が発生していないからなのかログは記録されていませんでした。
 インスタンスのシャットダウンインスタンスを 1 台、シャットダウンさせてみました。
 Updating のログ変更によって ClusterStatus が Updating になったため、クラスターの状態遷移ログが記録されました。
{
    "version": "0",
    "id": "ed911b94-3094-229f-cbcf-2407c825a5c5",
    "detail-type": "SageMaker HyperPod Cluster State Change",
    "source": "aws.sagemaker",
    "account": "111122223333",
    "time": "2025-05-11T08:04:29Z",
    "region": "us-west-2",
    "resources": [
        "arn:aws:sagemaker:us-west-2:111122223333:cluster/ac7vzqgh29mf"
    ],
    "detail": {
        "SdkResponseMetadata": null,
        "SdkHttpMetadata": null,
        "ClusterArn": "arn:aws:sagemaker:us-west-2:111122223333:cluster/ac7vzqgh29mf",
        "ClusterName": "slurm-orchestrator-no-vpc",
        "ClusterStatus": "Updating",
        "CreationTime": 1746950084370,
        "FailureMessage": "",
        "InstanceGroups": [
            {
                "CurrentCount": 1,
                "TargetCount": 1,
                "InstanceGroupName": "controller-machine",
                "InstanceType": "ml.t3.medium",
                "LifeCycleConfig": {
                    "SourceS3Uri": "s3://sagemaker-hyperpod-lifecycle-111122223333-us-west-2/config/",
                    "OnCreate": "on_create.sh"
                },
                "ExecutionRole": "arn:aws:iam::111122223333:role/sagemaker-hyperpod-role",
                "ThreadsPerCore": 2,
                "InstanceStorageConfigs": null,
                "EnableBurnInTest": null,
                "OnStartDeepHealthCheck": null,
                "OnStartDeepHealthChecks": null,
                "Status": "InService",
                "FailureMessages": null,
                "ScalingConfig": null,
                "TrainingPlanArn": "",
                "TrainingPlanStatus": "NotApplicable",
                "OverrideVpcConfig": null,
                "CustomMetadata": null,
                "ScheduledUpdateConfig": null,
                "CurrentImageId": null,
                "DesiredImageId": null
            },
            {
                "CurrentCount": 1,
                "TargetCount": 0,
                "InstanceGroupName": "worker-group-1",
                "InstanceType": "ml.t3.medium",
                "LifeCycleConfig": {
                    "SourceS3Uri": "s3://sagemaker-hyperpod-lifecycle-111122223333-us-west-2/config/",
                    "OnCreate": "on_create.sh"
                },
                "ExecutionRole": "arn:aws:iam::111122223333:role/sagemaker-hyperpod-role",
                "ThreadsPerCore": 2,
                "InstanceStorageConfigs": null,
                "EnableBurnInTest": null,
                "OnStartDeepHealthCheck": null,
                "OnStartDeepHealthChecks": null,
                "Status": "Deleting",
                "FailureMessages": null,
                "ScalingConfig": null,
                "TrainingPlanArn": "",
                "TrainingPlanStatus": "NotApplicable",
                "OverrideVpcConfig": null,
                "CustomMetadata": null,
                "ScheduledUpdateConfig": null,
                "CurrentImageId": null,
                "DesiredImageId": null
            }
        ],
        "RestrictedInstanceGroups": null,
        "VpcConfig": null,
        "Orchestrator": null,
        "ResilienceConfig": null,
        "NodeRecovery": "",
        "Tags": {}
    }
}
 InService のログクラスターが再度利用可能になると、 InService に変化しました。必要に応じてイベントをフィルターし、ログの流量を調整する方が良さそうです。
{
    "version": "0",
    "id": "1dd4d7d3-01ce-e9d1-7f07-2f28d53daa62",
    "detail-type": "SageMaker HyperPod Cluster State Change",
    "source": "aws.sagemaker",
    "account": "111122223333",
    "time": "2025-05-11T08:05:33Z",
    "region": "us-west-2",
    "resources": [
        "arn:aws:sagemaker:us-west-2:111122223333:cluster/ac7vzqgh29mf"
    ],
    "detail": {
        "SdkResponseMetadata": null,
        "SdkHttpMetadata": null,
        "ClusterArn": "arn:aws:sagemaker:us-west-2:111122223333:cluster/ac7vzqgh29mf",
        "ClusterName": "slurm-orchestrator-no-vpc",
        "ClusterStatus": "InService",
        "CreationTime": 1746950084370,
        "FailureMessage": "",
        "InstanceGroups": [
            {
                "CurrentCount": 1,
                "TargetCount": 1,
                "InstanceGroupName": "controller-machine",
                "InstanceType": "ml.t3.medium",
                "LifeCycleConfig": {
                    "SourceS3Uri": "s3://sagemaker-hyperpod-lifecycle-111122223333-us-west-2/config/",
                    "OnCreate": "on_create.sh"
                },
                "ExecutionRole": "arn:aws:iam::111122223333:role/sagemaker-hyperpod-role",
                "ThreadsPerCore": 2,
                "InstanceStorageConfigs": null,
                "EnableBurnInTest": null,
                "OnStartDeepHealthCheck": null,
                "OnStartDeepHealthChecks": null,
                "Status": "InService",
                "FailureMessages": null,
                "ScalingConfig": null,
                "TrainingPlanArn": "",
                "TrainingPlanStatus": "NotApplicable",
                "OverrideVpcConfig": null,
                "CustomMetadata": null,
                "ScheduledUpdateConfig": null,
                "CurrentImageId": null,
                "DesiredImageId": null
            }
        ],
        "RestrictedInstanceGroups": null,
        "VpcConfig": null,
        "Orchestrator": null,
        "ResilienceConfig": null,
        "NodeRecovery": "",
        "Tags": {}
    }
}
再度、ノードの状態ログを見ましたが、ログストリームは作成されていませんでした。
あくまでノードの Health が変更したらイベントを飛ばすようなイメージなのでしょうか。
 まとめ以上、「Amazon SageMaker HyperPod が Amazon EventBridge と統合され、ステータス変更イベントを配信できるようになりました。」でした。
個人的ですが、普段検証していて InService のタイミングで、通知するケースは便利そうだなと思いました。
参考になれば幸いです。クラウド事業本部コンサルティング部のたかくに（@takakuni_）でした！
[アップデート] Amazon SageMaker HyperPod が Amazon EventBridge と統合され、ステータス変更イベントを配信できるようになりました

アップデート内容

HyperPod クラスターの状態遷移

クラスター内に所属するノードの状態変更

想定されるユースケース

やってみる

ロググループの作成

イベントの作成

HyperPod クラスターの状態遷移

クラスター内に所属するノードの状態変更

クラスターの作成

Creating のログ

InService のログ

ノードのヘルスイベント

インスタンスのシャットダウン

Updating のログ

InService のログ

まとめ

関連記事

主なカテゴリ

AWSで探す

注目のテーマ

プロダクトやサービスで探す

特集やシリーズから探す

お問い合わせ

運営会社