SageMaker HyperPod のスケールダウンについて整理してみた Slurm オーケストレータ編
こんにちは!クラウド事業本部コンサルティング部のたかくに(@takakuni_)です。
先日、 SageMaker HyperPod クラスターのインスタンスグループが削除をサポートしました。
Changes Added new capability in the UpdateCluster operation to remove instance groups from your SageMaker HyperPod cluster.
これにより、HyperPod クラスター内のインスタンスグループ/インスタンスの増減がより柔軟になりました。今回は Slurm オーケーストレータのスケールダウンについて整理してみたいと思います。
まとめ
先にまとめです。次の要件が見つかりました。
共通
- 最低 1 つのインスタンスグループはクラスター内に存在する必要がある
- インスタンスのスケールダウンには 2 つの方法がある
- UpdateCluster API で
InstanceCount
を増減させる方法- スケールダウンするインスタンスはランダムに指定される
- BatchDeleteClusterNodes API で特定のインスタンスを指定して削除する方法
- UpdateCluster API で
- ユーザー定義の VPC 環境と、 SageMaker 側のマネージド VPC 環境で差はない
- コントローラーグループに指定したインスタンス/インスタンスグループは削除できない
- ワーカーグループや、ログイングループに指定したインスタンスグループ内のインスタンスは 0 にすることが可能
やってみる
Slurm オーケストレータを利用した SageMaker HyperPod クラスターを VPC 設定あり/なしで 2 つ用意します。
インスタンスグループの初期状態を以下とします。
- controller-machine(1台)
- worker-group-1(1台)
- worker-group-2(1台)
実施するシナリオは次のとおりです。
- インスタンスグループ(worker-group-3)の追加
- インスタンスグループ(worker-group-3)のインスタンスを追加(2台)
- インスタンスグループ(worker-group-3)のインスタンスを UpdateCluster API で削除(1台)
- インスタンスグループ(worker-group-3)のインスタンスを BatchDeleteClusterNodes API で削除(1台)
- インスタンスグループ(worker-group-3)を削除
- インスタンスグループ(worker-group-2)を削除
- インスタンスが存在しているケースでインスタンスグループは削除可能なのか
- インスタンスグループ(worker-group-1)のインスタンスを UpdateCluster API で削除(1台)
- インスタンスグループ(controller-machine)のインスタンスを UpdateCluster API で削除(1台)
インスタンスグループの追加
まずは、インスタンスグループを追加します。
インスタンスの増加は後で行うため、インスタンス数が 0 のインスタンスグループを作成します。
Update Cluster 用の JSON ファイルを修正します。
{
"ClusterName": "slurm-orchestrator-vpc",
"InstanceGroups": [
{
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
"InstanceCount": 1,
"InstanceGroupName": "controller-machine",
"InstanceStorageConfigs": [],
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://vpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ThreadsPerCore": 2
},
{
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
"InstanceCount": 1,
"InstanceGroupName": "worker-group-1",
"InstanceStorageConfigs": [],
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://vpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ThreadsPerCore": 2
},
{
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
"InstanceCount": 1,
"InstanceGroupName": "worker-group-2",
"InstanceStorageConfigs": [],
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ThreadsPerCore": 2
- },
+ },
+ {
+ "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
+ "InstanceCount": 0,
+ "InstanceGroupName": "worker-group-3",
+ "InstanceStorageConfigs": [],
+ "InstanceType": "ml.t3.medium",
+ "LifeCycleConfig": {
+ "SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
+ "OnCreate": "on_create.sh"
+ },
+ "ThreadsPerCore": 2
+ }
]
}
{
"ClusterName": "slurm-orchestrator-no-vpc",
"InstanceGroups": [
{
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
"InstanceCount": 1,
"InstanceGroupName": "controller-machine",
"InstanceStorageConfigs": [],
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ThreadsPerCore": 2
},
{
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
"InstanceCount": 1,
"InstanceGroupName": "worker-group-1",
"InstanceStorageConfigs": [],
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ThreadsPerCore": 2
},
{
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
"InstanceCount": 1,
"InstanceGroupName": "worker-group-2",
"InstanceStorageConfigs": [],
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ThreadsPerCore": 2
- },
+ },
+ {
+ "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
+ "InstanceCount": 0,
+ "InstanceGroupName": "worker-group-3",
+ "InstanceStorageConfigs": [],
+ "InstanceType": "ml.t3.medium",
+ "LifeCycleConfig": {
+ "SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
+ "OnCreate": "on_create.sh"
+ },
+ "ThreadsPerCore": 2
+ }
]
}
クラスターの変更は AWS CLI で行います。
takakuni@Mac genai-blog % aws sagemaker update-cluster \
--cli-input-json file://update_cluster_slurm_novpc.json
Enter MFA code for arn:aws:iam::482842011168:mfa/cm-takakuni.shinnosuke:
{
"ClusterArn": "arn:aws:sagemaker:ap-northeast-1:accountid:cluster/wmv5lg859xc8"
}
takakuni@Mac genai-blog % aws sagemaker update-cluster \
--cli-input-json file://update_cluster_slurm_vpc.json
{
"ClusterArn": "arn:aws:sagemaker:ap-northeast-1:accountid:cluster/t829p1gig0w1"
}
空のインスタンスグループが作成されていますね。
takakuni@Mac genai-blog % aws sagemaker describe-cluster --cluster-name slurm-orchestrator-vpc | jq '{
ClusterName: .ClusterName,
ClusterStatus: .ClusterStatus,
InstanceGroups: [.InstanceGroups[] | {
CurrentCount: .CurrentCount,
InstanceGroupName: .InstanceGroupName,
InstanceType: .InstanceType,
LifeCycleConfig: .LifeCycleConfig,
ExecutionRole: .ExecutionRole,
Status: .Status
}],
VpcConfig: .VpcConfig
}'
{
"ClusterName": "slurm-orchestrator-vpc",
"ClusterStatus": "InService",
"InstanceGroups": [
{
"CurrentCount": 1,
"InstanceGroupName": "controller-machine",
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://vpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
"Status": "InService"
},
{
"CurrentCount": 1,
"InstanceGroupName": "worker-group-1",
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://vpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
"Status": "InService"
},
{
"CurrentCount": 1,
"InstanceGroupName": "worker-group-2",
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
"Status": "InService"
},
{
"CurrentCount": 0,
"InstanceGroupName": "worker-group-3",
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
"Status": "InService"
}
],
"VpcConfig": {
"SecurityGroupIds": [
"sg-06732926c332f6d14"
],
"Subnets": [
"subnet-0f0417ec77f811354"
]
}
}
takakuni@Mac genai-blog % aws sagemaker describe-cluster --cluster-name slurm-orchestrator-no-vpc | jq '{
ClusterName: .ClusterName,
ClusterStatus: .ClusterStatus,
InstanceGroups: [.InstanceGroups[] | {
CurrentCount: .CurrentCount,
InstanceGroupName: .InstanceGroupName,
InstanceType: .InstanceType,
LifeCycleConfig: .LifeCycleConfig,
ExecutionRole: .ExecutionRole,
Status: .Status
}],
VpcConfig: .VpcConfig
}'
{
"ClusterName": "slurm-orchestrator-no-vpc",
"ClusterStatus": "InService",
"InstanceGroups": [
{
"CurrentCount": 1,
"InstanceGroupName": "controller-machine",
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
"Status": "InService"
},
{
"CurrentCount": 1,
"InstanceGroupName": "worker-group-1",
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
"Status": "InService"
},
{
"CurrentCount": 1,
"InstanceGroupName": "worker-group-2",
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
"Status": "InService"
},
{
"CurrentCount": 0,
"InstanceGroupName": "worker-group-3",
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
"Status": "InService"
}
],
"VpcConfig": null
}
インスタンスの追加
続いてインスタンスグループにインスタンスを追加します。
provisioning_parameters.json
を変更し、ライフサイクルスクリプト用の S3 オブジェクトを更新します。
{
"controller_group":"controller-machine",
"version":"1.0.0",
"worker_groups":[
{
"instance_group_name":"worker-group-1",
"instance_type":"ml.t3.medium"
},
{
"instance_group_name":"worker-group-2",
"instance_type":"ml.t3.medium"
},
+ {
+ "instance_group_name":"worker-group-3",
+ "instance_type":"ml.t3.medium"
+ }
],
"workload_manager":"slurm"
}
インスタンス数を 0 から 2 に変更してみます。
{
"ClusterName": "slurm-orchestrator-vpc",
"InstanceGroups": [
{
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
"InstanceCount": 1,
"InstanceGroupName": "controller-machine",
"InstanceStorageConfigs": [],
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://vpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ThreadsPerCore": 2
},
{
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
"InstanceCount": 1,
"InstanceGroupName": "worker-group-1",
"InstanceStorageConfigs": [],
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://vpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ThreadsPerCore": 2
},
{
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
"InstanceCount": 1,
"InstanceGroupName": "worker-group-2",
"InstanceStorageConfigs": [],
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ThreadsPerCore": 2
},
{
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
- "InstanceCount": 0,
+ "InstanceCount": 2,
"InstanceGroupName": "worker-group-3",
"InstanceStorageConfigs": [],
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ThreadsPerCore": 2
}
]
}
{
"ClusterName": "slurm-orchestrator-no-vpc",
"InstanceGroups": [
{
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
"InstanceCount": 1,
"InstanceGroupName": "controller-machine",
"InstanceStorageConfigs": [],
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ThreadsPerCore": 2
},
{
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
"InstanceCount": 1,
"InstanceGroupName": "worker-group-1",
"InstanceStorageConfigs": [],
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ThreadsPerCore": 2
},
{
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
"InstanceCount": 1,
"InstanceGroupName": "worker-group-2",
"InstanceStorageConfigs": [],
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ThreadsPerCore": 2
},
{
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
- "InstanceCount": 0,
+ "InstanceCount": 2,
"InstanceGroupName": "worker-group-3",
"InstanceStorageConfigs": [],
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ThreadsPerCore": 2
}
]
}
こちらも問題なく、インスタンス数が 2 に増加しました。Status も InService となっており、稼働できる状況です。
takakuni@Mac genai-blog % aws sagemaker describe-cluster --cluster-name slurm-orchestrator-vpc | jq '{
ClusterName: .ClusterName,
ClusterStatus: .ClusterStatus,
InstanceGroups: [.InstanceGroups[] | {
CurrentCount: .CurrentCount,
InstanceGroupName: .InstanceGroupName,
InstanceType: .InstanceType,
LifeCycleConfig: .LifeCycleConfig,
ExecutionRole: .ExecutionRole,
Status: .Status
}],
VpcConfig: .VpcConfig
}'
{
"ClusterName": "slurm-orchestrator-vpc",
"ClusterStatus": "InService",
"InstanceGroups": [
{
"CurrentCount": 1,
"InstanceGroupName": "controller-machine",
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://vpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
"Status": "InService"
},
{
"CurrentCount": 1,
"InstanceGroupName": "worker-group-1",
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://vpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
"Status": "InService"
},
{
"CurrentCount": 1,
"InstanceGroupName": "worker-group-2",
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
"Status": "InService"
},
{
"CurrentCount": 2,
"InstanceGroupName": "worker-group-3",
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
"Status": "InService"
}
],
"VpcConfig": {
"SecurityGroupIds": [
"sg-06732926c332f6d14"
],
"Subnets": [
"subnet-0f0417ec77f811354"
]
}
}
takakuni@Mac genai-blog % aws sagemaker describe-cluster --cluster-name slurm-orchestrator-no-vpc | jq '{
ClusterName: .ClusterName,
ClusterStatus: .ClusterStatus,
InstanceGroups: [.InstanceGroups[] | {
CurrentCount: .CurrentCount,
InstanceGroupName: .InstanceGroupName,
InstanceType: .InstanceType,
LifeCycleConfig: .LifeCycleConfig,
ExecutionRole: .ExecutionRole,
Status: .Status
}],
VpcConfig: .VpcConfig
}'
{
"ClusterName": "slurm-orchestrator-no-vpc",
"ClusterStatus": "InService",
"InstanceGroups": [
{
"CurrentCount": 1,
"InstanceGroupName": "controller-machine",
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
"Status": "InService"
},
{
"CurrentCount": 1,
"InstanceGroupName": "worker-group-1",
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
"Status": "InService"
},
{
"CurrentCount": 1,
"InstanceGroupName": "worker-group-2",
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
"Status": "InService"
},
{
"CurrentCount": 2,
"InstanceGroupName": "worker-group-3",
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
"Status": "InService"
}
],
"VpcConfig": null
}
インスタンスの削除 UpdateCluster API
インスタンスの削除を行います。 まずは UpdateCluster API でインスタンス数を減らします。
{
"ClusterName": "slurm-orchestrator-vpc",
"InstanceGroups": [
{
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
"InstanceCount": 1,
"InstanceGroupName": "controller-machine",
"InstanceStorageConfigs": [],
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://vpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ThreadsPerCore": 2
},
{
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
"InstanceCount": 1,
"InstanceGroupName": "worker-group-1",
"InstanceStorageConfigs": [],
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://vpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ThreadsPerCore": 2
},
{
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
"InstanceCount": 1,
"InstanceGroupName": "worker-group-2",
"InstanceStorageConfigs": [],
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ThreadsPerCore": 2
},
{
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
- "InstanceCount": 2,
+ "InstanceCount": 1,
"InstanceGroupName": "worker-group-3",
"InstanceStorageConfigs": [],
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ThreadsPerCore": 2
}
]
}
{
"ClusterName": "slurm-orchestrator-no-vpc",
"InstanceGroups": [
{
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
"InstanceCount": 1,
"InstanceGroupName": "controller-machine",
"InstanceStorageConfigs": [],
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ThreadsPerCore": 2
},
{
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
"InstanceCount": 1,
"InstanceGroupName": "worker-group-1",
"InstanceStorageConfigs": [],
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ThreadsPerCore": 2
},
{
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
"InstanceCount": 1,
"InstanceGroupName": "worker-group-2",
"InstanceStorageConfigs": [],
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ThreadsPerCore": 2
},
{
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
- "InstanceCount": 2,
+ "InstanceCount": 1,
"InstanceGroupName": "worker-group-3",
"InstanceStorageConfigs": [],
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ThreadsPerCore": 2
}
]
}
インスタンス数が 1 に変更され、 Status が InService になっています。
takakuni@Mac genai-blog % aws sagemaker describe-cluster --cluster-name slurm-orchestrator-vpc | jq '{
ClusterName: .ClusterName,
ClusterStatus: .ClusterStatus,
InstanceGroups: [.InstanceGroups[] | {
CurrentCount: .CurrentCount,
InstanceGroupName: .InstanceGroupName,
InstanceType: .InstanceType,
LifeCycleConfig: .LifeCycleConfig,
ExecutionRole: .ExecutionRole,
Status: .Status
}],
VpcConfig: .VpcConfig
}'
{
"ClusterName": "slurm-orchestrator-vpc",
"ClusterStatus": "InService",
"InstanceGroups": [
{
"CurrentCount": 1,
"InstanceGroupName": "controller-machine",
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://vpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
"Status": "InService"
},
{
"CurrentCount": 1,
"InstanceGroupName": "worker-group-1",
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://vpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
"Status": "InService"
},
{
"CurrentCount": 1,
"InstanceGroupName": "worker-group-2",
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
"Status": "InService"
},
{
"CurrentCount": 1,
"InstanceGroupName": "worker-group-3",
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
"Status": "InService"
}
],
"VpcConfig": {
"SecurityGroupIds": [
"sg-06732926c332f6d14"
],
"Subnets": [
"subnet-0f0417ec77f811354"
]
}
}
takakuni@Mac genai-blog % aws sagemaker describe-cluster --cluster-name slurm-orchestrator-no-vpc | jq '{
ClusterName: .ClusterName,
ClusterStatus: .ClusterStatus,
InstanceGroups: [.InstanceGroups[] | {
CurrentCount: .CurrentCount,
InstanceGroupName: .InstanceGroupName,
InstanceType: .InstanceType,
LifeCycleConfig: .LifeCycleConfig,
ExecutionRole: .ExecutionRole,
Status: .Status
}],
VpcConfig: .VpcConfig
}'
{
"ClusterName": "slurm-orchestrator-no-vpc",
"ClusterStatus": "InService",
"InstanceGroups": [
{
"CurrentCount": 1,
"InstanceGroupName": "controller-machine",
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
"Status": "InService"
},
{
"CurrentCount": 1,
"InstanceGroupName": "worker-group-1",
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
"Status": "InService"
},
{
"CurrentCount": 1,
"InstanceGroupName": "worker-group-2",
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
"Status": "InService"
},
{
"CurrentCount": 1,
"InstanceGroupName": "worker-group-3",
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
"Status": "InService"
}
],
"VpcConfig": null
}
インスタンスの削除 BatchDeleteClusterNodes API
続いて、BatchDeleteClusterNodes API で worker-group-3 に所属する、残りのインスタンスを削除します。
# インスタンスの特定
takakuni@Mac genai-blog % NODE_ID=$(aws sagemaker list-cluster-nodes \
--cluster-name slurm-orchestrator-vpc \
--instance-group-name-contains "worker-group-3" | jq -r '.ClusterNodeSummaries[].InstanceId')
# インスタンス ID の表示
takakuni@Mac genai-blog % echo "Deleting nodes: $NODE_ID"
Deleting nodes: i-0d60f6a5d1ba07c6c
# インスタンスの削除
takakuni@Mac genai-blog % aws sagemaker batch-delete-cluster-nodes \
--cluster-name slurm-orchestrator-vpc \
--node-ids $NODE_ID
{
"Successful": [
"i-0d60f6a5d1ba07c6c"
]
}
# インスタンスの特定
takakuni@Mac genai-blog % NODE_ID=$(aws sagemaker list-cluster-nodes \
--cluster-name slurm-orchestrator-no-vpc \
--instance-group-name-contains "worker-group-3" | jq -r '.ClusterNodeSummaries[].InstanceId')
# インスタンス ID の表示
takakuni@Mac genai-blog % echo "Deleting nodes: $NODE_ID"
Deleting nodes: i-0f7184e323b14beca
# インスタンスの削除
takakuni@Mac genai-blog % aws sagemaker batch-delete-cluster-nodes \
--cluster-name slurm-orchestrator-no-vpc \
--node-ids $NODE_ID
{
"Successful": [
"i-0f7184e323b14beca"
]
}
クラスターを確認すると、インスタンス数が 0 のインスタンスグループに戻りました。
takakuni@Mac genai-blog % aws sagemaker describe-cluster --cluster-name slurm-orchestrator-vpc | jq '{
ClusterName: .ClusterName,
ClusterStatus: .ClusterStatus,
InstanceGroups: [.InstanceGroups[] | {
CurrentCount: .CurrentCount,
InstanceGroupName: .InstanceGroupName,
InstanceType: .InstanceType,
LifeCycleConfig: .LifeCycleConfig,
ExecutionRole: .ExecutionRole,
Status: .Status
}],
VpcConfig: .VpcConfig
}'
{
"ClusterName": "slurm-orchestrator-vpc",
"ClusterStatus": "InService",
"InstanceGroups": [
{
"CurrentCount": 1,
"InstanceGroupName": "controller-machine",
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://vpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
"Status": "InService"
},
{
"CurrentCount": 1,
"InstanceGroupName": "worker-group-1",
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://vpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
"Status": "InService"
},
{
"CurrentCount": 1,
"InstanceGroupName": "worker-group-2",
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
"Status": "InService"
},
{
"CurrentCount": 0,
"InstanceGroupName": "worker-group-3",
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
"Status": "InService"
}
],
"VpcConfig": {
"SecurityGroupIds": [
"sg-06732926c332f6d14"
],
"Subnets": [
"subnet-0f0417ec77f811354"
]
}
}
takakuni@Mac genai-blog % aws sagemaker describe-cluster --cluster-name slurm-orchestrator-no-vpc | jq '{
ClusterName: .ClusterName,
ClusterStatus: .ClusterStatus,
InstanceGroups: [.InstanceGroups[] | {
CurrentCount: .CurrentCount,
InstanceGroupName: .InstanceGroupName,
InstanceType: .InstanceType,
LifeCycleConfig: .LifeCycleConfig,
ExecutionRole: .ExecutionRole,
Status: .Status
}],
VpcConfig: .VpcConfig
}'
{
"ClusterName": "slurm-orchestrator-no-vpc",
"ClusterStatus": "InService",
"InstanceGroups": [
{
"CurrentCount": 1,
"InstanceGroupName": "controller-machine",
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
"Status": "InService"
},
{
"CurrentCount": 1,
"InstanceGroupName": "worker-group-1",
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
"Status": "InService"
},
{
"CurrentCount": 1,
"InstanceGroupName": "worker-group-2",
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
"Status": "InService"
},
{
"CurrentCount": 0,
"InstanceGroupName": "worker-group-3",
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
"Status": "InService"
}
],
"VpcConfig": null
}
インスタンスグループの削除
インスタンスグループの削除に入ります。インスタンスグループの削除は Update Cluster API で行います。
対象のインスタンスグループを InstanceGroups から外し、InstanceGroupsToDelete に名前を入力します。
{
"ClusterName": "slurm-orchestrator-vpc",
"InstanceGroups": [
{
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
"InstanceCount": 1,
"InstanceGroupName": "controller-machine",
"InstanceStorageConfigs": [],
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://vpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ThreadsPerCore": 2
},
{
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
"InstanceCount": 1,
"InstanceGroupName": "worker-group-1",
"InstanceStorageConfigs": [],
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://vpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ThreadsPerCore": 2
},
{
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
"InstanceCount": 1,
"InstanceGroupName": "worker-group-2",
"InstanceStorageConfigs": [],
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ThreadsPerCore": 2
+ }
- },
- {
- "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
- "InstanceCount": 1,
- "InstanceGroupName": "worker-group-3",
- "InstanceStorageConfigs": [],
- "InstanceType": "ml.t3.medium",
- "LifeCycleConfig": {
- "SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
- "OnCreate": "on_create.sh"
- },
- "ThreadsPerCore": 2
- }
- ]
+ ],
+ "InstanceGroupsToDelete": ["worker-group-3"]
}
{
"ClusterName": "slurm-orchestrator-no-vpc",
"InstanceGroups": [
{
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
"InstanceCount": 1,
"InstanceGroupName": "controller-machine",
"InstanceStorageConfigs": [],
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ThreadsPerCore": 2
},
{
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
"InstanceCount": 1,
"InstanceGroupName": "worker-group-1",
"InstanceStorageConfigs": [],
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ThreadsPerCore": 2
},
{
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
"InstanceCount": 1,
"InstanceGroupName": "worker-group-2",
"InstanceStorageConfigs": [],
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ThreadsPerCore": 2
+ }
- },
- {
- "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
- "InstanceCount": 1,
- "InstanceGroupName": "worker-group-3",
- "InstanceStorageConfigs": [],
- "InstanceType": "ml.t3.medium",
- "LifeCycleConfig": {
- "SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
- "OnCreate": "on_create.sh"
- },
- "ThreadsPerCore": 2
- },
- ]
+ ],
+ "InstanceGroupsToDelete": ["worker-group-3"]
}
InstanceGroups から worker-group-3
が削除されていますね。
takakuni@Mac genai-blog % aws sagemaker describe-cluster --cluster-name slurm-orchestrator-vpc | jq '{
ClusterName: .ClusterName,
ClusterStatus: .ClusterStatus,
InstanceGroups: [.InstanceGroups[] | {
CurrentCount: .CurrentCount,
InstanceGroupName: .InstanceGroupName,
InstanceType: .InstanceType,
LifeCycleConfig: .LifeCycleConfig,
ExecutionRole: .ExecutionRole,
Status: .Status
}],
VpcConfig: .VpcConfig
}'
{
"ClusterName": "slurm-orchestrator-vpc",
"ClusterStatus": "InService",
"InstanceGroups": [
{
"CurrentCount": 1,
"InstanceGroupName": "controller-machine",
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://vpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
"Status": "InService"
},
{
"CurrentCount": 1,
"InstanceGroupName": "worker-group-1",
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://vpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
"Status": "InService"
},
{
"CurrentCount": 1,
"InstanceGroupName": "worker-group-2",
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
"Status": "InService"
}
],
"VpcConfig": {
"SecurityGroupIds": [
"sg-06732926c332f6d14"
],
"Subnets": [
"subnet-0f0417ec77f811354"
]
}
}
takakuni@Mac genai-blog % aws sagemaker describe-cluster --cluster-name slurm-orchestrator-no-vpc | jq '{
ClusterName: .ClusterName,
ClusterStatus: .ClusterStatus,
InstanceGroups: [.InstanceGroups[] | {
CurrentCount: .CurrentCount,
InstanceGroupName: .InstanceGroupName,
InstanceType: .InstanceType,
LifeCycleConfig: .LifeCycleConfig,
ExecutionRole: .ExecutionRole,
Status: .Status
}],
VpcConfig: .VpcConfig
}'
{
"ClusterName": "slurm-orchestrator-no-vpc",
"ClusterStatus": "InService",
"InstanceGroups": [
{
"CurrentCount": 1,
"InstanceGroupName": "controller-machine",
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
"Status": "InService"
},
{
"CurrentCount": 1,
"InstanceGroupName": "worker-group-1",
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
"Status": "InService"
},
{
"CurrentCount": 1,
"InstanceGroupName": "worker-group-2",
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
"Status": "InService"
}
],
"VpcConfig": null
}
インスタンスが存在するインスタンスグループの削除
インスタンスが存在する場合のインスタンスグループ削除もやってみます。
{
"ClusterName": "slurm-orchestrator-vpc",
"InstanceGroups": [
{
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
"InstanceCount": 1,
"InstanceGroupName": "controller-machine",
"InstanceStorageConfigs": [],
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://vpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ThreadsPerCore": 2
},
{
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
"InstanceCount": 1,
"InstanceGroupName": "worker-group-1",
"InstanceStorageConfigs": [],
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://vpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ThreadsPerCore": 2
+ }
- },
- {
- "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
- "InstanceCount": 1,
- "InstanceGroupName": "worker-group-2",
- "InstanceStorageConfigs": [],
- "InstanceType": "ml.t3.medium",
- "LifeCycleConfig": {
- "SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
- "OnCreate": "on_create.sh"
- },
- "ThreadsPerCore": 2
- }
],
"InstanceGroupsToDelete": ["worker-group-2"]
}
{
"ClusterName": "slurm-orchestrator-no-vpc",
"InstanceGroups": [
{
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
"InstanceCount": 1,
"InstanceGroupName": "controller-machine",
"InstanceStorageConfigs": [],
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ThreadsPerCore": 2
},
{
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
"InstanceCount": 1,
"InstanceGroupName": "worker-group-1",
"InstanceStorageConfigs": [],
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ThreadsPerCore": 2
+ }
- },
- {
- "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
- "InstanceCount": 1,
- "InstanceGroupName": "worker-group-2",
- "InstanceStorageConfigs": [],
- "InstanceType": "ml.t3.medium",
- "LifeCycleConfig": {
- "SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
- "OnCreate": "on_create.sh"
- },
- "ThreadsPerCore": 2
- }
],
"InstanceGroupsToDelete": ["worker-group-2"]
}
InstanceGroups から worker-group-2
が削除され、controller-machine
と worker-group-1
のみになりました。
takakuni@Mac genai-blog % aws sagemaker describe-cluster --cluster-name slurm-orchestrator-vpc | jq '{
ClusterName: .ClusterName,
ClusterStatus: .ClusterStatus,
InstanceGroups: [.InstanceGroups[] | {
CurrentCount: .CurrentCount,
InstanceGroupName: .InstanceGroupName,
InstanceType: .InstanceType,
LifeCycleConfig: .LifeCycleConfig,
ExecutionRole: .ExecutionRole,
Status: .Status
}],
VpcConfig: .VpcConfig
}'
{
"ClusterName": "slurm-orchestrator-vpc",
"ClusterStatus": "InService",
"InstanceGroups": [
{
"CurrentCount": 1,
"InstanceGroupName": "controller-machine",
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://vpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
"Status": "InService"
},
{
"CurrentCount": 1,
"InstanceGroupName": "worker-group-1",
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://vpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
"Status": "InService"
}
],
"VpcConfig": {
"SecurityGroupIds": [
"sg-06732926c332f6d14"
],
"Subnets": [
"subnet-0f0417ec77f811354"
]
}
}
takakuni@Mac genai-blog % aws sagemaker describe-cluster --cluster-name slurm-orchestrator-no-vpc | jq '{
ClusterName: .ClusterName,
ClusterStatus: .ClusterStatus,
InstanceGroups: [.InstanceGroups[] | {
CurrentCount: .CurrentCount,
InstanceGroupName: .InstanceGroupName,
InstanceType: .InstanceType,
LifeCycleConfig: .LifeCycleConfig,
ExecutionRole: .ExecutionRole,
Status: .Status
}],
VpcConfig: .VpcConfig
}'
{
"ClusterName": "slurm-orchestrator-no-vpc",
"ClusterStatus": "InService",
"InstanceGroups": [
{
"CurrentCount": 1,
"InstanceGroupName": "controller-machine",
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
"Status": "InService"
},
{
"CurrentCount": 1,
"InstanceGroupName": "worker-group-1",
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
"Status": "InService"
}
],
"VpcConfig": null
}
インスタンスグループ内のインスタンスをすべて削除
最後に controller-machine
のインスタンスの削除を試みます。
{
"ClusterName": "slurm-orchestrator-vpc",
"InstanceGroups": [
{
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
+ "InstanceCount": 0,
- "InstanceCount": 1,
"InstanceGroupName": "controller-machine",
"InstanceStorageConfigs": [],
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://vpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ThreadsPerCore": 2
},
{
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
"InstanceCount": 1,
"InstanceGroupName": "worker-group-1",
"InstanceStorageConfigs": [],
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://vpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ThreadsPerCore": 2
}
+ ]
- ],
- "InstanceGroupsToDelete": ["worker-group-2"]
}
{
"ClusterName": "slurm-orchestrator-no-vpc",
"InstanceGroups": [
{
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
+ "InstanceCount": 0,
- "InstanceCount": 1,
"InstanceGroupName": "controller-machine",
"InstanceStorageConfigs": [],
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ThreadsPerCore": 2
},
{
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
"InstanceCount": 1,
"InstanceGroupName": "worker-group-1",
"InstanceStorageConfigs": [],
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ThreadsPerCore": 2
}
+ ]
- ],
- "InstanceGroupsToDelete": ["worker-group-2"]
}
Cannot delete the nodes in controller group: controller-machine.
とコントローラーグループのノード削除ができない旨が記載されていますね。
takakuni@Mac genai-blog % aws sagemaker describe-cluster --cluster-name slurm-orchestrator-vpc | jq '{
ClusterName: .ClusterName,
FailureMessage: .FailureMessage,
ClusterStatus: .ClusterStatus,
InstanceGroups: [.InstanceGroups[] | {
CurrentCount: .CurrentCount,
InstanceGroupName: .InstanceGroupName,
InstanceType: .InstanceType,
LifeCycleConfig: .LifeCycleConfig,
ExecutionRole: .ExecutionRole,
Status: .Status
}],
VpcConfig: .VpcConfig
}'
{
"ClusterName": "slurm-orchestrator-vpc",
"FailureMessage": "Cannot delete the nodes in controller group: controller-machine.",
"ClusterStatus": "Failed",
"InstanceGroups": [
{
"CurrentCount": 1,
"InstanceGroupName": "controller-machine",
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://vpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
"Status": "InService"
},
{
"CurrentCount": 1,
"InstanceGroupName": "worker-group-1",
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://vpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
"Status": "InService"
}
],
"VpcConfig": {
"SecurityGroupIds": [
"sg-06732926c332f6d14"
],
"Subnets": [
"subnet-0f0417ec77f811354"
]
}
}
aws sagemaker describe-cluster --cluster-name slurm-orchestrator-no-vpc | jq '{
ClusterName: .ClusterName,
ClusterStatus: .ClusterStatus,
FailureMessage: .FailureMessage,
InstanceGroups: [.InstanceGroups[] | {
CurrentCount: .CurrentCount,
InstanceGroupName: .InstanceGroupName,
InstanceType: .InstanceType,
LifeCycleConfig: .LifeCycleConfig,
ExecutionRole: .ExecutionRole,
Status: .Status
}],
VpcConfig: .VpcConfig
}'
{
"ClusterName": "slurm-orchestrator-no-vpc",
"ClusterStatus": "Failed",
"FailureMessage": "Cannot delete the nodes in controller group: controller-machine.",
"InstanceGroups": [
{
"CurrentCount": 1,
"InstanceGroupName": "controller-machine",
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
"Status": "InService"
},
{
"CurrentCount": 1,
"InstanceGroupName": "worker-group-1",
"InstanceType": "ml.t3.medium",
"LifeCycleConfig": {
"SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
"OnCreate": "on_create.sh"
},
"ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
"Status": "InService"
}
],
"VpcConfig": null
}
まとめ
以上、「SageMaker HyperPod のスケールダウンについて整理してみた Slurm オーケストレータ編
」でした。
少し前まで、 Slurm オーケストレータではインスタンスグループにかかわらず、ノードの削除ができなかった記憶でしたが、アップデートされてインスタンスグループまで削除できるようになってました。検証でなるべくコストを抑えたい私にとっては、かなり嬉しいアップデートですね。このブログがどなたかの参考になれば幸いです。
クラウド事業本部コンサルティング部のたかくに(@takakuni_)でした!