SageMaker HyperPod のスケールダウンについて整理してみた Slurm オーケストレータ編

SageMaker HyperPod のスケールダウンについて整理してみた Slurm オーケストレータ編

Clock Icon2025.02.24

こんにちは!クラウド事業本部コンサルティング部のたかくに(@takakuni_)です。

先日、 SageMaker HyperPod クラスターのインスタンスグループが削除をサポートしました。

Changes Added new capability in the UpdateCluster operation to remove instance groups from your SageMaker HyperPod cluster.

https://awsapichanges.com/archive/changes/f24a18-api.sagemaker.html

これにより、HyperPod クラスター内のインスタンスグループ/インスタンスの増減がより柔軟になりました。今回は Slurm オーケーストレータのスケールダウンについて整理してみたいと思います。

まとめ

先にまとめです。次の要件が見つかりました。

共通

  • 最低 1 つのインスタンスグループはクラスター内に存在する必要がある
  • インスタンスのスケールダウンには 2 つの方法がある
    • UpdateCluster API で InstanceCount を増減させる方法
      • スケールダウンするインスタンスはランダムに指定される
    • BatchDeleteClusterNodes API で特定のインスタンスを指定して削除する方法
  • ユーザー定義の VPC 環境と、 SageMaker 側のマネージド VPC 環境で差はない
  • コントローラーグループに指定したインスタンス/インスタンスグループは削除できない
  • ワーカーグループや、ログイングループに指定したインスタンスグループ内のインスタンスは 0 にすることが可能

やってみる

Slurm オーケストレータを利用した SageMaker HyperPod クラスターを VPC 設定あり/なしで 2 つ用意します。

インスタンスグループの初期状態を以下とします。

  • controller-machine(1台)
  • worker-group-1(1台)
  • worker-group-2(1台)

実施するシナリオは次のとおりです。

  1. インスタンスグループ(worker-group-3)の追加
  2. インスタンスグループ(worker-group-3)のインスタンスを追加(2台)
  3. インスタンスグループ(worker-group-3)のインスタンスを UpdateCluster API で削除(1台)
  4. インスタンスグループ(worker-group-3)のインスタンスを BatchDeleteClusterNodes API で削除(1台)
  5. インスタンスグループ(worker-group-3)を削除
  6. インスタンスグループ(worker-group-2)を削除
    1. インスタンスが存在しているケースでインスタンスグループは削除可能なのか
  7. インスタンスグループ(worker-group-1)のインスタンスを UpdateCluster API で削除(1台)
  8. インスタンスグループ(controller-machine)のインスタンスを UpdateCluster API で削除(1台)

インスタンスグループの追加

まずは、インスタンスグループを追加します。

インスタンスの増加は後で行うため、インスタンス数が 0 のインスタンスグループを作成します。

Update Cluster 用の JSON ファイルを修正します。

update_cluster_slurm_vpc.json
{
  "ClusterName": "slurm-orchestrator-vpc",
  "InstanceGroups": [
    {
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
      "InstanceCount": 1,
      "InstanceGroupName": "controller-machine",
      "InstanceStorageConfigs": [],
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://vpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ThreadsPerCore": 2
    },
    {
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
      "InstanceCount": 1,
      "InstanceGroupName": "worker-group-1",
      "InstanceStorageConfigs": [],
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://vpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ThreadsPerCore": 2
    },
    {
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
      "InstanceCount": 1,
      "InstanceGroupName": "worker-group-2",
      "InstanceStorageConfigs": [],
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ThreadsPerCore": 2
-    },
+    },
+    {
+      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
+      "InstanceCount": 0,
+      "InstanceGroupName": "worker-group-3",
+      "InstanceStorageConfigs": [],
+      "InstanceType": "ml.t3.medium",
+      "LifeCycleConfig": {
+        "SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
+        "OnCreate": "on_create.sh"
+      },
+      "ThreadsPerCore": 2
+    }
  ]
}
update_cluster_slurm_novpc.json
{
  "ClusterName": "slurm-orchestrator-no-vpc",
  "InstanceGroups": [
    {
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
      "InstanceCount": 1,
      "InstanceGroupName": "controller-machine",
      "InstanceStorageConfigs": [],
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ThreadsPerCore": 2
    },
    {
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
      "InstanceCount": 1,
      "InstanceGroupName": "worker-group-1",
      "InstanceStorageConfigs": [],
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ThreadsPerCore": 2
    },
    {
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
      "InstanceCount": 1,
      "InstanceGroupName": "worker-group-2",
      "InstanceStorageConfigs": [],
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ThreadsPerCore": 2
-    },
+    },
+    {
+      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
+      "InstanceCount": 0,
+      "InstanceGroupName": "worker-group-3",
+      "InstanceStorageConfigs": [],
+      "InstanceType": "ml.t3.medium",
+      "LifeCycleConfig": {
+        "SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
+        "OnCreate": "on_create.sh"
+      },
+      "ThreadsPerCore": 2
+    }
  ]
}

クラスターの変更は AWS CLI で行います。

takakuni@Mac genai-blog % aws sagemaker update-cluster \
--cli-input-json file://update_cluster_slurm_novpc.json
Enter MFA code for arn:aws:iam::482842011168:mfa/cm-takakuni.shinnosuke:
{
    "ClusterArn": "arn:aws:sagemaker:ap-northeast-1:accountid:cluster/wmv5lg859xc8"
}
takakuni@Mac genai-blog % aws sagemaker update-cluster \
--cli-input-json file://update_cluster_slurm_vpc.json
{
    "ClusterArn": "arn:aws:sagemaker:ap-northeast-1:accountid:cluster/t829p1gig0w1"
}

空のインスタンスグループが作成されていますね。

takakuni@Mac genai-blog % aws sagemaker describe-cluster --cluster-name slurm-orchestrator-vpc | jq '{
    ClusterName: .ClusterName,
    ClusterStatus: .ClusterStatus,
    InstanceGroups: [.InstanceGroups[] | {
        CurrentCount: .CurrentCount,
        InstanceGroupName: .InstanceGroupName,
        InstanceType: .InstanceType,
        LifeCycleConfig: .LifeCycleConfig,
        ExecutionRole: .ExecutionRole,
        Status: .Status
    }],
    VpcConfig: .VpcConfig
}'
{
  "ClusterName": "slurm-orchestrator-vpc",
  "ClusterStatus": "InService",
  "InstanceGroups": [
    {
      "CurrentCount": 1,
      "InstanceGroupName": "controller-machine",
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://vpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
      "Status": "InService"
    },
    {
      "CurrentCount": 1,
      "InstanceGroupName": "worker-group-1",
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://vpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
      "Status": "InService"
    },
    {
      "CurrentCount": 1,
      "InstanceGroupName": "worker-group-2",
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
      "Status": "InService"
    },
    {
      "CurrentCount": 0,
      "InstanceGroupName": "worker-group-3",
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
      "Status": "InService"
    }
  ],
  "VpcConfig": {
    "SecurityGroupIds": [
      "sg-06732926c332f6d14"
    ],
    "Subnets": [
      "subnet-0f0417ec77f811354"
    ]
  }
}

takakuni@Mac genai-blog % aws sagemaker describe-cluster --cluster-name slurm-orchestrator-no-vpc | jq '{
    ClusterName: .ClusterName,
    ClusterStatus: .ClusterStatus,
    InstanceGroups: [.InstanceGroups[] | {
        CurrentCount: .CurrentCount,
        InstanceGroupName: .InstanceGroupName,
        InstanceType: .InstanceType,
        LifeCycleConfig: .LifeCycleConfig,
        ExecutionRole: .ExecutionRole,
        Status: .Status
    }],
    VpcConfig: .VpcConfig
}'
{
  "ClusterName": "slurm-orchestrator-no-vpc",
  "ClusterStatus": "InService",
  "InstanceGroups": [
    {
      "CurrentCount": 1,
      "InstanceGroupName": "controller-machine",
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
      "Status": "InService"
    },
    {
      "CurrentCount": 1,
      "InstanceGroupName": "worker-group-1",
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
      "Status": "InService"
    },
    {
      "CurrentCount": 1,
      "InstanceGroupName": "worker-group-2",
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
      "Status": "InService"
    },
    {
      "CurrentCount": 0,
      "InstanceGroupName": "worker-group-3",
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
      "Status": "InService"
    }
  ],
  "VpcConfig": null
}

インスタンスの追加

続いてインスタンスグループにインスタンスを追加します。

provisioning_parameters.json を変更し、ライフサイクルスクリプト用の S3 オブジェクトを更新します。

provisioning_parameters.json
{
  "controller_group":"controller-machine",
  "version":"1.0.0",
  "worker_groups":[
    {
      "instance_group_name":"worker-group-1",
      "instance_type":"ml.t3.medium"
    },
    {
      "instance_group_name":"worker-group-2",
      "instance_type":"ml.t3.medium"
    },
+    {
+      "instance_group_name":"worker-group-3",
+      "instance_type":"ml.t3.medium"
+    }
  ],
  "workload_manager":"slurm"
}

インスタンス数を 0 から 2 に変更してみます。

update_cluster_slurm_vpc.json
{
  "ClusterName": "slurm-orchestrator-vpc",
  "InstanceGroups": [
    {
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
      "InstanceCount": 1,
      "InstanceGroupName": "controller-machine",
      "InstanceStorageConfigs": [],
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://vpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ThreadsPerCore": 2
    },
    {
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
      "InstanceCount": 1,
      "InstanceGroupName": "worker-group-1",
      "InstanceStorageConfigs": [],
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://vpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ThreadsPerCore": 2
    },
    {
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
      "InstanceCount": 1,
      "InstanceGroupName": "worker-group-2",
      "InstanceStorageConfigs": [],
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ThreadsPerCore": 2
    },
    {
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
-      "InstanceCount": 0,
+      "InstanceCount": 2,
      "InstanceGroupName": "worker-group-3",
      "InstanceStorageConfigs": [],
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ThreadsPerCore": 2
    }
  ]
}
update_cluster_slurm_novpc.json
{
  "ClusterName": "slurm-orchestrator-no-vpc",
  "InstanceGroups": [
    {
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
      "InstanceCount": 1,
      "InstanceGroupName": "controller-machine",
      "InstanceStorageConfigs": [],
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ThreadsPerCore": 2
    },
    {
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
      "InstanceCount": 1,
      "InstanceGroupName": "worker-group-1",
      "InstanceStorageConfigs": [],
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ThreadsPerCore": 2
    },
    {
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
      "InstanceCount": 1,
      "InstanceGroupName": "worker-group-2",
      "InstanceStorageConfigs": [],
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ThreadsPerCore": 2
    },
    {
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
-      "InstanceCount": 0,
+      "InstanceCount": 2,
      "InstanceGroupName": "worker-group-3",
      "InstanceStorageConfigs": [],
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ThreadsPerCore": 2
    }
  ]
}

こちらも問題なく、インスタンス数が 2 に増加しました。Status も InService となっており、稼働できる状況です。

takakuni@Mac genai-blog %   aws sagemaker describe-cluster --cluster-name slurm-orchestrator-vpc | jq '{
    ClusterName: .ClusterName,
    ClusterStatus: .ClusterStatus,
    InstanceGroups: [.InstanceGroups[] | {
        CurrentCount: .CurrentCount,
        InstanceGroupName: .InstanceGroupName,
        InstanceType: .InstanceType,
        LifeCycleConfig: .LifeCycleConfig,
        ExecutionRole: .ExecutionRole,
        Status: .Status
    }],
    VpcConfig: .VpcConfig
}'
{
  "ClusterName": "slurm-orchestrator-vpc",
  "ClusterStatus": "InService",
  "InstanceGroups": [
    {
      "CurrentCount": 1,
      "InstanceGroupName": "controller-machine",
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://vpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
      "Status": "InService"
    },
    {
      "CurrentCount": 1,
      "InstanceGroupName": "worker-group-1",
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://vpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
      "Status": "InService"
    },
    {
      "CurrentCount": 1,
      "InstanceGroupName": "worker-group-2",
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
      "Status": "InService"
    },
    {
      "CurrentCount": 2,
      "InstanceGroupName": "worker-group-3",
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
      "Status": "InService"
    }
  ],
  "VpcConfig": {
    "SecurityGroupIds": [
      "sg-06732926c332f6d14"
    ],
    "Subnets": [
      "subnet-0f0417ec77f811354"
    ]
  }
}
takakuni@Mac genai-blog % aws sagemaker describe-cluster --cluster-name slurm-orchestrator-no-vpc | jq '{
    ClusterName: .ClusterName,
    ClusterStatus: .ClusterStatus,
    InstanceGroups: [.InstanceGroups[] | {
        CurrentCount: .CurrentCount,
        InstanceGroupName: .InstanceGroupName,
        InstanceType: .InstanceType,
        LifeCycleConfig: .LifeCycleConfig,
        ExecutionRole: .ExecutionRole,
        Status: .Status
    }],
    VpcConfig: .VpcConfig
}'
{
  "ClusterName": "slurm-orchestrator-no-vpc",
  "ClusterStatus": "InService",
  "InstanceGroups": [
    {
      "CurrentCount": 1,
      "InstanceGroupName": "controller-machine",
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
      "Status": "InService"
    },
    {
      "CurrentCount": 1,
      "InstanceGroupName": "worker-group-1",
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
      "Status": "InService"
    },
    {
      "CurrentCount": 1,
      "InstanceGroupName": "worker-group-2",
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
      "Status": "InService"
    },
    {
      "CurrentCount": 2,
      "InstanceGroupName": "worker-group-3",
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
      "Status": "InService"
    }
  ],
  "VpcConfig": null
}

インスタンスの削除 UpdateCluster API

インスタンスの削除を行います。 まずは UpdateCluster API でインスタンス数を減らします。

update_cluster_slurm_vpc.json
{
  "ClusterName": "slurm-orchestrator-vpc",
  "InstanceGroups": [
    {
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
      "InstanceCount": 1,
      "InstanceGroupName": "controller-machine",
      "InstanceStorageConfigs": [],
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://vpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ThreadsPerCore": 2
    },
    {
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
      "InstanceCount": 1,
      "InstanceGroupName": "worker-group-1",
      "InstanceStorageConfigs": [],
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://vpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ThreadsPerCore": 2
    },
    {
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
      "InstanceCount": 1,
      "InstanceGroupName": "worker-group-2",
      "InstanceStorageConfigs": [],
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ThreadsPerCore": 2
    },
    {
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
-      "InstanceCount": 2,
+      "InstanceCount": 1,
      "InstanceGroupName": "worker-group-3",
      "InstanceStorageConfigs": [],
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ThreadsPerCore": 2
    }
  ]
}
update_cluster_slurm_novpc.json
{
  "ClusterName": "slurm-orchestrator-no-vpc",
  "InstanceGroups": [
    {
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
      "InstanceCount": 1,
      "InstanceGroupName": "controller-machine",
      "InstanceStorageConfigs": [],
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ThreadsPerCore": 2
    },
    {
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
      "InstanceCount": 1,
      "InstanceGroupName": "worker-group-1",
      "InstanceStorageConfigs": [],
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ThreadsPerCore": 2
    },
    {
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
      "InstanceCount": 1,
      "InstanceGroupName": "worker-group-2",
      "InstanceStorageConfigs": [],
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ThreadsPerCore": 2
    },
    {
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
-      "InstanceCount": 2,
+      "InstanceCount": 1,
      "InstanceGroupName": "worker-group-3",
      "InstanceStorageConfigs": [],
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ThreadsPerCore": 2
    }
  ]
}

インスタンス数が 1 に変更され、 Status が InService になっています。

takakuni@Mac genai-blog % aws sagemaker describe-cluster --cluster-name slurm-orchestrator-vpc | jq '{
    ClusterName: .ClusterName,
    ClusterStatus: .ClusterStatus,
    InstanceGroups: [.InstanceGroups[] | {
        CurrentCount: .CurrentCount,
        InstanceGroupName: .InstanceGroupName,
        InstanceType: .InstanceType,
        LifeCycleConfig: .LifeCycleConfig,
        ExecutionRole: .ExecutionRole,
        Status: .Status
    }],
    VpcConfig: .VpcConfig
}'
{
  "ClusterName": "slurm-orchestrator-vpc",
  "ClusterStatus": "InService",
  "InstanceGroups": [
    {
      "CurrentCount": 1,
      "InstanceGroupName": "controller-machine",
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://vpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
      "Status": "InService"
    },
    {
      "CurrentCount": 1,
      "InstanceGroupName": "worker-group-1",
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://vpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
      "Status": "InService"
    },
    {
      "CurrentCount": 1,
      "InstanceGroupName": "worker-group-2",
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
      "Status": "InService"
    },
    {
      "CurrentCount": 1,
      "InstanceGroupName": "worker-group-3",
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
      "Status": "InService"
    }
  ],
  "VpcConfig": {
    "SecurityGroupIds": [
      "sg-06732926c332f6d14"
    ],
    "Subnets": [
      "subnet-0f0417ec77f811354"
    ]
  }
}
takakuni@Mac genai-blog % aws sagemaker describe-cluster --cluster-name slurm-orchestrator-no-vpc | jq '{
    ClusterName: .ClusterName,
    ClusterStatus: .ClusterStatus,
    InstanceGroups: [.InstanceGroups[] | {
        CurrentCount: .CurrentCount,
        InstanceGroupName: .InstanceGroupName,
        InstanceType: .InstanceType,
        LifeCycleConfig: .LifeCycleConfig,
        ExecutionRole: .ExecutionRole,
        Status: .Status
    }],
    VpcConfig: .VpcConfig
}'
{
  "ClusterName": "slurm-orchestrator-no-vpc",
  "ClusterStatus": "InService",
  "InstanceGroups": [
    {
      "CurrentCount": 1,
      "InstanceGroupName": "controller-machine",
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
      "Status": "InService"
    },
    {
      "CurrentCount": 1,
      "InstanceGroupName": "worker-group-1",
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
      "Status": "InService"
    },
    {
      "CurrentCount": 1,
      "InstanceGroupName": "worker-group-2",
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
      "Status": "InService"
    },
    {
      "CurrentCount": 1,
      "InstanceGroupName": "worker-group-3",
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
      "Status": "InService"
    }
  ],
  "VpcConfig": null
}

インスタンスの削除 BatchDeleteClusterNodes API

続いて、BatchDeleteClusterNodes API で worker-group-3 に所属する、残りのインスタンスを削除します。

slurm-orchestrator-vpc
# インスタンスの特定
takakuni@Mac genai-blog % NODE_ID=$(aws sagemaker list-cluster-nodes \
    --cluster-name slurm-orchestrator-vpc \
    --instance-group-name-contains "worker-group-3" | jq -r '.ClusterNodeSummaries[].InstanceId')
# インスタンス ID の表示
takakuni@Mac genai-blog % echo "Deleting nodes: $NODE_ID"
Deleting nodes: i-0d60f6a5d1ba07c6c
# インスタンスの削除
takakuni@Mac genai-blog % aws sagemaker batch-delete-cluster-nodes \
    --cluster-name slurm-orchestrator-vpc \
    --node-ids $NODE_ID
{
    "Successful": [
        "i-0d60f6a5d1ba07c6c"
    ]
}
slurm-orchestrator-vpc
# インスタンスの特定
takakuni@Mac genai-blog % NODE_ID=$(aws sagemaker list-cluster-nodes \
    --cluster-name slurm-orchestrator-no-vpc \
    --instance-group-name-contains "worker-group-3" | jq -r '.ClusterNodeSummaries[].InstanceId')
# インスタンス ID の表示
takakuni@Mac genai-blog % echo "Deleting nodes: $NODE_ID"
Deleting nodes: i-0f7184e323b14beca
# インスタンスの削除
takakuni@Mac genai-blog % aws sagemaker batch-delete-cluster-nodes \
    --cluster-name slurm-orchestrator-no-vpc \
    --node-ids $NODE_ID
{
    "Successful": [
        "i-0f7184e323b14beca"
    ]
}

クラスターを確認すると、インスタンス数が 0 のインスタンスグループに戻りました。

takakuni@Mac genai-blog % aws sagemaker describe-cluster --cluster-name slurm-orchestrator-vpc | jq '{
    ClusterName: .ClusterName,
    ClusterStatus: .ClusterStatus,
    InstanceGroups: [.InstanceGroups[] | {
        CurrentCount: .CurrentCount,
        InstanceGroupName: .InstanceGroupName,
        InstanceType: .InstanceType,
        LifeCycleConfig: .LifeCycleConfig,
        ExecutionRole: .ExecutionRole,
        Status: .Status
    }],
    VpcConfig: .VpcConfig
}'

{
  "ClusterName": "slurm-orchestrator-vpc",
  "ClusterStatus": "InService",
  "InstanceGroups": [
    {
      "CurrentCount": 1,
      "InstanceGroupName": "controller-machine",
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://vpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
      "Status": "InService"
    },
    {
      "CurrentCount": 1,
      "InstanceGroupName": "worker-group-1",
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://vpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
      "Status": "InService"
    },
    {
      "CurrentCount": 1,
      "InstanceGroupName": "worker-group-2",
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
      "Status": "InService"
    },
    {
      "CurrentCount": 0,
      "InstanceGroupName": "worker-group-3",
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
      "Status": "InService"
    }
  ],
  "VpcConfig": {
    "SecurityGroupIds": [
      "sg-06732926c332f6d14"
    ],
    "Subnets": [
      "subnet-0f0417ec77f811354"
    ]
  }
}
takakuni@Mac genai-blog % aws sagemaker describe-cluster --cluster-name slurm-orchestrator-no-vpc | jq '{
    ClusterName: .ClusterName,
    ClusterStatus: .ClusterStatus,
    InstanceGroups: [.InstanceGroups[] | {
        CurrentCount: .CurrentCount,
        InstanceGroupName: .InstanceGroupName,
        InstanceType: .InstanceType,
        LifeCycleConfig: .LifeCycleConfig,
        ExecutionRole: .ExecutionRole,
        Status: .Status
    }],
    VpcConfig: .VpcConfig
}'
{
  "ClusterName": "slurm-orchestrator-no-vpc",
  "ClusterStatus": "InService",
  "InstanceGroups": [
    {
      "CurrentCount": 1,
      "InstanceGroupName": "controller-machine",
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
      "Status": "InService"
    },
    {
      "CurrentCount": 1,
      "InstanceGroupName": "worker-group-1",
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
      "Status": "InService"
    },
    {
      "CurrentCount": 1,
      "InstanceGroupName": "worker-group-2",
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
      "Status": "InService"
    },
    {
      "CurrentCount": 0,
      "InstanceGroupName": "worker-group-3",
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
      "Status": "InService"
    }
  ],
  "VpcConfig": null
}

インスタンスグループの削除

インスタンスグループの削除に入ります。インスタンスグループの削除は Update Cluster API で行います。

対象のインスタンスグループを InstanceGroups から外し、InstanceGroupsToDelete に名前を入力します。

update_cluster_slurm_vpc.json
{
  "ClusterName": "slurm-orchestrator-vpc",
  "InstanceGroups": [
    {
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
      "InstanceCount": 1,
      "InstanceGroupName": "controller-machine",
      "InstanceStorageConfigs": [],
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://vpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ThreadsPerCore": 2
    },
    {
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
      "InstanceCount": 1,
      "InstanceGroupName": "worker-group-1",
      "InstanceStorageConfigs": [],
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://vpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ThreadsPerCore": 2
    },
    {
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
      "InstanceCount": 1,
      "InstanceGroupName": "worker-group-2",
      "InstanceStorageConfigs": [],
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ThreadsPerCore": 2
+    }
-    },
-    {
-      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
-      "InstanceCount": 1,
-      "InstanceGroupName": "worker-group-3",
-      "InstanceStorageConfigs": [],
-      "InstanceType": "ml.t3.medium",
-      "LifeCycleConfig": {
-        "SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
-        "OnCreate": "on_create.sh"
-      },
-      "ThreadsPerCore": 2
-    }
-  ]
+  ],
+  "InstanceGroupsToDelete": ["worker-group-3"]
}
update_cluster_slurm_novpc.json
{
  "ClusterName": "slurm-orchestrator-no-vpc",
  "InstanceGroups": [
    {
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
      "InstanceCount": 1,
      "InstanceGroupName": "controller-machine",
      "InstanceStorageConfigs": [],
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ThreadsPerCore": 2
    },
    {
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
      "InstanceCount": 1,
      "InstanceGroupName": "worker-group-1",
      "InstanceStorageConfigs": [],
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ThreadsPerCore": 2
    },
    {
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
      "InstanceCount": 1,
      "InstanceGroupName": "worker-group-2",
      "InstanceStorageConfigs": [],
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ThreadsPerCore": 2
+    }
-    },
-    {
-      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
-      "InstanceCount": 1,
-      "InstanceGroupName": "worker-group-3",
-      "InstanceStorageConfigs": [],
-      "InstanceType": "ml.t3.medium",
-      "LifeCycleConfig": {
-        "SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
-        "OnCreate": "on_create.sh"
-      },
-      "ThreadsPerCore": 2
-    },
-  ]
+  ],
+  "InstanceGroupsToDelete": ["worker-group-3"]
}

InstanceGroups から worker-group-3 が削除されていますね。

takakuni@Mac genai-blog % aws sagemaker describe-cluster --cluster-name slurm-orchestrator-vpc | jq '{
    ClusterName: .ClusterName,
    ClusterStatus: .ClusterStatus,
    InstanceGroups: [.InstanceGroups[] | {
        CurrentCount: .CurrentCount,
        InstanceGroupName: .InstanceGroupName,
        InstanceType: .InstanceType,
        LifeCycleConfig: .LifeCycleConfig,
        ExecutionRole: .ExecutionRole,
        Status: .Status
    }],
    VpcConfig: .VpcConfig
}'
{
  "ClusterName": "slurm-orchestrator-vpc",
  "ClusterStatus": "InService",
  "InstanceGroups": [
    {
      "CurrentCount": 1,
      "InstanceGroupName": "controller-machine",
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://vpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
      "Status": "InService"
    },
    {
      "CurrentCount": 1,
      "InstanceGroupName": "worker-group-1",
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://vpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
      "Status": "InService"
    },
    {
      "CurrentCount": 1,
      "InstanceGroupName": "worker-group-2",
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
      "Status": "InService"
    }
  ],
  "VpcConfig": {
    "SecurityGroupIds": [
      "sg-06732926c332f6d14"
    ],
    "Subnets": [
      "subnet-0f0417ec77f811354"
    ]
  }
}

takakuni@Mac genai-blog % aws sagemaker describe-cluster --cluster-name slurm-orchestrator-no-vpc | jq '{
    ClusterName: .ClusterName,
    ClusterStatus: .ClusterStatus,
    InstanceGroups: [.InstanceGroups[] | {
        CurrentCount: .CurrentCount,
        InstanceGroupName: .InstanceGroupName,
        InstanceType: .InstanceType,
        LifeCycleConfig: .LifeCycleConfig,
        ExecutionRole: .ExecutionRole,
        Status: .Status
    }],
    VpcConfig: .VpcConfig
}'
{
  "ClusterName": "slurm-orchestrator-no-vpc",
  "ClusterStatus": "InService",
  "InstanceGroups": [
    {
      "CurrentCount": 1,
      "InstanceGroupName": "controller-machine",
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
      "Status": "InService"
    },
    {
      "CurrentCount": 1,
      "InstanceGroupName": "worker-group-1",
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
      "Status": "InService"
    },
    {
      "CurrentCount": 1,
      "InstanceGroupName": "worker-group-2",
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
      "Status": "InService"
    }
  ],
  "VpcConfig": null
}

インスタンスが存在するインスタンスグループの削除

インスタンスが存在する場合のインスタンスグループ削除もやってみます。

update_cluster_slurm_vpc.json
{
  "ClusterName": "slurm-orchestrator-vpc",
  "InstanceGroups": [
    {
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
      "InstanceCount": 1,
      "InstanceGroupName": "controller-machine",
      "InstanceStorageConfigs": [],
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://vpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ThreadsPerCore": 2
    },
    {
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
      "InstanceCount": 1,
      "InstanceGroupName": "worker-group-1",
      "InstanceStorageConfigs": [],
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://vpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ThreadsPerCore": 2
+    }
-    },
-    {
-      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
-      "InstanceCount": 1,
-      "InstanceGroupName": "worker-group-2",
-      "InstanceStorageConfigs": [],
-      "InstanceType": "ml.t3.medium",
-      "LifeCycleConfig": {
-        "SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
-        "OnCreate": "on_create.sh"
-      },
-      "ThreadsPerCore": 2
-    }
  ],
  "InstanceGroupsToDelete": ["worker-group-2"]
}
update_cluster_slurm_no_vpc.json
{
  "ClusterName": "slurm-orchestrator-no-vpc",
  "InstanceGroups": [
    {
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
      "InstanceCount": 1,
      "InstanceGroupName": "controller-machine",
      "InstanceStorageConfigs": [],
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ThreadsPerCore": 2
    },
    {
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
      "InstanceCount": 1,
      "InstanceGroupName": "worker-group-1",
      "InstanceStorageConfigs": [],
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ThreadsPerCore": 2
+    }
-    },
-    {
-      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
-      "InstanceCount": 1,
-      "InstanceGroupName": "worker-group-2",
-      "InstanceStorageConfigs": [],
-      "InstanceType": "ml.t3.medium",
-      "LifeCycleConfig": {
-        "SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
-        "OnCreate": "on_create.sh"
-      },
-      "ThreadsPerCore": 2
-    }
  ],
  "InstanceGroupsToDelete": ["worker-group-2"]
}

InstanceGroups から worker-group-2 が削除され、controller-machineworker-group-1 のみになりました。

takakuni@Mac genai-blog % aws sagemaker describe-cluster --cluster-name slurm-orchestrator-vpc | jq '{
    ClusterName: .ClusterName,
    ClusterStatus: .ClusterStatus,
    InstanceGroups: [.InstanceGroups[] | {
        CurrentCount: .CurrentCount,
        InstanceGroupName: .InstanceGroupName,
        InstanceType: .InstanceType,
        LifeCycleConfig: .LifeCycleConfig,
        ExecutionRole: .ExecutionRole,
        Status: .Status
    }],
    VpcConfig: .VpcConfig
}'
{
  "ClusterName": "slurm-orchestrator-vpc",
  "ClusterStatus": "InService",
  "InstanceGroups": [
    {
      "CurrentCount": 1,
      "InstanceGroupName": "controller-machine",
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://vpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
      "Status": "InService"
    },
    {
      "CurrentCount": 1,
      "InstanceGroupName": "worker-group-1",
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://vpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
      "Status": "InService"
    }
  ],
  "VpcConfig": {
    "SecurityGroupIds": [
      "sg-06732926c332f6d14"
    ],
    "Subnets": [
      "subnet-0f0417ec77f811354"
    ]
  }
}
takakuni@Mac genai-blog % aws sagemaker describe-cluster --cluster-name slurm-orchestrator-no-vpc | jq '{
    ClusterName: .ClusterName,
    ClusterStatus: .ClusterStatus,
    InstanceGroups: [.InstanceGroups[] | {
        CurrentCount: .CurrentCount,
        InstanceGroupName: .InstanceGroupName,
        InstanceType: .InstanceType,
        LifeCycleConfig: .LifeCycleConfig,
        ExecutionRole: .ExecutionRole,
        Status: .Status
    }],
    VpcConfig: .VpcConfig
}'
{
  "ClusterName": "slurm-orchestrator-no-vpc",
  "ClusterStatus": "InService",
  "InstanceGroups": [
    {
      "CurrentCount": 1,
      "InstanceGroupName": "controller-machine",
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
      "Status": "InService"
    },
    {
      "CurrentCount": 1,
      "InstanceGroupName": "worker-group-1",
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
      "Status": "InService"
    }
  ],
  "VpcConfig": null
}

インスタンスグループ内のインスタンスをすべて削除

最後に controller-machine のインスタンスの削除を試みます。

update_cluster_slurm_vpc.json
{
  "ClusterName": "slurm-orchestrator-vpc",
  "InstanceGroups": [
    {
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
+      "InstanceCount": 0,
-      "InstanceCount": 1,
      "InstanceGroupName": "controller-machine",
      "InstanceStorageConfigs": [],
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://vpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ThreadsPerCore": 2
    },
    {
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
      "InstanceCount": 1,
      "InstanceGroupName": "worker-group-1",
      "InstanceStorageConfigs": [],
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://vpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ThreadsPerCore": 2
    }
+  ]
-  ],
-  "InstanceGroupsToDelete": ["worker-group-2"]
}
update_cluster_slurm_no_vpc.json
{
  "ClusterName": "slurm-orchestrator-no-vpc",
  "InstanceGroups": [
    {
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
+      "InstanceCount": 0,
-      "InstanceCount": 1,
      "InstanceGroupName": "controller-machine",
      "InstanceStorageConfigs": [],
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ThreadsPerCore": 2
    },
    {
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
      "InstanceCount": 1,
      "InstanceGroupName": "worker-group-1",
      "InstanceStorageConfigs": [],
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ThreadsPerCore": 2
    }
+  ]
-  ],
-  "InstanceGroupsToDelete": ["worker-group-2"]
}

Cannot delete the nodes in controller group: controller-machine. とコントローラーグループのノード削除ができない旨が記載されていますね。

takakuni@Mac genai-blog % aws sagemaker describe-cluster --cluster-name slurm-orchestrator-vpc | jq '{
    ClusterName: .ClusterName,
    FailureMessage: .FailureMessage,
    ClusterStatus: .ClusterStatus,
    InstanceGroups: [.InstanceGroups[] | {
        CurrentCount: .CurrentCount,
        InstanceGroupName: .InstanceGroupName,
        InstanceType: .InstanceType,
        LifeCycleConfig: .LifeCycleConfig,
        ExecutionRole: .ExecutionRole,
        Status: .Status
    }],
    VpcConfig: .VpcConfig
}'
{
  "ClusterName": "slurm-orchestrator-vpc",
  "FailureMessage": "Cannot delete the nodes in controller group: controller-machine.",
  "ClusterStatus": "Failed",
  "InstanceGroups": [
    {
      "CurrentCount": 1,
      "InstanceGroupName": "controller-machine",
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://vpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
      "Status": "InService"
    },
    {
      "CurrentCount": 1,
      "InstanceGroupName": "worker-group-1",
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://vpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-vpc-role",
      "Status": "InService"
    }
  ],
  "VpcConfig": {
    "SecurityGroupIds": [
      "sg-06732926c332f6d14"
    ],
    "Subnets": [
      "subnet-0f0417ec77f811354"
    ]
  }
}

aws sagemaker describe-cluster --cluster-name slurm-orchestrator-no-vpc | jq '{
    ClusterName: .ClusterName,
    ClusterStatus: .ClusterStatus,
    FailureMessage: .FailureMessage,
    InstanceGroups: [.InstanceGroups[] | {
        CurrentCount: .CurrentCount,
        InstanceGroupName: .InstanceGroupName,
        InstanceType: .InstanceType,
        LifeCycleConfig: .LifeCycleConfig,
        ExecutionRole: .ExecutionRole,
        Status: .Status
    }],
    VpcConfig: .VpcConfig
}'
{
  "ClusterName": "slurm-orchestrator-no-vpc",
  "ClusterStatus": "Failed",
  "FailureMessage": "Cannot delete the nodes in controller group: controller-machine.",
  "InstanceGroups": [
    {
      "CurrentCount": 1,
      "InstanceGroupName": "controller-machine",
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
      "Status": "InService"
    },
    {
      "CurrentCount": 1,
      "InstanceGroupName": "worker-group-1",
      "InstanceType": "ml.t3.medium",
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://novpc-lifecycle-accountid/config/",
        "OnCreate": "on_create.sh"
      },
      "ExecutionRole": "arn:aws:iam::accountid:role/hyperpod-role",
      "Status": "InService"
    }
  ],
  "VpcConfig": null
}

まとめ

以上、「SageMaker HyperPod のスケールダウンについて整理してみた Slurm オーケストレータ編」でした。

少し前まで、 Slurm オーケストレータではインスタンスグループにかかわらず、ノードの削除ができなかった記憶でしたが、アップデートされてインスタンスグループまで削除できるようになってました。検証でなるべくコストを抑えたい私にとっては、かなり嬉しいアップデートですね。このブログがどなたかの参考になれば幸いです。

クラウド事業本部コンサルティング部のたかくに(@takakuni_)でした!

Share this article

facebook logohatena logotwitter logo

© Classmethod, Inc. All rights reserved.