Amazon SageMaker HyperPod のワークショップから学ぶ環境構築のハマりどころ

今年のサンタさんは「ふぉっふぉっふぉ、お願いしていたGPUは手に入らなかったから代わりにAWSのGPU代なら負担するぞい」と一言残して帰っていきました。今どきのサンタさんはAmazonと蜜月関係なんだなぁと思いつつ、私はサンタさんを信じてSageMaker HyperPodのクラスターでGPUをぶん回すことにしました。
#SageMaker HyperPod
#HPC
#AWS
大村保貴
2024.12.25
HyperPod 未経験者が、HyperPod を学ぶなら手を動かすのが早いと考えました。そこでワークショップを試したところ、環境構築中に疑問点とハマりどころが多数ありました。本記事ではそれらを紹介します。
 今回のワークショップこのワークショップでは、Slurm を使った HyperPod クラスターで大規模言語モデル (LLM) のトレーニングを体験できます。
https://catalog.workshops.aws/sagemaker-hyperpod/en-US
本コンテンツの一部は re:Invent 2024 でワークショップとして開催された内容です。
https://dev.classmethod.jp/articles/reinvent2024-report-aim403/
 本記事でお届けする内容ワークショップのコンテンツを利用して自前の AWS アカウントに環境構築しました。構築段階でハマった点と疑問点をまとめた記事です。
re:Invent などの AWS 主催イベントでは、必要な実行環境が構築済みの AWS アカウントが多く払い出されます。  そのため、環境構築はスキップされがちです。  ここでは自前の AWS アカウントで環境構築した際の注意事項を紹介します。
 0. Prerequisites自前の AWS アカウントを利用し、FSx for Lustre などのインフラ環境をセットアップします。
 2. Own Accountワークショップの CloudFormation テンプレートで、HyperPod が利用する AWS インフラ部分を先に用意します。  クラスター作成自体は次の章で行います。
 パラメータ設定時の注意事項スタック名はデフォルトのsagemaker-hyperpodが便利です。HyperPod のクラスター作成用スクリプト実行時にデフォルトの名前以外のスタック名のときは入力が求められます。スタック名が重複しているなどの問題がない場合は、デフォルトのスタック名にするとよいでしょう。
AZ の指定は、利用したいリージョンの AZ ID に合わせましょう。  デフォルトはus-west-2です。
 利用費が気になるリソースCloudFormation テンプレートから作成されるリソースの中で利用費が気になるものをピックアップしました。
FSx for Lustre
NAT Gateway
どちらも停止できず、課金を止められません。利用費を節約するなら短期で一気にワークショップを進めるか、都度構築し直す方法もあります。
 1. Cluster SetupHyperPod のクラスターを構築するセクションです。準備されているセットアップスクリプトを実行してクラスターを構築するか、マネージメントコンソールからぽちぽちして構築する方法の 2 パターン提供されています。今回は準備されているスクリプトを実行する方法で試してみます。
 1. (Option A) Easy Cluster Setup実行環境は CloudShell からでも問題ありませんでした。実行環境構築に必要なライブラリは念のため事前にインストールしてからはじめました。
pip install boto3 jsonchema
対話式でクラスターを構築できるセットアップスクリプトをが準備されています。このスクリプト実行時の注意事項をまとめておきます。
bash automate-cluster-creation.sh
https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/automate-smhp-slurm/automate-cluster-creation.sh
 各種グループ（ノード）の設定HyperPod 用語がわからかったためクラスターをデプロイしてから調べました。
コントローラーインスタンスグループは、Slurm のコントローラーノードです。こちらは必須リソースです。同じく Slurm を利用する他の AWS サービスを例にします。AWS ParallelCluster だとヘッドノード、AWS PCS だと、Slurm コントローラーと呼ばれているものに相当します。要はslurmctldが動いているノードを指しています。
ログイングループは、ログイン専用ノードです。こちらの作成は任意です。AWS ParallelCluster、AWS PCS ではログインノードと呼ばれているものに相当します。
ワーカーグループは、計算処理を行うノードです。こちらは必須リソースです。AWS ParallelCluster、AWS PCS ではコンピュートノードと呼ばれているものに相当します。Instance Count は、常時起動するインスタンス台数の指定です。この時点では最大起動台数だと勘違いしており10と指定しています。そのため、ml.c5.xlargeが 10 台常時起動する状況になりました。
Enter the name for the controller instance group [controller-machine]:
Enter the instance type for the controller [ml.m5.12xlarge]: ml.c5.large
Do you want to add a login group? (yes/no):
yes

Enter the instance type for the login group [ml.m5.4xlarge]: ml.c5.large
✅ Login Group added

=== Worker Group Configuration ===
Do you want to add a worker instance group? (yes/no): [yes]:
Configuring Worker Group 1
Enter the instance type for worker group 1 [ml.c5.4xlarge]: ml.c5.xlarge
Enter the instance count for worker group 1 [4]: 10
 サービスクォータに注意各種ノードの指定で指定したインスタンスタイプが起動可能か確認しておきましょう。セットアップスクリプトをデフォルトのまま進めると ml.m5.12xlarge を利用することになります。ml.からはじまるインスタンスタイプは EC2 ではなくて、SageMaker のサービスクォータで管理されています。
普段使わないml.系のインスタンスタイプだったため制限されており、コントローラーノードの起動に失敗しました。セットアップスクリプトにはご丁寧にどこの項目（ml.m5.12xlarge for cluster usage）で制限されたのか表示されます。失敗した場合は、当該項目の上限を緩和するか、他のインスタンスタイプを指定してください。
✅ Creating cluster for you!
⚠️  Error occurred while creating the cluster:

An error occurred (ResourceLimitExceeded) when calling the CreateCluster operation: The account-level service limit 'ml.m5.12xlarge for cluster usage' is 0 Instances, with current utilization of 0 Instances and a request delta of 1 Instances. Please use AWS Service Quotas to request an increase for this quota. If AWS Service Quotas is not available, contact AWS support to request an increase for this quota.
以下は参考までに構築に成功した場合の出力メッセージです。
Next Steps:
Do you want the script to create the cluster for you now? (yes/no): [yes]:
⚠️  Please note:
   - Cluster creation may take some time (~15-20 min)
   - This operation may incur costs on your AWS account
   - Ensure you understand the implications before proceeding

✅ Creating cluster for you!
✅ Cluster creation request submitted successfully. To monitor the progress of cluster creation, you can either check the SageMaker console, or you can run:.
watch -n 1 aws sagemaker list-clusters --output table
Thank you for using the SageMaker HyperPod Cluster Creation Script!
For any issues or questions, please refer to the AWS documentation.
https://docs.aws.amazon.com/sagemaker/latest/dg/smcluster-getting-started.html

Exiting script. Good luck with your SageMaker HyperPod journey! 👋
 セットアップスクリプトの対話内容全文参考までにデプロイに成功したときのセットアップスクリプトの対話内容を全文を掲載しておきます。
対話内容全文[cloudshell-user@ip-10-134-45-116 automate-smhp-slurm]$ ./automate-cluster-creation.sh

=================================================

==== 🚀 Welcome to the SageMaker HyperPod Cluster Creation Script! 🚀 ====

=================================================
Before running this script, please ensure the following:

1. 🔑 IAM Credentials:
   You have Administrator Access Credentials in IAM.
   This is crucial as we'll be using CloudFormation to create IAM roles and policies.
   Run 'aws configure' to set up your credentials.

2. 🌐 VPC Stack:
   Deploy the sagemaker-hyperpod VPC stack using:
   https://catalog.workshops.aws/sagemaker-hyperpod/en-US/00-setup/02-own-account
   This creates essential resources: VPC, subnets, FSx Lustre volumes,
   S3 bucket, and IAM role for your SageMaker HyperPod cluster.

3. 📊 Observability Stack:
   It's highly recommended to deploy the observability stack as well.
   Navigate to https://catalog.workshops.aws/sagemaker-hyperpod/en-US/00-setup/02-own-account#2.-deploy-cluster-observability-stack-(recommended) to deploy the stack

4. 💻 Development Environment:
   Ensure you have a Linux-based development environment (macOS works great too).

5. 🔧 Packages required for this script to run:
   Ensure you install the following: pip, jq, boto3, and jsonschema

Ready to proceed? Press Enter to continue or Ctrl+C to exit...

🔍 AWS Account Verification
Your AWS Account ID is: 123456789012
Press Enter to confirm ✅ or Ctrl+C to exit❌...

📦 1a: AWS CLI Installation and Verification
=== Checking AWS CLI Installation ===
✅ AWS CLI found. Checking version...
Current version: 2.22.18
Min. required version: 2.17.1
✅ AWS CLI version 2.22.18 is up to date.
=== AWS CLI Check Complete ===

🌎 AWS Region Configuration
Please confirm that your AWS region is us-east-1 (default).
If not, enter the AWS region where you want to set up your cluster (e.g., us-west-2):
> us-east-1
✅ Region updated to: us-east-1

Your region is set to: us-east-1
Ensure your chosen region supports SageMaker HyperPod.
You can check out https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod.html#sagemaker-hyperpod-available-regions to learn about supported regions.
Press Enter to continue...

🔧 Setting Up Lifecycle Scripts
Cloning awsome-distributed-training
⚠️  The directory 'awsome-distributed-training' already exists.
Do you want to remove it and clone again? (yes/no):
yes
Removing existing directory...
Cloning repository...
Cloning into 'awsome-distributed-training'...
remote: Enumerating objects: 687, done.
remote: Counting objects: 100% (687/687), done.
remote: Compressing objects: 100% (559/559), done.
remote: Total 687 (delta 101), reused 475 (delta 57), pack-reused 0 (from 0)
Receiving objects: 100% (687/687), 27.63 MiB | 25.88 MiB/s, done.
Resolving deltas: 100% (101/101), done.
✅ Repository cloned successfully
Enter the name of the SageMaker VPC CloudFormation stack that was deployed as a prerequisite (default: sagemaker-hyperpod):
ohmura-sagemaker-hyperpod
✅ Configuration script updated with stack name: ohmura-sagemaker-hyperpod
Generating new environment variables...
✅ New environment variables generated and sourced

=== Environment Variables Summary ===
Note: You may ignore the INSTANCES parameter for now
Current environment variables:
export AWS_REGION=us-east-1
export INSTANCES=g5.12xlarge
export VPC_ID=vpc-08a6dd95e3c6d8b80
export SUBNET_ID=subnet-0f86c5bd4038e1a6e
export PUBLIC_SUBNET_ID=subnet-020fb679044617906
export FSX_ID=fs-0a3d78c27db03d58e
export FSX_MOUNTNAME=ccuitb4v
export SECURITY_GROUP=sg-028d6a11916c85acf
export ROLE=arn:aws:iam::123456789012:role/ohmura-sagemaker-hyperpod-AmazonSagemakerClusterExe-HPprMfnS4bYv
export ROLENAME=ohmura-sagemaker-hyperpod-AmazonSagemakerClusterExe-HPprMfnS4bYv
export BUCKET=sagemaker-lifecycle-4681d320

=== Environment Setup Complete ===
=== Setting Up Lifecycle Scripts ===
Did you deploy the optional hyperpod-observability CloudFormation stack? (yes/no)
Observability not enabled. Continuing with default configuration
Uploading your lifecycle scripts to S3 bucket sagemaker-lifecycle-4681d320
✅ Lifecycle scripts uploaded successfully

=== Lifecycle Scripts Setup Complete ===
✅ Lifecycle scripts setup completed

🚀 Creating the Cluster
1c. Generating cluster configuration...

=== Lifecycle Scripts Setup Complete ===
Enter the name for the controller instance group [controller-machine]:
Enter the instance type for the controller [ml.m5.12xlarge]: ml.c5.large
Do you want to add a login group? (yes/no):
yes
Enter the instance type for the login group [ml.m5.4xlarge]: ml.c5.large
✅ Login Group added

=== Worker Group Configuration ===
Do you want to add a worker instance group? (yes/no): [yes]:
Configuring Worker Group 1
Enter the instance type for worker group 1 [ml.c5.4xlarge]: ml.c5.xlarge
Enter the instance count for worker group 1 [4]: 10
Are you using training plans? (yes/no):
no
✅ Worker Group 1 added
Do you want to add another worker instance group? (yes/no): [no]: no
What would you like to name your cluster? (default: ml-cluster):
✅ cluster-config.json created successfully

Creating provisioning_parameters.json...
✅ provisioning_parameters.json created successfully

Copying configuration to S3 bucket...
✅ Provisioning Parameters uploaded successfully

=== Cluster Configuration Complete ===
✅ Cluster configuration created successfully
ℹ️  Validating the generated configuration before proceeding
✅ Cluster configuration validated!
ℹ️  For your viewing, here's the cluster configuration generated. Please make sure it looks right before proceeding. Press enter to continue, or Ctrl+C to exit and make changes
{
  "ClusterName": "ml-cluster",
  "InstanceGroups": [
    {
      "InstanceGroupName": "login-group",
      "InstanceType": "ml.c5.large",
      "InstanceStorageConfigs": [
        {
          "EbsVolumeConfig": {
            "VolumeSizeInGB": 500
          }
        }
      ],
      "InstanceCount": 1,
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://sagemaker-lifecycle-4681d320/src",
        "OnCreate": "on_create.sh"
      },
      "ExecutionRole": "arn:aws:iam::123456789012:role/ohmura-sagemaker-hyperpod-AmazonSagemakerClusterExe-HPprMfnS4bYv",
      "ThreadsPerCore": 2
    },
    {
      "InstanceGroupName": "controller-machine",
      "InstanceType": "ml.c5.large",
      "InstanceStorageConfigs": [
        {
          "EbsVolumeConfig": {
            "VolumeSizeInGB": 500
          }
        }
      ],
      "InstanceCount": 1,
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://sagemaker-lifecycle-4681d320/src",
        "OnCreate": "on_create.sh"
      },
      "ExecutionRole": "arn:aws:iam::123456789012:role/ohmura-sagemaker-hyperpod-AmazonSagemakerClusterExe-HPprMfnS4bYv",
      "ThreadsPerCore": 2
    },
    {
      "InstanceGroupName": "worker-group-1",
      "InstanceType": "ml.c5.xlarge",
      "InstanceCount": 10,
      "InstanceStorageConfigs": [
        {
          "EbsVolumeConfig": {
            "VolumeSizeInGB": 500
          }
        }
      ],
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://sagemaker-lifecycle-4681d320/src",
        "OnCreate": "on_create.sh"
      },
      "ExecutionRole": "arn:aws:iam::123456789012:role/ohmura-sagemaker-hyperpod-AmazonSagemakerClusterExe-HPprMfnS4bYv",
      "ThreadsPerCore": 1
    }
  ],
  "VpcConfig": {
    "SecurityGroupIds": [
      "sg-028d6a11916c85acf"
    ],
    "Subnets": [
      "subnet-0f86c5bd4038e1a6e"
    ]
  }
}

=================================================

==== 🎉 Cluster Creation Script Completed! 🎉 ====

=================================================
Congratulations! You've completed all the preparatory steps.
Next Steps:
Do you want the script to create the cluster for you now? (yes/no): [yes]:
⚠️  Please note:
   - Cluster creation may take some time (~15-20 min)
   - This operation may incur costs on your AWS account
   - Ensure you understand the implications before proceeding

✅ Creating cluster for you!
✅ Cluster creation request submitted successfully. To monitor the progress of cluster creation, you can either check the SageMaker console, or you can run:.
watch -n 1 aws sagemaker list-clusters --output table
Thank you for using the SageMaker HyperPod Cluster Creation Script!
For any issues or questions, please refer to the AWS documentation.
https://docs.aws.amazon.com/sagemaker/latest/dg/smcluster-getting-started.html

Exiting script. Good luck with your SageMaker HyperPod journey! 👋

[cloudshell-user@ip-10-134-45-116 automate-smhp-slurm]$
 g. SSH into Clusterセットアップスクリプトで作成したクラスターにログインするセクションです。
 セッションマネージャーでの接続方法--tagetの指定には、クラスター ID と、ノード名、インスタンス ID の確認が必要です。クラスターID とノード名の間は_（アンダースコア）ですが、ノード名とインスタンス ID の間は-（ハイフン）になっています。紛らわしいのでご注意ください。
aws ssm start-session --target sagemaker-cluster:4oq7xjbaxltw_login-group-i-00068f3caa180fb18
CLUSTER_ID=hoge
NODE_GROUP_NAME=fuga
INSTANCE_ID=piyo

aws ssm start-session --target sagemaker-cluster:${CLUSTER_ID}_${NODE_GROUP_NAME}-${INSTANCE_ID}
 ログイン後のユーザーは rootセッションマネージャーで接続するとルートユーザーでした。HyperPod のインスタンスのデフォルトユーザーはssm-userではありませんでした。
# whoami
root
ドキュメントを確認したところsudo su - ubuntuでユーザーを切り替えるように説明がありました。
https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-run-jobs-access-nodes.html
 ワーカグループは常時起動クラスターの起動インスタンスを確認すると、ワーカーグループで 10 台起動しています。AWS ParallelCluster や、AWS PCS の様にジョブをサブミットしたら起動するものと思っていたのですが間違えでした。
sinfoコマンドで確認しても、間違いなく 10 台起動している様子でした。クラスターを作成しただけなのでジョブはもちろんありません。
# sinfo
PARTITION    AVAIL  TIMELIMIT  NODES  STATE NODELIST
dev*            up   infinite     10   idle ip-10-1-45-185,ip-10-1-52-110,ip-10-1-62-81,ip-10-1-63-193,ip-10-1-67-25,ip-10-1-72-114,ip-10-1-104-[113,221],ip-10-1-109-180,ip-10-1-112-220
ml.c5.xlarge    up   infinite     10   idle ip-10-1-45-185,ip-10-1-52-110,ip-10-1-62-81,ip-10-1-63-193,ip-10-1-67-25,ip-10-1-72-114,ip-10-1-104-[113,221],ip-10-1-109-180,ip-10-1-112-220

# squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
 ワーカーグループの起動台数をマネージメントコンソールから変更はエラーになるマネージメントコンソールからワーカーグループの起動台数を変更しようとしました。

数量を10から1に変更して保存しました。
そのあとInternal Errorの文字だけ表示され、数量の変更が反映されませんでした。何度か試したのですがエラーになる原因はわかりませんでした。
 実行環境の削除クラスターを作成した段階で良い時間になりました。夜間も利用費がかかるのは避けたいので一度削除することにしました。
 クラスター削除クラスターは以下のコマンドで正常に削除されました。
$ aws sagemaker delete-cluster --cluster-name ml-cluster
削除されるのはあくまでもクラスターのみであり、セットアップスクリプトから作成されたライフサイクルのスクリプトを保存している S3 バケットは削除されませんのでご注意ください。
 CloudFormation スタック削除提供された CloudFormation テンプレートから作成したスタックを削除すると、正常に削除されました。
 まとめHyperPod の環境構築で注意すべき主なポイントは以下の通りです。
サービスクォータの確認
ml. で始まるインスタンスタイプは SageMaker のクォータで管理
デフォルトでは制限値が低いため、必要に応じて上限緩和申請が必要

ワークショップの利用費
FSx for Lustre と NAT Gateway は停止不可
短期利用か、都度再構築を検討

セットアップスクリプト利用時の注意点
ワーカーグループのインスタンス数に注意
Instance Count は常時起動するインスタンス数
マネージメントコンソールからの台数変更は現時点では不安定で動作しない


 おわりに初めてのサービスを学ぶときは、実際に手を動かして得られるものが多いと感じます。その分、ワークショップの進みは遅くなりました。引き続きワークショップを完走を目指してやっていきます。
 参考SageMaker HyperPod リファレンス - Amazon SageMaker
Amazon SageMaker HyperPod のワークショップから学ぶ環境構築のハマりどころ

今回のワークショップ

本記事でお届けする内容

0. Prerequisites

2. Own Account

パラメータ設定時の注意事項

利用費が気になるリソース

1. Cluster Setup

1. (Option A) Easy Cluster Setup

各種グループ（ノード）の設定

サービスクォータに注意

セットアップスクリプトの対話内容全文

g. SSH into Cluster

セッションマネージャーでの接続方法

ログイン後のユーザーは root

ワーカグループは常時起動

ワーカーグループの起動台数をマネージメントコンソールから変更はエラーになる

実行環境の削除

クラスター削除

CloudFormation スタック削除

まとめ

おわりに

参考

関連記事

主なカテゴリ

AWSで探す

注目のテーマ

プロダクトやサービスで探す

特集やシリーズから探す

お問い合わせ

運営会社