AWS PCS でジョブの自動再キューイングをサポートした Slurm 25.05 がリリースされました

AWS PCS でジョブの自動再キューイングをサポートした Slurm 25.05 がリリースされました

2025.10.23

はじめに

AWS Parallel Computing Service (PCS) で Slurm 25.05 のサポートが追加されました。Slurm 25.05 で追加された新機能と変更点について、AWS PCS クラスターを構築して検証した結果を紹介します。

https://aws.amazon.com/jp/about-aws/whats-new/2025/10/aws-parallel-computing-service-supports-pcs-slurm-v25-05/

Slurm 25.05 の主な新機能

AWS PCS で利用可能になった Slurm 25.05 の主な新機能は以下の 3 点です。

  1. マルチクラスター sackd 設定のサポート
  2. requeue_on_resume_failure のデフォルト有効化
  3. LogTimeFormat の変更

それぞれの詳細を見ていきます。

マルチクラスター sackd 設定のサポート

機能概要

複数の AWS PCS クラスターへ単一のログインノードから接続できるようになりました。ログインノードの台数を削減でき、管理コストと運用コストを低減できます。

どう変わったのか

Slurm 25.05 では、複数の sackd(Slurm Auth and Cred Kiosk Daemon)デーモンを同一ログインノード上で共存させることが可能になりました。これにより、1 つのログインノードから複数のクラスターへアクセスできます。

複数の sackd デーモンを共存させるには、systemd の RuntimeDirectory オプションを使用します。slurm-<clustername> のようにクラスターごとに異なるディレクトリを設定することで、各クラスター専用のソケットパスが作成されます。たとえば /run/slurm-cluster1/sack.socket/run/slurm-cluster2/sack.socket のようになり、複数のクラスターへの接続が可能になります。

参考: https://slurm.schedmd.com/sackd.html

具体的には、以下の変更が入っていました。

  • クライアントが /run/slurm-<cluster>/sack.socket を優先的に探すようになった
  • SLURM_CONF 環境変数で設定ソースと認証ソケットを別々に指定できる
  • sackd が RUNTIME_DIRECTORY 環境変数を認識し、クラスターごとに独立した実行環境を持てる

参考: https://github.com/SchedMD/slurm/blob/slurm-25.05/CHANGELOG/slurm-25.05.md

設定方法

実際に試したかったのですが難易度が高かったため時間があるときに確認します。設定方法は以下のドキュメントを参照ください。

https://docs.aws.amazon.com/pcs/latest/userguide/multi-cluster-login-script.html

requeue_on_resume_failure のデフォルト有効化

機能概要

Slurm の requeue_on_resume_failure SchedulerParameter がデフォルトで有効になりました。インスタンス起動失敗時にジョブを自動的に再キューイングします。

実機で確認

AWS PCS クラスター(Slurm 25.05)で scontrol show config を実行しました。その結果、SchedulerParametersrequeue_on_resume_failure が設定されていることを確認しました。

scontrol show config 実行結果抜粋
SchedulerParameters     = requeue_on_resume_failure
ResumeTimeout           = 1800 sec
SLURM_VERSION           = 25.05.3

ResumeTimeout は、ResumeProgram がノードを起動してから Slurm に応答するまでの最大待機時間を指定します。AWS PCS では 1800 秒(30 分)に設定されています。この時間内にノードが起動しない場合、自動的に DOWN 状態となりジョブが再キューイングされます。

参考: Slurm Workload Manager - slurm.conf

タイムアウト設定について

1800 秒(30 分)は一見長めの設定です。通常、コンピュートノードは 10 分程度で起動します(経験上)。以下の理由から、余裕をもった設定になったと推測しています。

  • コンピュートノードの初期化にかかる時間
  • EC2 インスタンスのキャパシティ不足による起動遅延をある程度見込んで

LogTimeFormat の変更

変更内容

LogTimeFormat=format_stderr オプションが削除されました。ログファイルと標準エラー出力(stderr)は常に同じタイムスタンプフォーマットで出力されるようになりました。

現在利用可能なフォーマット

Slurm 25.05 では以下のフォーマットが利用可能です。

  • iso8601
  • iso8601_ms(デフォルト)
  • rfc5424
  • rfc5424_ms
  • rfc3339
  • clock
  • short
  • thread_id

参考: Slurm Workload Manager - slurm.conf

実機での確認

AWS PCS クラスター(Slurm 25.05)では rfc5424_ms フォーマットが採用されています。

scontrol show config 実行結果抜粋
LogTimeFormat           = rfc5424_ms

検証環境

本記事の検証には、以下の構成で AWS PCS クラスターを構築しました。

項目 設定値
リージョン 東京リージョン(ap-northeast-1)
Slurm バージョン 25.05.3
ログインノード t3.micro × 1 台
コンピュートノード m7i.large × 最大 4 台
共有ストレージ Amazon EFS

パラレルコンピューティングサービス___ap-northeast-1.png

構築に使用した CloudFormation テンプレートです。

CloudFormation テンプレート
AWSTemplateFormatVersion: '2010-09-09'
Description: 'Simple AWS PCS cluster for verification - 1 login node (t3.micro) and up to 4 compute nodes (m7i.large) with EFS storage'

Parameters:
  ClusterName:
    Type: String
    Default: 'simple-pcs-cluster'
    Description: 'Name of the PCS cluster'
  AMIID:
    Type: AWS::EC2::Image::Id
    Default: 'ami-056a2dabd78ff3886'
  SlurmVersion:
    Type: String
    Default: '25.05'
    Description: 'Slurm version for the PCS cluster'
    AllowedValues:
      - '25.05'
      - '24.11'

Resources:
  # VPC
  VPC:
    Type: AWS::EC2::VPC
    Properties:
      CidrBlock: '10.0.0.0/16'
      EnableDnsHostnames: true
      EnableDnsSupport: true
      Tags:
        - Key: Name
          Value: !Sub '${ClusterName}-vpc'

  # Internet Gateway
  InternetGateway:
    Type: AWS::EC2::InternetGateway
    Properties:
      Tags:
        - Key: Name
          Value: !Sub '${ClusterName}-igw'

  InternetGatewayAttachment:
    Type: AWS::EC2::VPCGatewayAttachment
    Properties:
      InternetGatewayId: !Ref InternetGateway
      VpcId: !Ref VPC

  # Public Subnet
  PublicSubnet1:
    Type: AWS::EC2::Subnet
    Properties:
      VpcId: !Ref VPC
      AvailabilityZone: !Select [0, !GetAZs '']
      CidrBlock: '10.0.1.0/24'
      MapPublicIpOnLaunch: true
      Tags:
        - Key: Name
          Value: !Sub '${ClusterName}-public-subnet-1'

  PublicSubnet2:
    Type: AWS::EC2::Subnet
    Properties:
      VpcId: !Ref VPC
      AvailabilityZone: !Select [1, !GetAZs '']
      CidrBlock: '10.0.2.0/24'
      MapPublicIpOnLaunch: true
      Tags:
        - Key: Name
          Value: !Sub '${ClusterName}-public-subnet-2'

  PublicSubnet3:
    Type: AWS::EC2::Subnet
    Properties:
      VpcId: !Ref VPC
      AvailabilityZone: !Select [2, !GetAZs '']
      CidrBlock: '10.0.3.0/24'
      MapPublicIpOnLaunch: true
      Tags:
        - Key: Name
          Value: !Sub '${ClusterName}-public-subnet-3'

  # Private Subnet
  PrivateSubnet1:
    Type: AWS::EC2::Subnet
    Properties:
      VpcId: !Ref VPC
      AvailabilityZone: !Select [0, !GetAZs '']
      CidrBlock: '10.0.11.0/24'
      Tags:
        - Key: Name
          Value: !Sub '${ClusterName}-private-subnet-1'

  PrivateSubnet2:
    Type: AWS::EC2::Subnet
    Properties:
      VpcId: !Ref VPC
      AvailabilityZone: !Select [1, !GetAZs '']
      CidrBlock: '10.0.12.0/24'
      Tags:
        - Key: Name
          Value: !Sub '${ClusterName}-private-subnet-2'

  PrivateSubnet3:
    Type: AWS::EC2::Subnet
    Properties:
      VpcId: !Ref VPC
      AvailabilityZone: !Select [2, !GetAZs '']
      CidrBlock: '10.0.13.0/24'
      Tags:
        - Key: Name
          Value: !Sub '${ClusterName}-private-subnet-3'

  # Isolated Subnet
  IsolatedSubnet1:
    Type: AWS::EC2::Subnet
    Properties:
      VpcId: !Ref VPC
      AvailabilityZone: !Select [0, !GetAZs '']
      CidrBlock: '10.0.21.0/24'
      Tags:
        - Key: Name
          Value: !Sub '${ClusterName}-isolated-subnet-1'

  IsolatedSubnet2:
    Type: AWS::EC2::Subnet
    Properties:
      VpcId: !Ref VPC
      AvailabilityZone: !Select [1, !GetAZs '']
      CidrBlock: '10.0.22.0/24'
      Tags:
        - Key: Name
          Value: !Sub '${ClusterName}-isolated-subnet-2'

  IsolatedSubnet3:
    Type: AWS::EC2::Subnet
    Properties:
      VpcId: !Ref VPC
      AvailabilityZone: !Select [2, !GetAZs '']
      CidrBlock: '10.0.23.0/24'
      Tags:
        - Key: Name
          Value: !Sub '${ClusterName}-isolated-subnet-3'

  # NAT Gateway
  NatGateway1EIP:
    Type: AWS::EC2::EIP
    DependsOn: InternetGatewayAttachment
    Properties:
      Domain: vpc

  NatGateway1:
    Type: AWS::EC2::NatGateway
    Properties:
      AllocationId: !GetAtt NatGateway1EIP.AllocationId
      SubnetId: !Ref PublicSubnet1

  # Route Tables
  PublicRouteTable:
    Type: AWS::EC2::RouteTable
    Properties:
      VpcId: !Ref VPC
      Tags:
        - Key: Name
          Value: !Sub '${ClusterName}-public-routes'

  DefaultPublicRoute:
    Type: AWS::EC2::Route
    DependsOn: InternetGatewayAttachment
    Properties:
      RouteTableId: !Ref PublicRouteTable
      DestinationCidrBlock: '0.0.0.0/0'
      GatewayId: !Ref InternetGateway

  PublicSubnet1RouteTableAssociation:
    Type: AWS::EC2::SubnetRouteTableAssociation
    Properties:
      RouteTableId: !Ref PublicRouteTable
      SubnetId: !Ref PublicSubnet1

  PublicSubnet2RouteTableAssociation:
    Type: AWS::EC2::SubnetRouteTableAssociation
    Properties:
      RouteTableId: !Ref PublicRouteTable
      SubnetId: !Ref PublicSubnet2

  PublicSubnet3RouteTableAssociation:
    Type: AWS::EC2::SubnetRouteTableAssociation
    Properties:
      RouteTableId: !Ref PublicRouteTable
      SubnetId: !Ref PublicSubnet3

  PrivateRouteTable1:
    Type: AWS::EC2::RouteTable
    Properties:
      VpcId: !Ref VPC
      Tags:
        - Key: Name
          Value: !Sub '${ClusterName}-private-routes-1'

  DefaultPrivateRoute1:
    Type: AWS::EC2::Route
    Properties:
      RouteTableId: !Ref PrivateRouteTable1
      DestinationCidrBlock: '0.0.0.0/0'
      NatGatewayId: !Ref NatGateway1

  PrivateSubnet1RouteTableAssociation:
    Type: AWS::EC2::SubnetRouteTableAssociation
    Properties:
      RouteTableId: !Ref PrivateRouteTable1
      SubnetId: !Ref PrivateSubnet1

  PrivateSubnet2RouteTableAssociation:
    Type: AWS::EC2::SubnetRouteTableAssociation
    Properties:
      RouteTableId: !Ref PrivateRouteTable1
      SubnetId: !Ref PrivateSubnet2

  PrivateSubnet3RouteTableAssociation:
    Type: AWS::EC2::SubnetRouteTableAssociation
    Properties:
      RouteTableId: !Ref PrivateRouteTable1
      SubnetId: !Ref PrivateSubnet3

  # Security Groups
  LoginSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupName: !Sub '${ClusterName}-login-sg'
      GroupDescription: 'Security group for PCS login nodes'
      VpcId: !Ref VPC
      SecurityGroupEgress:
        - IpProtocol: -1
          CidrIp: 0.0.0.0/0
          Description: 'Allow all outbound traffic'
      Tags:
        - Key: Name
          Value: !Sub '${ClusterName}-login-sg'

  ComputeSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupName: !Sub '${ClusterName}-compute-sg'
      GroupDescription: 'Security group for PCS compute nodes'
      VpcId: !Ref VPC
      SecurityGroupIngress:
        - IpProtocol: tcp
          FromPort: 22
          ToPort: 22
          CidrIp: 10.0.0.0/16
          Description: 'SSH access from VPC for PCS requirement'
      SecurityGroupEgress:
        - IpProtocol: -1
          CidrIp: 0.0.0.0/0
          Description: 'Allow all outbound traffic'
      Tags:
        - Key: Name
          Value: !Sub '${ClusterName}-compute-sg'

  # Allow all traffic from login to compute
  ComputeSecurityGroupIngressFromLogin:
    Type: AWS::EC2::SecurityGroupIngress
    Properties:
      GroupId: !Ref ComputeSecurityGroup
      IpProtocol: -1
      SourceSecurityGroupId: !Ref LoginSecurityGroup
      Description: 'Allow all traffic from login nodes'

  # Allow all traffic from compute to login
  LoginSecurityGroupIngressFromCompute:
    Type: AWS::EC2::SecurityGroupIngress
    Properties:
      GroupId: !Ref LoginSecurityGroup
      IpProtocol: -1
      SourceSecurityGroupId: !Ref ComputeSecurityGroup
      Description: 'Allow all traffic from compute nodes'

  # Self-referencing rules for login nodes
  LoginSecurityGroupIngressSelfRef:
    Type: AWS::EC2::SecurityGroupIngress
    Properties:
      GroupId: !Ref LoginSecurityGroup
      IpProtocol: -1
      SourceSecurityGroupId: !Ref LoginSecurityGroup
      Description: 'Allow all traffic within login nodes'

  # Self-referencing rules for compute nodes
  ComputeSecurityGroupIngressSelfRef:
    Type: AWS::EC2::SecurityGroupIngress
    Properties:
      GroupId: !Ref ComputeSecurityGroup
      IpProtocol: -1
      SourceSecurityGroupId: !Ref ComputeSecurityGroup
      Description: 'Allow all traffic within compute nodes'

  EFSSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupName: !Sub '${ClusterName}-efs-sg'
      GroupDescription: 'Security group for EFS'
      VpcId: !Ref VPC
      SecurityGroupIngress:
        - IpProtocol: tcp
          FromPort: 2049
          ToPort: 2049
          SourceSecurityGroupId: !Ref LoginSecurityGroup
        - IpProtocol: tcp
          FromPort: 2049
          ToPort: 2049
          SourceSecurityGroupId: !Ref ComputeSecurityGroup
      Tags:
        - Key: Name
          Value: !Sub '${ClusterName}-efs-sg'

  # EFS
  EFSFileSystem:
    Type: AWS::EFS::FileSystem
    Properties:
      PerformanceMode: generalPurpose
      ThroughputMode: bursting
      Encrypted: true
      FileSystemTags:
        - Key: Name
          Value: !Sub '${ClusterName}-efs'

  EFSMountTarget1:
    Type: AWS::EFS::MountTarget
    Properties:
      FileSystemId: !Ref EFSFileSystem
      SubnetId: !Ref PrivateSubnet1
      SecurityGroups:
        - !Ref EFSSecurityGroup

  EFSMountTarget2:
    Type: AWS::EFS::MountTarget
    Properties:
      FileSystemId: !Ref EFSFileSystem
      SubnetId: !Ref PrivateSubnet2
      SecurityGroups:
        - !Ref EFSSecurityGroup

  EFSMountTarget3:
    Type: AWS::EFS::MountTarget
    Properties:
      FileSystemId: !Ref EFSFileSystem
      SubnetId: !Ref PrivateSubnet3
      SecurityGroups:
        - !Ref EFSSecurityGroup

  # IAM Role for PCS Instance
  PCSInstanceRole:
    Type: AWS::IAM::Role
    Properties:
      RoleName: !Sub 'AWSPCS-${AWS::StackName}-${AWS::Region}'
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service: ec2.amazonaws.com
            Action: sts:AssumeRole
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
      Policies:
        - PolicyName: PCSComputeNodePolicy
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Action:
                  - pcs:RegisterComputeNodeGroupInstance
                Resource: '*'

  PCSInstanceProfile:
    Type: AWS::IAM::InstanceProfile
    Properties:
      Roles:
        - !Ref PCSInstanceRole

  # Launch Template for Login Nodes
  LoginLaunchTemplate:
    Type: AWS::EC2::LaunchTemplate
    Properties:
      LaunchTemplateName: !Sub '${ClusterName}-login-launch-template'
      LaunchTemplateData:
        ImageId: !Ref AMIID
        SecurityGroupIds:
          - !Ref LoginSecurityGroup
        IamInstanceProfile:
          Arn: !GetAtt PCSInstanceProfile.Arn
        UserData:
          Fn::Base64: !Sub |
            MIME-Version: 1.0
            Content-Type: multipart/mixed; boundary="==BOUNDARY=="

            --==BOUNDARY==
            Content-Type: text/x-shellscript; charset="us-ascii"

            #!/bin/bash
            yum update -y
            yum install -y amazon-efs-utils
            mkdir -p /shared
            echo "${EFSFileSystem}.efs.${AWS::Region}.amazonaws.com:/ /shared efs defaults,_netdev" >> /etc/fstab
            mount -a
            chown ec2-user:ec2-user /shared
            chmod 755 /shared

            # Add Slurm commands to PATH for all users
            echo 'export PATH=/opt/slurm/bin:$PATH' >> /etc/profile.d/slurm.sh
            chmod +x /etc/profile.d/slurm.sh

            # Add to ec2-user's bashrc for immediate availability
            echo 'export PATH=/opt/slurm/bin:$PATH' >> /home/ec2-user/.bashrc
            chown ec2-user:ec2-user /home/ec2-user/.bashrc

            --==BOUNDARY==--
        MetadataOptions:
          HttpTokens: required
          HttpPutResponseHopLimit: 2
          HttpEndpoint: enabled

  # Launch Template for Compute Nodes
  ComputeLaunchTemplate:
    Type: AWS::EC2::LaunchTemplate
    Properties:
      LaunchTemplateName: !Sub '${ClusterName}-compute-launch-template'
      LaunchTemplateData:
        ImageId: !Ref AMIID
        SecurityGroupIds:
          - !Ref ComputeSecurityGroup
        IamInstanceProfile:
          Arn: !GetAtt PCSInstanceProfile.Arn
        UserData:
          Fn::Base64: !Sub |
            MIME-Version: 1.0
            Content-Type: multipart/mixed; boundary="==BOUNDARY=="

            --==BOUNDARY==
            Content-Type: text/x-shellscript; charset="us-ascii"

            #!/bin/bash
            yum update -y
            yum install -y amazon-efs-utils
            mkdir -p /shared
            echo "${EFSFileSystem}.efs.${AWS::Region}.amazonaws.com:/ /shared efs defaults,_netdev" >> /etc/fstab
            mount -a
            chown ec2-user:ec2-user /shared
            chmod 755 /shared

            # Add Slurm commands to PATH for all users
            echo 'export PATH=/opt/slurm/bin:$PATH' >> /etc/profile.d/slurm.sh
            chmod +x /etc/profile.d/slurm.sh

            # Add to ec2-user's bashrc for immediate availability
            echo 'export PATH=/opt/slurm/bin:$PATH' >> /home/ec2-user/.bashrc
            chown ec2-user:ec2-user /home/ec2-user/.bashrc

            --==BOUNDARY==--
        MetadataOptions:
          HttpTokens: required
          HttpPutResponseHopLimit: 2
          HttpEndpoint: enabled

  # PCS Cluster
  PCSCluster:
    Type: AWS::PCS::Cluster
    Properties:
      Name: !Ref ClusterName
      Size: SMALL
      Scheduler:
        Type: SLURM
        Version: !Ref SlurmVersion
      Networking:
        SubnetIds:
          - !Ref PrivateSubnet1
        SecurityGroupIds:
          - !Ref LoginSecurityGroup
          - !Ref ComputeSecurityGroup

  # Login Node Group
  LoginNodeGroup:
    Type: AWS::PCS::ComputeNodeGroup
    Properties:
      ClusterId: !GetAtt PCSCluster.Id
      Name: login
      ScalingConfiguration:
        MinInstanceCount: 1
        MaxInstanceCount: 1
      IamInstanceProfileArn: !GetAtt PCSInstanceProfile.Arn
      CustomLaunchTemplate:
        TemplateId: !Ref LoginLaunchTemplate
        Version: 1
      SubnetIds:
        - !Ref PublicSubnet1
      AmiId: !Ref AMIID
      InstanceConfigs:
        - InstanceType: 't3.micro'

  # Compute Node Group
  ComputeNodeGroup:
    Type: AWS::PCS::ComputeNodeGroup
    Properties:
      ClusterId: !GetAtt PCSCluster.Id
      Name: compute
      ScalingConfiguration:
        MinInstanceCount: 0
        MaxInstanceCount: 4
      IamInstanceProfileArn: !GetAtt PCSInstanceProfile.Arn
      CustomLaunchTemplate:
        TemplateId: !Ref ComputeLaunchTemplate
        Version: 1
      SubnetIds:
        - !Ref PrivateSubnet1
      AmiId: !Ref AMIID
      InstanceConfigs:
        - InstanceType: 'm7i.large'

  # Queue
  ComputeQueue:
    Type: AWS::PCS::Queue
    Properties:
      ClusterId: !GetAtt PCSCluster.Id
      Name: compute-queue
      ComputeNodeGroupConfigurations:
        - ComputeNodeGroupId: !GetAtt ComputeNodeGroup.Id

Outputs:
  ClusterId:
    Description: 'The ID of the PCS cluster'
    Value: !GetAtt PCSCluster.Id
    Export:
      Name: !Sub '${AWS::StackName}-ClusterId'

  VPCId:
    Description: 'The ID of the VPC'
    Value: !Ref VPC
    Export:
      Name: !Sub '${AWS::StackName}-VPCId'

  EFSFileSystemId:
    Description: 'The ID of the EFS file system'
    Value: !Ref EFSFileSystem
    Export:
      Name: !Sub '${AWS::StackName}-EFSFileSystemId'

  PCSConsoleUrl:
    Description: 'URL to access the cluster in the PCS console'
    Value: !Sub 'https://${AWS::Region}.console.aws.amazon.com/pcs/home?region=${AWS::Region}#/clusters/${PCSCluster.Id}'
    Export:
      Name: !Sub '${AWS::StackName}-PCSConsoleUrl'

  EC2ConsoleUrl:
    Description: 'URL to access login node instances via Session Manager'
    Value: !Sub 'https://${AWS::Region}.console.aws.amazon.com/ec2/home?region=${AWS::Region}#Instances:instanceState=running;tag:aws:pcs:compute-node-group-id=${LoginNodeGroup.Id}'
    Export:
      Name: !Sub '${AWS::StackName}-EC2ConsoleUrl'

  PCSInstanceRoleArn:
    Description: 'ARN of the PCS Instance Role with vended log delivery permissions'
    Value: !GetAtt PCSInstanceRole.Arn
    Export:
      Name: !Sub '${AWS::StackName}-PCSInstanceRoleArn'

まとめ

AWS Parallel Computing Service で Slurm 25.05 のサポートが開始されました。

本記事では、主な新機能として以下の 3 点を紹介しました。

  1. マルチクラスター sackd 設定のサポート
    • 単一ログインノードから複数クラスターへアクセスが可能に
  2. requeue_on_resume_failure のデフォルト有効化
    • インスタンス起動失敗時の自動リトライによりクラスターの信頼性が向上
  3. LogTimeFormat の変更
    • ログ出力の一貫性が向上

とくに requeue_on_resume_failure のデフォルト有効化は、AWS のような動的スケーリング環境において、キャパシティ不足対策として有用なアップデートでした。

おわりに

AWS PCS のアップデートというよりは、Slurm 25.05 のアップデート内容を確認することになり、なかなか調べるのが大変でした。良い勉強になりました。AWS ParallelCluster も Slurm25.x 代対応になったら同じ様なアップデート入りそうですね。

付録

AWS PCS クラスター(Slurm 25.05.3)で取得した Slurm 設定の全文を貼っておきます。

scontrol show config の出力全文
Configuration data as of 2025-10-23T06:22:02
AccountingStorageBackupHost = (null)
AccountingStorageEnforce = none
AccountingStorageHost   = localhost
AccountingStorageExternalHost = (null)
AccountingStorageParameters = (null)
AccountingStoragePort   = 0
AccountingStorageTRES   = cpu,mem,energy,node,billing,fs/disk,vmem,pages
AccountingStorageType   = (null)
AccountingStorageUser   = root
AccountingStoreFlags    = (null)
AcctGatherEnergyType    = (null)
AcctGatherFilesystemType = (null)
AcctGatherInterconnectType = (null)
AcctGatherNodeFreq      = 0 sec
AcctGatherProfileType   = (null)
AllowSpecResourcesUsage = no
AuthAltTypes            = (null)
AuthAltParameters       = (null)
AuthInfo                = use_client_ids,cred_expire=70
AuthType                = auth/slurm
BatchStartTimeout       = 10 sec
BcastExclude            = /lib,/usr/lib,/lib64,/usr/lib64
BcastParameters         = (null)
BOOT_TIME               = 2025-10-23T02:53:27
BurstBufferType         = (null)
CertgenParameters       = (null)
CertgenType             = (null)
CertmgrParameters       = (null)
CertmgrType             = (null)
CliFilterPlugins        = (null)
ClusterName             = simple-pcs-cluster
CommunicationParameters = NoCtldInAddrAny
CompleteWait            = 0 sec
CpuFreqDef              = Unknown
CpuFreqGovernors        = OnDemand,Performance,UserSpace
CredType                = auth/slurm
DataParserParameters    = (null)
DebugFlags              = AuditRPCs,GLOB_SILENCE,Power
DefMemPerNode           = UNLIMITED
DependencyParameters    = (null)
DisableRootJobs         = no
EioTimeout              = 60
EnforcePartLimits       = ALL
EpilogMsgTime           = 2000 usec
FairShareDampeningFactor = 1
FederationParameters    = (null)
FirstJobId              = 1
GresTypes               = gpu
GpuFreqDef              = (null)
GroupUpdateForce        = 1
GroupUpdateTime         = 600 sec
HASH_VAL                = Match
HashPlugin              = hash/sha3
HealthCheckInterval     = 0 sec
HealthCheckNodeState    = ANY
HealthCheckProgram      = (null)
InactiveLimit           = 0 sec
InteractiveStepOptions  = --interactive --preserve-env --pty $SHELL
JobAcctGatherFrequency  = 30
JobAcctGatherType       = (null)
JobAcctGatherParams     = (null)
JobCompHost             = localhost
JobCompLoc              = /var/log/container/controller/daemon/slurm-25.05-jobcomp.log
JobCompParams           = (null)
JobCompPort             = 0
JobCompType             = jobcomp/pcs
JobCompUser             = root
JobContainerType        = (null)
JobDefaults             = (null)
JobFileAppend           = 0
JobRequeue              = 1
JobSubmitPlugins        = (null)
KillOnBadExit           = 0
KillWait                = 30 sec
LaunchParameters        = (null)
Licenses                = (null)
LogTimeFormat           = rfc5424_ms
MailDomain              = (null)
MailProg                = /bin/mail
MaxArraySize            = 257
MaxBatchRequeue         = 5
MaxDBDMsgs              = 0
MaxJobCount             = 257
MaxJobId                = 67043328
MaxMemPerNode           = UNLIMITED
MaxNodeCount            = 5
MaxStepCount            = 40000
MaxTasksPerNode         = 512
MCSPlugin               = (null)
MCSParameters           = (null)
MessageTimeout          = 210 sec
MinJobAge               = 300 sec
MpiDefault              = pmix
MpiParams               = (null)
NEXT_JOB_ID             = 1
NodeFeaturesPlugins     = (null)
OverTimeLimit           = 0 min
PluginDir               = /opt/slurm/lib/slurm
PlugStackConfig         = (null)
PreemptMode             = OFF
PreemptParameters       = (null)
PreemptType             = (null)
PreemptExemptTime       = 00:00:00
PrEpParameters          = (null)
PrEpPlugins             = prep/script
PriorityParameters      = (null)
PrioritySiteFactorParameters = (null)
PrioritySiteFactorPlugin = (null)
PriorityDecayHalfLife   = 7-00:00:00
PriorityCalcPeriod      = 00:05:00
PriorityFavorSmall      = no
PriorityFlags           =
PriorityMaxAge          = 7-00:00:00
PriorityType            = priority/multifactor
PriorityUsageResetPeriod = NONE
PriorityWeightAge       = 0
PriorityWeightAssoc     = 0
PriorityWeightFairShare = 0
PriorityWeightJobSize   = 0
PriorityWeightPartition = 0
PriorityWeightQOS       = 0
PriorityWeightTRES      = (null)
PrivateData             = none
ProctrackType           = proctrack/cgroup
PrologEpilogTimeout     = 65534
PrologFlags             = Alloc,Contain
PropagatePrioProcess    = 0
PropagateResourceLimits = ALL
PropagateResourceLimitsExcept = (null)
RebootProgram           = /sbin/reboot
ReconfigFlags           = KeepPartState
RequeueExit             = (null)
RequeueExitHold         = (null)
ResumeFailProgram       = (null)
ResumeProgram           = /etc/slurm/scripts/resume_program.sh
ResumeRate              = 0 nodes/min
ResumeTimeout           = 1800 sec
ResvEpilog              = (null)
ResvOverRun             = 0 min
ResvProlog              = (null)
ReturnToService         = 1
SchedulerParameters     = requeue_on_resume_failure
SchedulerTimeSlice      = 30 sec
SchedulerType           = sched/backfill
ScronParameters         = (null)
SelectType              = select/cons_tres
SelectTypeParameters    = CR_CPU
SlurmUser               = slurm(401)
SlurmctldAddr           = (null)
SlurmctldDebug          = verbose
SlurmctldHost[0]        = slurmctld-primary(10.0.11.198)
SlurmctldLogFile        = /dev/null
SlurmctldPort           = 6817
SlurmctldSyslogDebug    = (null)
SlurmctldPrimaryOffProg = (null)
SlurmctldPrimaryOnProg  = (null)
SlurmctldTimeout        = 300 sec
SlurmctldParameters     = idle_on_node_suspend,power_save_min_interval=0,node_reg_mem_percent=75,enable_configless,disable_triggers,max_powered_nodes=32
SlurmdDebug             = info
SlurmdLogFile           = /var/log/slurmd.log
SlurmdParameters        = (null)
SlurmdPidFile           = /var/run/slurm/slurmd.pid
SlurmdPort              = 6818
SlurmdSpoolDir          = /var/spool/slurmd
SlurmdSyslogDebug       = (null)
SlurmdTimeout           = 180 sec
SlurmdUser              = root(0)
SlurmSchedLogFile       = (null)
SlurmSchedLogLevel      = 0
SlurmctldPidFile        = /var/run/slurm/slurmctld.pid
SLURM_CONF              = /etc/slurm/slurm.conf
SLURM_VERSION           = 25.05.3
SrunEpilog              = (null)
SrunPortRange           = 0-0
SrunProlog              = (null)
StateSaveLocation       = /mnt/efs/slurm/statesave
SuspendExcNodes         = login-[1-1],
SuspendExcParts         = (null)
SuspendExcStates        = DYNAMIC_FUTURE,DYNAMIC_NORM
SuspendProgram          = /etc/slurm/scripts/suspend_program.sh
SuspendRate             = 0 nodes/min
SuspendTime             = 600 sec
SuspendTimeout          = 60 sec
SwitchParameters        = (null)
SwitchType              = (null)
TaskEpilog              = (null)
TaskPlugin              = task/cgroup,task/affinity
TaskPluginParam         = (null type)
TaskProlog              = (null)
TCPTimeout              = 2 sec
TLSParameters           = (null)
TLSType                 = tls/none
TmpFS                   = /tmp
TopologyParam           = RoutePart
TopologyPlugin          = topology/flat
TrackWCKey              = no
TreeWidth               = 16
UsePam                  = no
UnkillableStepProgram   = (null)
UnkillableStepTimeout   = 1050 sec
VSizeFactor             = 0 percent
WaitTime                = 0 sec
X11Parameters           = (null)

Cgroup Support Configuration:
AllowedRAMSpace         = 100.0%
AllowedSwapSpace        = 0.0%
CgroupMountpoint        = /sys/fs/cgroup
CgroupPlugin            = autodetect
ConstrainCores          = yes
ConstrainDevices        = no
ConstrainRAMSpace       = yes
ConstrainSwapSpace      = no
EnableControllers       = no
EnableExtraControllers  = (null)
IgnoreSystemd           = no
IgnoreSystemdOnFailure  = no
MaxRAMPercent           = 100.0%
MaxSwapPercent          = 100.0%
MemorySwappiness        = (null)
MinRAMSpace             = 30MB
SystemdTimeout          = 1000 ms

MPI Plugins Configuration:
PMIxCliTmpDirBase       = (null)
PMIxCollFence           = (null)
PMIxDebug               = 0
PMIxDirectConn          = yes
PMIxDirectConnEarly     = no
PMIxDirectConnUCX       = no
PMIxDirectSameArch      = no
PMIxEnv                 = (null)
PMIxFenceBarrier        = no
PMIxNetDevicesUCX       = (null)
PMIxTimeout             = 300
PMIxTlsUCX              = (null)

Slurmctld(primary) at slurmctld-primary is UP

参考資料

この記事をシェアする

FacebookHatena blogX

関連記事