ECS タスクが動かない?ランブックで原因をサクッと特定しよう!

ECS タスクが動かない?ランブックで原因をサクッと特定しよう!

Clock Icon2025.04.28

こんにちは。テクニカルサポートチームのShiinaです。

はじめに

ECS タスクが停止したり、開始できなかった経験はありませんか?
タスク失敗の原因は、IAM 権限の設定ミスやタスク定義の不備、ネットワークの問題など多岐にわたり、特定には時間がかかることもあります。
AWS 提供 の Systems Manager Automation ランブック「AWSSupport-TroubleshootECSTaskFailedToStart」を活用して、効率的にトラブルシューティングを行う方法をご紹介します。

ランブックの概要

「AWSSupport-TroubleshootECSTaskFailedToStart」は ECS タスクの起動を妨げる問題の分析を行うランブックです。

処理の概要

ランブックで実施される処理の概要は以下の通りです。

  • 実行ユーザーまたはロールに必要な IAM 権限があるかを確認します。
  • ネットワーク接続の検証が必要と判断された場合、VPC 内に Lambda 関数を作成して接続性をシミュレートします。
  • Lambda 関数や ECS タスクの情報をもとに、失敗の根本原因を分析します。
  • ネットワーク接続の検証を実施した場合は作成した Lambda 関数や関連リソースを削除します。
  • 分析結果を出力します。

入力パラメータ

パラメーター 必須 説明 備考
AutomationAssumeRole Systems Manager Automation がユーザーに代わってアクションを実行する IAM ロール名 指定なしの場合は実行者の権限を使用
ClusterName クラスター名
TaskId タスクID
CloudwatchRetentionPeriod CloudWatch Logs に保存される Lambda 関数ログの保存期間

IAM ロールの設定(任意)

本ランブックでは Systems Manager Automation がユーザーに代わってアクションを実行できるよう、オプションで IAM ロールを指定できます。
今回は最小限の権限とした専用のロールを作成し、AutomationAssumeRole パラメータに指定しました。

  • ロール名
    任意の名称(例:SSM-TroubleshootECSTaskFailedToStart-Role)

  • インラインポリシー

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "cloudtrail:LookupEvents",
        "ec2:DeleteNetworkInterface",
        "ec2:DescribeInstances",
        "ec2:DescribeInstanceAttribute",
        "ec2:DescribeIamInstanceProfileAssociations",
        "ec2:DescribeSecurityGroups",
        "ec2:DescribeNetworkAcls",
        "ec2:DescribeNetworkInterfaces",
        "ec2:DescribeRouteTables",
        "ec2:DescribeSubnets",
        "ec2:DescribeVpcEndpoints",
        "ec2:DescribeVpcs",
        "ec2:DescribeDhcpOptions",
        "ec2:DescribeVpcAttribute",
        "ecr:DescribeImages",
        "ecr:GetRepositoryPolicy",
        "ecs:DescribeContainerInstances",
        "ecs:DescribeServices",
        "ecs:DescribeTaskDefinition",
        "ecs:DescribeTasks",
        "iam:AttachRolePolicy",
        "iam:CreateRole",
        "iam:DeleteRole",
        "iam:DetachRolePolicy",
        "iam:GetInstanceProfile",
        "iam:GetRole",
        "iam:ListRoles",
        "iam:ListUsers",
        "iam:PassRole",
        "iam:SimulateCustomPolicy",
        "iam:SimulatePrincipalPolicy",
        "kms:DescribeKey",
        "lambda:CreateFunction",
        "lambda:DeleteFunction",
        "lambda:GetFunctionConfiguration",
        "lambda:InvokeFunction",
        "lambda:TagResource",
        "logs:DescribeLogGroups",
        "logs:PutRetentionPolicy",
        "secretsmanager:DescribeSecret",
        "ssm:DescribeParameters",
        "sts:GetCallerIdentity",
        "elasticfilesystem:DescribeFileSystems",
        "elasticfilesystem:DescribeMountTargets",
        "elasticfilesystem:DescribeMountTargetSecurityGroups",
        "elasticfilesystem:DescribeFileSystemPolicy"
      ],
      "Resource": "*"
    }
  ]
}
  • 信頼関係
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "",
            "Effect": "Allow",
            "Principal": {
                "Service": "ssm.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}

SSM-TroubleshootECSTaskFailedToStart-Role-IAM-Global-04-28-2025_01_38_PM

ランブックの実行

以下の URL からドキュメントを開きます。
https://ap-northeast-1.console.aws.amazon.com/systems-manager/automation/execute/AWSSupport-TroubleshootECSTaskFailedToStart?region=ap-northeast-1

次のパラメータを指定して「Execute」をクリックします。

  • AutomationAssumeRole:SSM-TroubleshootECSTaskFailedToStart-Role
  • ClusterName:クラスター名
  • TaskId:タスク ID
  • CloudwatchRetentionPeriod:30

Execute-Automation-Systems-Manager-ap-northeast-1-04-28-2025_09_36_AM

Secrets Manager 問題の原因を特定してみる

シークレット値を取得しようとした際に AccessDeniedException が発生してタスクの起動に失敗したエラーメッセージのようです。
ランブックを実行して確認してみます。
error2

実行結果

オートメーション実行結果を確認してみます。

出力結果
EniDeletionMessage.Status
No output available yet because the step is not successfully executed
ExecutionResults.ExecutionLogs

Runbook Execution logs
######################

++++++++++++++++++++++++++++++
STEP: PreflightPermissionChecks
++++++++++++++++++++++++++++++
INFO: The IAM Identity arn:aws:iam::XXXXXXXXXXXX:role/SSM-TroubleshootECSTaskFailedToStart-Role used to execute the Runbook has all required permission

++++++++++++++++++++++++++
STEP: NetworkToolDeployment
++++++++++++++++++++++++++

+++++++++++++++++++++++++++++++++++
STEP: CoreFailureReasonEvaluation
+++++++++++++++++++++++++++++++++++
GenericChecks:Checking Task networking and public IP assignment

GenericChecks:Checking Task networking and public IP assignment

GenericChecks:Checking if Image Pull Rate Limit occurred

RegistryAnalysis:Checking Security group egress rules for DNS resolved IP/s of ECR domains

RegistryAnalysis:Checking required VPC endpoints for ECR

RegistryAnalysis:Checking Security group egress rules for DNS resolved IP/s of external registry domains

LogAnalysis:Checking log configuration permission and group existence

SecretAnalysis:Checking SecretsManager credential existence and KMS settings

+++++++++++++++++++++++
STEP: DeletionLifecycle
+++++++++++++++++++++++

WARNING: ENI Deletion: Skipped, Network tool AWS Lambda function [NetworkToolSSMRunbookExecution93fcd121a150] associated with this specific execution id not found
WARNING: Lambda Function Deletion: Network tool AWS Lambda function [NetworkToolSSMRunbookExecution93fcd121a150] associated with this specific execution not found, No further action required
INFO: The log group [/aws/lambda/NetworkToolSSMRunbookExecution93fcd121a150] could not be found, skipping to change retention policy as resource doesnt exist

WARNING: IAM Role Deletion: Network tool AWS Lambda execution role [NetworkToolSSMRunbookExecution93fcd121a150] associated with this specific execution not found, No further action required

ExecutionResults.TaskFailureReason

Task Failure Reason Analysis:
#############################

SECRETS REFERENCE CHECKS
========================
The Secrets Manager secret arn:aws:secretsmanager:ap-northeast-1:XXXXXXXXXXXX:secret:XXXXX that you have referenced could not be found

実行結果1

The Secrets Manager secret arn:aws:secretsmanager:ap-northeast-1:XXXXXXXXXXXX:secret:XXXXX that you have referenced could not be found
タスク実行ロールのポリシーの問題ではなく、タスク定義で指定しているシークレット値が存在しないためエラーとなっていることがわかりました。

ネットワーク接続問題の原因を特定してみる

ECR(Elastic Container Registry)からイメージを取得しようとした際に、ネットワーク接続エラー(タイムアウト)が発生してタスクの起動に失敗したエラーメッセージのようです。
ランブックを実行して確認してみます。
networkerr1

実行結果

オートメーション実行結果を確認してみます。

出力結果
EniDeletionMessage.Status
VPC Lambda ENI Deletion Status:
#############################

INFO: IAM Role Deletion: Successfully deleted the role [NetworkToolSSMRunbookExecutioneadb54f07588] which got created as part of this execution
The VPC Lambda ENI ['eni-XXXXXXXXXXXX'] not found, the lambda service has deleted it automatically.
No further actions required.

ExecutionResults.ExecutionLogs
Runbook Execution logs
######################

++++++++++++++++++++++++++++++
STEP: PreflightPermissionChecks
++++++++++++++++++++++++++++++
INFO: The IAM Identity arn:aws:iam::XXXXXXXXXXXX:role/SSM-TroubleshootECSTaskFailedToStart-Role used to execute the Runbook has all required permission

++++++++++++++++++++++++++
STEP: NetworkToolDeployment
++++++++++++++++++++++++++
WARNING: An Lambda function execution role named NetworkToolSSMRunbookExecutioneadb54f07588 has been successfully created.
The Role will be deleted as part of the DeletionLifecycle execution or at Lambda_ENI_Deletion_Handler however watch for any exception or information message under runbook Execution logs of global output section

WARNING: An VPC Lambda Function arn:aws:lambda:ap-northeast-1:XXXXXXXXXXXX:function:NetworkToolSSMRunbookExecutioneadb54f07588 has been successfully deployed to test VPC Networking.
The Lambda function will be deleted as part of the DeletionLifecycle execution step however watch for any exception or information message under runbook Execution logs of global output section

+++++++++++++++++++++++++++++++++++
STEP: CoreFailureReasonEvaluation
+++++++++++++++++++++++++++++++++++
GenericChecks:Checking Task networking and public IP assignment

GenericChecks:Checking Task networking and public IP assignment

GenericChecks:Checking if Image Pull Rate Limit occurred

RegistryAnalysis:describe_image tag: 'latest' for existence check

RegistryAnalysis:Checking ECS/Fargate Agent role permission to perform ['ecr:GetAuthorizationToken']

RegistryAnalysis:Checking Security group egress rules for DNS resolved IP/s of ECR domains

RegistryAnalysis:Checking required VPC endpoints for ECR

RegistryAnalysis:Checking VPCe analysis for s3

RegistryAnalysis:Checking Security group egress rules for DNS resolved IP/s of external registry domains

LogAnalysis:Checking network results for CW endpoints

LogAnalysis:Checking egress security group rules for DNS resolved IP(s) of CW endpoint

LogAnalysis:Checking log configuration permission and group existence

+++++++++++++++++++++++
STEP: DeletionLifecycle
+++++++++++++++++++++++

INFO: ENI Deletion: Found HyperPlane ENI ['eni-XXXXXXXXXXXX'] which got created in this execution, The ENI will be attempted for deletion as part of step [Lambda_ENI_Deletion_Handler]
INFO: Lambda Function Deletion: Successfully deleted the AWS Lambda function [NetworkToolSSMRunbookExecutioneadb54f07588]  which got created as part of this execution
INFO: Successfully modified Cloudwatch log group retention period for group /aws/lambda/NetworkToolSSMRunbookExecutioneadb54f07588 to the given value 30

ExecutionResults.TaskFailureReason

Task Failure Reason Analysis:
#############################

REGISTRY CHECKS
===============
The registry domain [XXXXXXXXXXXX.dkr.ecr.ap-northeast-1.amazonaws.com] encountered socket connection failure for DNS Resolved IP(s) ['52.192.36.247', '52.192.36.247', '52.192.36.247'] with same network settings used by ECS / Fargate Agent. For more information check the following Knowledge Center articles:
- https://aws.amazon.com/premiumsupport/knowledge-center/ecs-pull-container-api-error-ecr/ 
- https://aws.amazon.com/premiumsupport/knowledge-center/ecs-pull-container-error/ 

You have enabled ECR related private endpoints ['s3'], However missing endpoint/s {'ecr.dkr', 'ecr.api'} 
See https://docs.aws.amazon.com/AmazonECR/latest/userguide/vpc-endpoints.html 

The route table rtb-XXXXXXXXXXXX associated with Task Subnet subnet-XXXXXXXXXXXX does not have network access either via Internet Gateway or Nat Gateway.
If you are leveraging custom networking path via peered network or via Transit Gateway please ensure the subnet where the task is placed have network connectivity to pull all required resources

The route table rtb-XXXXXXXXXXXX associated with Task Subnet subnet-XXXXXXXXXXXX does not have network access either via Internet Gateway or Nat Gateway.
If you are leveraging custom networking path via peered network or via Transit Gateway please ensure the subnet where the task is placed have network connectivity to pull all required resources

Log Configuration Checks
========================
The registry domain [logs.ap-northeast-1.amazonaws.com] encountered socket connection failure for DNS Resolved IP(s) ['18.181.204.238', '18.181.204.238', '18.181.204.238', '57.181.184.225', '57.181.184.225', '57.181.184.225', '18.181.204.198', '18.181.204.198', '18.181.204.198', '18.181.204.204', '18.181.204.204', '18.181.204.204', '18.181.204.203', '18.181.204.203', '18.181.204.203', '18.181.204.209', '18.181.204.209', '18.181.204.209', '18.181.204.225', '18.181.204.225', '18.181.204.225', '57.181.184.228', '57.181.184.228', '57.181.184.228'] with same network settings used by ECS / Fargate Agent. For more information check the following Knowledge Center articles:- https://aws.amazon.com/premiumsupport/knowledge-center/ecs-pull-container-api-error-ecr/ 
- https://aws.amazon.com/premiumsupport/knowledge-center/ecs-pull-container-error/ 

The route table rtb-XXXXXXXXXXXX associated with Task Subnet subnet-XXXXXXXXXXXX does not have network access either via Internet Gateway or Nat Gateway.
If you are leveraging custom networking path via peered network or via Transit Gateway please ensure the subnet where the task is placed have network connectivity to pull all required resources

実行結果2

S3 の VPC エンドポイントは作成されているが、com.amazonaws.ap-northeast-1.ecr.api com.amazonaws.ap-northeast-1.ecr.dkr のエンドポイントが作成されていないことを示しています。
サブネットにアタッチされているルートテーブルに、インターネットゲートウェイ(IGW)またはNATゲートウェイ(NATGW)へのルートがなく、ECR にアクセスできないと分析されました。
また、logs.ap-northeast-1.amazonaws.comへの接続失敗しており、CloudWatch Logs にもアクセスができず、問題が発生していることがわかりました。

ネットワーク接続の検証について

ネットワーク接続の検証が必要と判断された場合、Lambda(NetworkToolSSMRunbookExecutionxxxxxxxxxx)がデプロイされます。
ネットワーク接続の検証が終わると Lambda の削除処理が行われますが、 ENI を削除する関係で30分スリープする処理が入っていますのでランブックの実行時間は長くなります。
CloudWatch Logs のロググループは自動で削除されないため、手動で削除する必要がありますのでご注意ください。
lambda

CloudWatch-ap-northeast-1-04-28-2025_11_32_AM

まとめ

ECS タスクが停止・起動に失敗した際は、AWS 提供の Systems Manager Automation ランブック「AWSSupport-TroubleshootECSTaskFailedToStart」を活用することで、原因特定を効率化できます。
IAM 権限やネットワーク設定、Secrets Manager の問題など幅広い要因を自動で分析できます。
トラブルシューティングの時間短縮にぜひ活用してみてください。

本記事が誰かのお役に立てれば幸いです。

参考

https://docs.aws.amazon.com/ja_jp/systems-manager-automation-runbooks/latest/userguide/automation-aws-troubleshootecstaskfailedtostart.html

Share this article

facebook logohatena logotwitter logo

© Classmethod, Inc. All rights reserved.