ECS タスクが動かない?ランブックで原因をサクッと特定しよう!
こんにちは。テクニカルサポートチームのShiinaです。
はじめに
ECS タスクが停止したり、開始できなかった経験はありませんか?
タスク失敗の原因は、IAM 権限の設定ミスやタスク定義の不備、ネットワークの問題など多岐にわたり、特定には時間がかかることもあります。
AWS 提供 の Systems Manager Automation ランブック「AWSSupport-TroubleshootECSTaskFailedToStart」を活用して、効率的にトラブルシューティングを行う方法をご紹介します。
ランブックの概要
「AWSSupport-TroubleshootECSTaskFailedToStart」は ECS タスクの起動を妨げる問題の分析を行うランブックです。
処理の概要
ランブックで実施される処理の概要は以下の通りです。
- 実行ユーザーまたはロールに必要な IAM 権限があるかを確認します。
- ネットワーク接続の検証が必要と判断された場合、VPC 内に Lambda 関数を作成して接続性をシミュレートします。
- Lambda 関数や ECS タスクの情報をもとに、失敗の根本原因を分析します。
- ネットワーク接続の検証を実施した場合は作成した Lambda 関数や関連リソースを削除します。
- 分析結果を出力します。
入力パラメータ
パラメーター | 必須 | 説明 | 備考 |
---|---|---|---|
AutomationAssumeRole | Systems Manager Automation がユーザーに代わってアクションを実行する IAM ロール名 | 指定なしの場合は実行者の権限を使用 | |
ClusterName | ○ | クラスター名 | |
TaskId | ○ | タスクID | |
CloudwatchRetentionPeriod | CloudWatch Logs に保存される Lambda 関数ログの保存期間 |
IAM ロールの設定(任意)
本ランブックでは Systems Manager Automation がユーザーに代わってアクションを実行できるよう、オプションで IAM ロールを指定できます。
今回は最小限の権限とした専用のロールを作成し、AutomationAssumeRole パラメータに指定しました。
-
ロール名
任意の名称(例:SSM-TroubleshootECSTaskFailedToStart-Role) -
インラインポリシー
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"cloudtrail:LookupEvents",
"ec2:DeleteNetworkInterface",
"ec2:DescribeInstances",
"ec2:DescribeInstanceAttribute",
"ec2:DescribeIamInstanceProfileAssociations",
"ec2:DescribeSecurityGroups",
"ec2:DescribeNetworkAcls",
"ec2:DescribeNetworkInterfaces",
"ec2:DescribeRouteTables",
"ec2:DescribeSubnets",
"ec2:DescribeVpcEndpoints",
"ec2:DescribeVpcs",
"ec2:DescribeDhcpOptions",
"ec2:DescribeVpcAttribute",
"ecr:DescribeImages",
"ecr:GetRepositoryPolicy",
"ecs:DescribeContainerInstances",
"ecs:DescribeServices",
"ecs:DescribeTaskDefinition",
"ecs:DescribeTasks",
"iam:AttachRolePolicy",
"iam:CreateRole",
"iam:DeleteRole",
"iam:DetachRolePolicy",
"iam:GetInstanceProfile",
"iam:GetRole",
"iam:ListRoles",
"iam:ListUsers",
"iam:PassRole",
"iam:SimulateCustomPolicy",
"iam:SimulatePrincipalPolicy",
"kms:DescribeKey",
"lambda:CreateFunction",
"lambda:DeleteFunction",
"lambda:GetFunctionConfiguration",
"lambda:InvokeFunction",
"lambda:TagResource",
"logs:DescribeLogGroups",
"logs:PutRetentionPolicy",
"secretsmanager:DescribeSecret",
"ssm:DescribeParameters",
"sts:GetCallerIdentity",
"elasticfilesystem:DescribeFileSystems",
"elasticfilesystem:DescribeMountTargets",
"elasticfilesystem:DescribeMountTargetSecurityGroups",
"elasticfilesystem:DescribeFileSystemPolicy"
],
"Resource": "*"
}
]
}
- 信頼関係
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "",
"Effect": "Allow",
"Principal": {
"Service": "ssm.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}
ランブックの実行
以下の URL からドキュメントを開きます。
次のパラメータを指定して「Execute」をクリックします。
- AutomationAssumeRole:SSM-TroubleshootECSTaskFailedToStart-Role
- ClusterName:クラスター名
- TaskId:タスク ID
- CloudwatchRetentionPeriod:30
Secrets Manager 問題の原因を特定してみる
シークレット値を取得しようとした際に AccessDeniedException が発生してタスクの起動に失敗したエラーメッセージのようです。
ランブックを実行して確認してみます。
実行結果
オートメーション実行結果を確認してみます。
出力結果
EniDeletionMessage.Status
No output available yet because the step is not successfully executed
ExecutionResults.ExecutionLogs
Runbook Execution logs
######################
++++++++++++++++++++++++++++++
STEP: PreflightPermissionChecks
++++++++++++++++++++++++++++++
INFO: The IAM Identity arn:aws:iam::XXXXXXXXXXXX:role/SSM-TroubleshootECSTaskFailedToStart-Role used to execute the Runbook has all required permission
++++++++++++++++++++++++++
STEP: NetworkToolDeployment
++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++
STEP: CoreFailureReasonEvaluation
+++++++++++++++++++++++++++++++++++
GenericChecks:Checking Task networking and public IP assignment
GenericChecks:Checking Task networking and public IP assignment
GenericChecks:Checking if Image Pull Rate Limit occurred
RegistryAnalysis:Checking Security group egress rules for DNS resolved IP/s of ECR domains
RegistryAnalysis:Checking required VPC endpoints for ECR
RegistryAnalysis:Checking Security group egress rules for DNS resolved IP/s of external registry domains
LogAnalysis:Checking log configuration permission and group existence
SecretAnalysis:Checking SecretsManager credential existence and KMS settings
+++++++++++++++++++++++
STEP: DeletionLifecycle
+++++++++++++++++++++++
WARNING: ENI Deletion: Skipped, Network tool AWS Lambda function [NetworkToolSSMRunbookExecution93fcd121a150] associated with this specific execution id not found
WARNING: Lambda Function Deletion: Network tool AWS Lambda function [NetworkToolSSMRunbookExecution93fcd121a150] associated with this specific execution not found, No further action required
INFO: The log group [/aws/lambda/NetworkToolSSMRunbookExecution93fcd121a150] could not be found, skipping to change retention policy as resource doesnt exist
WARNING: IAM Role Deletion: Network tool AWS Lambda execution role [NetworkToolSSMRunbookExecution93fcd121a150] associated with this specific execution not found, No further action required
ExecutionResults.TaskFailureReason
Task Failure Reason Analysis:
#############################
SECRETS REFERENCE CHECKS
========================
The Secrets Manager secret arn:aws:secretsmanager:ap-northeast-1:XXXXXXXXXXXX:secret:XXXXX that you have referenced could not be found
The Secrets Manager secret arn:aws:secretsmanager:ap-northeast-1:XXXXXXXXXXXX:secret:XXXXX that you have referenced could not be found
タスク実行ロールのポリシーの問題ではなく、タスク定義で指定しているシークレット値が存在しないためエラーとなっていることがわかりました。
ネットワーク接続問題の原因を特定してみる
ECR(Elastic Container Registry)からイメージを取得しようとした際に、ネットワーク接続エラー(タイムアウト)が発生してタスクの起動に失敗したエラーメッセージのようです。
ランブックを実行して確認してみます。
実行結果
オートメーション実行結果を確認してみます。
出力結果
EniDeletionMessage.Status
VPC Lambda ENI Deletion Status:
#############################
INFO: IAM Role Deletion: Successfully deleted the role [NetworkToolSSMRunbookExecutioneadb54f07588] which got created as part of this execution
The VPC Lambda ENI ['eni-XXXXXXXXXXXX'] not found, the lambda service has deleted it automatically.
No further actions required.
ExecutionResults.ExecutionLogs
Runbook Execution logs
######################
++++++++++++++++++++++++++++++
STEP: PreflightPermissionChecks
++++++++++++++++++++++++++++++
INFO: The IAM Identity arn:aws:iam::XXXXXXXXXXXX:role/SSM-TroubleshootECSTaskFailedToStart-Role used to execute the Runbook has all required permission
++++++++++++++++++++++++++
STEP: NetworkToolDeployment
++++++++++++++++++++++++++
WARNING: An Lambda function execution role named NetworkToolSSMRunbookExecutioneadb54f07588 has been successfully created.
The Role will be deleted as part of the DeletionLifecycle execution or at Lambda_ENI_Deletion_Handler however watch for any exception or information message under runbook Execution logs of global output section
WARNING: An VPC Lambda Function arn:aws:lambda:ap-northeast-1:XXXXXXXXXXXX:function:NetworkToolSSMRunbookExecutioneadb54f07588 has been successfully deployed to test VPC Networking.
The Lambda function will be deleted as part of the DeletionLifecycle execution step however watch for any exception or information message under runbook Execution logs of global output section
+++++++++++++++++++++++++++++++++++
STEP: CoreFailureReasonEvaluation
+++++++++++++++++++++++++++++++++++
GenericChecks:Checking Task networking and public IP assignment
GenericChecks:Checking Task networking and public IP assignment
GenericChecks:Checking if Image Pull Rate Limit occurred
RegistryAnalysis:describe_image tag: 'latest' for existence check
RegistryAnalysis:Checking ECS/Fargate Agent role permission to perform ['ecr:GetAuthorizationToken']
RegistryAnalysis:Checking Security group egress rules for DNS resolved IP/s of ECR domains
RegistryAnalysis:Checking required VPC endpoints for ECR
RegistryAnalysis:Checking VPCe analysis for s3
RegistryAnalysis:Checking Security group egress rules for DNS resolved IP/s of external registry domains
LogAnalysis:Checking network results for CW endpoints
LogAnalysis:Checking egress security group rules for DNS resolved IP(s) of CW endpoint
LogAnalysis:Checking log configuration permission and group existence
+++++++++++++++++++++++
STEP: DeletionLifecycle
+++++++++++++++++++++++
INFO: ENI Deletion: Found HyperPlane ENI ['eni-XXXXXXXXXXXX'] which got created in this execution, The ENI will be attempted for deletion as part of step [Lambda_ENI_Deletion_Handler]
INFO: Lambda Function Deletion: Successfully deleted the AWS Lambda function [NetworkToolSSMRunbookExecutioneadb54f07588] which got created as part of this execution
INFO: Successfully modified Cloudwatch log group retention period for group /aws/lambda/NetworkToolSSMRunbookExecutioneadb54f07588 to the given value 30
ExecutionResults.TaskFailureReason
Task Failure Reason Analysis:
#############################
REGISTRY CHECKS
===============
The registry domain [XXXXXXXXXXXX.dkr.ecr.ap-northeast-1.amazonaws.com] encountered socket connection failure for DNS Resolved IP(s) ['52.192.36.247', '52.192.36.247', '52.192.36.247'] with same network settings used by ECS / Fargate Agent. For more information check the following Knowledge Center articles:
- https://aws.amazon.com/premiumsupport/knowledge-center/ecs-pull-container-api-error-ecr/
- https://aws.amazon.com/premiumsupport/knowledge-center/ecs-pull-container-error/
You have enabled ECR related private endpoints ['s3'], However missing endpoint/s {'ecr.dkr', 'ecr.api'}
See https://docs.aws.amazon.com/AmazonECR/latest/userguide/vpc-endpoints.html
The route table rtb-XXXXXXXXXXXX associated with Task Subnet subnet-XXXXXXXXXXXX does not have network access either via Internet Gateway or Nat Gateway.
If you are leveraging custom networking path via peered network or via Transit Gateway please ensure the subnet where the task is placed have network connectivity to pull all required resources
The route table rtb-XXXXXXXXXXXX associated with Task Subnet subnet-XXXXXXXXXXXX does not have network access either via Internet Gateway or Nat Gateway.
If you are leveraging custom networking path via peered network or via Transit Gateway please ensure the subnet where the task is placed have network connectivity to pull all required resources
Log Configuration Checks
========================
The registry domain [logs.ap-northeast-1.amazonaws.com] encountered socket connection failure for DNS Resolved IP(s) ['18.181.204.238', '18.181.204.238', '18.181.204.238', '57.181.184.225', '57.181.184.225', '57.181.184.225', '18.181.204.198', '18.181.204.198', '18.181.204.198', '18.181.204.204', '18.181.204.204', '18.181.204.204', '18.181.204.203', '18.181.204.203', '18.181.204.203', '18.181.204.209', '18.181.204.209', '18.181.204.209', '18.181.204.225', '18.181.204.225', '18.181.204.225', '57.181.184.228', '57.181.184.228', '57.181.184.228'] with same network settings used by ECS / Fargate Agent. For more information check the following Knowledge Center articles:- https://aws.amazon.com/premiumsupport/knowledge-center/ecs-pull-container-api-error-ecr/
- https://aws.amazon.com/premiumsupport/knowledge-center/ecs-pull-container-error/
The route table rtb-XXXXXXXXXXXX associated with Task Subnet subnet-XXXXXXXXXXXX does not have network access either via Internet Gateway or Nat Gateway.
If you are leveraging custom networking path via peered network or via Transit Gateway please ensure the subnet where the task is placed have network connectivity to pull all required resources
S3 の VPC エンドポイントは作成されているが、com.amazonaws.ap-northeast-1.ecr.api
com.amazonaws.ap-northeast-1.ecr.dkr
のエンドポイントが作成されていないことを示しています。
サブネットにアタッチされているルートテーブルに、インターネットゲートウェイ(IGW)またはNATゲートウェイ(NATGW)へのルートがなく、ECR にアクセスできないと分析されました。
また、logs.ap-northeast-1.amazonaws.com
への接続失敗しており、CloudWatch Logs にもアクセスができず、問題が発生していることがわかりました。
ネットワーク接続の検証について
ネットワーク接続の検証が必要と判断された場合、Lambda(NetworkToolSSMRunbookExecutionxxxxxxxxxx)がデプロイされます。
ネットワーク接続の検証が終わると Lambda の削除処理が行われますが、 ENI を削除する関係で30分スリープする処理が入っていますのでランブックの実行時間は長くなります。
CloudWatch Logs のロググループは自動で削除されないため、手動で削除する必要がありますのでご注意ください。
まとめ
ECS タスクが停止・起動に失敗した際は、AWS 提供の Systems Manager Automation ランブック「AWSSupport-TroubleshootECSTaskFailedToStart」を活用することで、原因特定を効率化できます。
IAM 権限やネットワーク設定、Secrets Manager の問題など幅広い要因を自動で分析できます。
トラブルシューティングの時間短縮にぜひ活用してみてください。
本記事が誰かのお役に立てれば幸いです。
参考