AWS IAM のコントロールプレーンとデータプレーンに思いを馳せてみました #devio2022

IAM コントロールプレーンで障害が起こった時、IAM データプレーンで障害が起こった時、それぞれでどんな影響が生じるかを想像してみると楽しいです。(起こらないのがいちばん)

コンバンハ、千葉(幸)です。

弊社主催のオンラインイベントDevelopersIO 2022 で どこで動いてるの?AWS IAM のコントロールプレーンとデータプレーンに思いを馳せるというタイトルで発表を行いました。

「AWS IAM はよく使うけどその裏側の仕組みは考えたことない……」そんな方にちょっとだけ世界が豊かになる情報をお伝えする。がテーマのセッションです。AWS IAM のコントロールプレーン、そしてデータプレーンについて理解を深めることを目的としています。

発表に用いた資料、そのサマリ、動画をまとめてご紹介します。

登壇資料

内容のサマリ

全体のイメージ

発表の内容を一枚絵にまとめたものが以下です。

IAM_Control_plane_data_plane

こちらのイメージをもとに以下を取り上げました。

  • IAM コントロールプレーンとは
  • IAM データプレーンとは
  • シークレットアクセスキーの派生キー
  • IAM コントロールプレーンの障害?
  • IAM データプレーンの障害?

IAM コントロールプレーンとデータプレーン

  • コントロールプレーン
    • バージニア北部リージョンにのみ存在する
    • IAM リソースが保管されている
  • データプレーン
    • コントロールプレーンから IAM リソースが複製される
    • 各リージョンに存在し、各リージョンで認証・認可を行う
    • 少なくとも 3 つの AZ に分散されている

カスタマーはグローバルエンドポイントを経由してコントロールプレーンにアクセス可能ですが、データプレーンにはアクセスできません。以下のエントリに、より詳細な内容をまとめています。

シークレットアクセスキーの派生キー

IAM エンドポイントがセキュリティを維持しながら大量の API リクエストをどう捌くのか、その要因となる派生キーについて取り上げます。

  • シークレットアクセスキーはコントロールプレーンにのみ存在する
  • 日付とリージョンの情報を含む派生キーがデータプレーンに生成される
  • IAM エンドポイントで上記の派生キーにサービスの情報を追加した派生キーが生成されキャッシュされる

これらの内容は以下のセッションを参考にしました。

AWS re:Invent 2021 - Keynote with Dr. Werner Vogels - YouTube

IAM コントロールプレーンの障害?

Health Dashboard から確認できる過去の障害から、IAM コントロールプレーンで発生したのかな?と思われる障害を取り上げました。

イベントは以下のエントリで確認したものを流用しました。

% curl https://status.aws.amazon.com/data.json\
 | jq '.archive | sort_by(.date) | .[] | select(.service == "iam")'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  212k  100  212k    0     0  55903      0  0:00:03  0:00:03 --:--:-- 55941
{
  "service_name": "AWS Identity and Access Management",
  "summary": "[RESOLVED] IAM errors and propagation latency",
  "date": "1629996723",
  "status": "1",
  "details": "",
  "description": "<div><span class=\"yellowfg\"> 9:52 AM PDT</span>&nbsp;We are investigating elevated latencies and errors on the IAM APIs. In addition, we are investigating propagation delays for recently created or recently updated IAM users, credentials, roles, policies. Authentication and authorization of existing users, credentials, roles, policies are not impacted. Other AWS services like AWS CloudFormation that use IAM roles were also impacted.</div><div><span class=\"yellowfg\">10:39 AM PDT</span>&nbsp;Between 6:44 AM and 9:12 AM PDT, customers experienced elevated latency and error rates in response to IAM API requests, as well as delays in describing recently created or modified IAM resources. In addition, between 6:44 AM and 10:02 AM, propagation of IAM API updates was delayed in the ME-SOUTH-1, EU-SOUTH-1, AP-EAST-1, and AF-SOUTH-1 regions, and newly created or recently updated IAM users, credentials, roles, and policies may not have been available for authentication and authorization in those regions. Other AWS Services that rely on IAM changes to provision resources, such as CloudFormation, were also impacted. Authentication and authorization for existing users, credentials, roles, policies were not impacted. The issue has been resolved and the service is operating normally.</div>",
  "service": "iam"
}
  • IAM の API リクエストの遅延とエラー率が上昇した
  • 一部リージョンへの IAM リソースの更新の伝播が遅延した
  • 既存の IAM リソースによる認証・認可には影響がなかった

この例では、各リージョンのデータプレーンに存在する既存の IAM リソース情報を用いた認証・認可は影響を受けなかったようです。

IAM データプレーンの障害?

先ほどは「サービス」が IAM であるイベントを取得しましたが、「説明」に IAM を含むイベントを取得しました。

% curl https://status.aws.amazon.com/data.json\
 | jq '.archive | sort_by(.date) | .[] | select(.description | contains("IAM"))'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  212k  100  212k    0     0  58763      0  0:00:03  0:00:03 --:--:-- 58797
{
  "service_name": "Amazon Simple Storage Service (Stockholm)",
  "summary": "[RESOLVED] Elevated IAM Error Rate",
  "date": "1623976331",
  "status": "3",
  "details": "",
  "description": "<div><span class=\"yellowfg\"> 5:32 PM PDT</span>&nbsp;We are investigating an elevated failure rate for AWS IAM authentication API requests in the EU-NORTH-1 Region.  This issue will also cause API requests for other AWS services in the EU-NORTH-1 Region to experience elevated failures. </div><div><span class=\"yellowfg\"> 5:45 PM PDT</span>&nbsp;We are seeing signs of recovery of the IAM authentication API error rates, and this is also showing improvement for other services.  We are actively working towards full mitigation.</div><div><span class=\"yellowfg\"> 5:52 PM PDT</span>&nbsp;The IAM authentication API error rates continue to improve, and many services have fully recovered.  We are actively working towards full mitigation.</div><div><span class=\"yellowfg\"> 5:58 PM PDT</span>&nbsp;Between 4:59 PM and 5:32 PM PDT we experienced elevated failure rate for AWS IAM authentication API requests in the EU-NORTH-1 Region. While some other services continue to move toward recovery, requests made to S3 are no longer experiencing elevated error rates. The issue has been resolved and the service is operating normally.</div>",
  "service": "s3-eu-north-1"
}
{
  "service_name": "Amazon Elastic Compute Cloud (Stockholm)",
  "summary": "[RESOLVED] Elevated IAM Error Rate",
  "date": "1623977653",
  "status": "1",
  "details": "",
  "description": "<div><span class=\"yellowfg\"> 5:54 PM PDT</span>&nbsp;The IAM authentication API error rates continue to improve, and many services have fully recovered.  We are actively working towards full mitigation.</div><div><span class=\"yellowfg\"> 6:14 PM PDT</span>&nbsp;Starting at 4:59 PM PDT, the AWS IAM authentication API experienced elevated error rates in the EU-NORTH-1 Region. This issue caused AWS services to experience increased error rates, as they rely on the IAM service to authenticate their API requests. Services began to see recovery at 5:24 PM, with full recovery at 6:02 PM. All services are operating normally in the EU-NORTH-1 Region at this time.</div>",
  "service": "ec2-eu-north-1"
}
{
  "service_name": "Amazon Elastic Load Balancing (Stockholm)",
  "summary": "[RESOLVED] ELB API and Connectivity Issues",
  "date": "1623980954",
  "status": "1",
  "details": "",
  "description": "<div><span class=\"yellowfg\"> 6:49 PM PDT</span>&nbsp;Starting at 4:59 PM PDT, the AWS IAM authentication API experienced elevated error rates in the EU-NORTH-1 Region. This issue caused ELB to experience increased API error rates and latencies, as the APIs rely on the IAM service to authenticate requests. Recovery for the APIs began recovery at 5:24 PM, with full recovery at 6:11 PM. Between 6:03 PM and 6:28 PM PDT, we experienced an increase in connectivity issues and WAF errors for some load balancers. The issue has been resolved and the service is operating normally.</div><div><span class=\"yellowfg\"> 6:51 PM PDT</span>&nbsp;Starting at 4:59 PM PDT, the AWS IAM authentication API experienced elevated error rates in the EU-NORTH-1 Region. This issue caused ELB to experience increased API error rates and latencies, as the APIs rely on the IAM service to authenticate requests. Recovery for the APIs began recovery at 5:24 PM, with full recovery at 6:11 PM. Between 6:03 PM and 6:28 PM PDT, we experienced an increase in connectivity issues and WAF errors for some load balancers. The issue has been resolved and the service is operating normally.</div>",
  "service": "elb-eu-north-1"
}
{
  "service_name": "AWS Identity and Access Management",
  "summary": "[RESOLVED] IAM errors and propagation latency",
  "date": "1629996723",
  "status": "1",
  "details": "",
  "description": "<div><span class=\"yellowfg\"> 9:52 AM PDT</span>&nbsp;We are investigating elevated latencies and errors on the IAM APIs. In addition, we are investigating propagation delays for recently created or recently updated IAM users, credentials, roles, policies. Authentication and authorization of existing users, credentials, roles, policies are not impacted. Other AWS services like AWS CloudFormation that use IAM roles were also impacted.</div><div><span class=\"yellowfg\">10:39 AM PDT</span>&nbsp;Between 6:44 AM and 9:12 AM PDT, customers experienced elevated latency and error rates in response to IAM API requests, as well as delays in describing recently created or modified IAM resources. In addition, between 6:44 AM and 10:02 AM, propagation of IAM API updates was delayed in the ME-SOUTH-1, EU-SOUTH-1, AP-EAST-1, and AF-SOUTH-1 regions, and newly created or recently updated IAM users, credentials, roles, and policies may not have been available for authentication and authorization in those regions. Other AWS Services that rely on IAM changes to provision resources, such as CloudFormation, were also impacted. Authentication and authorization for existing users, credentials, roles, policies were not impacted. The issue has been resolved and the service is operating normally.</div>",
  "service": "iam"
}

複数のサービスカテゴリでイベントが記録されており、それぞれで IAM 認証 API エラーが増加したとされています。

時刻(PDT) サービス description
05:32 s3 We are investigating an elevated failure rate for AWS IAM authentication API requests in the EU-NORTH-1 Region. This issue will also cause API requests for other AWS services in the EU-NORTH-1 Region to experience elevated failures.
05:45 s3 We are seeing signs of recovery of the IAM authentication API error rates, and this is also showing improvement for other services. We are actively working towards full mitigation.
05:52 s3 The IAM authentication API error rates continue to improve, and many services have fully recovered. We are actively working towards full mitigation.
05:58 s3 Between 4:59 PM and 5:32 PM PDT we experienced elevated failure rate for AWS IAM authentication API requests in the EU-NORTH-1 Region. While some other services continue to move toward recovery, requests made to S3 are no longer experiencing elevated error rates. The issue has been resolved and the service is operating normally.
05:54 ec2 The IAM authentication API error rates continue to improve, and many services have fully recovered. We are actively working towards full mitigation.
06:14 ec2 Starting at 4:59 PM PDT, the AWS IAM authentication API experienced elevated error rates in the EU-NORTH-1 Region. This issue caused AWS services to experience increased error rates, as they rely on the IAM service to authenticate their API requests. Services began to see recovery at 5:24 PM, with full recovery at 6:02 PM. All services are operating normally in the EU-NORTH-1 Region at this time.
06:49 elb Starting at 4:59 PM PDT, the AWS IAM authentication API experienced elevated error rates in the EU-NORTH-1 Region. This issue caused ELB to experience increased API error rates and latencies, as the APIs rely on the IAM service to authenticate requests. Recovery for the APIs began recovery at 5:24 PM, with full recovery at 6:11 PM. Between 6:03 PM and 6:28 PM PDT, we experienced an increase in connectivity issues and WAF errors for some load balancers. The issue has been resolved and the service is operating normally.
06:51 elb Starting at 4:59 PM PDT, the AWS IAM authentication API experienced elevated error rates in the EU-NORTH-1 Region. This issue caused ELB to experience increased API error rates and latencies, as the APIs rely on the IAM service to authenticate requests. Recovery for the APIs began recovery at 5:24 PM, with full recovery at 6:11 PM. Between 6:03 PM and 6:28 PM PDT, we experienced an increase in connectivity issues and WAF errors for some load balancers. The issue has been resolved and the service is operating normally.

他のリージョンで同様のエラーがないこと、サービス「IAM」でのイベントが付近の時刻に記録されていないことから、特定のリージョン(ストックホルム)のデータプレーンでのみ発生した障害だと想定されます。

IAM_dataplane_failuer

EC2 や S3 といったサービスに対する API リクエストは裏側では IAM データプレーンにパスされて認証・認可がなされる、というのを改めて意識させられる例です。

動画

終わりに

AWS IAM コントロールプレーンとデータプレーンについて思いを馳せてみました。

ただ IAM を利用するだけであれば知らなくてもいいことですが、理解を深めることでより楽しく利用できるのではないでしょうか。障害発生時の影響を考慮する際にも役立つかもしれません。

なお、curl https://status.aws.amazon.com/data.jsonで取得できるイベントは概ね過去 1 年程度であり、文中で引用した実行結果は 2022/9/12時点ではもう得られません。そういった意味だと登壇資料を準備していた時点でまだ確認できたのはラッキーだったなと思います。

IAM 好きなみなさんの参考になれば幸いです。

以上、 チバユキ (@batchicchi) がお送りしました。