ちょっと話題の記事

【小ネタ】AWSで過去に発生した障害の履歴を確認する方法

2020.02.22

この記事は公開されてから1年以上経過しています。情報が古い可能性がありますので、ご注意ください。

中山(順)です

AWSの利用を検討する方は、AWSの信頼性がどの程度のものなのか、気になるのではないでしょうか?

Amazon.comのCTOであるWerner Vogelsが"Everything fails, all the time"と述べているとおり、 システムに要求する信頼性に応じてシステムを構成するコンポーネントを冗長化して信頼性を高めることが基本です。 (Design for Failure)

とはいうものの、システムの信頼性はシステムを構成するコンポーネント自体の信頼性や連携する外部サービス / サービスが稼働するインフラストラクチャの信頼性にも依存します。

AWSが各サービスやインフラストラクチャの信頼性向上についてどのような取り組みを行ってるかについては、イベントのプレゼンテーションやホワイトペーパーなどで公開されています。 例えば、AWSのグローバルネットワークやリージョン / エッジロケーションの仕組みについては以下のプレゼンテーションで紹介されています。

しかし、皆さんは思ったのではないでしょうか。

「で、実際どうだったのよ?」と。

この記事ではそれを確認する方法 / 情報源をご紹介します。

Service Health Dashboard

多数の利用者に影響を与えている障害はService Health Dashboardでアナウンスが行われます。

Service Health Dashboard

Service Health Dashboardでは、"Current Status"および"Status History"を確認することが可能です。 Status Historyでは、過去1年分の履歴を確認することが可能です。 しかし、GUIでは一度に1週間分の情報しか見ることができませんし、地域毎に分かれているもののサービスの数が膨大で、長期的にどうだったのかを確認することには適していません。

Status HistoryをJSONで取得

実はStatus HistoryはJSON形式で提供されています。 (先日まで知りませんでした・・・教えて頂いた社内のメンバーに感謝!)

https://status.aws.amazon.com/data.json

例として、CloudFrontの障害履歴を確認してみます。

$ curl https://status.aws.amazon.com/data.json | jq '.archive | sort_by(.date) | .[] | select(.service == "cloudfront")'
{
  "service_name": "Amazon CloudFront",
  "summary": "[RESOLVED] Delays in propagating certain configuration changes to a few CloudFront Edge locations ",
  "date": "1528895774",
  "status": 1,
  "details": "",
  "description": "<div><span class=\"yellowfg\"> 6:16 AM PDT</span>&nbsp;We are investigating longer than usual propagation times to propagate certain configuration changes to a few of our edge locations: the creation and deletion of CloudFront distributions and the updating of certificates for a CloudFront distribution. All other CloudFront configuration changes are propagating normally. End-user requests for content from our edge locations are not affected by this issue and are being served normally. </div><div><span class=\"yellowfg\"> 6:59 AM PDT</span>&nbsp;Between 3:25 AM and 6:45 AM PDT, CloudFront experienced longer than usual propagation times to propagate certain configuration changes to a few of our edge locations. This issue has been resolved and the service is operating normally. During this time, end user requests for content from our edge locations were not affected by this issue and were being served normally.</div>",
  "service": "cloudfront"
}
{
  "service_name": "Amazon CloudFront",
  "summary": "[RESOLVED] Elevated Lambda@Edge errors",
  "date": "1547501684",
  "status": 1,
  "details": "",
  "description": "<div><span class=\"yellowfg\"> 1:34 PM PST</span>&nbsp;We are investigating elevated Lambda@Edge error rates that may be affecting some customers. CloudFront customers who do not have Lambda@Edge functions are unaffected by this issue. </div><div><span class=\"yellowfg\"> 2:11 PM PST</span>&nbsp;Between 11:40 AM PST and 2:07 PM PST, customers may have experienced elevated Lambda@Edge error rates. CloudFront customers who do not have Lambda@Edge functions were unaffected by this issue. The issue has been resolved and the service is operating normally.</div>",
  "service": "cloudfront"
}
{
  "service_name": "Amazon CloudFront",
  "summary": "[RESOLVED] Change Propagation Delays",
  "date": "1551301596",
  "status": 1,
  "details": "",
  "description": "<div><span class=\"yellowfg\"> 1:06 PM PST</span>&nbsp;We’re investigating longer than usual propagation times for changes to CloudFront configurations. End-user requests for content from our edge locations are not affected by this issue and are being served normally.</div><div><span class=\"yellowfg\"> 1:30 PM PST</span>&nbsp;Between 10:09 AM and 1:23 PM PST customers may have experienced longer than usual propagation times while making changes to CloudFront configurations. End-user requests for content from edge locations were not affected. The issue has been resolved and the service is operating normally.</div>",
  "service": "cloudfront"
}
{
  "service_name": "Amazon CloudFront",
  "summary": "[RESOLVED] Change Propagation Delays",
  "date": "1553885279",
  "status": 1,
  "details": "",
  "description": "<div><span class=\"yellowfg\">11:48 AM PDT</span>&nbsp;We’re investigating longer than usual propagation times for changes to CloudFront configurations. End-user requests for content from our edge locations are not affected by this issue and are being served normally.</div><div><span class=\"yellowfg\">12:52 PM PDT</span>&nbsp;We have identified the root cause of the longer than usual propagation times for changes to CloudFront configurations. We continue to work toward resolution. </div><div><span class=\"yellowfg\"> 1:18 PM PDT</span>&nbsp;Between 8:50 AM and 12:59 PM PDT, we experienced longer than usual propagation times for changes to CloudFront configurations. The issue has been resolved and the service is operating normally.</div>",
  "service": "cloudfront"
}
{
  "service_name": "Amazon CloudFront",
  "summary": "[RESOLVED] Increased Error Rates",
  "date": "1556576555",
  "status": 1,
  "details": "",
  "description": "<div><span class=\"yellowfg\"> 3:22 PM PDT</span>&nbsp;We are investigating increased error rates for requests served by certain edge locations in South East Asia.</div><div><span class=\"yellowfg\"> 3:59 PM PDT</span>&nbsp;Between 2:16 PM and 3:43 PM PDT we experienced increased error rates for requests served by edge locations in South East Asia. The issue has been resolved and the service is operating normally.</div>",
  "service": "cloudfront"
}
{
  "service_name": "Amazon CloudFront",
  "summary": "[RESOLVED] Increased Invalidation Error Rates",
  "date": "1568887640",
  "status": "1",
  "details": "",
  "description": "<div><span class=\"yellowfg\"> 3:07 AM PDT</span>&nbsp;We are investigating elevated error rates for content invalidation. Some customers may also receive \"Rate exceeded\" exceptions when attempting to invalidate content. End-user requests for content from our edge locations are not affected by this issue and are being served normally.</div><div><span class=\"yellowfg\"> 3:36 AM PDT</span>&nbsp;Between September 18 10:58 PM and September 19 2:44 AM PDT, we experienced increased API error rates impacting CreateInvalidation API. Some customers may have also received \"Rate exceeded\" exceptions when attempting to invalidate content. End-user requests for content were not affected by this issue, and content from our edge locations were not affected and continue to be served normally. The issue has been resolved and the service is operating normally.</div>",
  "service": "cloudfront"
}
{
  "service_name": "Amazon CloudFront",
  "summary": "[RESOLVED] Change Propagation Delays",
  "date": "1573033383",
  "status": "1",
  "details": "",
  "description": "<div><span class=\"yellowfg\"> 1:43 AM PST</span>&nbsp;We are investigating longer than usual propagation times to propagate certain configuration changes made either via console, CloudFront APIs, or AWS CloudFormation templates to some of our edge locations. End-user requests for content from our edge locations are not affected by this issue and are being served normally.</div><div><span class=\"yellowfg\"> 2:39 AM PST</span>&nbsp;We can confirm longer than usual propagation times to propagate configuration changes made either via console, CloudFront APIs, or AWS CloudFormation templates to some of our edge locations, and continue to work toward resolution. End-user requests for content from our edge locations are not affected by this issue and are being served normally.</div><div><span class=\"yellowfg\"> 4:03 AM PST</span>&nbsp;We continue to work on mitigating the cause of longer than usual propagation times to propagate configuration changes made either via console, CloudFront APIs, or AWS CloudFormation templates to some of our edge locations. End-user requests for content from our edge locations are not affected by this issue and are being served normally.</div><div><span class=\"yellowfg\"> 6:23 AM PST</span>&nbsp;Between November 5 9:52 PM and November 6 6:00 AM PST, we experienced intermittent delays in propagating CloudFront distribution changes to some edge locations. End-user requests for content from our edge locations were not affected. The issue has been resolved and the service is operating normally. </div>",
  "service": "cloudfront"
}
{
  "service_name": "Amazon CloudFront",
  "summary": "[RESOLVED] Elevated CloudFront API errors",
  "date": "1577471528",
  "status": "1",
  "details": "",
  "description": "<div><span class=\"yellowfg\">10:32 AM PST</span>&nbsp;We are investigating increased CloudFront API errors and longer than usual propagation times while making changes to CloudFront configurations. End-user requests for content from edge locations are not affected.</div><div><span class=\"yellowfg\">11:16 AM PST</span>&nbsp;We have resolved the issues related to CloudFront APIs. We have identified the root cause of the longer than usual change propagation delays for invalidations and CloudFront configurations and are actively working towards resolution. End-user requests for content from edge locations are not affected and continue to be served normally.</div><div><span class=\"yellowfg\">12:04 PM PST</span>&nbsp;Between 9:30 AM and 11:01 AM PST, customers may have seen elevated CloudFront API errors. These have now been resolved. Due to these API errors there was a backlog of changes that resulted in longer than usual change propagation delays for CloudFront configurations and invalidations. This backlog of changes is actively being processed. End-user requests for content from edge locations are not affected and continue to be served normally.</div><div><span class=\"yellowfg\"> 2:01 PM PST</span>&nbsp;Between 9:30 AM and 11:01 AM PST, customers may have seen elevated CloudFront API errors. Due to these API errors there was a backlog of changes that resulted in longer than usual change propagation delays for CloudFront configurations and invalidations. The backlog of Invalidation changes were fully processed by 11:35 AM. The backlog of CloudFront configuration changes was fully processed by 2:00 PM PST. All issues have been fully resolved and the system is operating normally.</div>",
  "service": "cloudfront"
}
{
  "service_name": "Amazon CloudFront",
  "summary": "[RESOLVED] Change Propagation Delays",
  "date": "1580371190",
  "status": "1",
  "details": "",
  "description": "<div><span class=\"yellowfg\">11:59 PM PST</span>&nbsp;We are investigating longer than usual propagation times for changes to CloudFront configurations. End-user requests for content from our edge locations are not affected by this issue and are being served normally.</div><div><span class=\"yellowfg\">Jan 30, 12:15 AM PST</span>&nbsp;We have confirmed that we are seeing increased propagation times for changes to a few CloudFront edge locations. Majority of CloudFront edge locations are consuming configuration changes normally. End-user requests for content from our edge locations are not affected by this issue and are being served normally. </div><div><span class=\"yellowfg\">Jan 30,  1:23 AM PST</span>&nbsp;Between January 29 9:12 PM and January 30 12:48 AM PST we experienced delays in propagation times for changes to CloudFront configurations. During this time end-user requests for content from our edge locations were not affected. The issue has been resolved and the service is operating normally.</div>",
  "service": "cloudfront"
}
{
  "service_name": "Amazon CloudFront",
  "summary": "[RESOLVED] Increased Console Errors",
  "date": "1581553041",
  "status": "1",
  "details": "",
  "description": "<div><span class=\"yellowfg\"> 4:17 PM PST</span>&nbsp;Between 1:10 PM and 3:47 PM PST we experienced periods of increased error rates when accessing the CloudFront Management Console. Customers may have received a 404 Response. Existing distributions and the CloudFront APIs were not impacted by this issue. The issue has been resolved and the CloudFront Management Console is operating normally.</div>",
  "service": "cloudfront"
}

CloudFrontを利用しているサービスの可用性に影響を与えたイベントは2件(うち1件はLambda@Edgeを利用している場合のみ影響あり)でした。 それ以外は設定変更の反映遅延や管理コンソールの表示に関するエラーなどで、"End-user requests for content from our edge locations are not affected by this issue and are being served normally."といった記載を確認できます。

これをどう評価するかは人それぞれだと思いますが、エンドユーザーへの影響がないのであればそれ以外の問題は許容できるケースが多いのではないでしょうか。(あくまでも私の主観です)

いつのイベントまで確認可能なのか

CloudFrontの障害については、最も古い記録は以下の通り2018年6月13日の障害でした。

$ date --date='@1528895774'
Wed Jun 13 22:16:14 JST 2018

ちなみに、サービスによらない最も古い記録は2018年5月10日の障害でした。

curl https://status.aws.amazon.com/data.json | jq '.archive | sort_by(.date) | .[0]'
{
  "service_name": "Auto Scaling (N. Virginia)",
  "summary": "[RESOLVED] Increased API Error Rates",
  "date": "1525914401",
  "status": 1,
  "details": "",
  "description": "<div><span class=\"yellowfg\"> 6:06 PM PDT</span>&nbsp;We are investigating increased API error rates in the US-EAST-1 Region.</div><div><span class=\"yellowfg\"> 6:26 PM PDT</span>&nbsp;We are continuing to investigate increased API error rates in the US-EAST-1 Region</div><div><span class=\"yellowfg\"> 7:24 PM PDT</span>&nbsp;We have identified the root cause of the elevated API error rates and can confirm that most customers have recovered. We are continuing to work towards full resolution.</div><div><span class=\"yellowfg\"> 7:40 PM PDT</span>&nbsp;Between 5:42 PM and 7:04 PM PDT, we experienced increased API error rates in the US-EAST-1 Region. The issue has been resolved and the service is operating normally.</div>",
  "service": "autoscaling-us-east-1"
}
$ date --date='@1525914401'
Thu May 10 10:06:41 JST 2018

現時点で、これより古い記録を確認する方法は確認できておりません。 もし存在するのであれば、だれか教えてくださいw

AWS Post-Event Summaries

詳細な履歴以外にも、比較的大規模な障害については詳細なレポートが公開されています。 昨年8月に東京リージョンで発生した障害についてもレポートが掲載されています。

AWS Post-Event Summaries

まとめ

何かを評価するとき、事実に基づくことは非常に重要です。 憶測や伝聞に基づいて評価を下してしまってることはありませんか?

当たり前のことではありますが、自戒を込めて本記事を執筆させて頂きました。

現場からは以上です。