Updated October 23. AWS published its fastest post-mortem yet. We'd welcome your views.

AWS’s US-EAST-1 incident this morning crippled over 108 AWS services – and knocked out services at thousands of customers.

As The Stack published, over 78 AWS services were still degraded. 

Amazon earlier pointed to “DNS resolution of the DynamoDB API endpoint in US-EAST-1” as the cause behind the cascading incident. 

AWS outage cause: Internal network blamed

It later added that the incident “originated from within the EC2 internal network” without sharing further technical details – before narrowing this down in a later update to "an underlying internal subsystem responsible for monitoring the health of our network load balancers."

The incident has echoes of a similarly extensive US-EAST-1 outage on December 7, 2021. AWS initially cited “impairment of several network devices” in the wake of that incident.

Its engineers spent hours grappling with internal DNS issues that erupted as a result of the outage, spawning their own problems.

It later explained in a post-mortem that its internal network (which hosts foundational AWS services including monitoring, internal DNS, and parts of the EC2 control plane), faced “congestion” after “automated activity to scale capacity of one of the AWS services hosted in the main AWS network triggered unexpected behavior… [from] clients inside the internal network.” 

This, in turn, triggered a “large surge of connection activity that overwhelmed the networking devices between the internal network and the main AWS network"; chaos duly ensued. 

US-EAST-1 endpoints...

Late Monday, complex dependencies meant an array of “global services or features that rely on US-EAST-1 endpoints such as IAM updates and DynamoDB Global tables may also be experiencing issues” AWS said, with support also borked. Updated: In its latest notice, AWS said that:

At 12:26 AM on October 20, we identified the trigger of the event as DNS resolution issues for the regional DynamoDB service endpoints... [we then had] a subsequent impairment in the internal subsystem of EC2 that is responsible for launching EC2 instances due to its dependency on DynamoDB. As we continued to work through EC2 instance launch impairments, Network Load Balancer health checks also became impaired, resulting in network connectivity issues in multiple services such as Lambda, DynamoDB, and CloudWatch... By 3:01 PM, all AWS services returned to normal operations. Some services such as AWS Config, Redshift, and Connect continue to have a backlog of messages that they will finish processing over the next few hours. We will share a detailed AWS post-event summary.

That latter issue is frustrating for customers, given that following a major 2021 AWS outage, the hyperscaler pledged to build a “new support system architecture that actively runs across multiple AWS regions” – in August 2022 quietly announcing a new support console “built using the latest architecture standards for high availability and region redundancy…” 

The internet, of course, is held together everywhere by duct tape and is just one line of code away from absolute mayhem. 

We await the post-mortem...

We love a good post-mortem at The Stack, particularly for a hyperscaler outage. Whether it’s Google Cloud in Paris trying to explain how it set on fire, got flooded, and ran out of water; Microsoft Azure admitting, sotto voce, that its encryption key infrastructure is a rotten mess* and is breaking things, or AWS explaining how it lost control of data centre cooling and couldn’t execute “purge mode” as servers overheated frantically, we’re all twitching curtains; ditto for things like HPE deleting 77TB of critical data from a supercomputer with a borked update. 

We await the latest AWS post-mortem with interest. 

Our inbox, meanwhile, is flooded with largely asinine comment. Some is better than the rest. Ismael Wrixen, CEO of ThriveCart, is sensible: “Today's outage isn't just an "east coast AWS” problem; it's a reminder that 100% uptime is a myth for everyone" He added by email: "The internet runs on shared infrastructure. The real story isn't just that AWS had a critical issue, but how many businesses discovered their platform partner had no plan for it, especially outside of US hours. This is a harsh wake-up call about the critical need for multi-regional redundancy.”

*It is now better.

Read this: Operational resilience and stress-testing for "wartime".



The link has been copied!