When AWS’s US-EAST-1 suffered serious issues in September 2021 and December 2021, the outages also took down AWS Support’s case management system (where worried customers can file tickets) and crippled its ability to swiftly update its own service health dashboard that flags such issues – a failure that left one of many furious customers blasting it for having made “US-EAST-1 a source of systemic risk for every single AWS customer.”
Whilst no cloud provider has been immune from suffering shorter or longer periods of downtime, that particular issue fed concerns among more conservative CIOs and architects about cloud resilience (yes, still an issue for some). It was also singled out in Gartner’s 2022 Magic Quadrant for Cloud Infrastructure and Platform Services in 2022, with it pointing last month to AWS’s “regional dependencies and communication” as cause for concern.
Join peers following The Stack on LinkedIn
As Gartner’s analysts put it in October: “AWS’s operational incident of 7 December 2021 revealed some multiregion dependencies on the internal AWS network, which is hosted in US-EAST-1. Because US-EAST-1 also hosts support ticketing for North America, AWS customers also had difficulty communicating with technical support during the incident. Compounding this was AWS’s failure to communicate adequately in a timely fashion about the outages as they occurred and the confoundingly inaccurate Health Dashboard…”
AWS has actually shipped a fix for this issue, although Gartner could be forgiven for not knowing it.
In the wake of December 2021’s issue, AWS explained in a detailed post-mortem “we have been working on several enhancements to our Support Services to ensure we can more reliably and quickly communicate with customers during operational issues. We expect to release a new version of our Service Health Dashboard early next year that will make it easier to understand service impact and a new support system architecture that actively runs across multiple AWS regions to ensure we do not have delays in communicating with customers.”
Perhaps wary of drawing attention back to the embarrassing (?) issue, its sole update since appears to have been a 105-word note published on August 1, 2022 saying that it had launched a new “AWS Support Center console URL… [that] ensures you can always contact AWS Support via the AWS Support Center Console.”
This, it added, “is built using the latest architecture standards for high availability and region redundancy.”
Pressing for detail, The Stack confirmed that new architectures for both the AWS support case management system (for customers to communicate with technical support) and the health dashboard were completed and deployed over the summer -- AWS has not published more details about this new AWS Support architecture.
Nb: Whether this makes you mildly frustrated that AWS didn’t communicate this excellent news better to the world or suggests to you that Gartner’s Magic Quadrants are written with little primary research is a moot point: It’s been fixed. Next time US-EAST-1 throws a wobble, expect to still be able to file support tickets and a more responsive and possibly even honest service health dashboard (one can live in hope.)
A secondary side note: Not only did last year’s outage take down Amazon, Prime, Alexa, Disney+, Instacart etc. but many visibility solutions designed to track this kind of issue. ThousandEyes, Datadog, Splunk (SignalFX), and NewRelic all reported impacts from 2021 AWS outages – Datadog reporting delays that impacted multiple products, Splunk (SignalFX) confirming that its AWS cloud metric syncer data ingestion was impacted, and New Relic telling customers that some AWS Infrastructure and polling metrics were delayed in the US.
Just as having support and customer updates hosted on your core infrastructure is problematic, so is depending on a monitoring solution hosted within the environment being monitored – something noted in 2018 by Adrian Cockcroft, then a VP at AWS, who wrote in a blog: “The first thing that would be useful [for resilient monitoring] is to have a monitoring system that has failure modes which are uncorrelated with the infrastructure it is monitoring. For efficiency it is common to co-locate a monitoring system with the infrastructure, in the same datacenter or cloud region, but that sets up common dependencies that could cause both to fail together.”