AWS is scrambling to bring “additional cooling system capacity online” after overheating server racks in one of its critical US-EAST-1 region data centers knocked EC2 instances, OpenSearch and other cloud services offline.
The hyperscaler first reported the incident at 5:25pm on Thursday (1:25am BST Friday May 8) saying EC2 instances and EBS volumes hosted on impacted hardware were affected by loss of power amid a “thermal event.”
Cryptocurrency exchange Coinbase was among the AWS customers affected, with trading halted.
AWS told customers via its status page at 8:06pm PDT that EC2 Instances, EBS Volumes, and other AWS Services were seeing “elevated error rates and latencies for some workflows. As part of our recovery effort, we have shifted traffic away from the impacted Availability Zone for most services.”
As The Stack published, AWS was telling affected customers that it is making progress in resolving impaired instances and “working towards full recovery… In the impacted Availability Zone, EC2 Instances, EBS Volumes, and other AWS Services may continue to experience elevated error rates and latencies... Customers will continue to see some of their affected EC2 instances and EBS volumes as impaired until we achieve full recovery.”
It added: “We recommend customers utilize one of the other Availability Zones in the US-EAST-1 Region”; other AZs were unaffected by the issue.
The incident is the fourth AWS has recognised for US-EAST-1 so far in 2026, following three limited incidents in February, its service history shows.
Regions are geographically dispersed physical locations with separate power infrastructure, networking, and connectivity and “designed not to be simultaneously impacted by a shared fate scenario like utility power, water disruption, fiber isolation, earthquakes, fires, tornadoes, or floods.”
Each typically contains 3+ Availability Zones (AZs) which consist of “one or more independent data centers”, isolated to limit service failure contagion.
It’s getting hot in here…
Heating incidents in data centres can, on a bad day, escalate fast, as AWS learned in Japan back in 2019 when a whole cascade of minor issues added up, to result in the AP-NORTHEAST-1 region suffering a severe outage.
In that incident, a bug in the third-party cooling control system triggered “excessive interactions between the control system and the devices in the datacenter” that ultimately knocked the control system offline.
The data center default, if that happened, was for cooling systems go into maximum cooling mode until the control system functionality is restored.
In one Tokyo DC, the system decided to shut down instead.
In that scenario engineers are able to put the cooling system into “purge” mode to quickly exhaust hot air “but this also failed” and as temperatures naturally started to rise to above a certain trigger point servers powered off.
With no temperature control system available, engineers had to manually check equipment and put systems into a maximum cooling configuration.
“During this process, it was discovered that the PLCs controlling some of the air handling units were also unresponsive. These controllers [also] needed to be reset.”
How’s that for a bad hair day in a data centre! Affected by the outage? Suffered a shocker in a DC yourself? We love to hear your horror stories.