Over 90 Google Cloud services were knocked offline in its Paris region after a major data centre incident – believed to have been caused by a water leak triggering a fire in a battery room of a co-location data centre.
The incident also briefly triggered the complete global outage of its Cloud Console. (“Primary impact was observed from 2023-04-25 23:15:30 PDT to 2023-04-26 03:38:40 PDT” Google Cloud confirmed.)
Google Cloud blamed “water intrusion” – as it told customers on April 27 that it expects to see “an extended outage for some services” following the incident at a data centre owned and operated by Global Switch. (Cloud providers, like other customers, often rent racks in co-location data centres run by third parties.)
Google first reported the incident as “an issue affecting multiple Cloud services in the europe-west9-a zone” late April 25. It has managed to restore some Google Cloud Paris region services but others remain affected.
UPDATED May 1: Anthos Service Mesh, Google Compute Engine, Google Kubernetes Engine, Cloud Memorystore, Google Cloud Bigtable, Persistent Disk, Google Cloud SQL, Database Migration Service, Google Cloud Dataproc, Cloud Filestore all still affected a week on. "Impact is now limited to services in europe-west9-a. The impact for Cloud Bigtable continues in europe-west9-a. For the remaining products, impact is limited to instances located in the affected data center. Previously unaffected instances for these products will continue to work with no impact. There is no ETA for full recovery of affected instances in europe-west9-a at this time. We expect to see extended outages for these resources" Google Cloud told customers on May 1.
The incident in its "europe-west9 caused a multi-cluster failure that led to a shutdown of multiple zones. Impact is now limited to services in europe-west9-a. There is no ETA for full recovery of operations in europe-west9-a at this time. We expect to see extended outages for some services. Customers are advised to failover to other zones/regions if they are impacted" Google Cloud updated customers on April 27 at 11:53am Paris time.
Global Switch said Wednesday: “A fire incident has occurred in a room at one of the two data centres in our Paris campus this morning. The Fire Brigade has been in attendance and the fire is now contained.
Google Cloud Paris outage blamed on water intrusion...
“The fire response systems in the building have performed as designed and no one has been injured. A number of customers have been temporarily affected and our site team is working to restore services” it added.
The fire was caused by a cooling system water pump failure that caused water to leak into the battery room, which sparked a fire, according to a post to the French Network Operators Group. (The Stack appreciates that this will raise questions about resilience and coolant design that we hope to have more answers on in future…)
It has also triggered lively debate about Google Cloud’s resilience – after damage to the single availability zone (AZ) lead to the general unavailability of a region. Many customers assumed outright physical separation of AZs rather than them being in a single data centre with separation across networks and power supplies.
(“A zone is a deployment area for Google Cloud resources within a region. Zones should be considered a single failure domain within a region. To deploy fault-tolerant applications with high availability and help protect against unexpected failures, deploy your applications across multiple zones in a region. To protect against the loss of an entire region due to natural disaster, have a disaster recovery plan and know how to bring up your application in the unlikely event that your primary region is lost” Google Cloud advises in its guidance on availability.)
n.b. When things start going south in data centres, fixes can be harder to get right than many assume. A post-mortem from an AWS data centre incident in Tokyo gives a colourfual example of cascading issues...
Affected? Views on the incident? Get in touch.