Google Cloud cut off developer platform provider Railway’s API, VMs, and database instance with no warning, via an automated action.
That took its production environment (and customer workloads) down completely.
Railway says it spends some $2 million a month on GCP.
Its services have now been restored after eight hours.
Its founder, Jake Cooper, was left fuming. The incident was “beyond insane” he raged on social media.
Railway added in a swift incident post mortem that it will now strip GCP services from “our data plane’s hot path, and keep them only for secondary/failover..”
It had previously suffered an incident in 2023 due to what Cooper said was GCP automatically lowering its quota, and then refusing to upgrade it: “We've had such a litany of issues I don't know how anybody uses this product,” he wrote at the time (but apparently kept on using it.)
It’s a poor look for GCP, coming as it does 24 months after Google Cloud deleted the entire VMware Engine (GCVE) environment of Australian investment firm Unisuper it hosted across two regions.
(That incident was later blamed by an apologetic Google on an “inadvertent misconfiguration of the GCVE service by Google operators due to leaving a parameter blank. This had the unintended and then unknown consequence of defaulting the customer’s GCVE Private Cloud to a fixed term, with automatic deletion at the end of that period.”)
See also: Google praises UniSuper’s CIO after GCP error deleted $124 billion firm’s entire private cloud
Railway said in an FAQ that it runs hardware in eight sites around four locations around the world, but “at the start of 2026, due to demand on our systems, we have bursted back onto the cloud on AWS and GCP.
A subset of non-latency sensitive customers and Enterprise customers are using a public cloud for their hosts. However, when we migrated fully onto Metal in Mar. 2025, we kept our API and our DB on GCP as we felt that leaving that workload was well within our risk model.”
Criticised by some users for failing to architect for resilience, the company explained in a swift post mortem that its network is “a mesh ring, built up of high availability fiber interconnects between Metal <> GCP <> AWS.
"However, in this ring, there was still a hard dependency on workload discoverability being tied to the network control plane API that was hosted on the machines running in Google Cloud,” it said.
“Railway's edge proxies maintain a cache of routing tables from the network control plane, which is hosted within Google Cloud. While that cache held, workloads on Railway Metal and AWS continued to serve traffic. Once the cache expired, the edge could no longer resolve routes to active instances, and workloads across all regions, including Metal and AWS, began returning 404 errors. This caused the network outage impact to cascade beyond Google Cloud, into these regions as well, even though the workloads themselves remained online,” it added.
"We take full responsibility for the architectural decisions that allowed a single upstream provider action to cascade into a platform-wide outage."
Google Cloud had not responded for a request for comment as The Stack published. We await its explanation with some interest.
Sign up for The Stack
Interviews, insight, intelligence, and exclusive events for digital leaders.
No spam. Unsubscribe anytime.