Outages
Both primary and backup systems collapsed in 20 seconds, forcing the cancellation of thousands of flights.
"When incidents occur, the most important thing is fixing, learning and applying that knowledge institutionally. Finger-pointing and ceremonial sacrifices are not a mature response. Bias towards concrete action and banish politics"
Network configuration, CA and SWIFT issues, and certificate expiration blamed for a series of RTGS outages the past year.
Customers howl after "ServiceNow identified an expired TLS cross-chain certificate affecting MID Server and instance-to-instance connectivity for ServiceNow customers"
The system "incorrectly determined that the healthy hosts were unhealthy and began redistributing shards..."
New details of global mega-outage revealed, with Microsoft blaming disastrous "usage spike" on an implementation error in its own response to a cyberattack.
AWS EC2 Windows instances also borked with Crowdstrike's manual mitigation not working. Guidance available but...