Microsoft has "apologised deeply" for an Azure Portal outage late Monday March 15, 2021 and said it is working on a permanent fix for the authentication issue that caused it. The incident lasted from 19:00 - 21:00 GMT and saw Office 365, Teams, Xbox Live, Azure Active Directory (AD), and other services fail for millions of customers.
Blaming a cryptographic key rotation system gone awry, Redmond said Azure AD is in a "multi-phase effort to apply additional protections to the backend Safe Deployment Process (SDP) system to prevent a class of risks including this problem", adding "we understand how incredibly impactful and unacceptable this is".
Those efforts will wrap-up mid-2021, Microsoft said in an early Root Cause Analysis (RCA) on its Azure page.
The incident comes after a similar authentication-related outage in September 2020 that lasted 17 hours. European customers also saw a three-hour outage in October, and a 90-minute outage in December.
Microsoft said: "The first phase of those SDP changes is finished, and the second phase is in a very carefully staged deployment that will finish mid-year. The initial analysis does indicate that once that is fully deployed, it will prevent the type of outage that happened today, as well as the related incident in September 2020."
Azure Portal Outage: What happened?
Systems started failing after an error occurred in the rotation of keys used to support Azure AD’s use of OpenID, and other identity standard protocols for cryptographic signing operations.
"As part of standard security hygiene, an automated system, on a time-based schedule, removes keys that are no longer in use. Over the last few weeks, a particular key was marked as 'retain' for longer than normal to support a complex cross-cloud migration. This exposed a bug where the automation incorrectly ignored that 'retain' state, leading it to remove that particular key", an early Microsoft RCA said.
"Metadata about the signing keys is published by Azure AD to a global location in line with Internet Identity standard protocols. Once the public metadata was changed at 19:00 UTC, applications using these protocols with Azure AD began to pick up the new metadata and stopped trusting tokens/assertions signed with the key that was removed. At that point, end-users were no longer able to access those applications.
After rollback, time to mitigation for individual applications varied to the range of server implementations that handle caching differently, Microsoft added, noting that "in [the] September incident, we also referred to our rollout of Azure AD backup authentication. That effort is progressing well. Unfortunately, it did not help in this case as it provided coverage for token issuance but did not provide coverage for token validation as that was dependent on the impacted metadata endpoint."