Updated 9.55 GMT, March 16, 2021. Microsoft has blamed a bug that saw an automated system remove a cryptographic key that should have been retained.
The Azure Portal and a raft of other Microsoft services including Teams, Office 365, Dynamics, and Xbox Live failed for over two hours late Monday March 15, after what Microsoft initially blamed on "a recent change to an authentication system".
The company added late Monday: "The preliminary analysis of this incident shows that an error occurred in the rotation of keys used to support Azure AD’s use of OpenID, and other, Identity standard protocols for cryptographic signing operations. As part of standard security hygiene, an automated system, on a time-based schedule, removes keys that are no longer in use. Over the last few weeks, a particular key was marked as “retain” for longer than normal to support a complex cross-cloud migration. This exposed a bug where the automation incorrectly ignored that “retain” state, leading it to remove that particular key. "
"A subset of customers using Azure Active Directory may experience a 'Experiencing authentication issues' error when logging into the portal. We will provide more information as it is provided", Azure said in a separate update, saying the issue began at 19:15 GMT and engineering teams are investigating. Users noted that most Azure AD management functions were offline, as were Azure Security portals: "Calls to MS Support indicate they are impacted as well, since they cannot retrieve customer data (due to outage)", as one IT leader noted.
Microsoft also suffered authentication-related outages in September 2020.
The company blamed the issue in a later RCA on a "service update targeting an internal validation test ring [that] was deployed, causing a crash upon startup in the Azure AD backend services. A latent code defect in the Azure AD backend service Safe Deployment Process (SDP) system caused this to deploy directly into our production environment, bypassing our normal validation process."
The company added after that incident: “Azure AD is designed to be a geo-distributed service deployed in an active-active configuration with multiple partitions across multiple data centers around the world, built with isolation boundaries. Normally, changes initially target a validation ring that contains no customer data, followed by an inner ring that contains Microsoft only users, and lastly our production environment. These changes are deployed in phases across five rings over several days. In this case, the SDP system failed to correctly target the validation test ring due to a latent defect that impacted the system’s ability to interpret deployment metadata. Consequently, all rings were targeted concurrently. The incorrect deployment caused service availability to degrade. Within minutes of impact, we took steps to revert the change using automated rollback systems which would normally have limited the duration and severity of impact. However, the latent defect in our SDP system had corrupted the deployment metadata, and we had to resort to manual rollback processes.
It added: "This significantly extended the time to mitigate the issue.”
A 17-hour Microsoft 365 outage in 2019 for users with multi-factor authentification (MFA) set up meanwhile was caused by requests from MFA servers to a Redis cache in Europe reaching “operational threshold causing latency and timeouts”, Microsoft told customers in late November 2019. To mitigate the issue, engineers deployed a hotfix which broke the connection between Azure’s MFA service and an unnamed backend service. They then cycled impacted servers, allowing authentication requests to succeed. (Yes, they turned it off and on again...)
Are you affected? Want to vent? Drop us a line.