A major Facebook outage (including Workplace from Facebook, Instagram, and WhatsApp) appears to have been triggered by a botched configuration change to Facebook's BGP peering routers -- something that had a cascade effect across the internet as a flood of DNS traffic looking for the sites (which combined have nearly six billion users) triggered what amounted to a global DDoS attack on public DNS resolvers.
In making whatever change it did, Facebook appears to have inadvertantly withdrawn all BGP routes to its own DNS name servers.
Facebook said late on October 4: "Our engineering teams have learned that configuration changes on the backbone routers that coordinate network traffic between our data centers caused issues that interrupted this communication. This disruption to network traffic had a cascading effect on the way our data centers communicate, bringing our services to a halt. Our services are now back online and we’re actively working to fully return them to regular operations." The company added: "We want to make clear at this time we believe the root cause of this outage was a faulty configuration change."
The Internet is divided into thousands of Autonomous Systems, loosely defined as networks and routers under the same administrative control. BGP is the routing protocol between Autonomous Systems. DNS servers provide the IP address, but BGP provides the most efficient way to reach that IP address.
A peering exchange meanwhile is a place where different networks interconnect by establishing BGP sessions between their routers. This process, until recently, has involved a complex and significantly manual process of checking and approving peering requests by ISPs, et al. Facebook recently automated that process. Issues may have arisen here.
The outage, reports suggest, caused such severe networking borkage that even Facebook LANs went down.
New York Times reporter Sheera Frenkel tweeted: "Was just on phone with someone who works for FB who described employees unable to enter buildings this morning to begin to evaluate extent of outage because their badges weren’t working to access doors."
Workplace from Facebook (the company's Slack/Teams enterprise collaboration alternative) which has over seven million enterprise users is also down. Major blue chip clients include Nestlé and insurer Zurich.
As Cloudflare CTO John Gramam-Cumming noted: "Between 15:50 UTC and 15:52 UTC Facebook and related properties disappeared from the Internet in a flurry of BGP updates. Now, here's the fun part. Cloudflare runs a free DNS resolver, 18.104.22.168, and lots of people use it. So Facebook etc. are down... guess what happens? People keep retrying. Software keeps retrying. We get hit by a massive flood of DNS traffic asking for facebook.com."
"Teams at Cloudflare have to get spun up to make sure things keep running smoothly during the onslaught. Good reminder the Internet is a network of networks that works through standards and cooperation."
A post on Reddit purporting to be from a Facebook staffer (its rapid deletion somewhat gives credibility to that claim, although we could not verify it) noted: "As many of you know, DNS for FB services has been affected and this is likely a symptom of the actual issue, and that's BGP peering with Facebook peering routers has gone down [sic], very likely due to a configuration change that went into effect shortly before the outages happened."
They added: "There are people now trying to gain access to the peering routers to implement fixes, but the people with the physical access is [sic] spearate from the people with the knowledge of how to actually authenticate to the systems and people who know what to actually do, so there is now a logistical challenge with getting all that knowledge unified. Part of this is also due to lower staffing in data centers due to pandemic measures."
Facebook BGP peering
A May 2021 blog post on the Facebook Engineering page -- currently inaccessible but recovered by The Stack through the ever-handy Wayback Machine [donate here] -- noted that "Initially, we managed peering via a time-intensive manual process. But there is no industry standard for how to set up a scalable, automatic peering management system. So we’ve developed a new automated method, which allows for faster self-service peering configuration. Before developing our automated system... Peers would email us to request to establish peering sessions. Next, one of our Edge engineers would verify the email and check our mutual traffic levels. To confirm the traffic levels were appropriate, that team member had to check numerous internal dashboards, reports, and rulebooks, as well as external resources, such as the potential peer’s PeeringDB record.
"The team member then would use a few internal tools to configure BGP sessions, reply back to the peer, and wait for the peer to configure their side of the network. This approach had several problems. First, there was no centralized place to see the incoming peering requests or the existing peering status. Requests could arrive over email, or several other out-of-band systems. Edge engineers had to track, parse, and hand-verify every request. Next, for each request, the team member had to manually launch and monitor an in-house tool for each peer, and then, once finished, type a response to each peering request. At last count, we estimate that this process took more than nine hours per week — wasting a whole day of each workweek on a needlessly manual process."
Whether this new system or another issue in the BGP mix is to blame, the pressure of trying to recover services for nearly six billion users must be gargantuan. Hugops to all involved in the recovery effort.
Matthew Hodgson, co-founder and CEO of messenger Element and technical co-founder of decentralised Matrix network noted that the outages "highlight that global outages are one of the major downsides of a centralised system. Centralised apps mean that all the eggs are in one basket. When that basket breaks, all the eggs get smashed. We saw the same last week when Slack went down. Decentralised systems are far more reliable. There’s no single point of failure so they can withstand significant disruption and still keep people and businesses communicating."