(NEXSTAR/AP) – Facebook, as well as subsidiaries Instagram and WhatsApp, all suffered a six-hour outage Monday. What caused the social media meltdown?
“We are experiencing networking issues and teams are working as fast as possible to debug and restore as fast as possible,” said Facebook Chief Technology Officer Mike Schroepfer on Twitter at 12:52 p.m. Pacific Time – three hours into the outage.
The company posted an update Monday evening with further explanation: “Our engineering teams have learned that configuration changes on the backbone routers that coordinate network traffic between our data centers caused issues that interrupted this communication. This disruption to network traffic had a cascading effect on the way our data centers communicate, bringing our services to a halt.”
Doug Madory, director of internet analysis for Kentik Inc., said it appears that Facebook withdrew “authoritative DNS routes” that let the rest of the internet communicate with its properties.
Such routes are part of the internet’s Domain Name System, a key structure that determines where internet traffic needs to go. DNS translates an address like “facebook.com” to an IP address like 22.214.171.1240. If Facebook’s DNS records disappeared, apps and web addresses would be unable to locate it.
Madory said there was no sign that anyone but Facebook was responsible and discounted the possibility that another major internet player, such as a telecom company, might have inadvertently rewritten major routing tables that affect Facebook. “No one else announced these routes,” said Madory.
Computer scientists speculated that a bug introduced by a configuration change in Facebook’s routing management system could be to blame. Colombia University computer scientist Steven Bellovin tweeted that he expected Facebook would first try an automated recovery in such a case. If that failed, it could be in for “a world of hurt” — because it would need to order manual changes at outside data centers, he added.
“What it boils down to: running a LARGE, even by Internet standards, distributed system is very hard, even for the very best,” Bellovin tweeted.
Editor’s note: This story was updated Monday, Oct. 4 at 8:04 p.m. Pacific Time to include Facebook‘s explanation of the outage.