Facebook disaster-proofs data centres with Project Storm
Thu 1 Sep 2016
At Facebook’s @Scale conference this week, the company opened up about the rigorous stress testing it conducts to help prevent data loss in the event of natural disasters. The stress testing program, code-named Project Storm, involves simulating massive outages in data centers to test network resiliency.
Facebook VP and head of Infrastructure, Jay Parikh, said that 2012’s Hurricane Sandy was a ‘wake-up call’ for Facebook. It prompted the company to realize that it needed to plan to keep its users online in the event of a natural disaster.
From late 2012 through 2014, the company conducted a series of ‘mini-drills’, during which they drained traffic from different data centers and measured the effects on traffic flow and user experience. After reviewing the results and making what corrections they could, the Project Storm team completed the first large-scale simulation in 2014, during which they took down an entire data center.
The first two large-scale simulations resulted in minor problems for users, but major problems were uncovered behind the scenes. One of the most important adjustments that had to be made was in traffic management and load balancing during an outage. Another was the development of a standardized runbook that outlines each step of bringing a data center back online.
While Project Storm refers to a ‘swat team’ of employees that focus on emergency procedures and drills, the data center outage simulations involve the entire engineering team, plus employees from different groups throughout the company. Project Storm is still active two years after its first large-scale outage simulation, and Parikh revealed that the company continues to perform drills regularly, and has expanded the scope to test different types of failures as well.
“We’re solving unprecedented problems, problems that are not being solved elsewhere in the industry,” said Parikh. “Every day is chock-full of lots and lots of scale problems for us. So we need to build a resilient set of services.”
Encountering resistance to the idea of taking down a data center completely, Parikh responded, “You have to send the message to the entire company that you care about scale and resiliency.”