Framework for Responding to Data Centre Emergencies
Tue 28 Jun 2016
As modern business becomes increasingly reliant on 24 * 7 IT services, data centres are obliged to ensure that they can provide resilience and keep working under all circumstances. This in turn means they must be prepared for exceptional events or emergencies and have plans in place to maintain continuous operation. Despite best practices being observed during site selection, extreme weather events such as Hurricane Sandy have shown that facilities may at any time be exposed to risks above and beyond hazards those from which they are normally protected.
Accordingly, a new White Paper from Schneider Electric, a global specialist in energy management and automation, describes how good preparation and process can quickly and safely mitigate the impact of emergencies, and help prevent them from recurring. White paper #217, How to Prepare and Respond to Data Center Emergencies, describes a framework for an effective strategy arranged across three categories: Emergency Response Procedures, Emergency Drills and Incident Management.
White Paper 217 discusses each of the categories in detail and offers proactive considerations for data centre management during the planning stage. Emergency Response Procedures include: operational measures to deploy in the event of a crisis that safely isolate faults and restore service; a crisis management plan (CMP) which is a detailed step-by-step procedure to follow when emergency strikes; and Escalation Procedures, which are documented, prioritised contact lists outlining internal contact requirements for specific situations related to data centre operations.
Emergency Drills, which should be developed in advance and scheduled to occur regularly, guide operators through what should be done to counter the top 10 identifiable operational risks.
Incident Management includes: incident notification, a process to inform the appropriate people about any safety of mission-critical event; incident identification and reporting ensures that all incidents are reported as soon as the situation is stabilised and a brief summary of the incident prepared for circulation to appropriate staff; and failure analysis, which is a comprehensive program to determine the root cause of any incident that has resulted in system downtime or personal injury, or had the likelihood of doing so.
Communication and speed of response to emergency situations can greatly reduce the risk to operations caused by incidents. The White Paper’s recommendations focus strongly on the need for preparedness to anticipate problems before they arise, take appropriate speedy action when they do and communicate the lessons learned from each incident so that the worst effects are not repeated.
To effectively respond to different kinds of risks and crises in data centres, organisations must act quickly and know what to expect in unexpected situation. Proper operational methodology will avoid common mistakes and a good Emergency Preparedness plan, encompassing people processes and systems, will help operators run their facilities in a more predictable and effective way.
For companies that may not have adequate facilities or expertise internally to respond to risks, there is the option of service offerings from specialist subject-matter experts, such as Schneider Electric, who have the experience of developing and deploying best practices in data centre crisis management over many years.