Data centre resilience: mitigate against the cost of downtime
Fri 29 Dec 2017 | Mike Elms
Mike Elms, national sales manager, Socomec Power Conversion, discusses optimising your data centre for ever-changing requirements
The resilience of a data centre – the ability to remain operational even when there has been a power outage, hardware failure or other unforeseen disruption – is becoming increasingly critical for today’s data centre manager.
Combined with spiralling energy costs, legislative requirements and stringent environmental policy, the proliferation of big data and rise of the data centre make it increasingly important to mitigate against the cost of downtime.
Whether associated with productivity losses, revenue losses or longer-term customer attrition, the cost of recovering a system or the longer-term impact of reputational damage, the total cost of downtime can be crippling.
Cause and effect
A typical server system can experience more than 125 events each month – 88% of which are attributed to surges and transient events. Blackouts caused by accidental events, short circuits, switching on of heavy loads and overloads – as well as weather events – can all negatively impact trading and revenues, resulting in the loss of data and hardware damage or disk crashes.
Impurities will also take their toll – the culprit of data corruption and wear to electronic parts – sometimes the cause of irreparable component failure. Attributed to a range of factors including spikes, lightning, surges, harmonics and noise, the resilience of data centres against both impurities and blackouts is brought into sharp relief when the total potential cost of downtime is considered.
When it comes to voltage quality problems and associated downtime, every situation is uniquely demanding. The International Committee for Information Technology Standards (ITIC) has developed a power acceptability curve that comes close to providing global agreement as to what an electrical load should withstand.
The CBEMA curve is one of the most frequently deployed power acceptability curves in the industry. The four-tier system developed by The Uptime Institute is a classification approach to site infrastructure functionality that was widely adopted in 1995 and has become one of the standards in centre reliability for the uninterruptible uptime industry.
Designed to provide a common standard and industry benchmark, the goal is to develop system architecture with 99.999% availability – or ‘five nines’. But how achievable is 99.999% availability? How can data centres of all shapes and sizes protect against threats to their resilience and achieve a continuous and high quality power supply?
Can you be sure of 99.999% availability?
The delivery of a reliable, safe, high quality power supply requires an optimised combination of vital factors. Configurable redundancy, no single point of failure, devices designed for superior robustness, anomaly detection, rapid repair time and maintenance based on hot-swap modules are all key when considering improving resilience.
Furthermore, it is increasingly important to consider the complete economic model when specifying equipment and system upgrades, treating investment as a strategic asset rather than a short-term cost burden.
Resilience through modularity
The flexibility of a modular architecture enables an organisation to adapt rapidly to changing requirements. Rightsizing through modularity in design enables the power protection capacity to be added when it is needed to meet actual or existing demand, instead of total upfront deployment.
This approach means that capacity wasted is minimised in the case of variance between projected future loads and actual future loads. Furthermore, while redundancy provides an attractive MTBF, the rapid repair times associated with a modular configuration can reduce MTTR to a level that enables ‘six nines’ to be achieved – 99.9999% availability.
By working directly with a manufacturer with intricate knowledge of a system it is possible to identify and replace a defective module in fewer than 30 minutes.
Availability and resilience can also be optimised through a proactive approach to monitoring and therefore expediting remedial action where required, reducing the MTTR. The status of key operating parameters can be tracked in real time, delivering a greater degree of agility and accuracy – both virtual and physical anomalies can be addressed rapidly, in turn achieving maximum uptime and reduced operating expenditure.
Energy quality and resilience
By fitting permanent power quality monitoring systems, it is possible to check the reliability, efficiency and safety of an organisation’s electrical system. Data is collected and analysed to diagnose problems, identify deterioration in performance and highlight areas of risk – as well as locating the causes of electrical disturbances.
The latest network analyser equipment will ensure that the electrical system runs continuously and at optimised rates. By measuring electrical parameters and status, analysing the quality of energy according to class A IEC 61000-4-30, and measuring differential current while also providing GPS synchronisation, downtime and production losses are minimised, efficiency is improved, and running and maintenance costs are optimised.
Power hungry cooling systems
All facilities have dynamic environments, making it difficult to manage thermal airflow. The challenge is to match the cooling delivered to a facility with the heat generated by the current IT load, all of which needs to be monitored.
With the benefit of hindsight, preventive maintenance would be top of all of our agendas when it comes to data centre resilience
Socomec’s ATyS automatic transfer switches enhance power availability and simplify the electrical architecture, ensuring standby and alternate power availability. Fully certified to BS EN 60947-6-1, the ATyS family provides a manufacturer-built, fully programmable ATS that can be integrated into the data centre management system via serial communication. When fitted with a maintenance bypass, the ATS can be commissioned, tested and inspected with no downtime for the mechanical loads they typically serve.
Often mistakenly overlooked in favour of circuit breakers, fuses provide a compact protection solution for low current (~32A) loads fed from a busbar or PDU with a high prospective short-circuit current. As energy efficiency is improved by reducing the distance and impedance between transformer and final load, prospective short circuits can approach 100kA.
For very large switchboards, there is the added security of short circuit and heat rise testing by Tesla lab (12,000 A AC three phase and 6,000 A AC three phase and DC) – an independent lab specialising in testing of LV components, switchgear and switchgear assemblies.
When did you last inspect your critical power system?
With the benefit of hindsight, preventive maintenance would be top of all of our agendas when it comes to data centre resilience. A comprehensive preventive maintenance programme will optimise operating efficiency.
Mechanical, electrical and battery inspections are carried out along with environmental checks. Equipment cleaning and dust removal is undertaken, and electronics testing programmes and software updates are completed. With a detailed maintenance programme and report, it is possible to increase resilience and reliability.