Are you being fooled when it comes to resilience in the data centre?
Mon 23 Jan 2017 | Ian Bitterlin
Ian Bitterlin, consulting engineer and visiting professor at the University of Leeds, asks, is resilience misrepresented as well as misunderstood?
When it comes to data centres the word ‘resilience’ can be best defined as ‘the ability to maintain ICT service in the face of environmental extremes as well as human error or deliberate sabotage’ and, generally, higher levels of resilience can be engineered into the mechanical and electrical infrastructure at a cost premium.
However, human error is well documented to be the root cause of 70% of all data centre failures – but even that can be reduced by design. For example, a dual-bus power system with a UPS in each bus can largely protect a correctly connected dual-corded load against power failure, human error and inept sabotage but you probably notice how careful I am with the caveats…
Of course, if you are a client/user of a data centre you clearly want to know what you are getting for your money, not least so that you can pay for what you deserve but, in the wonderful words of John Ruskin (1819-1900): ‘There is nothing in the world that some man cannot make a little worse and sell a little cheaper, and he who considers price only is that man’s lawful prey’’ In modern terms, this means if you pay the lowest price you are usually buying rubbish.
How to differentiate between systems?
Well we have two metrics, somewhat interlinked and both abused;
- The Tiers of Uptime (I-IV), the Types of TIA-942 (I-IV), the Rating of BICSI (0-4, although ‘0’ doesn’t describe a data centre, so 1-4) and the Availability Class of EN50600
- Availability percentage, e.g. 99.999% (the so-called five nines).
Apart from pointing out that the Uptime rules are no longer written down for public consumption, TIA-942 and BICSI are ANSI Standards most applicable in North America, and that EN50600 isn’t yet used much… we can distill them all into four levels describing the capability of concurrent maintainability and fault tolerance.
The principles are clear; concurrent maintainability answers the question of what is the point of building a hugely reliable (and maybe resilient) data centre that must be shut-down once a year for maintenance? While a fault-tolerant system can have any component, path or space fail (one at a time) without impacting the ICT service.
A failure once a year is a disaster, for any Tier.
But the greatest abuse is reserved for availability percentage; easy to calculate but capable of huge misinterpretation to fool the unwary. The first problem is that to state an availability you need just two numbers, the MTBF (mean time between failure, hours) and MTTR (mean time to repair, hours) and you simply express the availability by dividing the MTBF by the total time (MTBF+MTTR) and multiplying by 100%.
So, having a very long MTBF and a very short MTTR gives you an incredibly high result. Unfortunately, both MTBF and MTTR are numbers that marketing departments can guess at, if they use them at all. For example, you can quote 99.999% for a UPS simply by assuming that the client has the skills and spare parts on site and can repair it himself in 20 minutes, instead of calling the service engineer, waiting for spare parts and then re-testing before putting back into service (often one day or longer).
The second problem is a combination of the number of failure events (summing multiple MTTRs) and the MTBF. The original Uptime white paper (now withdrawn) had an attempt at linking availability percentage with the four Tiers but didn’t define the period over which it would be measured. This led to the strange scenario where a low Tier facility would offer to be offline for 53 minutes per year but the ultimate ‘IV’ would offer only 5.3 minutes. How bizarre was that? A failure once a year is a disaster, for any Tier.
Anyway, let’s not dwell on that but consider the combination problem. This particularly impacts numerous very short-lived failures. The easiest way to illustrate it is to suggest that your heart is 99.9% available. Doesn’t sound too bad until you consider that it represents 36,000 missed heart beats a year and that if they are missed in one session you are very dead, while if they are evenly spread over the year you are just feeling unwell.
In data centre terms, look at the voltage supplied to the load. Many modern servers cannot withstand a break in supply for longer than 10ms (millisecond), and some considerably less, so offering a 99.9999999% availability in the power system (9-nines) could still produce three failures every year, each lasting 10ms.
So what to do?
The next time someone offers you 99.999% of anything just ask them ‘over what period’ and watch their expression change
Well, there is nothing wrong with availability as a metric as long as it is clear what it is based upon. For example: ‘An availability of 99.99% measured over 10 years with a single failure lasting no longer than 10 hours,’ is a clear statement of MTBF (10 years) and MTTR (10 hours).
OK, the marketing boys and girls may have rounded the answer from 99.98859…% but you may, by now, be getting the point that it is the MTBF that is more important than availability and, to boot, you need the MTBF to calculate the availability in first place. The single failure caveat avoids summation of multiple events.
The next time someone offers you 99.999% of anything just ask them ‘over what period’ and watch their expression change – it can be fun.
Of course, the ultimate failure of a resilient data centre is the easiest to achieve; it is not hacking into the UPS and turning off the power or (as in a recent movie) raising the server inlet temperature to get meltdown. No, just consider the definition of a data centre: A facility housing compute, storage and I/O connectivity, right? So, walk round the outer perimeter of the property noting the location of the fibre pits and return later that night with a few chums each armed with a balaclava, a few gallons of unleaded and a box of matches. Whip up the (unlocked) cast-iron pit lids and within seconds you are fleeing the scene and the data centre is disabled for several days.
The same principle applies to those strange folks who want to build an earthquake-proof facility. If the earthquake hits your location it will almost certainly sever the fibre and, without connectivity, a data centre is reduced to a secure depository for second-hand ICT kit.
Consulting Engineer & Visiting Professor, Leeds University
Critical Facilities Consulting