Data centre resilience? We hold the answers
Fri 25 Aug 2017 | Jonas Caino
Jonas Caino of Etix Everywhere, looks at what human beings can teach us about data centre resilience
Human beings are remarkably resilient. For a species so frail physically, we instinctively know how to survive and thrive virtually anywhere on the planet and possibly beyond. Another species developing alongside us are machines, more specifically, intelligent machines.
From Amazon’s Alexa to wearable technology, from edge technology driven smart sensors to powerful robotics, ‘things’ are growing at an exponential rate and are becoming fully integrated into our way of life. It seems like Moore’s law (doubling of integrated circuits on a single chip every two years) now applies to all aspects of technological advances.
It’s the logic and data embedded in the software that make machines intelligent and the nerve centre of all of this resides within our data centres. With Murphy’s Law (whatever could go wrong, will go wrong) forever lurking in the shadows, more than ever data centres need to be resilient, just like humans. So as data centre designers, operators, IT and facilities managers, without being too anthropological, what can we learn from the powerful resilient nature of us humans when it comes to data centres?
It’s in human genes
For whatever reason, our genes are designed to survive; selfishness is encoded into our DNA. That’s the starting point regarding resilience in data centres: design. Data centres have to be designed to be resilient on many levels (power, cooling, network communications as well as potential internal and external threats).
The most resilient design goes out of the window if an earthquake, a flood or a plane hits the site
At present, when those of us in the data centre industry think of resilience, we think redundancy particularly based on Uptime Institute’s Tier rating. The ultimate being fault tolerant site infrastructure where the data centre has two active infrastructure support paths providing the owner with availability of 99.995%; in other words, one gets planned or unplanned downtime of about 48 minutes a year.
Unlike genes, Tier IV resilience design is grossly inefficient. It’s almost as if efficiency and resilience are diametrically opposed. As you have redundancy, you also have expensive redundant infrastructure. A more efficient and cost effective compromise may be found within the Tier III space. Rather than have two mirrored sources of active power in a classic 2N configuration, one could opt to have three sources of power, each at half the 2N load in a 2N distributed configuration. In other words, mirroring 300kVA of load at 100% each is less efficient at three loads of 150KVA with a 66.6% utilisation on each. If one of the sources fail, the other two sources still provide 300kVA. In addition to this configuration being cheaper to deploy and operate, it also provides an improved uptime availability if the three sources are 100% independent.
Humans know where to put roots
Humans think carefully about where to settle and build their communities; site selection is critical for survival. The same goes for data centres. The most resilient design goes out of the window if an earthquake, a flood or a plane hits the site. We should plan for potential threats, as well as the possible benefits around power and cooling. The availability of alternative power sources and the use of nature to develop innovative ways to cool the data centre (be it the climate or natural sources of water, etc) only serves to increase resilience with increases in efficiency a bonus.
Humans work together
What if data centres become truly independent, where AI, predictive analytics and machine learning can compute, rationalise and make decisions
Communities have kept humans surviving age after age. If there is a catastrophe in one location, it doesn’t affect humanity as a whole. We are able to recover. Just as humans form clusters of communities around the world, taking advantage of clustering whole data centres make the whole enterprise resilient from threats beyond power and cooling. There is a logic in looking at infrastructure that spans a network of data centres, which could be located across a region or the globe with each facility being exactly the same in infrastructure architecture, look and feel. A multi-data centre resilient topology means that each facility runs the same application and has access to N copies of the database with all changes replicated across the data centres. Therefore, users can communicate with any data centre at anytime with no risk to failure.
Humans learn from past mistakes
A powerful characteristic of human resilience is our ability to learn from situations that can threaten our existence. In the same vein, understanding the data centre in operation is an endeavour that should last the lifetime of the facility. The starting point is monitoring. Everything needs to be monitored, temperature inside the racks, in front of the racks, at the rear of the racks, in the UPS/battery space, in the floor plenum, in the cold aisle, in the hot aisle; the same goes for humidity, air pressure, electrical circuits, flow and return water temperatures, return air, very early smoke detection the list can go on. You can never get enough monitoring sources.
Then the focus should be on just gathering the data over a substantial period of time and look for trends and cycles through the innovative creation of subsets of the data from the facility. Every data centre is unique and it’s in the continued understanding of the data the facility gives out and the right information created can one continually plan for resilience. A good data centre infrastructure management (DCIM) tool should help with this.
Humans are independent
Human beings think for themselves. They can move away from a community and adapt to what would appear an alien environment. Several studies have attributed a large percentage of data centre downtime to human error either directly or indirectly. Can taking humans out of the equation increase the resilience of machines? Certainly artificial intelligence is making its way into the data centre space. Right now the situation is that network operation centres interface with the data centre via a DCIM tool. What if data centres become truly independent, where AI, predictive analytics and machine learning can compute, rationalise and make decisions for benefit of the data centre? What if the data centre became self resilient?
There are already advances in data centre self-protection systems, which can remove the decision-making process regarding security access control via facial recognition and space awareness. Soon we expect data centres to make decisions based on predicting problems within the facility and regulating environmental and infrastructural parameters for complete optimisation not just within a data centre but also on behalf of a network of data centres. Already DeepMind’s is reducing Google’s data centre energy costs by up to 40% (excluding energy losses and cooling efficiencies).
What happens when Moore’s Law collides with Murphy’s Law? It seems, the ability to remain resilient in the face of the oncoming deluge of data and applications plus all that comes with managing them may be beyond humans and we’ll at some point have to get machines to manage machines.