The eight pillars of data centre management
Thu 17 Nov 2016
Building a resilient, highly available data centre is only the initial, embryonic phase of its lifetime. For the following 15-20 years, how one operates and maintains a data centre is far more important than how it was built (that said, it does need to be built right in order to be operated to its maximum potential).
So what is the difference between a well- run data centre and one that is not? Let’s take a look at what makes a high-quality site stand out…
Determine your strengths and weaknesses; concentrating on mitigating and removing the weaknesses is crucial. At least once a year, carry out a full risk assessment by walking through the data centre and inspecting the incident and problem management system. This will allow you to discover any negatively impacting trends.
The best data centres will undertake a full, wide-ranging threat and vulnerability risk assessment (TVRA) annually. This may identify a number of issues that would then be held on the risk register for action and closure, or acceptance if the cost of elimination is too high or too impactful on operations.
A fully compliant ISO20000 incident, change and problem management system should be available and in active use to ensure that tickets are auto-escalated, based on business rules aligned with the SLAs agreed with your clients. Integrating this system with your FM supplier, or OEM suppliers’ support systems can bring the benefit of rapid response to service-affecting incidents.
Energy efficient best practice
Adhering to the best practices in the EU Code of Conduct (EUCoC) for Data Centres (Energy Efficiency), has a double benefit – providing reduced power consumption that can translate through to lower costs for your customers, while also bolstering your sustainability credentials. If managed through ISO14001 Environmental Management and ISO50001 Energy Management, the real benefits can be realised in a formal, process-oriented way.
DCIM, BMS and EMS
Data centre infrastructure management (DCIM), building management system (BMS) and environmental management system (EMS) are available to keep tabs on how your data centre is performing. Used well, these can provide critical information on issues before they affect service and can also provide customers (whether internal or external) with valuable operational data about the performance of their installed critical environment. The top data centres can also use these to provide virtual feedback and control loops that constantly monitor and adjust the critical environment to minimise power usage while maximising availability.
SLAs should ideally be leveraged to encourage positive behaviours rather than be penal
Whether a data centre employs an FM supplier to provide all the proactive maintenance and reactive support, or contracts out every type of plant and equipment to the OEM, managing the suppliers’ behaviours and responsiveness is critical to keeping the lights on – literally! SLAs should ideally be leveraged to encourage positive behaviours rather than be penal and should be built into the ISO20000 system to ensure that auto escalation occurs when the response is outside of contractual limits.
Assets should be maintained in accordance with manufacturers’ defined maintenance periods and condition reports should be recorded every time an asset is inspected as part of the planned, preventive maintenance (PPM) schedule. Don’t forget that maintenance checks will differ depending on the frequency type of the visit (monthly, quarterly or yearly).
Many FM suppliers are now integrating permits to work (PtW), risk and method statements and engineer reports into a digital solution that reduces or even eliminates the need to shuffle paperwork around. Assets that are approaching their end of life should be in a plan for replacement and disposal at least one year in advance of that date occurring.
Black building tests
The data centre will have an inbuilt level of resilience and redundancy but that is worthless unless it is tested and proven on a regular basis. Too many data centre owners believe that load bank testing is sufficient, but unless the infrastructure is stressed against a live load, in my opinion, your testing counts for nothing.
Based on your risk assessment, black building testing should be carried out at periods that support your business need. Ensure your customers are aware why this is undertaken and how important it is, and make the case that having an outage during a planned exercise is safer than during a power brownout in the middle of a thunderstorm at 2am. Of course, if you’ve been doing everything else right, then the chances of an outage are always going to be very low.
Processes and procedures
At the end of the day your data centre will run well only if it is well understood by everyone responsible for its operation and upkeep. This implies that the processes and procedures of the data centre are documented, maintained and complied with by everyone. Regular audits (plan, do, check, act) will keep processes alive, adapted to current business needs and controlled.
This post originated at Data Centre Management magazine, from the same publisher as The Stack. Click here to find out more about the UK’s most important industry publication for the data centre space.