Why taking a preventative approach to maintenance is key to data centre resilience
Thu 9 Nov 2017 | Jonathan Humphries
Jonathan Humphries, Service Director at Stulz, explains why investing in a preemptive maintenance strategy is crucial for ensuring a cost-efficient, resilient data centre.
When designing and building a data centre, a lot of time is dedicated to looking into the right location, evaluating design models, selecting the right equipment and making sure the building is constructed correctly. There are also strict standards to follow through the entire commissioning process, including checks such as IST testing (Integrated System Testing).
However, a serious problem is now emerging where data centre operators are investing less in maintaining the infrastructure that they took considerable efforts to perfect during the initial stages. This shift can be described as a move towards a reactive maintenance approach.
Understandably, every data centre needs to try and reduce costs as much as possible, and once a facility is live, maintenance can often be cornered as an area to reduce OPEX costs. Consideration needs to be made to ensure that these cuts are truly being made in the right areas.
Preventative maintenance is one of the most critical strategies within a data centre. Making sure that a facility is maintained to its original design parameters will mean that the centre performs exactly how it should have done on Day One.
As data centre teams move towards a reactive maintenance model, they do not understand the risks they are facing. If maintenance has not been carried out correctly, a cascading fault can occur. If this happens, the data centre could very quickly go from having a loss of redundancy, to a loss of service.
99% of failures in data centres happen as a result of human error
Reactive maintenance is, therefore, costlier than a preventative model. When a piece of equipment fails, it is going to be a lot more expensive to correct than if something was maintained in a preemptive manner.
Risk management procedures
There are two sets of documents that are imperative to the effective operation of a data centre. The first, SOPs (Standard Operating Procedures), ensure that the onsite operations teams know exactly how the equipment is supposed to perform on a day-to-day basis. If the teams are working to their SOPs, they should be able to identify an issue and deal with it very quickly.
The second set of guidelines are Emergency Operating Procedures (EOPs). In an emergency, such as a number of equipment failures, data centre operators can follow these and will know what to expect in that particular scenario.
99% of failures in data centres happen as a result of human error. However, risks can be minimised if data centre operations staff have fully rehearsed their SOPs and EOPs – limiting the amount of time for potential issues to arise in the facility.
A low load strategy needs to be instigated to optimise the data centre equipment
Touch testing is also an important preventative measure. Each quarter, onsite operations teams can undertake the review which evaluates their ability to maintain and operate the site. These live workshops can quickly identify risk areas and provide an opportunity to help data centre managers tailor their SOPs and EOPs from a mechanical perspective.
Many failures are also due to people not understanding how the equipment operates or making changes such as turning pieces of equipment off – this is a really big issue. If ever anything is done in the data centre, it is critical that change management procedures are in place. This comes hand-in-hand with preemptive maintenance measures.
Data centre operators would also be advised to assess their IT heat loads. Many data centres in their early infant days have very low loads which means the mechanical equipment cannot always operate correctly – everything is running very low and turning equipment on and off again is not a healthy use of the system.
In these cases, a low load strategy needs to be instigated to optimise the data centre equipment to suit ever-changing loads. This can help identify where a piece of equipment needs to be turned off while still ensuring resilience and design specifications.
Additionally, as a manufacturer focusing on preemptive maintenance, we document every failure and have a clear idea when a piece of equipment is going to fail. Using a Mean Time Between Failure rate, it is possible to gauge at what point in its lifecycle a component in a data centre is expected to fail.
Operations teams often do not understand the importance of implementing these preemptive steps, but by not maintaining a data centre properly they are taking the risk away from the manufacturer and are beginning to own that risk themselves. If there is a failure, it is now the business’ own responsibility. Having a specialist to work hand-in-hand with the operations team on a preventative regime will help to identify risks, alleviate maintenance pressures, and save considerable costs in the long term.