Averting UPS disaster in the data centre
Thu 18 Jan 2018 | Leo Craig
Leo Craig, general manager at Riello, asks: is it time to tighten up your maintenance approach?
When it comes to ensuring your data centre is resilient, regular maintenance of your UPS is essential. The maintenance process is designed to minimise risk and keep your UPS operating in a fail-safe, efficient manner. So far, so good but, what happens if the very act of carrying out maintenance poses a risk itself? What checks and balances can you put in place to ensure peace of mind and a watertight approach?
As British Airways discovered to their cost in the summer of 2017, human error is the main cause of problems occurring during uninterruptible power supply (UPS) maintenance procedures; engineers may throw a wrong switch, or carry out a procedure in the wrong order.
But, while it’s easy to lay the blame solely at the feet of the engineer in these instances, errors of this kind are often the result of poor operational procedures, poor labelling or even poor training. By ironing out these issues at the start of a UPS installation, risks can be avoided.
For example, if the system being installed is a critical system comprising large UPSs in parallel and a complex switchgear panel, Castell interlocks should be incorporated into the design. These force the user to switch in a controlled and safe fashion, but are often left out of the design to save costs at the start of the project.
Round-the-clock equipment monitoring also offers robust protection and should be part of your maintenance package. Rigorous training is also vital
Simple things can make a difference. By ensuring that basic labelling and switching schematics are up to date, disaster can be averted. Having clearly documented switching procedures available is recommended. If the site is extremely critical, the procedure of pilot – co-pilot (two engineers both check the procedure before carrying out each action) will prevent most human errors.
Any maintenance is typically intrusive into the UPS or switchgear, so reducing this is always a good thing. Most problems arising, including the failure of electrical components, are preceded by an increase in heat.
If a connect point isn’t tightened properly, for example, it will start to heat up and eventually fail in some way. Short of checking every connection physically, the most effective solution is thermal imaging. Thermal image technology can identify potential issues that wouldn’t necessarily be picked up using conventional techniques, without the need for physical intervention.
Monitor equipment and competency
Round-the-clock equipment monitoring also offers robust protection and should be part of your maintenance package. Rigorous training is also vital, as is ensuring that the attending engineer can carry out the work competently.
Never be afraid to ask questions of your maintenance provider – it is your responsibility to request proof of competency levels – pertaining both to the company itself and the engineers it uses. And always check ‘on the day’ that the engineer on site is competent and isn’t a last-minute sub-contractor sent in because the original engineer is off sick.
A strong maintenance package should ensure that when the UPS does fail, the response is timely and effective. Service level agreements need to be appropriate to the criticality of the application. There is no point having a maintenance contract for a UPS 24/7 response if access to the UPS can only be gained during normal business hours. Transversely, if operations are 24/7 and very critical to the business, then a 24/7 response is a must.
Be clear on exactly what the ‘response’ constitutes – will it just be a phone call or will it be someone coming to site, if so, will that someone be a competent engineer?
Review today, protect tomorrow
Undertaking a review of your current UPS maintenance procedure will help to identify and reduce risk to critical operations that you may not have previously anticipated. By applying an extra level of due diligence today, you can help to avert disaster tomorrow.
This post originated at Data Centre Management magazine, from the same publisher as The Stack. Click here to find out more about the UK’s most important industry publication for the data centre space.