Testing your disaster recovery – why we keep harping on about it
Thu 19 Jun 2014

Disaster recovery is a complex process and one in five recoveries fail because the solution has not been properly tested. But, Jules Taplin has a plan.
If you’ve invested in a disaster recovery solution to ensure business continuity, you may have a level of reassurance that you’ve done all you can to mitigate the risk of IT downtime. Businesses are, however, being put at a disadvantage by the lack of testing of their disaster recovery solutions. With the complexity of today’s infrastructure, the high change rates and disaster recovery testing costs, it’s not surprising that 48% of companies never test their system.*
So why is testing so important? Put simply, it’s highly unlikely that you can achieve a fast or even reliable recovery of your IT systems unless you’re testing. Let me explain why. When we reconstruct a machine as would happen during the recovery process, we reconstruct an image of a machine, make changes to IP addressing and adjust the machine to incorporate these changes. We then test the consistency of the applications and then read the event logs to see any errors.
Typically on a first full recovery test we see tens to hundreds of errors that will all need working through to iron out before a system is working efficiently. The applications service is tested to ensure it starts, and finally the data is tested – this involves booting the machines, checking the construction to ensure it’s able to boot, starting the applications to check they’re working and then the testing logic tests variables e.g. if it’s a web server, checking the ports are available. Unless the file is actually read, you won’t know if the data is real and available or not.
It’s only once this has been completed successfully that you will have a recovered system. Because of the high numbers of errors that occur during a first recovery, this will end up taking hours of specialist engineers’ time to sort through and this is why your recovery time will suffer.
Regular testing of the full recovery process will reduce your recovery time. Once the initial errors have been ironed out, subsequent recoveries will become faster and your risk of downtime significantly reduced.
Unfortunately, however, it doesn’t stop there – with any change that occurs with your infrastructure, new errors will appear and recovery time will be impacted again. For example, if you install blackberry on your system, when a recovery is invoked the blackberry services won’t work on the recovered system. So regular testing is the key to teach the system to test for blackberry services.
Ideally testing nightly is best. By booting recovery systems nightly, we see things that customers don’t see e.g. the potential for disc failure if there are changes to permissions on services i.e. service counts, or application license expiry. Daily “monitoring” will test full architectures and ensure machines come up in rank order. Booting the domain controller and then booting subsequent application based machines checks is important as active directory needs to be ready for exchange to work for example. Log files should be read daily to check for errors.
If you’re looking to improve your IT availability all these elements should be tested otherwise you’re running the risk of being one of the 19% of people whose recoveries fail.*
* The Plan B Disaster Recovery Report 2014
Jules Taplin is Technical Director with Plan B Disaster Recovery