Network recovery advice: Experts weigh in
In the old days, you just had redundant everything, and disaster recovery meant switching over. Not so in the world of cloud computing, security nightmares, and virtual everything.
Recovering from a network crash isn’t as simple as it used to be, given the modern complex arrangements of hybrid cloud computing, security, and virtual servers.
A systems administrator who wants to stay employed should consider the latest-and-greatest network recovery advice from Todd Matters, co-founder and chief architect of RackWare, and Scott Saunders, principal recovery solutions architect at Sungard Availability Services.
“Having a recovery plan for your network is really just a part of a comprehensive disaster recovery plan,” Matters said. “It’s always a good idea to have some resiliency built-in. Of course there’s cost-benefit tradeoffs there. Increasingly it’s very economical to do some disaster recovery in the cloud.”
Even if your company doesn’t use any other cloud computing services (which is unlikely in 2019), the advantage to cloud services for recovery is that they’re elastic, quickly reconfigurable, and readily available. “It’s not like the old days where you kind of had to have a bunch of duplicated hardware most of the time sitting there doing nothing,” Matters noted. But just as you would on real iron, “You want to figure out which parts of your network should be pre-provisioned and which parts should be dynamically provisioned.” External-facing DNS, routers, and anything mission critical can be in the former category, while internal DNS and domain controllers are in the latter, he added.
“This sounds like a no-brainer, but you would be shocked at how many people don’t follow these guidelines. You absolutely need to run through your disaster recovery drills and constantly update the runbook,” Matters said, referring to a company’s documentation for exactly how and in what order to bring up each system and application following a major crash.
Security should be one of the first things brought back up. Think of the old West—if your fort got destroyed, the first thing you’d do is surround yourself with new walls before rebuilding. But one thing that shouldn’t be automated is your wide-area networking, because that would take longer than rebuilding it manually, Matters said.
Saunders spoke in terms of RTO—recovery time objectives. He agreed with Matters that too many companies lack sufficient focus on network recovery, and added that the best way to perform annual tests is with different people than those who designed the tests. After all, what if a disaster prevented your regular IT staff from getting to work—or worse? “A lot of people don’t think when their entire environment is down that their key people may not be able to help,” he said.
Annual full-recovery tests should be supplemented with monthly or quarterly incremental tests, such as restoring specific applications, Saunders observed. “Just because you test once a year, you could have made many incremental changes in your environment… If you have a disaster mid-term, you’re not going to be prepared. The best action as far as disaster recovery is to do all the incremental lifecycle management changes.”
A decade ago, companies measured recovery in days and hours—today it’s hours and minutes, and perhaps in another decade it’ll be minutes and seconds, Saunders noted.
Cyberinsurance is also a factor. Auditing a company’s disaster recovery plan is not currently a basis for granting coverage or denying a claim, said Joshua Motta, CEO of Coalition, a cyberinsurance specialist. However, “Very often the application for insurance will ask if the company does have an IT disaster recovery plan,” he said. “Once you get into the large enterprise, it’s probably few and far between that a company doesn’t.”