About twenty IT professionals joined us on Tuesday to compare notes about disasters, discuss disaster recovery planning, and to design architectures to respond to different disasters. On Wednesday, I shared what we learned in a 90-minute "report out" presentation during Symposium. I'd like to share a brief summary of that here.
First, what is disaster recovery compared to business continuity?
In the simplest definition: disaster recovery is what you do to recover your system from a disaster. Business continuity is what you do to keep conducting business while others recover the system.
In most organizations, IT focuses on the disaster recovery, and the business units focus on the business continuity.
When we talk about disaster recovery, most people immediately think of the "smoking hole in the ground" scenario. They imagine examples where the data center building is engulfed in an inferno or demolished by a tornado. In fact, disasters can be large or small.
In the workshop, we led a small group breakout to discuss different disasters we had all experienced. We heard examples like a database administrator not checking the status of a backup before deleting a database to upgrade the database software (in fact, the database backup had failed). Or of building management conducting a fire alarm system test, and accidentally shutting off all power to the data center. Or a water leak from an overhead sprinkler system that damaged several racks of systems in a data center. Or construction workers who accidentally cut into the only fiber data connection for the data center.
So a disaster recovery plan needs to accommodate different types of disasters, different scales from large to small. In the best case, organizations can design an architecture for their systems that remains flexible to an outage. The best disaster recovery plan is one that doesn't require people to get involved. But if you aren't able to have systems automatically "fail over" during an outage, you want to have a plan documented beforehand about how to recover your systems.
Folks divided into four working groups, and spent the rest of the day designing an architecture and documenting a response plan for different scenarios. One group discussed how to recover from partial data center damage, and another group prepared a plan for a new property tax system. A third group designed an architecture to extend flexibility for a network in the face of a possible ISP outage, while a fourth group used the workshop to outline an actual disaster recovery plan.
For details from the four working groups, please see the materials we shared from the Symposium. We captured the work of all the working groups and have made the information available to Symposium attendees. We encourage you to use this as a starting point for disaster recovery planning in your own IT organizations. Let this work help you get a jump start on planning for disasters.
At a high level, here are several key "take-aways" from the four working groups:
Identify critical people and vendors.
In a disaster, you will need to immediately reach out to technology folks to respond to the disaster, and vendors to help you get back to business quickly. Your next disaster may not be conveniently timed to occur during working hours. Make sure you have copies of contact information available off-site.Network is important for business.
With so many organizations using Cloud and vendor-hosted applications, the network is key to your business. If your partial data center outage impacts the network, your users may not be able to access these remote systems. Also consider authentication and other components to your network.Maintain a warm site.
If possible, you should implement dual "hot" sites where you run production applications and systems from two or more sites.Identify critical systems and applications.
Conduct a risk analysis or other determination to understand which systems are truly your most critical. Create disaster recovery plans for these systems first. For example, most organizations can let "development" or "test" systems wait a few days while you bring up production systems.Are your RTO and RPO realistic and achievable?
How quickly can you recover a system, and how old will your data be when you get it back up? these address two important factors in disaster recovery planning: RTO (Recovery Time Objective) is how long it will take to recover your system, and RPO (Recovery Point Objective) describes the "age" of the data that you can recover.Understand the impact.
What is the cost of an outage? In this scenario, the working group was able to quantify some numbers. For example, the interest on $100 million over three days. Use the cost of an outage to help you justify additional infrastructure.BC and DR can sometimes intersect.
In most instances, disaster recovery is performed by the IT team, and business continuity is the responsibility of business units. But depending on the application, note that the "players" for these may overlap.Design for redundancy and failover.
Where possible, create an architecture that remains flexible in the face of an outage. In one example, you might run production systems from two different data centers. But as you design "failover" into your architecture, look for single points of failure and address them.Network is business critical.
This was a repeated theme throughout the day. Many organizations now outsource applications, and leverage Cloud and vendor-hosted systems. While this simplifies your IT, it means the network becomes even more critical to your operations.Identify single points of failure.
Another repeated theme from the workshop. As you design your architecture, consider what systems are dependent on other systems. Watch for any single points of failure.What is your upstream's DR plan?
Be mindful not to be lulled into a sense of security when outsourcing applications and services. Just because you have outsourced part of your systems to a vendor or other upstream provider doesn't mean you can ignore disaster recovery planning. Discuss disaster recovery plans with your upstream vendor to understand how they will bring your applications and data back online in the face of a disaster.Identify systems and inter-dependencies.
As you create a disaster recovery plan for an application, look closely at the systems it connects to. What inter-dependencies exist? How does your application rely on other systems and applications?Document the architecture.
You can advance a disaster recovery plan very quickly simply by documenting the architecture. Make this a group exercise, to sketch how each component of an application runs on different systems, and how those systems are connected.How do you go back to normal?
It's tempting to focus only on the actual "recovery" portion of a disaster recovery plan. For example, your plan may require moving production to a "test" server. But you can't run production on the secondary system forever. Having "recovered" on the test system, how do you plan to go back to a normal state?Analyze risk to find critical systems.
Use a risk analysis method to identify which systems and applications are the most critical to your business. One way to do this is to break it up into components: the likelihood of a failure, and the business impact of the failure. The combination of these components determines the criticality of the overall system.I thought it was great that two of our working groups used Tuesday's workshop to carry forward work that is currently underway in their organizations. If you are a local CIO who sent staff to this workshop, your teams came away with actionable results.