Flirtin' With Disaster

I spoke on a panel at a Marcus-Evans conference on business continuity and disaster recovery and found myself in the position of converging three themes: business continuity, security, and virtualization. Of course, I had my eyes (mostly) open while speaking, although it appears I was out of focus for much of the session so this shot of me escaping the bull by the horns has to suffice, or at least detract from my weak Molly Hatchet references (speakers on if you click on the link).

Historically, data center management saw disaster recovery and business continuity as reactions to physical events: force majeure of nature, building inaccessibility, threats or acts of terror, or major infrastructure failures like network outages or power blackouts. Increased stress on the power grid and near-continous construction in major cities increases the risks of the last two, but they are still somewhat contained, physical events that prompt physical reactions: spin up the redundant data center, fail over the services, and get up and running again, ideally without having missed shipments, deliveries or customer interactions.

Business continuity today has those physical events as table stakes only. The larger, more difficult problems are network service failure (due to denial of service attacks or failure of a dependent service), geographic restriction (due to pandemic fears, public transportation failures, or risk management), data disclosure and privacy, and the overall effect on brand and customer retention. What if you can't get people into an office, or have to react to an application failiure that results in customer, partner or supposedly anonymous visitor information being disclosed? Welcome to this decade's disasters.

Where does virtualization fit in? Quite well, actually. Virtualization is a form of abstraction; it "hides" what's under the layer addressed by the operating system (in terms of a hypervisor) or the language virtual machine (in the case of an interpreted language). But it's critical that virtualization be used as a tool to truly drive location and network transparency, not just spin up more operating system copies. I never worry about the actual data center containing my mail server, because I only see it through an IMAP abstraction. It could move, failover, or even switch servers, operating systems and IMAP implementations, and I'd never know. Virtualization gains in importance for business continuity because it drives the discussion of abstraction: what services are seen by what users, where and how on which networks?

Bottom line: Business continuity planning shares several common themes with systemic security design. There's self-preservation, the notion that a system should survive after one or more failures, even if they are coincident or nested. The least privilege design philosophy ensures that each process, administrator or user is given the minimum span of control to perform a task; in security this limits root privileges while in BC planning it ensures that you don't give conflicting directions regarding alternate data center locations. Compartmentalization drives isolation of systems that may fail in dependent ways, and helps prevent cascaded failures, and proportionality helps guide investment into areas where there is perceived risk. The short form of proportionality is to not spend money on rapid recovery from risks that would have other, far-reaching effects on your business anyway. My co-author Evan Marcus used to joke that it was silly to build a data center recovery plan for a potential Godzilla attack, because if that happens we have other, larger issues to deal with. On the other hand, if you saw Cloverfield, there's a lot of infrastructure that people depend upon even when monsters are eating Manhattan.

The best planning is to write out a narrative of what would happen should your business continuity plan go into effect: script out the disaster or event that causes your company to act, and write up press releases, decision making scenarios, and some plausible risk-adjusted actions, and follow the actions out to their conclusion. If you don't have a prescribed meeting place for a building evacuation, and there's no system for employees to check in and validate their safety, then your business continuity plan may suffer when you have to scramble to find a critical employee. When disasters happen, the entire electronic and physical infrastructures are unusually stressed, and normal chains of communication break down. Without a narrative to put issues into perspective, your disaster planning document becomes a write-only memory, holding little interest or failing to gain enough inspection from key stakeholders and contributors. Start naming names, and putting brand, product and individual risks into black and white, and you'll see how your carbon-based networks hold up when the fiber and copper ones are under duress.


Good commentary.

I have reservations about your last paragraph.

Perhaps I missed the point of it, but scenario based planning is only a small percentage of what needs to be done for business continuity. Yes, write scripts for the obvious threats like a building fire or gas leak evacuation. But you also need to build in the flexibility to handle the unexpected. Narratives are really only good for exercising your plans.

A meeting point outside the building is only good for some evacuations. Many emergency management organizations will engage in a walk out evacuation of an urban area due to the obvious traffic problems should everyone run to their vehicles and proceed to create irresolvable gridlock, thus trapping themselves within the very area they are trying to escape.

Will the meeting point be safe from weather, hazmat, undesirable people assaulting your staff while they wait on the sidewalk…?

You can have critical processes, but you can’t have critical employees. If you have a single point of failure in the form of an employee, your plan has a huge hole.

The best plans are made up of a series of options for executing every critical process. The options should be selected by the individual employee at the time of disaster, since that’s the only time they’ll know the best option. The options should be exercised against the threats identified for the area by local emergency management and deemed appropriate by management. Everyone should automatically know what to do as a failsafe should communications be down. A frank discussion needs to be held with the appropriate customers to inform them of your plans. Your critical supply chain needs to be investigated and secured.

Posted by Bill Lang on March 02, 2008 at 06:17 AM EST #

Post a Comment:
Comments are closed for this entry.

Hal Stern's thoughts on software, services, cloud computing, security, privacy, and data management


« July 2016