Flirtin' With Disaster
By stern on Mar 02, 2008
Historically, data center management saw disaster recovery and business continuity as reactions to physical events: force majeure of nature, building inaccessibility, threats or acts of terror, or major infrastructure failures like network outages or power blackouts. Increased stress on the power grid and near-continous construction in major cities increases the risks of the last two, but they are still somewhat contained, physical events that prompt physical reactions: spin up the redundant data center, fail over the services, and get up and running again, ideally without having missed shipments, deliveries or customer interactions.
Business continuity today has those physical events as table stakes only. The larger, more difficult problems are network service failure (due to denial of service attacks or failure of a dependent service), geographic restriction (due to pandemic fears, public transportation failures, or risk management), data disclosure and privacy, and the overall effect on brand and customer retention. What if you can't get people into an office, or have to react to an application failiure that results in customer, partner or supposedly anonymous visitor information being disclosed? Welcome to this decade's disasters.
Where does virtualization fit in? Quite well, actually. Virtualization is a form of abstraction; it "hides" what's under the layer addressed by the operating system (in terms of a hypervisor) or the language virtual machine (in the case of an interpreted language). But it's critical that virtualization be used as a tool to truly drive location and network transparency, not just spin up more operating system copies. I never worry about the actual data center containing my mail server, because I only see it through an IMAP abstraction. It could move, failover, or even switch servers, operating systems and IMAP implementations, and I'd never know. Virtualization gains in importance for business continuity because it drives the discussion of abstraction: what services are seen by what users, where and how on which networks?
Bottom line: Business continuity planning shares several common themes with systemic security design. There's self-preservation, the notion that a system should survive after one or more failures, even if they are coincident or nested. The least privilege design philosophy ensures that each process, administrator or user is given the minimum span of control to perform a task; in security this limits root privileges while in BC planning it ensures that you don't give conflicting directions regarding alternate data center locations. Compartmentalization drives isolation of systems that may fail in dependent ways, and helps prevent cascaded failures, and proportionality helps guide investment into areas where there is perceived risk. The short form of proportionality is to not spend money on rapid recovery from risks that would have other, far-reaching effects on your business anyway. My co-author Evan Marcus used to joke that it was silly to build a data center recovery plan for a potential Godzilla attack, because if that happens we have other, larger issues to deal with. On the other hand, if you saw Cloverfield, there's a lot of infrastructure that people depend upon even when monsters are eating Manhattan.
The best planning is to write out a narrative of what would happen should your business continuity plan go into effect: script out the disaster or event that causes your company to act, and write up press releases, decision making scenarios, and some plausible risk-adjusted actions, and follow the actions out to their conclusion. If you don't have a prescribed meeting place for a building evacuation, and there's no system for employees to check in and validate their safety, then your business continuity plan may suffer when you have to scramble to find a critical employee. When disasters happen, the entire electronic and physical infrastructures are unusually stressed, and normal chains of communication break down. Without a narrative to put issues into perspective, your disaster planning document becomes a write-only memory, holding little interest or failing to gain enough inspection from key stakeholders and contributors. Start naming names, and putting brand, product and individual risks into black and white, and you'll see how your carbon-based networks hold up when the fiber and copper ones are under duress.