By pcr on Sht 02, 2005
My job at Sun is to assist our customers with the architecture, design and implementation of complex systems. Naturally I am reticent to mention customer names lest there be legal or public relations issues. However, I spent a large part of last year preparing a customer in New Orleans for Business Continuity and Disaster Recovery and now those plans have been executed. I will be oblique about the precise details but two large systems integrators are assisting one of our government agencies to do a large Enterprise Resource Planning application that involves a web front end, application servers in the middle tier and a very large database on the back end. The task was so large that I worked extensively with one of our partners, Mr. Chip Elmblad of Sub2 Technology Consulting. Feeding this system are many computers and users from around the world. There are also many systems associated with the core of the ERP system that perform functions like reporting and ad hoc queries. Hopefully this diagram will help you get the picture.
The servers are located in a building overlooking Lake Ponchatrain very near this location from Google Maps. This view is Google's very cool hybrid satellite photo overlaid with street names. If you have seen some of the news coverage you will notice that one of the levee breaches was on the left side of this photo and the computers are housed in several buildings on the right side of the photo. Here is closer view of the buildings themselves (note the address is not accurate, its just to zoom in on the buildings.) While on site, various people told me that sea level was the 3rd floor of these buildings and one of my friends reported that as of Tuesday there was 10 feet of water in the buildings. I would call this a disaster.
As an aside, when working for a government agency, we don't want to do 'Disaster Recovery,' we do 'Continuous Operations.' Perhaps we are shying away from the negative connotations of the word 'disaster.' Last year we tested the system when Hurricane Ivan grazed New Orleans but did not do much damage. We did invoke the process of failing over to the remote computers and all of the core systems worked well. I think some people in New Orleans looked back to last year and thought, 'We were OK last year during Hurricane Ivan so I'm not going to bother to evacuate this year.'
To describe some of the technical details of continuous operations, our partners set up similar hardware in a location near Memphis, TN to match web servers, application servers, database servers and associated servers from New Orleans. The customer uses EMC as their storage vendor and so we used EMC's SRDF (Symmetrix Remote Data Facility) to replicate the data from New Orleans to Memphis. (Note that this could also be accomplished with Hitachi storage and True Copy.) Every day a consistent image of the updated database pages and associated files on other servers are shipped at specific times during the day. Our SLA (Service Level Agreement) was to be at most 12 hours behind in data replication. Perhaps this diagram can help illustrate some of the complexity of the data replication process.
Naturally the network at the remote site cannot have identical names and IP addresses with the primary site due to name space requirements. It simply takes careful planning and adjustments to control and configuration files to make certain that the remote site applications perform like the primary site. It must be tested, preferably under realistic scenarios, to verify that all aspects of the system and network function properly. The core of this ERP system worked very well both last year and this week, but certain associated systems had challenges establishing connectivity with the remote site.
To my friends Rene, John, Dana, Jarmaine, Linda, Brian, Jeanne, Trey, Marc, Jim, Eddie, Matt and all others, our prayers are with you. I hope your homes are OK.