Downtime and consequences
By was on Aug 11, 2007
As you probably already know if you read my blog, I'm responsible for the high volume websites at Sun. www.sun.com, java.sun.com, java.com and about 40 total sites/services. My group owns the architecture, design, development, deployment and day-to-day running of the services.
We even own the infrastructure - we worked with our IT team (the web group at Sun is based in the Marketing organization) to choose an appropriate vendor for hosting these core/critical websites and services. We chose 365Main in San Francisco. We did this based on many factors - really, a tremendous amount of work went into choosing 365Main for a datacenter. Part of the choice was based on their incredible infrastructure - really an amazing facility built on an old Army tank turret factory (really!).
Well, 2 weeks ago (July 24th) as you've probably already heard, we lost power at the facility. All told, power was off for about 45 minutes. There's a long description of what happened at the facility but suffice to say, it was very unpleasant.
The worst part was that during the power "outage", there were actually fluctuations in the power - causing our servers to start, then halfway through recovery crash again. 6 separate times. Ick. Servers 50% through an fsck punted over and over.
So, once primary power was restored, we began working on the servers in earnest. The order of recovery was a little in flux - we've been adding SANs and databases at a furious rate, and those systems take a bit longer to come back than just the raw machines... So some machines came up with no data, causing apps to spin, etc. The first apps came back as soon as power was restored, others took longer. What was most impressive was how the team responded - working together on a conference call, just getting it done. With over 160 machines in the cage, it was a scramble. 95% of the applications were restored to functionality in 3 hrs (some not at full capacity, but functionality was there). All servers recovered in about 5 hrs. Most impressive was the fact that there weren't any hardware failures.
I sent one of my engineers (thanks Dan) to the facility just to check on things, and he said it was "interesting". We were the only cage without engineers inside on laptops in the whole CoLo. Pretty funny - but we've always built to run remote. One of Sun's hardware strong points is the ability to do Lights Out Management.
Some might ask - "Does Sun have a backup datacenter?", yes we do, and we were down to the wire on activating the backup plan. I'm glad I didn't have to!