Downtime and consequences

As you probably already know if you read my blog, I'm responsible for the high volume websites at Sun. www.sun.com, java.sun.com, java.com and about 40 total sites/services. My group owns the architecture, design, development, deployment and day-to-day running of the services.

We even own the infrastructure - we worked with our IT team (the web group at Sun is based in the Marketing organization) to choose an appropriate vendor for hosting these core/critical websites and services. We chose 365Main in San Francisco. We did this based on many factors - really, a tremendous amount of work went into choosing 365Main for a datacenter. Part of the choice was based on their incredible infrastructure - really an amazing facility built on an old Army tank turret factory (really!).

Well, 2 weeks ago (July 24th) as you've probably already heard, we lost power at the facility. All told, power was off for about 45 minutes. There's a long description of what happened at the facility but suffice to say, it was very unpleasant.

The worst part was that during the power "outage", there were actually fluctuations in the power - causing our servers to start, then halfway through recovery crash again. 6 separate times. Ick. Servers 50% through an fsck punted over and over.

So, once primary power was restored, we began working on the servers in earnest. The order of recovery was a little in flux - we've been adding SANs and databases at a furious rate, and those systems take a bit longer to come back than just the raw machines... So some machines came up with no data, causing apps to spin, etc. The first apps came back as soon as power was restored, others took longer. What was most impressive was how the team responded - working together on a conference call, just getting it done. With over 160 machines in the cage, it was a scramble. 95% of the applications were restored to functionality in 3 hrs (some not at full capacity, but functionality was there). All servers recovered in about 5 hrs. Most impressive was the fact that there weren't any hardware failures.

I sent one of my engineers (thanks Dan) to the facility just to check on things, and he said it was "interesting". We were the only cage without engineers inside on laptops in the whole CoLo. Pretty funny - but we've always built to run remote. One of Sun's hardware strong points is the ability to do Lights Out Management.

Some might ask - "Does Sun have a backup datacenter?", yes we do, and we were down to the wire on activating the backup plan. I'm glad I didn't have to!

Comments:

No UPS? Or aren't any of the services regarded as Tier One?

Posted by Geoff Arnold on August 11, 2007 at 01:16 PM PDT #

Actually, if you review the design of 365main, you'll find they use the "next generation" design of continuous power systems (from hitec). These are \*never\* supposed to drop, and they always run at n+2. Anyway, think of it as UPS plus plus.

Posted by was on August 11, 2007 at 01:44 PM PDT #

Regarding fsck vs zfs - we run ZFS on some of the sites, but it's not like we can swap all the sites and services to a completely new architecture overnight. As we upgrade OS versions we take advantage of new features like ZFS. Some of our "beta" sites are actually running Nevada (opensolaris) bits, and I've even heard of a few trying out Xen bits...

Posted by was on August 11, 2007 at 01:49 PM PDT #

Thanks for the info, Will. I have trouble with the argument from 365main, though. I understand there was a bug in a component that impacted the startup of the backup system; I don't understand why this was not discovered _before_ the emergency. - eduard/o

Posted by eduardo pelegri-llopart on August 12, 2007 at 01:16 AM PDT #

Post a Comment:
Comments are closed for this entry.
About

I run the engineering group responsible for Sun.com and the high volume websites at Sun.

Will Snow
Sr. Engineering Director

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today