I should have listened. And I typically do. Operations wise, that is. Other things, not so much...
A couple of years ago I attended a PARC forum done by one of the top engineering guys at Yahoo (dang, I think he was the CTO, but looking through the PARC video archives, I can't find him :(. Anyway, he had some really interesting things to say about watching the search and click trends of millions of people daily and trying to discover patterns. He was trying to see if he could find epidemics early via what people were doing on the web. Turned out it seemed to be working, until he realized Britney Spears (or some other teen idol) had just mentioned the term.
One of the most interesting things he said, and the reason for this post, was that he never ever deployed anything at Yahoo on less than 2 servers. They just didn't do it.
That recently bit me in the hind end... As you may have noticed, blogs.sun.com has been a bit twitchy... Way too much downtime for my liking. Well, as I mentioned we kind of put the server up in a week - from scratch, not knowing the software or anything. I was in a meeting with the heads of blogging at Sun - Pat Chanezon, Tim Bray, Simon Phipps, Danese Cooper and two vips - Jonathan Schwartz and John Fowler, when Sun took the official first steps toward blogging. Schwartz and Fowler said "go". So I took the nearest machine and my blogging/wiki expert hoffie and off we went. Note: there were LOTS of people at that meeting, please forgive me for only focusing on a few
Usually I'll deploy in a multi-tier architecture, but for several reasons that wasn't going to happen. We were near the end of the quarter, so we didn't have much time, we wanted to take advantage of "top of mind" of the VIPs, and most importantly, the software wasn't written to be load balanced.
We dropped a single machine out there, and I figured that we'd be ok for a little while. Next thing I know, we have Schwartz blogging and Business Week writing articles about it! Holy moly!
Traffic spikes (seriously) like you can't believe. The only thing I believe at this point is that my blackberry couldn't be more annoying at 3 in the morning. Yeah, I was on call for the server. And I didn't do a great job at catching all the faults. Hoffie was out having a 5th kid... Ok, his wife was having the kid, but dang!
Finally last week I got the other machines in rotation, so if the primary server drops, we'll be able to survive for a while.
Sorry for all the downtime; realize that it was probably more painful for me than you (heck, you weren't getting paged at 3AM). Doc Searls, please know that it was simple downtime and we've disallowed pinging to our servers, so that's not a really good measure of whether a server is up (but a really good try!) Hey folks, if you want to read an informative blog, I'd suggest Doc's.
Operations notes: we took the machine down friday afternoon for approx. 20 min. because we noticed it was down to 200k free memory. We could have just added swap space, but we had been noticing a steady decrease in free memory that stopping/restarting the webservers wasn't clearing. It took a reboot to get the memory back. Went from 200k to 7.5GB!!! The rest of the changes included tuning mysql and the jvm for the application we use - rollerweblogger.