Disaster Averted (aka Attack of the POD people!)

Have you ever had one of those jobs that you knew was really important, but no one else seemed to know it?  Or, have you ever had one of those jobs where no one seemed to notice all the great work you've been doing until something goes terribly wrong?

Inside the Connected Systems Network there is a quiet, little piece of infrastructure called Patch Operations and Delivery (POD for short).  POD is the key part of the pipeline that internal engineers at Sun use to push out patches and updates to products like Solaris and the Java Enterprise System.  It isn't directly visible to customers, but it's the back-end of important services like Update Connection and SunSolve.  If POD were to suddenly go away people would notice!

Well, that's just what happened recently.  POD had a whopper of a melt down.  In one of those classic IT mishaps, a sleeply little UltraSPARC-II system in a non-production lab had crept into the transaction flow for the production database.  It had been running fine for years, but then there was failure in the storage array attached to the machine!

Without going into all the details of what transpired (there were repair scripts written, data pulled from mirrors across the planet, and people working through the night and over the weekend) I'm happy to report that POD is back to full strength.  There were some serious heroics involved in this incredibly delicate fix.  Here's a quick list of people that deserve a real thank you.

  • Jan Birkelund
  • Slim Heilpern
  • Darcy O'Connor
  • Simon Ip
  • Philippe Nave
  • Derk Norton
  • Mike Tanaka
  • Janet Bacon
  • Don Gritzmacher
  • Darl Kuhn

In honor of you brave warriors of software, I give you Wired Magazine's List of the Coolest Movie Weapons.  These may come in handy next time you have to fight a problem this big!

Thanks Again Gang!

Comments:

see : http://www.blastwave.org/dclarke/blog/?q=blog/1

Posted by Dennis Clarke on November 07, 2006 at 01:07 AM PST #

Hey Dennis. Thanks for the post. Yeah, it seems these things can run forever sometimes.

Posted by Steve Wilson on November 07, 2006 at 12:34 PM PST #

Thanks for posting this information and explaining the recent patch traffic jam.

As you are obviously familiar with the inwards of the patch publishing process, maybe you can comment on my list of the most pressing problems with patch access on SunSolve. See my summary from 2006/11/08 on:

http://www.par.univie.ac.at/solaris/pca/news.html

Posted by Martin Paul on November 07, 2006 at 05:21 PM PST #

Frustration resulted from not being able to track the ticket (SSO#31467). Why did it take so long for the patches with zero length to get fixed to get fixed on sunsolve but they were fine from patches.sun.com? I realize there was a meltdown and all but I would think there was some better QA already built in to catch this kind of corruption corruption.

Posted by Bruce Riddle on November 08, 2006 at 12:48 PM PST #

Hi Bruce. I don't think the problem you mention had anything to do with POD (the thing I mentioned was purely internal and patchs already on SunSolve weren't affected). However, it sounds like you and Martin are both mentioning problems with downloading from SunSolve. I'll check into that and see if there's anything known going on. Thanks.

Posted by Steve Wilson on November 08, 2006 at 12:53 PM PST #

I've been talking to the SunSolve team about the questions raised on this thread. I think I have this figured out. I'll post something this week to let people know what's going on.

Posted by Steve Wilson on November 13, 2006 at 07:40 AM PST #

Post a Comment:
Comments are closed for this entry.
About

Thoughts on cloud computing, virtualization and data center management from Steve Wilson, Oracle engineering VP.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today