How many times can we learn the same lesson

How many times can we learn the same lesson?

Seriously - I spent 2 hours on Friday debugging an issue that we've seen many other times. And that's annoying. It turned out that glassfish by default has a really terrible configuration for running as a server (particularly at load).

An app was deployed and tests were run, and the issue was never encountered as the automated testing doesn't preserve sessions for long enough to evoke the nasty behavior (at least I think that's why the load testing never saw an issue).

Took @creechy and myself about 20 minutes to figure out, 5 minutes to fix it after wandering around the system trying to figure out what was wrong. Not having start/stop scripts was another problem (and one I'm fixing today).

I guess it's hard as the organization grows to make sure the institutional knowledge is shared - most of the lead engineers know to just send out a "has anyone deployed X on Y? Any special things I should do?" message. We've done so much of this so many times that several of the engineers can make the changes in their sleep. And that's a problem when a new engineering group or new engineer joins the staff. Transferring that knowledge is next to impossible - you have to know to ask the question or you just don't get the right info. Wikis are great for documenting what you've done, but they tend to get out of date pretty fast, and new versions of software come out with different config issues.

We'll do better next time - but I hate the pain of downtime on a production site!

Comments:

Post a Comment:
Comments are closed for this entry.
About

I run the engineering group responsible for Sun.com and the high volume websites at Sun.

Will Snow
Sr. Engineering Director

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today