By timatworkhomeandinbetween on Mar 14, 2005
Tired dogs, tired people after a busy weekend. I spent Saturday talking my daughter to all her activities, ballet, horse riding. Horse riding is such a technical hobby, how she manages to remember all the details and not fall off (too often) is amazing.
I am always astounded at the varying quality of "escalations" I get to interfere in. They vary from works of art where the problem is presented in fine detail with reasoned debate and supporting material, to throw away statements like "it goes slow" - what is it? How is it measured? Does it ever go fast? When did it start to go slow - what changed at that date?
Over the last XX years I keep coming across the Kepner Tregoe rational process Kepner Tregoe . Our part of Sun has adopted this methodology as an excellent way of documenting problems as well as speeding up finding the solution. I always think of this as procedural common sense, the format is logical and the answers steer you either to relevant questions or towards the answer, and as a by product irrelevant information can be tested and discarded.
The number of times you get information like "the machine started to panic on the 15th november and has panic'ed since" and you just have to ask "what happened on the night of the 14th November?", ah we applied a patch! Or you ask the question " do you have a similar machine with a similar workload that does not have the fault?" "yes the other twelve work perfectly", and you have to ask " so what is the difference between the working twelve and the ailing one?".
If there is one way to get your call into Sun dealt with quickly
then consider this methodology as an excellent way of interaction with
our service organisation. There must be other similar methodologies so
my apologies to their supporters.
I stared at a crash dump today that hits in Sunsolve with a few reports each year from the customer base, they all report one off panics on a variety of h/ware, all alleged to be fixed by a hardware intervention. This brings up the subject of how long do you need to wait after a fix has been attempted before you declare success? With my meagre grip of statistics the time is reduced the greater the number of failures, something to do with standard deviation? So for these customers having one failure after 3 years we probably have to wait about 10 or 12 years before we say that the problem is fixed with a h/ware change. A mentor of mine had an interesting theory that it might be better to take a few more hits so as to reduce the standard deviation of the fault interval, This would then reduce the post fix testing time dramatically allowing success to be declared much earlier. I suspect a software race condition myself...
15 miles in the smart car, todays excuse - too tired.
0 miles on the bike, I'll try really hard tomorrow.