On a Thursday morning my wife goes out for coffee with her friend(!), not because she
likes coffee but because she is in the house all the time. She gave up her job as a nurse
when we had kids and now as the kids get older we have two dogs. Finn as a puppy is quite
restrictive as he can't be left for long - his teeth are coming out so he wants to chew and
chew. So I work from home on a Thursday morning.
Work! not this morning. I walked the mutts in the park and everything was good, they behaved well
until Cleo found some fox muck! Just imagine a white fluffy Samoyed rolling around on her back in fox muck!
She was plastered in it and the smell - bad! She got bathed in the garden when we got home so she
white again, it just takes a couple of days for her to dry out.
This evening when I returned home I discovered they had been rolling around in the muddy garden
and my white dog is a nice muddy brown again...
I have a customer's machine where occasionally a cpu goes to 100% system time for a few seconds,
curiosity means I want to know why? As the box is running solaris 9 I can't use dtrace so we have to
resort to more brute force techniques. The question is which technique? kstat? tnf? lockstat? mdb -k?
live savecore? forced crash dump?
In the end i think the easiest will be one window running mpstat 1, and in another running
# lockstat -I -i 977 -s 30 -p sleep 2 > /tmp/file
lockstat in this mode samples each cpu 977 times a second using a level 14 interrupt.
Each sample contains the kernel stack trace. Lockstat then aggregates samples with the same
stack by keeping a count of each unique stack it sees on each cpu.
The -p for parseable output means each aggregated stack/count is spat out on one line.
So if you ran this on a single cpu machine and added up the number of samples for each stack in
column two of the output it should be (1\*977\*2). If you have a 100 cpus then the addition should equal
(100 \* 977 \*2).
So this allows you to do time based kernel profiling with a bit of awk. I like to grab a number
of short samples say 1 or 2 seconds rather than run lockstat for 30 seconds or so and need to tune the
record number etc.. It's only statistics!
So tomorrow's job is to process all those lines..
I also looked at a crash dump from a cluster machine - good job it was a cluster! The customer made a good
decision there. Our machines are very reliable but they are not fault tolerant, you need to do the math(s)
and work out if a single machine can provide your required availability. In the depths of our nice new building
is a lab and at the back are some fault tolerant machines that Sun built and sold following the IMP purchase.
The machines were designed by some very very clever people and had loads of patents, but they were a little
bit expensive, it was the cluster technology that stopped them selling outside the telco market. Take two or three
or more cheaper machines, run some smart software and get acceptable service availability numbers.
It was interesting as the cause of the crash was using a bogus value as kernel thread pointer. By bogus
I mean it looked like a real address with an extra bit set - must be hardware I thought, but by
looking a bit more I found that the bogus value was correct it was just that we read it from
a bogus address. By bogus address I mean we thought it was a held kernel mutex but it wasn't,
a bit more digging and eventually we get to the problem. The moral is keep looking. In fact over
the last XX years of staring at crash dumps very few have been undiagnosable eventually.
I saw the first "help with writing dtrace" escalation today.
15 miles in the smart car, fuel gauge is saying 4 litres left.
0 miles on the bike, I'll ride tomorrow honest!