Saturday Mar 12, 2005

Friday - just a day late.

At Home

Friday is curry night - mmm! Dogs are tired, Finn seems to grow bigger everyday, he is only 5 months old but he is heavier and taller at the shoulder than Cleo who is a year older. If you want to see some Samoyeds have a peek at Samoyeds. To avoid causing a pet war here are the cats two-cats

At Work

So I got to the bottom of the busy cpu. That machine had a massive kernel device tree.  "prtconf -v | grep nstance | wc -l " was about 30,000.  Looking in the prtconf -v we see that 80% of the device nodes are "driver not attached" so they are  nodes without a driver instance bound to them. Surprisingly they are all for the same driver, time to investigate what is in that driver .conf file..

So what is the problem with having a big kernel device tree? Well  many utilities need to grab a snapshot of the tree in order to be able to match driver names/numbers/instances. On this machine iostat was one contributor and the other was the event gatherer for Sun's SRS product.  Grabbing the snapshot involves walking the in-kernel tree and copying it into a snapshot buffer, from there it can be copied out from the kernel to userland.
This is not a lightweight operation, the less often we make the kernel execute the more cpu cycles are available for the application to try and run its instructions.

To work out what process was issuing the ioctl, I could have trussed every application until I found the right one , but this machine has thousands of processes. So we used the tnf probes built into the kernel, luckily there are probes in the entry and return points for syscalls. So we turned those on and looked for the lines with system call number 54 (grep ioctl /etc/name_to_sysnum).

# prex -k
Type "help" for help ...
prex> buffer alloc 10m
Buffer of size 10485760 bytes allocated
prex> enable name=/syscall/
prex> trace name=/syscall/
prex> list probes name=/syscall/
name=syscall_end enable=on trace=on file=../../intel/ia32/os/syscall.c line=649 funcs=<no value>
name=syscall_start enable=on trace=on file=../../intel/ia32/os/syscall.c line=364 funcs=<no value>
prex> ktrace on
#wait a bit
prex> ktrace off
prex>quit
# tnfxtract /tmp/1.tnf
# tnfdump /tmp/1.tnf > /tmp/1.txt
# pg /tmp/1.txt  (and lets look for syscall 54)
---------------- ---------------- ----- ----- ---------- --- ------------------------- ------------------------
    Elapsed (ms)       Delta (ms)   PID LWPID    TID     CPU Probe Name                Data / Description . . .
---------------- ---------------- ----- ----- ---------- --- ------------------------- ------------------------
        0.113180         0.013207   719     1 0xffffffff839138a0   1 syscall_start             sysnum: 54
        0.113962         0.000782   719     1 0xffffffff839138a0   1 syscall_end               rval1: 0 rval2: 0 errno: 0
better dealloc that 10m buffer
# prex -k
prex> buffer dealloc
prex> quit
#

So it was pid 719, all we need now is ps -efl..

In Between

0 miles in the smart, but I had to fill it up with petrol - the massive amount of 16 litres. So that was 180 miles for 16 litres which is only 50 mpg. Letting the automatic mode change gear normally gets about 65 mpg but this tankful I have been using the manual mode as I thought I could do a better job, I will switch back as the point of getting a smart car was a cheap way to get to work.

15 miles on the bike.


tim.

Thursday Mar 10, 2005

Comic Relief - red nose day tomorrow.

At Home
=======
On a Thursday morning my wife goes out for coffee with her friend(!), not because she
likes coffee but because she is in the house all the time. She gave up her job as a nurse
when we had kids and now as the kids get older we have two dogs. Finn as a puppy is quite
restrictive as he can't be left for long - his teeth are coming out so he wants to chew and 
chew. So I work from home on a Thursday morning.

Work! not this morning. I walked the mutts in the park and everything was good, they behaved well
until Cleo found some fox muck! Just imagine a white fluffy Samoyed rolling around on her back in fox muck!
She was plastered in it and the smell - bad! She got bathed in the garden when we got home so she 
white again, it just takes a couple of days for her to dry out.

This evening when I returned home I discovered they had been rolling around in the muddy garden 
and my white dog is a nice muddy brown again...

At Work
=======
I have a customer's machine where occasionally a cpu goes to 100% system time for a few seconds,
curiosity means I want to know why? As the box is running solaris 9 I can't use dtrace so we have to
resort to more brute force techniques. The question is which technique? kstat? tnf? lockstat? mdb -k?
live savecore? forced crash dump?

In the end i think the easiest will be one window running mpstat 1, and in another running
# lockstat -I -i 977 -s 30 -p sleep 2 > /tmp/file

lockstat in this mode samples each cpu 977 times a second using a level 14 interrupt.
Each sample contains the kernel stack trace. Lockstat then aggregates samples with the same
stack by keeping a count of each unique stack it sees on each cpu.

The -p for parseable output means each aggregated stack/count is spat out on one line.

So if you ran this on a single cpu machine and added up the number of samples for each stack in
column two of the output it should be (1\*977\*2). If you have a 100 cpus then the addition should equal
(100 \* 977 \*2).

So this allows you to do time based kernel profiling with a bit of awk. I like to grab a number
of short samples say 1 or 2 seconds rather than run lockstat for 30  seconds or so and need to tune the
record number etc.. It's only statistics!

So tomorrow's job is to process all those lines..

I also looked at a crash dump from a cluster machine - good job it was a cluster! The customer made a good
decision there. Our machines are very reliable but they are not fault tolerant, you need to do the math(s)
and work out if a single machine can provide your required availability. In the depths of our nice new building
is a lab and at the back are some fault tolerant machines that Sun built and sold following the IMP purchase.
The machines were designed by some very very clever people and had loads of patents, but they were a little
bit expensive, it was the cluster technology that stopped them selling outside the telco market. Take two or three
or more cheaper machines, run some smart software and get acceptable service availability numbers.

It was interesting as the cause of the crash was using a bogus value as kernel thread pointer. By bogus 
I mean it looked like a real address with an extra bit set - must be hardware I thought, but by
looking a bit more I found that the bogus value was correct it was just that we read it from
a bogus address. By bogus address I mean we thought it was a held kernel mutex but it wasn't,
a bit more digging and eventually we get to the problem. The moral is keep looking. In fact over 
the last XX years of staring at crash dumps very few have been undiagnosable eventually.

I saw the first "help with writing dtrace" escalation today.

In Between
==========
15 miles in the smart car, fuel gauge is saying 4 litres left.
0 miles on the bike, I'll ride tomorrow honest!

tim

Wednesday Mar 09, 2005

weary wednesday.

At Home
========

Finished getting the tiles off the old bathroom walls - better go buy a new bathroom suite now!
Dogs dug a big hole in the garden - I could hire them out as earth movers the speed they can dig at!
Both kids at school! Both barred from the internet game Runescape as another "player" was asking them 
personal details, they both know not to supply any but better safe than sorry. I knew it was a good decision
to have the only internet'ed computer in the lounge. Its quite funny to watch them using JDS, they seem quite
at home with it unlike CDE which they struggled with.

At work
=======

Still looking at sparc cpu performance counters, I have looked at a couple of applications that
walked through large in memory arrays looking for flags/pointers, this test the memory speed. Luckily on 
Sparc the designers thought about this and introduced a software prefetch instruction that allows you to
start a memory fetch into the prefetch cache in the background. This is used in the USIII/USIV block copy code to
get data arriving into the cpu caches just as it is needed. The art is to predict how far ahead of your
current memory location you need to prefetch so it arrives just in time. Get it there too early and it will
be evicted by someone else, too late and you wasted bandwidth.

So today's code marched through large memory arrays, prefetching at different offsets ahead of the loop
reading the array, on the 490 I was using the optimum distance ahead was only 256 bytes I thought 
it would be more, I need to experiment on a range of our hardware.

With no prefetch instruction the 1gb array was read with a 64 byte stride in 4.95 seconds.
With prefetch at 256 bytes ahead of the reader time was down to 1.29 seconds, not bad for 15
extra instructions! This is a very tiny little example but if you have large arrays to examine, or
you are walking long linked lists ( prefetch instructions have no side effects so you can prefetch ->next without
worrying if it is going to be NULL) this technique could be useful..

It's sort of a hard concept that the cpu is running the same total number of instructions its just 
wasting less cycles as the number of pipeline stalls due to lack of data is reduced.  This is talked 
about in the papers on chip multithreading where the chip can fill dead patches by executing instructions
that have data. It should be possible to produce an efficiency figure for software that relates to the
ration of instructions executed to cycles - sounds like something that the pic counters can measure....

so setting pic0 to Cycle_cnt and pic1 to Instr_cnt and using a prefetch interval of 256 I see it takes
1246685232 cycles to execute 620888367 instructions or almost exactly 2 cycles per instruction.

so lets do it without prefetch at all, the prefetch instruction is replaced with a harmless or.

 4907025843 cycles to execute 620922314 instructions which is 7.90 cycles per instruction. wow!

here is the magic instruction..

asm("prefetch [%i0], 0");

but if you use the compiler suite then there are some header files and macros to 
use prefetch more easily.. Some of the best application performance gains have been through
using the latetst compilers and allowing it to optimise for the hardware. A program compiled up 
on a US1 machine many years ago will run just fine on a brand new USIV based machine, but it
wont go as fast as it could. Sun uses this in the libc_psr.so libraries, there is a nice simple generic
block copy in libc but the runtime linker looks for the platform specific filter library to get
the go faster version for the specific platform you are on, I haven't seen this only any applications
but the mechanism is there.

In Between
==========

15 miles in smart car, going to need some petrol soon, legs ached after yesterdays cycling.
0 miles on the bike, and none tomorrow as I have to work from home.

tim

Tuesday Mar 08, 2005

monday/tuesday

home
====
one child off sick, two sleepy dogs after some long walks at the weekend. If you live near Farnham
in Surrey, UK and you see two fluffy white dogs that's probably us.

Work
====
Explained again that the "%b" column in iostat is not a measure of how busy a storage device. It is the
percentage of the sample time for which there is at least one i/o outstanding in HBA driver/disk. It is a 
measure of how "bursty" your i/o load is. The storage device will perform best when the i/o is spread out
evenly over the sample time, conversely it will perform less well if all the i/o is dropped on it at the same instant.
Worst of all is when the burst of i/o exceeds the target drivers throttle and i/os are queued on the target driver's
waitq, there they linger until a i/o comes back from the disk and then they get to start; as far as the layer above
is concerned that i/o took 1.5 times as long as it should. So the warning sign is a low %b, and a wsvc_t > 0, somehow 
you need to smooth out the workload, or disperse it across more drives, or check you haven't set the target drivers
throttle too low.

collect and analyzer and friends are the performance analysis parts of the compiler suite. I only learn't today that 
you can merge profiling runs with harware counters. It took me a while to work out how to use the h/ware counters, but 
it is easy, you pick the two h/ware counters that you want the cpu the count, you specify the number 
of increments you want before a signal is sent to you process, so say you wanted to be notified on every dtlb miss
and functions that use a lot of instructions then you could use..

collect -p off -h Instr_cnt,200000,DTLB_miss,1 ls -al /tmp

those two counter names come from /usr/sbin/cpustat -h

so what can I see from this..

 er_print -pcs test.1.er
test.1.er: Experiment has warnings, see header for details

Objects sorted by metric: Exclusive Instr_cnt Events

Excl.      Incl.      Excl.      Incl.       Name
Instr_cnt  Instr_cnt  DTLB_miss  DTLB_miss
Events     Events     Events     Events
....
     0          0      2          2         _ndoprnt + 0x000003D0
      0          0      2          2         _ndoprnt + 0x000020EC
      0          0      0          3         _ndoprnt + 0x00002898
      0          0      0          5         _rt_boot + 0x00000088
      0          0      0          5         _setup + 0x000003AC
      0          0      0          2         _start + 0x000000AC
      0    1400067      0         53         _start + 0x00000108
      0     200017      0         35         _start + 0x00000110
      0     200008      0         17         _ti_bind_clear + 0x00000050
      0          0      0          1         atexit + 0x0000001C
      0     200017      0         33         atexit_fini + 0x00000068
      0          0      0          3         call_fini + 0x000000D4
      0          0      0          5         call_fini + 0x000000DC
      0     200017      0         25         call_fini + 0x00000104
      0          0      0          5         call_init + 0x000001B0
      0          0      0          3         collector_close_experiment + 0x00000090
      0          0      0          5         collector_init + 0x000000DC
      0          0      0          1         collector_sigprof_sigaction + 0x00000108
      0     200021      0          1         dcgettext + 0x000000A0
      0     400018      0         27         do_exit_critical + 0x000000C8
      0          0      1          1         doformat + 0x00000A18
      0          0      0          1         doformat + 0x00000A2C
      0          0      1          1         doformat + 0x000015A8
..
      0    1400067      0         47         main + 0x00000A04
...

I have a lot more work to do to really understand what it is displaying, The benefit over cputrack
is that you "should" be able to find the pc causing the most data cache misses etc and then look at changing the 
code/adding prefetches to reduce the stall and get the instruction execution rate up.

in between
==========
 0 miles in the smart
30 miles on the bike. Nearly hit the back of a lotus - the highly skilled driver overtook me 50 feet from a
junction, pulled in front and braked hard for the junction -great! thanks!

tim

Friday Mar 04, 2005

Great it's Friday!

home
====

Walked one child to school with the mutts, they like the snow so much they barked all the way.
Seven people suggested that I needed a sledge in about a mile..... would have been fun if 
I'd had one.
Other one off sick.

Fighting trying to copy an old audio tape onto CD on the x86 box. 
I can use audiorecord and then audioplay and it sounds okay, but if I use cdrw to
blow the cd ( type aur) then it comes out all soprano.. everything says it is 44.1khz - more
like 64khz!

#audiorecord -s 44.1k -elinear  -c 2 -v 90   /var/tmp/out.au
# file /var/tmp/out.au
/var/tmp/out.au:        audio data: 16-bit linear PCM, stereo, 44100 Hz
#cdrw -a -Taur /var/tmp/out.au
Looking for CD devices...
Initializing device...done.
Writing track 1...done.
done.
Finalizing (Can take several minutes)...done.

Yet it sounds like pinky and perky!

Gave up and went for a curry - My wife lets me have a curry on Friday night!

work
====
Stared at a crash dump - solaris 8, didn't make much progress, hopefully I will get
some inspiration on Monday.
Used a preload library to show that an application was the cause of the scsi error messages,
I interposed on ioctl() and if the command was USCSICMD ( man uscsi) then we print out the
command and use printstack() ( man walkcontext) and when real_ioctl() returns we printout 
the status. Absolute proof if they needed it.. Of course on solaris 10 it would have been 
about 20 lines of D and a much quicker answer for the customer.
Code reviewed a sd target driver fix, looks good. The process for introducing changes to 
solaris kernel code is quite heavyweight but this is what has lead to the quality of the 
kernel. Back in the 2.5 days crashes and hangs were easy to diagnose not anymore! now they are 
subtle timing races or deadlocks involving chains of threads.


in between
==========
another 15 miles in the smart ( watched a loon speed in the snow - straight into a fence..)
another 0 miles for the bike...mind you there was an inch of snow.

tim
        
    

Thursday Mar 03, 2005

stand back!

Time for a blog. 

Home 
====
Walked the two samoyed's ( white fluffy siberian dogs)  in the sleet - nice for them not for me. They had a good time
playing with their friends Boots ( English springer), a visiting Bulldog and a couple of Beagles in Farnham Park. They managed
to tip me over today, the sun has melted the frozen ground a little and when they saw a deer before me - crash!
Brushed the two bengal cats ( spotty cats that look like little leopards in the Sun light), put plasters on the scratches!
Took tiles off the bathroom wall with a chisel - strangely satisfying! plastered holes where the skim coat had come off.
Must remember Mother's day on Sunday.

Work
====
Looked at 2 crashdumps, one solaris 6. The internal code name was 297 as in 1997, when you have to go back and look at it
it is primitive, I have no urge to force folks to upgrade if they are happy but debugging it is painful.
Looking at the effects of using large page sizes ( ppgsz) to reduce dtlb misses.
Helping collegues to get a twiki server performing better- libumem was the big win ( so far). Thing to do - look
at some way of caching the perl libraries, we used "truss -d -D" and a load of dtrace to check the timings.
Trying to interpose on a local function in ld.so.1  - what a fascinating man page - ld.so.1, limited success.
Did a web based training course on inclusion.
Wrote up a performance related customer visit, lots of avenues to explore, I suspect that the changes in 117171-17
will be of benefit as they have a large 15k type machine.
My own (cheap) laptop runs solaris 10 now, it ran JDS on Suse linux before. Its faster on 10 and seems to do everything
that it did before. I was caught by the fact that gtkam needs you to do an add_drv ( man ugen) for your camera before
it can pick up the device. Sound seems to be working, Xorg X  worked perfectly. If anyone can get a battery monitor to work
on a Toshiba satellite I'd be grateful.. I have a Broadcom based pcmcia wireless card and that works! now I get to work in the evening
whilst my wife surfs the Web on our dual opteron based pc ( solaris 10 amd64 - just so fast!).



In between
==========
15 miles Smart Car - filled tank with 14 pounds of petrol.
0  miles Cycling   - I should have cycled ( trek 5200)  but I was late after walking the mutts, and its cold, 2 degrees.

tim.

About

timatworkhomeandinbetween

Search

Categories
Archives
« April 2014
MonTueWedThuFriSatSun
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
    
       
Today
News

No bookmarks in folder

Blogroll

No bookmarks in folder