Finished getting the tiles off the old bathroom walls - better go buy a new bathroom suite now!
Dogs dug a big hole in the garden - I could hire them out as earth movers the speed they can dig at!
Both kids at school! Both barred from the internet game Runescape as another "player" was asking them
personal details, they both know not to supply any but better safe than sorry. I knew it was a good decision
to have the only internet'ed computer in the lounge. Its quite funny to watch them using JDS, they seem quite
at home with it unlike CDE which they struggled with.
Still looking at sparc cpu performance counters, I have looked at a couple of applications that
walked through large in memory arrays looking for flags/pointers, this test the memory speed. Luckily on
Sparc the designers thought about this and introduced a software prefetch instruction that allows you to
start a memory fetch into the prefetch cache in the background. This is used in the USIII/USIV block copy code to
get data arriving into the cpu caches just as it is needed. The art is to predict how far ahead of your
current memory location you need to prefetch so it arrives just in time. Get it there too early and it will
be evicted by someone else, too late and you wasted bandwidth.
So today's code marched through large memory arrays, prefetching at different offsets ahead of the loop
reading the array, on the 490 I was using the optimum distance ahead was only 256 bytes I thought
it would be more, I need to experiment on a range of our hardware.
With no prefetch instruction the 1gb array was read with a 64 byte stride in 4.95 seconds.
With prefetch at 256 bytes ahead of the reader time was down to 1.29 seconds, not bad for 15
extra instructions! This is a very tiny little example but if you have large arrays to examine, or
you are walking long linked lists ( prefetch instructions have no side effects so you can prefetch ->next without
worrying if it is going to be NULL) this technique could be useful..
It's sort of a hard concept that the cpu is running the same total number of instructions its just
wasting less cycles as the number of pipeline stalls due to lack of data is reduced. This is talked
about in the papers on chip multithreading where the chip can fill dead patches by executing instructions
that have data. It should be possible to produce an efficiency figure for software that relates to the
ration of instructions executed to cycles - sounds like something that the pic counters can measure....
so setting pic0 to Cycle_cnt and pic1 to Instr_cnt and using a prefetch interval of 256 I see it takes
1246685232 cycles to execute 620888367 instructions or almost exactly 2 cycles per instruction.
so lets do it without prefetch at all, the prefetch instruction is replaced with a harmless or.
4907025843 cycles to execute 620922314 instructions which is 7.90 cycles per instruction. wow!
here is the magic instruction..
asm("prefetch [%i0], 0");
but if you use the compiler suite then there are some header files and macros to
use prefetch more easily.. Some of the best application performance gains have been through
using the latetst compilers and allowing it to optimise for the hardware. A program compiled up
on a US1 machine many years ago will run just fine on a brand new USIV based machine, but it
wont go as fast as it could. Sun uses this in the libc_psr.so libraries, there is a nice simple generic
block copy in libc but the runtime linker looks for the platform specific filter library to get
the go faster version for the specific platform you are on, I haven't seen this only any applications
but the mechanism is there.
15 miles in smart car, going to need some petrol soon, legs ached after yesterdays cycling.
0 miles on the bike, and none tomorrow as I have to work from home.