X
  • Work
    January 19, 2006

Measuring floating point use in applications

Guest Author

mrbenchmark has written a blog entry on assessing floating point performance in codes using the hardware performance counters on the UltraSPARC-III. It's a nice detailed read, so I'm not going to repeat any of what's written there.

However, there's a bit more to be said on the topic - in particular the technique used may over estimate floating point content in some codes.

The best place to start is a test program

double a[1024\*1024],b[1024\*1024];
void main()
{
int i,j;
for (i=0; i<1024\*1024;i++) {a[i]=b[i]=0.0;}
for (j=0;j<100;j++) for (i=0; i<1024\*1024;i++) {a[i]+=b[i];}
for (j=0; j<100; j++) for (i=0; i<1024\*1024;i++) {a[i]\*=b[i];}
}

The program does 100\*1024\*1024 (approx 100 million) each of floating point adds and floating point multiplies. The next step is to compile and run the program.

$ cc fp
$ cputrack -c pic0=FA_pipe_completion,pic1=Instr_cnt -c pic0=Instr_cnt,pic1=FM_pipe_completion a.out |grep exit
32.687 1 exit 55477739 1104761007 # pic0=FA_pipe_completion,pic1=Instr_cnt
32.071 1 exit 1001979477 50793545 # pic0=Instr_cnt,pic1=FM_pipe_completion

The program has been run under cputrack which follows the process and counts the number of the specified events that occur. cputrack is set to count:

  • FA_pipe_completion events - instructions that use the floating point add pipe.
  • FM_pipe_completion events - instructions that use the floating point multiply pipe.
  • Instr_cnt - the total instruction count

[If you want to find out more about the available performance counters on the UltraSPARC hardware, then the manuals are available.]

Since cputrack was asked to count two different sets of events, it alternates between them, which means that the number of each type of events that it counts is undercounted by a factor of two (assuming that the distribution of missed events was even). Given this information it is possible to estimate the total number of instructions that used the floating point add pipe as being 111 million (55,477,739\*2), and the total number that used the floating point multiply pipeline as 101 million (50,793,545\*2). The total number of instructions was 2.1 billion (1,104,761,007 + 1,001,979,477). So the floating point content is about 10% ( (101+111)/2100).

The floating point percentage is artificially low because the code was not compiled with any optimisation - so there were plenty of instructions which could have been optimised out.

Now I've carefully written floating point add pipeline, and there is a good reason for doing that. It is not only floating point add or multiply events that go down the floating point pipes, it is also VIS instructions. You may have thought that your code does not use any VIS instructions, but unfortunately most codes do use them at some point. Here's a simple example.

#include 
char a[10\*1024\*1024],b[10\*1024\*1024];
void main()
{
int i;
for (i=0;i<100; i++){bcopy(a,b,10\*1024\*1024);}
}

And here is it running under cputrack:

$ cc mov.c
$ cputrack -c pic0=FA_pipe_completion,pic1=Instr_cnt -c pic0=Instr_cnt,pic1=FM_pipe_completion a.out | grep exit
1.098 1 exit 73756539 212490199 # pic0=FA_pipe_completion,pic1=Instr_cnt
1.791 1 exit 164927147 0 # pic0=Instr_cnt,pic1=FM_pipe_completion

So this little code generated 146 million (73,756,539\*2)instructions that used the floating point add pipe, out of a total of 376 million instructions - pretty much half of the total number of instructions! Now this degree of disturbance is unusual (or contrived), but there are many codes out there that do spend significant time in memcpy type functions, so care has to be taken when using this approach.

The next obvious question is whether it is possible to check for this. Of course, the thing to do is to fire up the Performance Analyzer from Sun Studio 11, and use that to gather a profile.

$ collect a.out
Creating experiment database test.1.er ...
$ er_print -limit 3 -func test.1.er
Functions sorted by metric: Exclusive User CPU Time
Excl. Incl. Name
User CPU User CPU
sec. sec.
1.321 1.321 _memcpy
1.321 1.321
0. 1.321 main

From the profile it is possible to be pretty certain that most of the events come in from _memcpy. To be sure take a look at the hot disassembly. To do this it is necessary to run er_print interactively to pick the correct version of _memcpy from libc_psr.

$ er_print test.1.er
(er_print) dis _memcpy
Available name list:
0) Cancel
1) (unknown)(/usr/lib/libc.so.1)`_memcpy
2) (unknown)(/usr/platform/sun4u-us3/lib/libc_psr.so.1)`_memcpy
3) (unknown)(/usr/lib/ld.so.1)`_memcpy
Enter selection: 2
Source file: (unknown)
Object file: /usr/platform/sun4u-us3/lib/libc_psr.so.1
Load Object: /usr/platform/sun4u-us3/lib/libc_psr.so.1
...
0. 0. [?] 720: ldd [%o1], %f2
0. 0. [?] 724: faligndata %f12, %f14, %f44
0. 0. [?] 728: ldd [%o1 + 8], %f4
0. 0. [?] 72c: faligndata %f14, %f0, %f46
0.090 0.090 [?] 730: stda %f32, [%o0] 0xf0
0. 0. [?] 734: ldd [%o1 + 16], %f6
0. 0. [?] 738: faligndata %f0, %f2, %f32
0.100 0.100 [?] 73c: inc 64, %o0
0. 0. [?] 740: ldd [%o1 + 24], %f8
0. 0. [?] 744: faligndata %f2, %f4, %f34
0. 0. [?] 748: ldd [%o1 + 32], %f10
0. 0. [?] 74c: faligndata %f4, %f6, %f36
0. 0. [?] 750: ldd [%o1 + 40], %f12
0. 0. [?] 754: dec 64, %o2
0. 0. [?] 758: faligndata %f6, %f8, %f38
0.100 0.100 [?] 75c: ldd [%o1 + 48], %f14
0. 0. [?] 760: faligndata %f8, %f10, %f40
0.030 0.030 [?] 764: ldd [%o1 + 56], %f0
## 1.001 1.001 [?] 768: prefetch [%o1 + 384], #one_read
0. 0. [?] 76c: faligndata %f10, %f12, %f42
0. 0. [?] 770: cmp %o2, 72
0. 0. [?] 774: bgu,pt %icc,0x720
0. 0. [?] 778: inc 64, %o1
...

So the reason for all the floating point add pipeline activity is the faligndata VIS instruction.

Well, that's the hypothesis, which can be checked by reading the processor documentation. Many people do not want to go that far, and there is an alternative approach - use the Analyzer to sample based on the hardware performance counters:

$ collect -h FA_pipe_completion,,FM_pipe_completion a.out
Creating experiment database test.2.er ...
$ er_print -limit 3 -func test.2.er
Functions sorted by metric: Exclusive FA_pipe_completion Events
Excl. Incl. Excl. Incl. Name
FA_pipe_completion FA_pipe_completion FM_pipe_completion FM_pipe_completion
Events Events Events Events
131072040 131072040 0 0
131000768 131000768 0 0 _memcpy
71272 71272 0 0 collector_final_counters

Using this approach it is clearly visible that all the floating point add pipeline events come in from the _memcpy routine. It is even possible to check which instructions are generating them:

          0            0                        [?]      720:  ldd         [%o1], %f2
0 0 [?] 724: faligndata %f12, %f14, %f44
0 0 [?] 728: ldd [%o1 + 8], %f4
0 0 [?] 72c: faligndata %f14, %f0, %f46
3000012 0 [?] 730: stda %f32, [%o0] 0xf0
0 0 [?] 734: ldd [%o1 + 16], %f6
0 0 [?] 738: faligndata %f0, %f2, %f32
3000015 0 [?] 73c: inc 64, %o0
0 0 [?] 740: ldd [%o1 + 24], %f8
0 0 [?] 744: faligndata %f2, %f4, %f34
## 30000207 0 [?] 748: ldd [%o1 + 32], %f10
0 0 [?] 74c: faligndata %f4, %f6, %f36
## 27000189 0 [?] 750: ldd [%o1 + 40], %f12
0 0 [?] 754: dec 64, %o2
0 0 [?] 758: faligndata %f6, %f8, %f38
## 33000231 0 [?] 75c: ldd [%o1 + 48], %f14
0 0 [?] 760: faligndata %f8, %f10, %f40
0 0 [?] 764: ldd [%o1 + 56], %f0
## 35000114 0 [?] 768: prefetch [%o1 + 384], #one_read
0 0 [?] 76c: faligndata %f10, %f12, %f42
0 0 [?] 770: cmp %o2, 72
0 0 [?] 774: bgu,pt %icc,0x720
0 0 [?] 778: inc 64, %o1

Going back to the first case, we can also profile that:

Functions sorted by metric: Exclusive FA_pipe_completion Events
Excl. Excl. Name
FA_pipe_completion FM_pipe_completion
Events Events
104857640 104857600
104000312 104000312 main
857328 857288 collector_final_counters

And then dig down to find where in the code the events are happening:

...
0 0 [?] 10c2c: sll %i5, 3, %o0
0 0 [?] 10c30: add %o0, %o5, %i3
0 0 [?] 10c34: ldd [%i3], %f6
## 104000312 0 [?] 10c38: ldd [%o0 + %o4], %f4
0 0 [?] 10c3c: faddd %f6, %f4, %f4
0 0 [?] 10c40: std %f4, [%i3]
0 0 [?] 10c44: inc %i5
0 0 [?] 10c48: cmp %i5, %o3
0 0 [?] 10c4c: bl 0x10c2c
0 0 [?] 10c50: nop
...
0 0 [?] 10c90: sll %i5, 3, %o0
0 0 [?] 10c94: add %o0, %o5, %i3
0 0 [?] 10c98: ldd [%i3], %f6
## 0 104000312 [?] 10c9c: ldd [%o0 + %o4], %f4
0 0 [?] 10ca0: fmuld %f6, %f4, %f4
0 0 [?] 10ca4: std %f4, [%i3]
0 0 [?] 10ca8: inc %i5
0 0 [?] 10cac: cmp %i5, %o3
0 0 [?] 10cb0: bl 0x10c90
0 0 [?] 10cb4: nop
...

A quick summary of the points are

  • The floating point event counters count more than just floating point events, so check the profile of applications which have unexpectedly high counts.
  • The VIS instructions that go down the floating point add pipeline are used in memcpy, so be suspicious when there are substantially more of this kind of event than the floating point multiply pipeline event.
  • Use the Performance Analyzer to dig down to exactly where the events are happening

  • T:

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.