• Work
January 19, 2006

# Measuring floating point use in applications

Guest Author

mrbenchmark has written a blog entry on assessing floating point performance in codes using the hardware performance counters on the UltraSPARC-III. It's a nice detailed read, so I'm not going to repeat any of what's written there.

However, there's a bit more to be said on the topic - in particular the technique used may over estimate floating point content in some codes.

The best place to start is a test program

`double a[1024\*1024],b[1024\*1024];void main(){  int i,j;  for (i=0; i<1024\*1024;i++) {a[i]=b[i]=0.0;}  for (j=0;j<100;j++) for (i=0; i<1024\*1024;i++) {a[i]+=b[i];}  for (j=0; j<100; j++) for (i=0; i<1024\*1024;i++) {a[i]\*=b[i];}}`

The program does 100\*1024\*1024 (approx 100 million) each of floating point adds and floating point multiplies. The next step is to compile and run the program.

`\$ cc fp\$ cputrack -c pic0=FA_pipe_completion,pic1=Instr_cnt -c pic0=Instr_cnt,pic1=FM_pipe_completion a.out |grep exit 32.687   1       exit  55477739 1104761007 # pic0=FA_pipe_completion,pic1=Instr_cnt 32.071   1       exit 1001979477  50793545 # pic0=Instr_cnt,pic1=FM_pipe_completion`

The program has been run under cputrack which follows the process and counts the number of the specified events that occur. cputrack is set to count:

• FA_pipe_completion events - instructions that use the floating point add pipe.
• FM_pipe_completion events - instructions that use the floating point multiply pipe.
• Instr_cnt - the total instruction count

[If you want to find out more about the available performance counters on the UltraSPARC hardware, then the manuals are available.]

Since cputrack was asked to count two different sets of events, it alternates between them, which means that the number of each type of events that it counts is undercounted by a factor of two (assuming that the distribution of missed events was even). Given this information it is possible to estimate the total number of instructions that used the floating point add pipe as being 111 million (55,477,739\*2), and the total number that used the floating point multiply pipeline as 101 million (50,793,545\*2). The total number of instructions was 2.1 billion (1,104,761,007 + 1,001,979,477). So the floating point content is about 10% ( (101+111)/2100).

The floating point percentage is artificially low because the code was not compiled with any optimisation - so there were plenty of instructions which could have been optimised out.

Now I've carefully written floating point add pipeline, and there is a good reason for doing that. It is not only floating point add or multiply events that go down the floating point pipes, it is also VIS instructions. You may have thought that your code does not use any VIS instructions, but unfortunately most codes do use them at some point. Here's a simple example.

`#include char a[10\*1024\*1024],b[10\*1024\*1024];void main(){   int i;   for (i=0;i<100; i++){bcopy(a,b,10\*1024\*1024);}}`

And here is it running under cputrack:

`\$ cc mov.c\$ cputrack -c pic0=FA_pipe_completion,pic1=Instr_cnt -c pic0=Instr_cnt,pic1=FM_pipe_completion a.out | grep exit  1.098   1       exit  73756539 212490199 # pic0=FA_pipe_completion,pic1=Instr_cnt  1.791   1       exit 164927147         0 # pic0=Instr_cnt,pic1=FM_pipe_completion`

So this little code generated 146 million (73,756,539\*2)instructions that used the floating point add pipe, out of a total of 376 million instructions - pretty much half of the total number of instructions! Now this degree of disturbance is unusual (or contrived), but there are many codes out there that do spend significant time in memcpy type functions, so care has to be taken when using this approach.

The next obvious question is whether it is possible to check for this. Of course, the thing to do is to fire up the Performance Analyzer from Sun Studio 11, and use that to gather a profile.

`\$ collect a.outCreating experiment database test.1.er ...\$ er_print -limit 3 -func test.1.erFunctions sorted by metric: Exclusive User CPU TimeExcl.     Incl.      Name  User CPU  User CPU          sec.      sec.       1.321     1.321      _memcpy1.321     1.321      0.        1.321      main`

From the profile it is possible to be pretty certain that most of the events come in from _memcpy. To be sure take a look at the hot disassembly. To do this it is necessary to run er_print interactively to pick the correct version of _memcpy from libc_psr.

`\$ er_print test.1.er(er_print) dis _memcpyAvailable name list:       0) Cancel       1) (unknown)(/usr/lib/libc.so.1)`_memcpy       2) (unknown)(/usr/platform/sun4u-us3/lib/libc_psr.so.1)`_memcpy       3) (unknown)(/usr/lib/ld.so.1)`_memcpyEnter selection: 2Source file: (unknown)Object file: /usr/platform/sun4u-us3/lib/libc_psr.so.1Load Object: /usr/platform/sun4u-us3/lib/libc_psr.so.1...   0.        0.                 [?]      720:  ldd         [%o1], %f2   0.        0.                 [?]      724:  faligndata  %f12, %f14, %f44   0.        0.                 [?]      728:  ldd         [%o1 + 8], %f4   0.        0.                 [?]      72c:  faligndata  %f14, %f0, %f46   0.090     0.090              [?]      730:  stda        %f32, [%o0] 0xf0   0.        0.                 [?]      734:  ldd         [%o1 + 16], %f6   0.        0.                 [?]      738:  faligndata  %f0, %f2, %f32   0.100     0.100              [?]      73c:  inc         64, %o0   0.        0.                 [?]      740:  ldd         [%o1 + 24], %f8   0.        0.                 [?]      744:  faligndata  %f2, %f4, %f34   0.        0.                 [?]      748:  ldd         [%o1 + 32], %f10   0.        0.                 [?]      74c:  faligndata  %f4, %f6, %f36   0.        0.                 [?]      750:  ldd         [%o1 + 40], %f12   0.        0.                 [?]      754:  dec         64, %o2   0.        0.                 [?]      758:  faligndata  %f6, %f8, %f38   0.100     0.100              [?]      75c:  ldd         [%o1 + 48], %f14   0.        0.                 [?]      760:  faligndata  %f8, %f10, %f40   0.030     0.030              [?]      764:  ldd         [%o1 + 56], %f0## 1.001     1.001              [?]      768:  prefetch    [%o1 + 384], #one_read   0.        0.                 [?]      76c:  faligndata  %f10, %f12, %f42   0.        0.                 [?]      770:  cmp         %o2, 72   0.        0.                 [?]      774:  bgu,pt      %icc,0x720   0.        0.                 [?]      778:  inc         64, %o1...`

So the reason for all the floating point add pipeline activity is the faligndata VIS instruction.

Well, that's the hypothesis, which can be checked by reading the processor documentation. Many people do not want to go that far, and there is an alternative approach - use the Analyzer to sample based on the hardware performance counters:

`\$ collect -h FA_pipe_completion,,FM_pipe_completion a.outCreating experiment database test.2.er ...\$ er_print -limit 3 -func test.2.erFunctions sorted by metric: Exclusive FA_pipe_completion EventsExcl.               Incl.               Excl.               Incl.                Name  FA_pipe_completion  FA_pipe_completion  FM_pipe_completion  FM_pipe_completion         Events              Events              Events              Events                131072040           131072040           0                   0                    131000768           131000768           0                   0                    _memcpy    71272               71272           0                   0                    collector_final_counters`

Using this approach it is clearly visible that all the floating point add pipeline events come in from the _memcpy routine. It is even possible to check which instructions are generating them:

`          0            0                        [?]      720:  ldd         [%o1], %f2          0            0                        [?]      724:  faligndata  %f12, %f14, %f44          0            0                        [?]      728:  ldd         [%o1 + 8], %f4          0            0                        [?]      72c:  faligndata  %f14, %f0, %f46    3000012            0                        [?]      730:  stda        %f32, [%o0] 0xf0          0            0                        [?]      734:  ldd         [%o1 + 16], %f6          0            0                        [?]      738:  faligndata  %f0, %f2, %f32    3000015            0                        [?]      73c:  inc         64, %o0          0            0                        [?]      740:  ldd         [%o1 + 24], %f8          0            0                        [?]      744:  faligndata  %f2, %f4, %f34## 30000207            0                        [?]      748:  ldd         [%o1 + 32], %f10          0            0                        [?]      74c:  faligndata  %f4, %f6, %f36## 27000189            0                        [?]      750:  ldd         [%o1 + 40], %f12          0            0                        [?]      754:  dec         64, %o2          0            0                        [?]      758:  faligndata  %f6, %f8, %f38## 33000231            0                        [?]      75c:  ldd         [%o1 + 48], %f14          0            0                        [?]      760:  faligndata  %f8, %f10, %f40          0            0                        [?]      764:  ldd         [%o1 + 56], %f0## 35000114            0                        [?]      768:  prefetch    [%o1 + 384], #one_read          0            0                        [?]      76c:  faligndata  %f10, %f12, %f42          0            0                        [?]      770:  cmp         %o2, 72          0            0                        [?]      774:  bgu,pt      %icc,0x720          0            0                        [?]      778:  inc         64, %o1`

Going back to the first case, we can also profile that:

`Functions sorted by metric: Exclusive FA_pipe_completion EventsExcl.               Excl.                Name  FA_pipe_completion  FM_pipe_completion         Events              Events                104857640           104857600            104000312           104000312            main   857328              857288            collector_final_counters`

And then dig down to find where in the code the events are happening:

`...           0                   0                [?]    10c2c:  sll         %i5, 3, %o0           0                   0                [?]    10c30:  add         %o0, %o5, %i3           0                   0                [?]    10c34:  ldd         [%i3], %f6## 104000312                   0                [?]    10c38:  ldd         [%o0 + %o4], %f4           0                   0                [?]    10c3c:  faddd       %f6, %f4, %f4           0                   0                [?]    10c40:  std         %f4, [%i3]           0                   0                [?]    10c44:  inc         %i5           0                   0                [?]    10c48:  cmp         %i5, %o3           0                   0                [?]    10c4c:  bl          0x10c2c           0                   0                [?]    10c50:  nop    ...           0                   0                [?]    10c90:  sll         %i5, 3, %o0           0                   0                [?]    10c94:  add         %o0, %o5, %i3           0                   0                [?]    10c98:  ldd         [%i3], %f6##         0           104000312                [?]    10c9c:  ldd         [%o0 + %o4], %f4           0                   0                [?]    10ca0:  fmuld       %f6, %f4, %f4           0                   0                [?]    10ca4:  std         %f4, [%i3]           0                   0                [?]    10ca8:  inc         %i5           0                   0                [?]    10cac:  cmp         %i5, %o3           0                   0                [?]    10cb0:  bl          0x10c90           0                   0                [?]    10cb4:  nop    ...`

A quick summary of the points are

• The floating point event counters count more than just floating point events, so check the profile of applications which have unexpectedly high counts.
• The VIS instructions that go down the floating point add pipeline are used in memcpy, so be suspicious when there are substantially more of this kind of event than the floating point multiply pipeline event.
• Use the Performance Analyzer to dig down to exactly where the events are happening

• T:

Comments ( 0 )