Author: Darryl Gove

A long time ago I described how misaligned loads appeared in profiles of 32-bit applications. Since then we’ve changed the level of detail presented in the Performance Analyzer. When I wrote the article the time spent on-cpu that wasn’t User time was grouped as System time. We’ve now started showing more detail – and more detail is good.

Here’s a similar bit of code:

#include <stdio.h>
static int i,j;
volatile double *d;
void main ()
{
  char a[10];
  d=(double*)&a[1];
  j=100000000;
  for (i=0;i < j; i++) 
  {
    *d+=5.0;
  }
  printf("%f",d);
}

This code stores into a misaligned double, and that’s all we need in order to generate misaligned traps and see how they are shown in the performance analyzer. Here’s the hot loop:

Load Object: a.out
Inclusive       Inclusive       
User CPU Time   Trap CPU Time   Name
(sec.)          (sec.)          
1.131           0.510               [?]    10928:  inc         4, %i1
0.              0.                  [?]    1092c:  ldd         [%i2], %f2
0.811           0.380               [?]    10930:  faddd       %f2, %f0, %f4
0.              0.                  [?]    10934:  std         %f4, [%i2]
0.911           0.480               [?]    10938:  ldd         [%i2], %f2
1.121           0.370               [?]    1093c:  faddd       %f2, %f0, %f4
0.              0.                  [?]    10940:  std         %f4, [%i2]
0.761           0.410               [?]    10944:  ldd         [%i2], %f2
0.911           0.410               [?]    10948:  faddd       %f2, %f0, %f4
0.010           0.                  [?]    1094c:  std         %f4, [%i2]
0.941           0.450               [?]    10950:  ldd         [%i2], %f2
1.111           0.380               [?]    10954:  faddd       %f2, %f0, %f4
0.              0.                  [?]    10958:  cmp         %i1, %i5
0.              0.                  [?]    1095c:  ble,pt      %icc,0x10928
0.              0.                  [?]    10960:  std         %f4, [%i2]

So the first thing to notice is that we’re now reporting Trap time rather than aggregating it into System time. This is useful because trap time is intrinsically different from system time, so it’s worth displaying it differently. Fortunately the new overview screen highlights the trap time, so it’s easy to recognise when to look for it.

Now, you should be familiar with the “previous instruction is to blame rule” for interpreting the output from the performance analyzer. Dealing with traps is no different, the time spent on the next instruction is due to the trap of the previous instruction. So the final load in the loop takes about 1.1s of user time and 0.38s of trap time.

Slight side track about the “blame the last instruction” rule. For misaligned accesses the problem instruction traps and its action is emulated. So the next instruction executed is the instruction following the misaligned access. That’s why we see time attributed to the following instruction. However, there are situations where an instruction is retried after a trap, in those cases the next instruction is the instruction that caused the trap. Examples of this are TLB misses or save/restore instructions.

If we recompile the code as 64-bit and set -xmemalign=8i, then we get a different profile:

Exclusive       
User CPU Time   Name
(sec.)          
3.002           <Total>
2.882           __misalign_trap_handler
0.070           main
0.040           __do_misaligned_ldst_instr
0.010           getreg
0.              _start

For 64-bit code the misaligned operations are fixed in user-land. One (unexpected) advantage of this is that you can take a look at the routines that call the trap handler and identify exactly where the misaligned memory operations are:

0.480           main + 0x00000078
0.410           main + 0x0000006C
0.380           main + 0x00000060
0.370           main + 0x00000088
0.310           main + 0x0000005C
0.270           main + 0x00000068
0.260           main + 0x00000074

This is really useful if there are a number of sites and your objective is to fix them in order of performance impact.