Author: Darryl Gove
A long time ago I described how misaligned loads appeared in profiles of 32-bit applications. Since then we’ve changed the level of detail presented in the Performance Analyzer. When I wrote the article the time spent on-cpu that wasn’t User time was grouped as System time. We’ve now started showing more detail – and more detail is good.
Here’s a similar bit of code:
#include <stdio.h>
static int i,j;
volatile double *d;
void main ()
{
char a[10];
d=(double*)&a[1];
j=100000000;
for (i=0;i < j; i++)
{
*d+=5.0;
}
printf("%f",d);
}
This code stores into a misaligned double, and that’s all we need in order to generate misaligned traps and see how they are shown in the performance analyzer. Here’s the hot loop:
Load Object: a.out Inclusive Inclusive User CPU Time Trap CPU Time Name (sec.) (sec.) 1.131 0.510 [?] 10928: inc 4, %i1 0. 0. [?] 1092c: ldd [%i2], %f2 0.811 0.380 [?] 10930: faddd %f2, %f0, %f4 0. 0. [?] 10934: std %f4, [%i2] 0.911 0.480 [?] 10938: ldd [%i2], %f2 1.121 0.370 [?] 1093c: faddd %f2, %f0, %f4 0. 0. [?] 10940: std %f4, [%i2] 0.761 0.410 [?] 10944: ldd [%i2], %f2 0.911 0.410 [?] 10948: faddd %f2, %f0, %f4 0.010 0. [?] 1094c: std %f4, [%i2] 0.941 0.450 [?] 10950: ldd [%i2], %f2 1.111 0.380 [?] 10954: faddd %f2, %f0, %f4 0. 0. [?] 10958: cmp %i1, %i5 0. 0. [?] 1095c: ble,pt %icc,0x10928 0. 0. [?] 10960: std %f4, [%i2]
So the first thing to notice is that we’re now reporting Trap time rather than aggregating it into System time. This is useful because trap time is intrinsically different from system time, so it’s worth displaying it differently. Fortunately the new overview screen highlights the trap time, so it’s easy to recognise when to look for it.
Now, you should be familiar with the “previous instruction is to blame rule” for interpreting the output from the performance analyzer. Dealing with traps is no different, the time spent on the next instruction is due to the trap of the previous instruction. So the final load in the loop takes about 1.1s of user time and 0.38s of trap time.
Slight side track about the “blame the last instruction” rule. For misaligned accesses the problem instruction traps and its action is emulated. So the next instruction executed is the instruction following the misaligned access. That’s why we see time attributed to the following instruction. However, there are situations where an instruction is retried after a trap, in those cases the next instruction is the instruction that caused the trap. Examples of this are TLB misses or save/restore instructions.
If we recompile the code as 64-bit and set -xmemalign=8i, then we get a different profile:
Exclusive User CPU Time Name (sec.) 3.002 <Total> 2.882 __misalign_trap_handler 0.070 main 0.040 __do_misaligned_ldst_instr 0.010 getreg 0. _start
For 64-bit code the misaligned operations are fixed in user-land. One (unexpected) advantage of this is that you can take a look at the routines that call the trap handler and identify exactly where the misaligned memory operations are:
0.480 main + 0x00000078 0.410 main + 0x0000006C 0.380 main + 0x00000060 0.370 main + 0x00000088 0.310 main + 0x0000005C 0.270 main + 0x00000068 0.260 main + 0x00000074
This is really useful if there are a number of sites and your objective is to fix them in order of performance impact.
