X

Misaligned loads profiled (again)

Guest Author

A long time ago I described how misaligned loads appeared in profiles of 32-bit applications. Since then we've changed the level of detail presented in the Performance Analyzer. When I wrote the article the time spent on-cpu that wasn't User time was grouped as System time. We've now started showing more detail - and more detail is good.

Here's a similar bit of code:

#include <stdio.h>
static int i,j;
volatile double *d;
void main ()
{
char a[10];
d=(double*)&a[1];
j=100000000;
for (i=0;i < j; i++)
{
*d+=5.0;
}
printf("%f",d);
}

This code stores into a misaligned double, and that's all we need in order to generate misaligned traps and see how they are shown in the performance analyzer. Here's the hot loop:

Load Object: a.out
Inclusive Inclusive
User CPU Time Trap CPU Time Name
(sec.) (sec.)
1.131 0.510 [?] 10928: inc 4, %i1
0. 0. [?] 1092c: ldd [%i2], %f2
0.811 0.380 [?] 10930: faddd %f2, %f0, %f4
0. 0. [?] 10934: std %f4, [%i2]
0.911 0.480 [?] 10938: ldd [%i2], %f2
1.121 0.370 [?] 1093c: faddd %f2, %f0, %f4
0. 0. [?] 10940: std %f4, [%i2]
0.761 0.410 [?] 10944: ldd [%i2], %f2
0.911 0.410 [?] 10948: faddd %f2, %f0, %f4
0.010 0. [?] 1094c: std %f4, [%i2]
0.941 0.450 [?] 10950: ldd [%i2], %f2
1.111 0.380 [?] 10954: faddd %f2, %f0, %f4
0. 0. [?] 10958: cmp %i1, %i5
0. 0. [?] 1095c: ble,pt %icc,0x10928
0. 0. [?] 10960: std %f4, [%i2]

So the first thing to notice is that we're now reporting Trap time rather than aggregating it into System time. This is useful because trap time is intrinsically different from system time, so it's worth displaying it differently. Fortunately the new overview screen highlights the trap time, so it's easy to recognise when to look for it.

Now, you should be familiar with the "previous instruction is to blame rule" for interpreting the output from the performance analyzer. Dealing with traps is no different, the time spent on the next instruction is due to the trap of the previous instruction. So the final load in the loop takes about 1.1s of user time and 0.38s of trap time.

Slight side track about the "blame the last instruction" rule. For misaligned accesses the problem instruction traps and its action is emulated. So the next instruction executed is the instruction following the misaligned access. That's why we see time attributed to the following instruction. However, there are situations where an instruction is retried after a trap, in those cases the next instruction is the instruction that caused the trap. Examples of this are TLB misses or save/restore instructions.

If we recompile the code as 64-bit and set -xmemalign=8i, then we get a different profile:

Exclusive       
User CPU Time Name
(sec.)
3.002 <Total>
2.882 __misalign_trap_handler
0.070 main
0.040 __do_misaligned_ldst_instr
0.010 getreg
0. _start

For 64-bit code the misaligned operations are fixed in user-land. One (unexpected) advantage of this is that you can take a look at the routines that call the trap handler and identify exactly where the misaligned memory operations are:

0.480

main + 0x00000078
0.450

main + 0x00000054
0.410

main + 0x0000006C
0.380

main + 0x00000060
0.370

main + 0x00000088
0.310

main + 0x0000005C
0.270

main + 0x00000068
0.260

main + 0x00000074

This is really useful if there are a number of sites and your objective is to fix them in order of performance impact.

Join the discussion

Comments ( 3 )
  • guest Thursday, May 28, 2015

    ...But you had to artifically aggravate the issue with (double *)&a[1]; That is not something one would normally do in code, in fact, I do not believe I have ever seen such a cast in any code before.

    Can you give real-life examples where this could happen? Will the compiler normally not automatically align the code on even addresses which are multiples of four or eight (depending whether the code is 32- or 64-bit)?


  • guest Thursday, May 28, 2015

    Another question I have is, when I am writing the assembler code by hand, under which circumstances and how would I trigger the misaligned code? I'm looking at this dissasebly and all the code is on even addresses, even the arithmetic.

    Can you share any methodology on spotting misaligned access when writing assembler code by hand, because I can't spot it in that listing.


  • Darryl Gove Thursday, May 28, 2015

    I think it'll take more than a reply to a comment to answer your question - so give me a bit of time to write a more detailed answer up!

    Darryl.


Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.Captcha