X
  • Work |
    June 1, 2006

On misaligned memory accesses

Guest Author

The UltraSPARC processor does not handle misaligned memory operations (loads and stores) in hardware. An application can be compiled to either crash on a misaligned memory accesses, or trap to software to correct the alignment. A further option is that an application can be compiled to assume that data is always misaligned, and use multiple loads or stores for each memory operation.

The behaviour is controlled through the -xmemalign flag as follows:

  • -xmemalign=1s equivalent to the old flag -misalign, which assumes that everything is misaligned, and generates multiple loads or stores to access memory. This option is appropriate if most memory operations are misaligned (this is rarely the case).
  • -xmemalign=8i This has been the default since Sun Studio 9. The compiler assumes that all memory operations have the correct alignment, but uses a trap handler to correct the situations where this is not true. There is an overhead from using a trap handler to correct alignment problems.
  • -xmemalign=8s equivalent the the old flag -dalign. This is included in -fast. It means that the compiler should assume correct alignment, but the application will crash if this is not the case.

To port an application which may or may not have alignment issues, the appropriate flag is the -xmemalign=8i flag which is enabled by default. The code will work, but may run slower due to having correct alignment issues in software. So the obvious question is how to detect whether there is a problem with misaligned memory accesses.

The first approach might be to compile with -xmemalign=8s, and see if the application runs. This would be a slow and rather painful way of testing this. Fortunately there are other options.

The debugger has the facility (available as a commandline) to check for misaligned memory accesses. The tool to use is:

bcheck <app> <params>

The following test code has a misalignment problem:

void main ()
{
char a[10];
double \*d;
d=(double\*)&a[1];
\*d=5.0;
printf("%f",d);
}

This code can be compiled and run under bcheck:

 $ cc -g -O -xmemalign=8i miss.c
$ bcheck -all a.out
...
signal SEGV (no mapping at the fault address) in main (optimized) at line 6 in file "miss.c"
6 \*d=5.0;
dbx: read of 4 bytes at address 20 failed -- Error 0

The code was compiled to correct misalignment problems, but under bcheck it failed because of a misaligned memory access. This is due to differences in the way that misalignment is handled in 32-bit and 64-bit modes. In 32-bit mode the kernel handles the alignment, in 64-bit the kernel passes control back to user-code to handle the alignment. So bcheck cannot capture the misalignment issue in 32-bit mode. The following example shows that it can be done in 64-bit mode:

$ cc -g -O -xtarget=generic64 -xmemalign=8i miss.c
$ bcheck -all a.out
...
errors are being redirected to file 'a.out.errs'
...
$ more a.out.errs
<rtc> Misaligned write (maw):
Attempting to write 4 bytes at address 0xffffffff7fffeec5
which is 197 bytes above the current stack pointer
=>[1] main() (optimized), at 0x10000261c (line ~6) in "miss.c"
<rtc> Read from unallocated (rua):
Attempting to read 4 bytes at address 0x100106e78
which is 17480 bytes into the heap; no blocks allocated
=>[1] __do_misaligned_ldst_instr(0xffffffff7fffed40, 0xffffffff7fffee00, 0x100106e78, 0xffffffff7fffe601, 0x10, 0x1), at 0x100000d74
[2] __misalign_trap_handler(0x100102a18, 0xffffffff7fffe6d1, 0x2, 0x0, 0x11e52c, 0xd0), at 0x1000020a0
[3] __rtc_dispatch(0x1, 0x100102000, 0x100102, 0x100000, 0x100002000, 0x100002), at 0xffffffff7352e910
[4] main() (optimized), at 0x1000025f8 (line ~2) in "miss.c"

The report indicates a Misaligned write (maw), so it has successfully detected the misalignment problem. The report also runs the code to completion, so all the problems are captured.

Of course not all programs can be ported to 64-bits just to check for the location of misaligned memory accesses. In fact, if the misaligned memory accesses are not causing a performance problem, there's no real reason to hunt them down and remove them. The way to check this is to profile the application and see whether there's a lot of time that might be due to misaligned memory accesses.

The following test program spends a bit more time on misaligned memory accesses, which will make the problem more apparent when the program is profiled:

static int i,j;
double \*d;
void main ()
{
char a[10];
d=(double\*)&a[1];
j=1000000;
for (i=0;i<j; i++)
\*d=5.0;
printf("%f",d);
}

First of all, profiling the program built as a 32-bit executable

$ cc -xO1 -g miss.c
$ collect a.out
$ er_print -metrics e.user:e.system -dis main test.2.er
Excl. Excl.
User CPU Sys. CPU
sec. sec.
...
9. \*d=5.0;
10. printf("%f",d);
...
0. 0. [ 9] 10ccc: std %f32, [%o2]
## 0.370 1.081 [ 8] 10cd0: add %l5, 696, %l6
0. 0. [ 9] 10cd4: add %g2, 64, %g3

In 32-bit mode, there is no user land handler for misaligned loads; the alignment is corrected in the kernel, so sytem time is an indicator that there could be problems with alignment. The time spent correcting the misaligned store is shown as system time on the following instruction.

The same test can be performed in 64-bit mode:

$ cc -g -O -xtarget=generic64 -xmemalign=8i miss.c 
$ collect a.out
$ er_print -metrics e.user:e.system -func test.4.er
Functions sorted by metric: Exclusive User CPU Time
Excl. Excl. Name
User CPU Sys. CPU
sec. sec.
0.570 0.
0.460 0. __misalign_trap_handler
0.050 0. __do_misaligned_ldst_instr
0.050 0. __fp_read_pdreg
0.010 0. main
0. 0. _start

In the 64-bit case, much time is spent in the trap handler that corrects misalignment. This is a good indication that misaligned memory accesses are a problem, but does not indicate where the problem is.

er_print -dis main test.4.er
Excl. Incl.
User CPU User CPU
sec. sec.
...
0. 0. [ 8] 100002738: or %l6, 258, %l7
## 0. 0.560 [ 9] 10000273c: std %f32, [%o7]
0. 0. [ 8] 100002740: or %g2, 258, %g3
...

To determine the place where the misaligned memory accesses are occurring, it is necessary to look at the call stack for the trap handler. This quickly points to the main routine being the location. Inspecting the disassembly shows that there is Inclusive (but not exclusive) user time attributed to the store instruction, which demonstrates that is the instruction that is accessing misaligned data.

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.Captcha
Oracle

Integrated Cloud Applications & Platform Services