Personal | Friday, March 30, 2012

Inline template efficiency

I like inline templates, and use them quite extensively. Whenever I write code with them I'm always careful to check the disassembly to see that the resulting output is efficient. Here's a potential cause of inefficiency.

Suppose we want to use the mis-named Leading Zero Detect (LZD) instruction on T4 (this instruction does a count of the number of leading zero bits in an integer register - so it should really be called leading zero count). So we put together an inline template called lzd.il looking like:

.inline lzd
lzd %o0,%o0
.end

And we throw together some code that uses it:

int lzd(int);
int a;
int c=0;
int main()
{
for(a=0; a<1000; a++)
{
c=lzd(c);
}
return 0;
}

We compile the code with some amount of optimisation, and look at the resulting code:

$ cc -O -xtarget=T4 -S lzd.c lzd.il
$ more lzd.s
.L77000018:
/* 0x001c 11 */ lzd %o0,%o0
/* 0x0020 9 */ ld [%i1],%i3
/* 0x0024 11 */ st %o0,[%i2]
/* 0x0028 9 */ add %i3,1,%i0
/* 0x002c */ cmp %i0,999
/* 0x0030 */ ble,pt %icc,.L77000018
/* 0x0034 */ st %i0,[%i1]

What is surprising is that we're seeing a number of loads and stores in the code. Everything could be held in registers, so why is this happening?

The problem is that the code is only inlined at the code generation stage - when the actual instructions are generated. Earlier compiler phases see a function call. The called functions can do all kinds of nastiness to global variables (like 'a' in this code) so we need to load them from memory after the function call, and store them to memory before the function call.

Fortunately we can use a #pragma directive to tell the compiler that the routine lzd() has no side effects - meaning that it does not read or write to memory. The directive to do that is #pragma no_side_effect(<routine name>), and it needs to be placed after the declaration of the function. The new code looks like:

int lzd(int);
#pragma no_side_effect(lzd)
int a;
int c=0;
int main()
{
for(a=0; a<1000; a++)
{
c=lzd(c);
}
return 0;
}

Now the loop looks much neater:

/* 0x0014         10 */         add     %i1,1,%i1
! 11 ! {
! 12 ! c=lzd(c);
/* 0x0018 12 */ lzd %o0,%o0
/* 0x001c 10 */ cmp %i1,999
/* 0x0020 */ ble,pt %icc,.L77000018
/* 0x0024 */ nop

Join the discussion

Comments ( 2 )
  • guest Friday, March 30, 2012

    I like gcc's extended asm syntax, where if you touch memory you need to add the "memory" clobber. And builtin functions (__builtin_clz here) for simple cases.


  • Darryl Gove Monday, April 2, 2012

    Thanks. For what I usually need to do the templates work fine. I find asm() a bit of a hassle, but for this situation it would have been great.

    There are a bunch of compiler provided inline templates, probably lzd is there, but I usually write my own.

    Darryl.


Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.Captcha
 

Visit the Oracle Blog

 

Contact Us

Oracle

Integrated Cloud Applications & Platform Services