Wednesday Nov 19, 2014

Writing inline templates

Writing some inline templates today... I've written about doing this kind of stuff in the past here and, in more detail, here.

I happen to need to pass a bundle of parameters on to the routine. The best way of checking how the parameters will be passed is to get the compiler to provide some initial template. Here's an example routine:

int parameters (int p0, int * p1, int * p2, int* p3, int * p4, int * p5, int * p6, int p7)
  return p0 + *p1 + *p2 + *p3 + *p4 + ((*p5)<<2) + ((*p6)<<3) + p7*p7;

In the routine I've tried to handle some of the parameters differently. I know that the first parameters get passed in registers, and then the later ones get passed on the stack. By handling them differently I can work out which loads from the stack correspond to which variables. The disassembly looks like:

-bash-4.1$ cc -g -O parameters.c -c
-bash-4.1$ dis -F parameters parameters.o
disassembly for parameters.o

    parameters:             ca 02 60 00  ld        [%o1], %g5
    parameters+0x4:         c4 02 e0 00  ld        [%o3], %g2
    parameters+0x8:         c2 02 a0 00  ld        [%o2], %g1
    parameters+0xc:         c6 03 a0 60  ld        [%sp + 0x60], %g3  // load of p7
    parameters+0x10:        88 02 00 05  add       %o0, %g5, %g4
    parameters+0x14:        d0 03 60 00  ld        [%o5], %o0
    parameters+0x18:        ca 03 20 00  ld        [%o4], %g5
    parameters+0x1c:        92 00 80 01  add       %g2, %g1, %o1
    parameters+0x20:        87 38 e0 00  sra       %g3, 0x0, %g3
    parameters+0x24:        82 01 00 09  add       %g4, %o1, %g1
    parameters+0x28:        d2 03 a0 5c  ld        [%sp + 0x5c], %o1 // load of p6
    parameters+0x2c:        88 48 c0 03  mulx      %g3, %g3, %g4     // %g4 = %g3*%g3
    parameters+0x30:        97 2a 20 02  sll       %o0, 0x2, %o3
    parameters+0x34:        94 00 40 05  add       %g1, %g5, %o2
    parameters+0x38:        da 02 60 00  ld        [%o1], %o5       
    parameters+0x3c:        84 02 c0 0a  add       %o3, %o2, %g2
    parameters+0x40:        99 2b 60 03  sll       %o5, 0x3, %o4     // %o4 = %o5<<3
    parameters+0x44:        90 00 80 0c  add       %g2, %o4, %o0
    parameters+0x48:        81 c3 e0 08  retl
    parameters+0x4c:        90 02 00 04  add       %o0, %g4, %o0

Monday Apr 02, 2012

Efficient inline templates and C++

I've talked before about calling inline templates from C++, I've also talked about calling inline templates efficiently. This time I want to talk about efficiently calling inline templates from C++.

The obvious starting point is that I need to declare the inline templates as being extern "C":

  extern "C"
    int mytemplate(int);

This enables us to call it, but the call may not be very efficient because the compiler will treat it as a function call, and may produce suboptimal code based on that premise. So we need to add the no_side_effect pragma:

  extern "C"
    int mytemplate(int); 
    #pragma no_side_effect(mytemplate)

However, this may still not produce optimal code. We've discussed how the no_side_effect pragma cannot be combined with exceptions, well we know that the code cannot produce exceptions, but the compiler doesn't know that. If we tell the compiler that information it may be able to produce even better code. We can do this by adding the "throw()" keyword to the template declaration:

  extern "C"
    int mytemplate(int) throw(); 
    #pragma no_side_effect(mytemplate)

The following is an example of how these changes might improve performance. We can take our previous example code and migrate it to C++, adding the use of a try...catch construct:

#include <iostream>

extern "C"
  int lzd(int);
  #pragma no_side_effect(lzd)

int a;
int c=0;

class myclass
  int routine();

int myclass::routine()
    for(a=0; a<1000; a++)
    std::cout << "Something happened" << std::endl;
 return 0;

Compiling this produces a slightly suboptimal code sequence in the hot loop:

$ CC -O -xtarget=T4 -S t.cpp
/* 0x0014         23 */         lzd     %o0,%o0
/* 0x0018         21 */         add     %l6,1,%l6
/* 0x001c            */         cmp     %l6,1000
/* 0x0020            */         bl,pt   %icc,.L77000033
/* 0x0024         23 */         st      %o0,[%l7]

There's a store in the delay slot of the branch, so we're repeatedly storing data back to memory. If we change the function declaration to include "throw()", we get better code:

$ CC -O -xtarget=T4 -S t.cpp
/* 0x0014         21 */         add     %i1,1,%i1
/* 0x0018         23 */         lzd     %o0,%o0
/* 0x001c         21 */         cmp     %i1,999
/* 0x0020            */         ble,pt  %icc,.L77000019
/* 0x0024            */         nop

The store has gone, but the code is still suboptimal - there's a nop in the delay slot rather than useful work. However, it's good enough for this example. The point I'm making is that the compiler produces the better code with both the "throw()" and the no side effect pragma.

Friday Jan 13, 2012

C++ and inline templates

A while back I wrote an article on using inline templates. It's a bit of a niche article as I would generally advise people to write in C/C++, and tune the compiler flags and source code until the compiler generates the code that they want to see.

However, one thing that I didn't mention in the article, it's implied but not stated, is that inline templates are defined as C functions. When used from C++ they need to be declared as extern "C", otherwise you get linker errors. Here's an example template:

.inline nothing

And here's some code that calls it:

void nothing();

int main()

The code works when compiled as C, but not as C++:

$ cc i.c
$ ./a.out
$ CC i.c
Undefined                       first referenced
 symbol                             in file
void nothing()                   i.o
ld: fatal: Symbol referencing errors. No output written to a.out

To fix this, and make the code compilable with both C and C++ we use the __cplusplus feature test macro and conditionally include extern "C". Here's the modified source:

#ifdef __cplusplus
  extern "C"
    void nothing();
#ifdef __cplusplus

int main()

Tuesday Dec 23, 2008

Debugging inline templates with dbx

Been working on inline templates to improve the performance on a couple of hot routines in a customer code. I've a couple of articles on this kind of work if you want to find out more details. There's an introductory article which covers the rules, and there's an article specifically talking about using VIS instructions.

Anyway, one of the most important things to do is to write a test harness, it's very easy to make a mistake and have the template not work for some particular situation. For these routines, one of my colleagues had already written a test harness. I ended up extending it to try a different corner case, and at that point discovered that my code no longer validated. The problem turned out to be a branch that should have been branch >= 2 and I'd coded branch != 2. The original test cases terminated with the value 2 at this point, but the new test I added ended up with the value 1, which still should have terminated, but the inline template as written didn't handle it correctly.

So I fired up dbx to take a look at what was going on:

$ cc -g test.c
$ dbx a.out
Reading a.out
(dbx) stop at 150
(dbx) run
stopped in main at line 150 in file "test.c"
  150           res1=campare(&buff1[j],buff2,i);

The stop at <line> command tells the debugger to stop at the problem line number (more details). However, the problem actually occurred when j was equal to 1. So I really should specify the break point better (more details).

(dbx) status
\*(2) stop at "mcmp-test-all.c":150
(dbx) delete 2
(dbx) stop at 150 -if j==1
(3) stop at "mcmp-test-all.c":150 -if j == 1
(dbx) run
Running: a.out
(process id 14983)

That got me to the point where the problem occurred. My initial thought was to step through the execution of the inline template using the nexti command. However, this is pretty inefficient:

(dbx) nexti
stopped in main at 0x00011cfc
0x00011cfc: main+0x1394:        sll      %l0, 1, %l1
(dbx) nexti
stopped in main at 0x00011d00
0x00011d00: main+0x1398:        add      %l3, %l1, %l0
(dbx) nexti
stopped in main at 0x00011d04
0x00011d04: main+0x139c:        ld       [%fp - 1044], %l1

It could take quite a large number of instructions before I actually encountered the problem code. Plus each step takes three lines on screen. However, there's a tracei command which traces the execution at the assembly code level (more details).

(dbx) tracei next
(dbx) cont
0x00011d08: main+0x13a0:        mov      %l0, %o0
0x00011d0c: main+0x13a4:        mov      %l2, %o1
0x00011d10: main+0x13a8:        mov      %l1, %o2
0x00011d14: main+0x13ac:        nop

The output took me through the code, and knowing the code path I had expected, I could pretty easily see the branch that caused the code to diverge.

Thursday May 15, 2008

Crossfile inlining and inline templates

Found an interesting 'feature' of using crossfile (-xipo) optimisation together with inline templates. Suppose you have a 'library' routine which is defined in one file and uses an inline template. This library routine is used all over the code. Here's an example of such a routine:

int T(int);
int W(int i)
 return T(i);

The routine W relies on an inline template (T) to do the work. The inline template contains some code like:

.inline T,0
  add  %o0,%o0,%o0

The main routine resides in another file, and uses the routine W:

int W(int);
void main()

To use inline templates you compile the file that contains the call to the inline template together with the inline template that it calls - like this:

$ cc -c -xO4 m.c
$ cc -c -xO4 w.c
$ cc -xO4 m.o w.o

However, when crossfile optimisation (-xipo) is used, the routine W is inlined into main, and now main has a dependence on the inline template. But when m.o is recompiled after W has been inlined into main, the compiler cannot see the inline template for T because it was not present on the initial compile line for m.c. The result of this is an error like:

$ cc -c -xO4 -xipo m.c
$ cc -c -xO4 -xipo w.c
$ cc -xO4 -xipo m.o w.o
Undefined                     first referenced
 symbol                             in file
T                                   m.o
ld: fatal: Symbol referencing errors. No output written to a.out

As you might guess from the above description, the workaround is not intuitive. You need to add the inline template to the initial compile of the file m.c:

$ cc -c -xO4 -xipo m.c
$ cc -c -xO4 -xipo w.c
$ cc -xO4 -xipo m.o w.o

It is not sufficient to add the inline template to the final compile line.

Looking beyond the simple test case shown above, the problem really is that when crossfile optimisation is used, the developer is no longer aware of the places in the code where inlining has happened (which is as it should be). So the developer can't know which initial compile lines to add the inline template to.

Hence, the conclusion is that whenever you are compiling code that relies on inline templates with crossfile optimisation, it is necessary to include the inline template on the compile line of every file.


Darryl Gove is a senior engineer in the Solaris Studio team, working on optimising applications and benchmarks for current and future processors. He is also the author of the books:
Multicore Application Programming
Solaris Application Programming
The Developer's Edge
Free Download


« March 2015
The Developer's Edge
Solaris Application Programming
OpenSPARC Book
Multicore Application Programming