Efficient inline templates and C++

I've talked before about calling inline templates from C++, I've also talked about calling inline templates efficiently. This time I want to talk about efficiently calling inline templates from C++.

The obvious starting point is that I need to declare the inline templates as being extern "C":

  extern "C"
  {
    int mytemplate(int);
  }

This enables us to call it, but the call may not be very efficient because the compiler will treat it as a function call, and may produce suboptimal code based on that premise. So we need to add the no_side_effect pragma:

  extern "C"
  {
    int mytemplate(int); 
    #pragma no_side_effect(mytemplate)
  }

However, this may still not produce optimal code. We've discussed how the no_side_effect pragma cannot be combined with exceptions, well we know that the code cannot produce exceptions, but the compiler doesn't know that. If we tell the compiler that information it may be able to produce even better code. We can do this by adding the "throw()" keyword to the template declaration:

  extern "C"
  {
    int mytemplate(int) throw(); 
    #pragma no_side_effect(mytemplate)
  }

The following is an example of how these changes might improve performance. We can take our previous example code and migrate it to C++, adding the use of a try...catch construct:

#include <iostream>

extern "C"
{
  int lzd(int);
  #pragma no_side_effect(lzd)
}

int a;
int c=0;

class myclass
{
  int routine();
};

int myclass::routine()
{
  try
  {
    for(a=0; a<1000; a++)
    {
      c=lzd(c);
    }
  }
  catch(...)
  {
    std::cout << "Something happened" << std::endl;
  }
 return 0;
}

Compiling this produces a slightly suboptimal code sequence in the hot loop:

$ CC -O -xtarget=T4 -S t.cpp t.il
...
/* 0x0014         23 */         lzd     %o0,%o0
/* 0x0018         21 */         add     %l6,1,%l6
/* 0x001c            */         cmp     %l6,1000
/* 0x0020            */         bl,pt   %icc,.L77000033
/* 0x0024         23 */         st      %o0,[%l7]

There's a store in the delay slot of the branch, so we're repeatedly storing data back to memory. If we change the function declaration to include "throw()", we get better code:

$ CC -O -xtarget=T4 -S t.cpp t.il
...
/* 0x0014         21 */         add     %i1,1,%i1
/* 0x0018         23 */         lzd     %o0,%o0
/* 0x001c         21 */         cmp     %i1,999
/* 0x0020            */         ble,pt  %icc,.L77000019
/* 0x0024            */         nop

The store has gone, but the code is still suboptimal - there's a nop in the delay slot rather than useful work. However, it's good enough for this example. The point I'm making is that the compiler produces the better code with both the "throw()" and the no side effect pragma.

Comments:

Post a Comment:
Comments are closed for this entry.
About

Darryl Gove is a senior engineer in the Solaris Studio team, working on optimising applications and benchmarks for current and future processors. He is also the author of the books:
Multicore Application Programming
Solaris Application Programming
The Developer's Edge

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
5
6
8
9
10
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today
Bookmarks
The Developer's Edge
Solaris Application Programming
Publications
Webcasts
Presentations
OpenSPARC Book
Multicore Application Programming
Docs