Thursday May 15, 2008

Crossfile inlining and inline templates

Found an interesting 'feature' of using crossfile (-xipo) optimisation together with inline templates. Suppose you have a 'library' routine which is defined in one file and uses an inline template. This library routine is used all over the code. Here's an example of such a routine:

int T(int);
int W(int i)
 return T(i);

The routine W relies on an inline template (T) to do the work. The inline template contains some code like:

.inline T,0
  add  %o0,%o0,%o0

The main routine resides in another file, and uses the routine W:

int W(int);
void main()

To use inline templates you compile the file that contains the call to the inline template together with the inline template that it calls - like this:

$ cc -c -xO4 m.c
$ cc -c -xO4 w.c
$ cc -xO4 m.o w.o

However, when crossfile optimisation (-xipo) is used, the routine W is inlined into main, and now main has a dependence on the inline template. But when m.o is recompiled after W has been inlined into main, the compiler cannot see the inline template for T because it was not present on the initial compile line for m.c. The result of this is an error like:

$ cc -c -xO4 -xipo m.c
$ cc -c -xO4 -xipo w.c
$ cc -xO4 -xipo m.o w.o
Undefined                     first referenced
 symbol                             in file
T                                   m.o
ld: fatal: Symbol referencing errors. No output written to a.out

As you might guess from the above description, the workaround is not intuitive. You need to add the inline template to the initial compile of the file m.c:

$ cc -c -xO4 -xipo m.c
$ cc -c -xO4 -xipo w.c
$ cc -xO4 -xipo m.o w.o

It is not sufficient to add the inline template to the final compile line.

Looking beyond the simple test case shown above, the problem really is that when crossfile optimisation is used, the developer is no longer aware of the places in the code where inlining has happened (which is as it should be). So the developer can't know which initial compile lines to add the inline template to.

Hence, the conclusion is that whenever you are compiling code that relies on inline templates with crossfile optimisation, it is necessary to include the inline template on the compile line of every file.

Thursday Mar 20, 2008

Performance tuning recipe

Dan Berger posted a comment about the compiler flags we'd used for Ruby. Basically, we've not done compiler flag tuning yet, so I'll write a quick outline of the first steps in tuning an application.

  • First of all profile the application with whatever flags it usually builds with. This is partly to get some idea of where the hot code is, but it's also useful to have some idea of what you're starting with. The other benefit is that it's tricky to debug build issues if you've already changed the defaults.
  • It should be pretty easy at this point to identify the build flags. Probably they will flash past on the screen, or in the worse case, they can be extracted (from non-stripped executables) using dumpstabs or dwarfdump. It can be necessary to check that the flags you want to use are actually the ones being passed into the compiler.
  • Of course, I'd certainly use spot to get all the profiles. One of the features spot has that is really useful is to archive the results, so that after multiple runs of the application with different builds it's still possible to look at the old code, and identify the compiler flags used.
  • I'd probably try -fast, which is a macro flag, meaning it enables a number of optimisations that typically improve application performance. I'll probably post later about this flag, since there's quite a bit to say about it. Performance under the flag should give an idea of what to expect with aggressive optimisations. If the flag does improve the applications performance, then you probably want to identify the exact options that provide the benefit and use those explicitly.
  • In the profile, I'd be looking for routines which seem to be taking too long, and I'd be trying to figure out what was causing the time to be spent. For SPARC, the execution count data from bit that's shown in the spot report is vital in being able to distinguish from code that runs slow, or code that runs fast but is executed many times. I'd probably try various other options based on what I saw. Memory stalls might make me try prefetch options, TLB misses would make me tweak the -xpagesize options.
  • I'd also want to look at using crossfile optimisation, and profile feedback, both of which tend to benefit all codes, but particularly codes which have a lot of branches. The compiler can make pretty good guesses on loops ;)

These directions are more a list of possible experiments than necessary an item-by-item checklist, but they form a good basis. And they are not an exhaustive list...


Darryl Gove is a senior engineer in the Solaris Studio team, working on optimising applications and benchmarks for current and future processors. He is also the author of the books:
Multicore Application Programming
Solaris Application Programming
The Developer's Edge
Free Download


« July 2016
The Developer's Edge
Solaris Application Programming
OpenSPARC Book
Multicore Application Programming