Calling libraries

I've previously blogged about measuring the performance of calling library code. Lets quickly cover where the costs come from, and what can be done about them.

The most obvious cost is that of making the call. Probably this is a straight-forward call instruction, although calls over indirection can involve loading the address from memory first of all. There's also a linkage table to negotiate - let's take a look at that:

#include <stdio.h>
void f()
{
  printf("Hello again\\n");
}

void main()
{
  printf("Hello World\\n");
  f();
}

There's two calls to printf in the code, libc is lazy-loaded, so the first call does the set up, and then we can see what happens more generally on the second call.

% cc -g p.c
% dbx a.out
Reading ld.so.1
Reading libc.so.1
(dbx) stop in f
(2) stop in f
(dbx) run
Running: a.out
(process id 63626)
Reading libc_psr.so.1
Hello World
stopped in f at line 4 in file "test.c"
    4     printf("Hello again\\n");
(dbx) stepi
stopped in f at 0x00010bc0
0x00010bc0: f+0x0008:   bset     48, %l0
0x00010bc4: f+0x000c:   call     printf [PLT]   ! 0x20ca8
0x00010bc8: f+0x0010:   or       %l0, %g0, %o0
0x00020ca8: printf        [PLT]:        sethi    %hi(0x15000), %g1
0x00020cac: printf+0x0004 [PLT]:        sethi    %hi(0xff31c400), %g1
0x00020cb0: printf+0x0008 [PLT]:        jmp      %g1 + 0x00000024
0x00020cb4: _get_exit_frame_monitor        [PLT]:       sethi    %hi(0x18000), %g1
0xff31c424: printf       :      save     %sp, -96, %sp

So the call to printf actually jumps to a procedure lookup table, which then jumps to the actual start address of the library code.

So that's the additional costs of libraries. But just doing a call instruction also has some costs:

  • For SPARC processors, there's the possibility of hitting a register windows spill/fill trap.
  • The other issue with call instructions is that the compiler does not know whether the routine being called will read or write to memory. So all variables need to be stored back to memory before the call, and read from memory afterwards - this can get quite ugly particularly for floating point codes where there maybe quite a few active registers at any one time. This behaviour can be avoided using the pragmas does_not_read_global_data, does_not_write_global_data, no_side_effect. The no_side_effect pragma means that the compiler can eliminate the call to the routine if the return value is not used.
  • There are also ABI issues. For example, the SPARC V8 ABI requires floating point parameters to be passed in the integer registers. Doing this requires storing the fp registers to the stack and then loading the values into the integer registers, and doing the opposite on the other side of the call!

So generally calling routines can be time consuming, but what can be done?

  • Check to see whether you might use intrinsics such as fsqrt rather than calling sqrt in libc (-xlibmil)
  • Compiling with -xO4 enables the compiler to avoid calls by inlining within the same source file.
  • Compiling and linking with -xipo enables the compiler to do cross-file inlining.
  • Make sure that every call that is made does substantial work - not just a handful of instructions.
  • Profile the application to confirm that there is real work being done in library code, and that the library routines called do perform substantial numbers of instructions on every invocation.
Comments:

"But just doing a call instruction also has some costs:

For SPARC processors, there's the possibility of hitting a register windows spill/fill trap."

I know you're simplifying, but there's no way for a call instruction to cause a spill trap, let alone a fill trap. Those are caused by save/restore instructions and happen after the call instruction.

Posted by Valued Reader on June 26, 2008 at 09:14 AM PDT #

Yes, that's quite correct, the call instruction just changes the pc, the save and restore instructions in the called routine change the register windows (and not all routines need the save and restore instructions).

Posted by Darryl Gove on June 26, 2008 at 09:49 AM PDT #

Post a Comment:
Comments are closed for this entry.
About

Darryl Gove is a senior engineer in the Solaris Studio team, working on optimising applications and benchmarks for current and future processors. He is also the author of the books:
Multicore Application Programming
Solaris Application Programming
The Developer's Edge

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
5
6
8
9
10
12
13
14
15
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today
Bookmarks
The Developer's Edge
Solaris Application Programming
Publications
Webcasts
Presentations
OpenSPARC Book
Multicore Application Programming
Docs