Thursday Jul 14, 2011

Best practices for libraries and linkers (part 8)

Part 8 is the conclusion of the series on the best practices for libraries and linking. The core set of best practices are:

  • Ensure at link time that all symbols are resolved.
  • Minimise the number of symbols of global scope.
  • Specify the library search paths at link time.

Putting this series of articles together turned out to be a fair amount of work. Hopefully you can see from the scale of the topics why we chose to break it down into bite-sized chunks. I'll be happy to hear feedback on whether you found it useful, or what other topics you would like discussed.

Using symbol scoping. Libraries and linker best practices part 7

In general the compiler is going to scope symbols declared in object files as being global. This means that they can be seen and bound to by any object. There are two other settings for symbol scope - "symbolic" and "hidden".

Hidden scope is easiest to describe as it just means that the symbol can only be seen within the module and is not exported for applications or libraries to use. This is basically a locally defined symbol. There are multiple advantages to using hidden scoping when possible, it reduces the number of symbols that the linker needs to handle at runtime, so reduces start up time. It also reduces the number of names, so reduces the chance of duplicate names. Finally hidden symbols cannot be bound to externally, so they cannot cause a link order problem. This makes hidden scope a good choice for all those symbols that don't need to be exported.

The other option is symbolic scope. A symbol with symbolic scope is still available for other modules to bind to - so it is like a global symbol in that respect. However, a symbolic symbol can only be satisfied from within the library or application. So if I have an unresolved symbolic symbol foo() then that symbol can only bind within the library or application. So symbolic-scoped symbols avoid the cross-library issue that causes link order problems.

Symbols can be declared with their scoping; __global,__symbolic, or __hidden. We can also use the compiler flag -xldscope=<scope> to set the default scoping for all the symbols not otherwise scoped.

The details of all this are discussed much more thoroughly in Part 7 of the series.

The best practices for symbol scoping come in two flavours:

The easiest way of handling scoping is to declare all the defined symbols to have symbolic scoping (-xldscope=symbolic). This ensures that these symbols end up with local binding rather than pulling in definitions that are present in other libraries. The downside of this is that it could cause multiple definitions for the same symbol to become present in the address space of an application.

The other approach is to carefully define interfaces by declaring exported symbols to be __symbolic, so that other libraries can bind to them, but this library will bind to the local versions in preference. Then to declare imported symbols as __global which will ensure that the library can bind to an external definition for the symbol. Then finally use -xldscope=hidden to avoid further pollution of the name space. This is time consuming but reduces runtime link costs, and also increases the robustness of the application.

Setting the initialisation order for libraries (Best practices for libraries and linking part 6)

Part 5 of the series talked about diagnosing initialisation problems. These are situations where the libraries are loaded in the wrong order and this causes the application not to function correctly (or at all). Part 6 discusses how to resolve this problem.

The easiest, but the least reliable approach is to reorder the libraries on the link line until they get initialised in the right order. This is an easy fix since it is just a matter of changing the link line, but it's not reliable. There are various reasons why this is a poor fix. It is limited to just fixing the one application, and does not fix the root of the problem. It is not robust as a change in one of the libraries may cause the whole problem to recur. etc. Better fixes involve avoiding the duplicate symbol problem that causes the library load order to be indeterminate.

If the symbols are introduced because of C++ templates, then the -instlib=<library> flag causes the compiler not to generate symbols that are defined in the listed libraries.

Direct binding is another approach which records the exact library dependencies at link time so that the linker knows exactly which libraries are required, and hence can determine the appropriate load order. This has the downside that it enables different libraries to bind to different definitions of the same symbol, this could be a useful feature, but could also introduce problems.

Friday May 27, 2011

Using LD_DEBUG to examine application startup (linking best practices part 3)

Part 3 of the series on best practices for linking C/C++ applications is up. This sections focuses on using LD_DEBUG to examine application startup.

The paper talks about the options LD_DEBUG=init which shows the initialisation and finalisation stages of an applications run, and LD_DEBUG=bindings which shows how the symbols are bound between the application and libraries.

Tuesday May 24, 2011

Best practices for linking - part 2

Part 2 of the article on library linking best practices is up on OTN. This is a relatively short read about ensuring that the library records its dependencies.

The relevant options are:

  • -z defs which will cause the linker to report any unresolved symbols found in the library. This is the default for applications, but is not the default for libraries. Using this flag requires that all the libraries that are required for successful linking are listed on the link line. Doing this will ensure that the library will fail to link rather than fail at runtime.
  • The command ldd -U -r will report if the library (or executable) is linked to libraries that it does not use. This is helpful in ensuring that the minimal number of libraries are loaded in order for an application to run.

Thursday May 21, 2009

Graph of libraries used by firefox and thunderbird

Just gathered library usage charts for firefox and thunderbird. The full charts look like:



Neither of which is particularly telling. The reduced charts look much better:



Wednesday May 20, 2009

Drawing libraries - neater eye-candy!

Chris Quenelle posted an interesting comment to my post which showed the dependencies for StarOffice. As you can see from the mass of lines below, adding more dependency information, using the latest version of ld_dot, into the StarOffice library map did not make the graphic any clearer!

It turns out that the reduction operation that Chris was alluding to is implemented by tred (the "transitive reduction filter", what great technobabble!). This filtering reduces the graph down to something which even looks ok when shrunk down to fit this page:

This clarifies the relationships between the libraries. More importantly it also looks pretty.

Libraries (5) - Runtime costs - TLBs

The next consideration when using libraries is that each library will get mapped in on a new virtual page of memory; as shown in this pmap output:

% pmap 60500
60500:  a.out
00010000       8K r-x--  /libraries/a.out
00020000       8K rwx--  /libraries/a.out
FEEC0000      24K rwx--    [ anon ]
FEED0000       8K r-x--  /libraries/
FEEE0000       8K rwx--  /libraries/
FEEF0000       8K r-x--  /libraries/
FEF00000       8K rwx--  /libraries/
FEF10000       8K r-x--  /libraries/
FEF20000       8K rwx--  /libraries/
FEF30000       8K r-x--  /libraries/
FEF40000       8K rwx--  /libraries/
FEF50000       8K rwx--    [ anon ]
FEF60000       8K r-x--  /libraries/
FEF70000       8K rwx--  /libraries/
FEF80000       8K r-x--  /libraries/
FEF90000       8K rwx--  /libraries/
FEFA0000       8K r-x--  /libraries/
FEFB0000       8K rwx--  /libraries/
FEFC0000       8K r-x--  /libraries/

There are finite number of TLB entries on a chip. If each library takes an entry, and the code jumps around between libraries, then a single application can utilise quite a few TLB entries. Take a CMT system where there are multiple applications (or copies of the same application) running, and there becomes a lot of pressure on the TLB.

One of the enhancements in Solaris to support CMT processors is Shared Context. When multiple applications map the same library at the same address, then they can share a single context to map that library. This can lead to a significant reduction in the TLB pressure. Shared context only works for libraries that are loaded into the same memory locations in different contexts, so it can be defeated if the libraries are loaded in different orders or any other mechanisms that scramble the locations in memory.

If each library is mapped into a different TLB entry, then every call into a new library is a new ITLB entry, together with a jump through the PLT, together with the normal register spill/fill overhead. This can become quite a significant chunk of overhead.

To round this off, lets look at some figures from an artificial code run on an UltraSPARC T1 system that was hanging around here.

Application that jumps between 26 different routines a->b->c...->z. All the routines are included in the same executable. 3s
Application that jumps between 26 different routines a->...z. The routines are provided as a library, and calls are therefore routed through the PLT. 6s
Application that jumps between 26 different routines a->...z. The routines are provided as a library, but all are declared static except for the initial routine that is called by main. Therefore the calls within the library avoid the PLT. 3s
Application that jumps between 26 different routines a->...z. Each routine is defined in its own library, so calls to the routine have to go through the PLT, and also require a new ITLB entry to be used. 60s

Since the routines in this test code don't actually do anything, the overhead of calling through the PLT is clearly shown as a doubling of runtime. However, this is insignificant when compared with the costs of calling to separate libraries, which is about 10x slower than this.

Moving the experiment to look at the impact on CMT systems:

One copy of this executable per core of an UltraSPARC T1 processor 1 minute
Two copies of this executable per core 5 minutes
Four copies of this executable per core (fully loaded system) 8 minutes

Running multiple copies of the application has a significant impact on performance. The performance counters show very few instructions being executed, and much time being lost to ITLB misses. Now this performance is from a system without the shared context changes - so I would expect much better scaling on a system with these improvements (if I find one I'll rerun the experiment).

The conclusion is that care needs to be taken when deciding to split application code into libraries.

Libraries (4) - Runtime costs - Procedure Lookup Table (PLT)

Most applications spend the majority of their time running - rather than starting up. So it's useful to look at the costs of using libraries at runtime.

The most apparent cost of using libraries is that calls to routines now go indirectly to the target routine through the procedure look up table (PLT). Unless the developer explicitly limits the scope of a function, it is exported from the library as a global function, which means that even calls within the library will go through the PLT. Consider the following code snippet:

void func2()

void func1()

If this is compiled into an executable the assembly code will look like:

        11104:  82 10 00 0f  mov        %o7, %g1
        11108:  7f ff ff f8  call       func2   ! 0x110e8
        1110c:  9e 10 00 01  mov        %g1, %o7

However, if this is compiled as part of a library then the code looks like:

         664:  82 10 00 0f  mov         %o7, %g1
         668:  40 00 40 b9  call        .plt+0x3c       ! 0x1094c
         66c:  9e 10 00 01  mov         %g1, %o7

This is a doubling of the cost of the call.

In C it's possible to limit the scope of the function using the static keyword. Declaring func1 as static will cause the compiler to generate a direct call to that routine. The downside is that the routine will only be visible within the source file that defines it. It is also possible to use other methods to limit the visibility of symbols.

Libraries (3) - Application startup costs

As can be seen from the previous graphs, even a simple application (like ssh) can pull in a fair number of libraries. Whenever a library is pulled in, the linker has to request memory, load the image from disk, and then link in all the routines. This effort takes time - it's basically a large chunk of the start up time of an application. If you profile the start up of an application, you'll probably not see much because much of this time is basically the OS/disk activity of mapping the libraries into memory.

Of course applications also have start up costs associated with initialising data structures etc. However, the biggest risk is that applications will pull in libraries that they don't need, or perhaps do need, but don't need yet. The best work-around for this is to lazy load the libraries. Of course it's fairly easy to write code that either breaks under lazy loading or breaks lazy loading. It's not hard to work around these issues with care, and doing so can have a substantial impact on start up time.

Libraries (2)

Just updated the ld_dot script to include filter libraries. Added a profile for ssh logging into a system, rather than just showing the help message (click the image for the full size version).

Tuesday May 19, 2009


I was talking to Rod Evans about the diagnostic capabilities available in the runtime linker. These are available through the environment setting LD_DEBUG. The setting LD_DEBUG=files gives diagnostic information about which libraries were loaded by which other libraries. This is rather hard to interpret, and would look better as a graph. It's relatively easy to parse the output from LD_DEBUG into dot format. This script does the parsing. The full stesp to do this for the date command are:

$ LD_DEBUG=files date >ld_date 2>&1
$ ld_dot ld_date
$ dot -Tpng -o date.png

The lines in the graph represent which libraries use which other libraries. Solid lines indicate "needed" or hard links, the dotted lines represent lazy loading or dynamic loading (dlopen). The resulting graph looks like:

More complex commands like ssh pull in a larger set of libraries:

It is possible to use this on much larger applications. Unfortunately, the library dependencies tend to get very complex. This is the library map for staroffice.

Thursday Jun 26, 2008

Calling libraries

I've previously blogged about measuring the performance of calling library code. Lets quickly cover where the costs come from, and what can be done about them.

The most obvious cost is that of making the call. Probably this is a straight-forward call instruction, although calls over indirection can involve loading the address from memory first of all. There's also a linkage table to negotiate - let's take a look at that:

#include <stdio.h>
void f()
  printf("Hello again\\n");

void main()
  printf("Hello World\\n");

There's two calls to printf in the code, libc is lazy-loaded, so the first call does the set up, and then we can see what happens more generally on the second call.

% cc -g p.c
% dbx a.out
(dbx) stop in f
(2) stop in f
(dbx) run
Running: a.out
(process id 63626)
Hello World
stopped in f at line 4 in file "test.c"
    4     printf("Hello again\\n");
(dbx) stepi
stopped in f at 0x00010bc0
0x00010bc0: f+0x0008:   bset     48, %l0
0x00010bc4: f+0x000c:   call     printf [PLT]   ! 0x20ca8
0x00010bc8: f+0x0010:   or       %l0, %g0, %o0
0x00020ca8: printf        [PLT]:        sethi    %hi(0x15000), %g1
0x00020cac: printf+0x0004 [PLT]:        sethi    %hi(0xff31c400), %g1
0x00020cb0: printf+0x0008 [PLT]:        jmp      %g1 + 0x00000024
0x00020cb4: _get_exit_frame_monitor        [PLT]:       sethi    %hi(0x18000), %g1
0xff31c424: printf       :      save     %sp, -96, %sp

So the call to printf actually jumps to a procedure lookup table, which then jumps to the actual start address of the library code.

So that's the additional costs of libraries. But just doing a call instruction also has some costs:

  • For SPARC processors, there's the possibility of hitting a register windows spill/fill trap.
  • The other issue with call instructions is that the compiler does not know whether the routine being called will read or write to memory. So all variables need to be stored back to memory before the call, and read from memory afterwards - this can get quite ugly particularly for floating point codes where there maybe quite a few active registers at any one time. This behaviour can be avoided using the pragmas does_not_read_global_data, does_not_write_global_data, no_side_effect. The no_side_effect pragma means that the compiler can eliminate the call to the routine if the return value is not used.
  • There are also ABI issues. For example, the SPARC V8 ABI requires floating point parameters to be passed in the integer registers. Doing this requires storing the fp registers to the stack and then loading the values into the integer registers, and doing the opposite on the other side of the call!

So generally calling routines can be time consuming, but what can be done?

  • Check to see whether you might use intrinsics such as fsqrt rather than calling sqrt in libc (-xlibmil)
  • Compiling with -xO4 enables the compiler to avoid calls by inlining within the same source file.
  • Compiling and linking with -xipo enables the compiler to do cross-file inlining.
  • Make sure that every call that is made does substantial work - not just a handful of instructions.
  • Profile the application to confirm that there is real work being done in library code, and that the library routines called do perform substantial numbers of instructions on every invocation.

Thursday May 15, 2008

Redistributable libraries

Steve Clamage and I just put together a short article on using the redistributable libraries that are shipped as part of the compiler. The particular one we focus on is stlport4 since this library is commonly substituted for the default libCstd.

There are two points to take away from the article. First of all, that the required libraries should be copied into a new directory structure for distribution with your application - this makes it easy to patch them, and ensures that the correct version is picked up. The second point is to use the $ORIGIN token when linking the application to specify the path, relative to the location of the executable, where the library will be found at runtime.

Runtime linking is one of my bugbears. I really get fed up with software that requires libraries to be located in particular places in order for it to run, or worse software that requires LD_LIBRARY_PATH to be set for the application to locate the libraries (see Rod Evan's blog entry).

Tuesday Oct 16, 2007

Building shared libraries for SPARCV9

By default, the SPARC compiler assumes that SPARCV9 objects are built with -xcode=abs44, which means that 44 bits are used to hold the absolute address of any object. Shared libraries should be built using position independent code, either -xcode=pic13 or -xcode=pic32 (replacing the deprecated -Kpic and -KPIC options.

If one of the object files in a library is built with abs44, then the linker will report the following error:

ld: fatal: relocation error: R_SPARC_H44: file file.o: symbol <unknown>:
relocations based on the ABS44 coding model can not be used in building
a shared object
Further details on this can be found in the compiler documentation.

Tuesday Jul 31, 2007

Snippet from book: cost of calling libraries

I've been working on a book about developing on Solaris, and I'm currently in the final stages of editing - which is a great feeling :) One of the strange things that happens at this stage is that material ends up being cut out. One of the sections that didn't make it was a discussion of the overhead of calling dynamic libraries rather than static libraries. The text is in a 'raw' format, and for some reason the document claims to have 4 pages, rather than the 3 that are there.

Wednesday Jul 25, 2007

List of Sun Studio redistributable libraries

List of libraries that are included with Sun Studio, and can be redistributed with applications compiled by Sun Studio.

All the documentation for Sun Studio 12.


Darryl Gove is a senior engineer in the Solaris Studio team, working on optimising applications and benchmarks for current and future processors. He is also the author of the books:
Multicore Application Programming
Solaris Application Programming
The Developer's Edge


« April 2014
The Developer's Edge
Solaris Application Programming
OpenSPARC Book
Multicore Application Programming