Tuesday May 28, 2013

One executable, many platforms

Different processors have different optimal sequences of code. Fortunately, most of the time the differences are minor, and we can easily accommodate them by generating generic code.

If you needed more than this, then the "old" model was to use dynamic string tokens to pick the best library for the platform. This works well, and was the mechanism that libc.so used. However, the downside is that you now need to ship a bundle of libraries with the application; this can get (and look) a bit messy.

There's a "new" approach that uses a family of capability functions. The idea here is that multiple versions of the routine are linked into the executable, and the runtime linker picks the best for the platform that the application is running on. The routines are denoted with a suffix, after a percentage sign, indicating the platform. For example here's the family of memcpy() implementations in libc:

$ elfdump -H /usr/lib/libc.so.1 2>&1 |grep memcpy
      [10]  0x0010627c 0x00001280  FUNC LOCL  D    0 .text          memcpy%sun4u
      [11]  0x001094d0 0x00000b8c  FUNC LOCL  D    0 .text          memcpy%sun4u-opl
      [12]  0x0010a448 0x000005f0  FUNC LOCL  D    0 .text          memcpy%sun4v-hwcap1

It takes a bit of effort to produce a family of implementations. Imagine we want to print something different when an application is run on a sun4v machine. First of all we'll have a bit of code that prints out the compile-time defined string that indicates the platform we're running on:

#include <stdio.h>
static char name[]=PLATFORM;

double platform()
  printf("Running on %s\n",name);

To compile this code we need to provide the definition for PLATFORM - suitably escaped. We will need to provide two versions, a generic version that can always run, and a platform specific version that runs on sun4v platforms:

$ cc -c -o generic.o p.c -DPLATFORM=\"Generic\"
$ cc -c -o sun4v.o   p.c -DPLATFORM=\"sun4v\"

Now we have a specialised version of the routine platform() but it has the same name as the generic version, so we cannot link the two into the same executable. So what we need to do is to tag it as being the version we want to run on sun4v platforms.

This is a two step process. The first step is that we tag the object file as being a sun4v object file. This step is only necessary if the compiler has not already tagged the object file. The compiler will tag the object file appropriately if it uses instructions from a particular architecture - for example if you compiled explicitly targeting T4 using -xtarget=t4. However, if you need to tag the object file, then you can use a mapfile to add the appropriate hardware capabilities:

$mapfile_version 2


We can then ask the linker to apply these hardware capabilities from the mapfile to the object file:

$ ld -r -o sun4v_1.o -Mmapfile.sun4v sun4v.o

You can see that the capabilities have been applied using elfdump:

$ elfdump -H sun4v_1.o

Capabilities Section:  .SUNW_cap

 Object Capabilities:
     index  tag               value
       [0]  CA_SUNW_ID       sun4v
       [1]  CA_SUNW_MACH     sun4v

The second step is to take these capabilities and apply them to the functions. We do this using the linker option -zsymbolcap

$ ld -r -o sun4v_2.o -z symbolcap sun4v_1.o

You can now see that the platform function has been tagged as being for sun4v hardware:

$ elfdump -H sun4v_2.o

Capabilities Section:  .SUNW_cap

 Symbol Capabilities:
     index  tag               value
       [1]  CA_SUNW_ID       sun4v
       [2]  CA_SUNW_MACH     sun4v

     index    value      size      type bind oth ver shndx          name
      [24]  0x00000010 0x00000070  FUNC LOCL  D    0 .text          platform%sun4v

And finally you can combine the object files into a single executable. The main() routine of the executable calls platform() which will print out a different message depending on the platform. Here's the source to main():

extern void platform();

int main()

Here's what happens when the program is compiled and run on a non-sun4v platform:

$ cc -o main -O main.c sun4v_2.o generic.o
$ ./main
Running on Generic

Here's the same executable running on a sun4v platform:

$ ./main
Running on sun4v

Thursday Jul 14, 2011

Using symbol scoping. Libraries and linker best practices part 7

In general the compiler is going to scope symbols declared in object files as being global. This means that they can be seen and bound to by any object. There are two other settings for symbol scope - "symbolic" and "hidden".

Hidden scope is easiest to describe as it just means that the symbol can only be seen within the module and is not exported for applications or libraries to use. This is basically a locally defined symbol. There are multiple advantages to using hidden scoping when possible, it reduces the number of symbols that the linker needs to handle at runtime, so reduces start up time. It also reduces the number of names, so reduces the chance of duplicate names. Finally hidden symbols cannot be bound to externally, so they cannot cause a link order problem. This makes hidden scope a good choice for all those symbols that don't need to be exported.

The other option is symbolic scope. A symbol with symbolic scope is still available for other modules to bind to - so it is like a global symbol in that respect. However, a symbolic symbol can only be satisfied from within the library or application. So if I have an unresolved symbolic symbol foo() then that symbol can only bind within the library or application. So symbolic-scoped symbols avoid the cross-library issue that causes link order problems.

Symbols can be declared with their scoping; __global,__symbolic, or __hidden. We can also use the compiler flag -xldscope=<scope> to set the default scoping for all the symbols not otherwise scoped.

The details of all this are discussed much more thoroughly in Part 7 of the series.

The best practices for symbol scoping come in two flavours:

The easiest way of handling scoping is to declare all the defined symbols to have symbolic scoping (-xldscope=symbolic). This ensures that these symbols end up with local binding rather than pulling in definitions that are present in other libraries. The downside of this is that it could cause multiple definitions for the same symbol to become present in the address space of an application.

The other approach is to carefully define interfaces by declaring exported symbols to be __symbolic, so that other libraries can bind to them, but this library will bind to the local versions in preference. Then to declare imported symbols as __global which will ensure that the library can bind to an external definition for the symbol. Then finally use -xldscope=hidden to avoid further pollution of the name space. This is time consuming but reduces runtime link costs, and also increases the robustness of the application.

Setting the initialisation order for libraries (Best practices for libraries and linking part 6)

Part 5 of the series talked about diagnosing initialisation problems. These are situations where the libraries are loaded in the wrong order and this causes the application not to function correctly (or at all). Part 6 discusses how to resolve this problem.

The easiest, but the least reliable approach is to reorder the libraries on the link line until they get initialised in the right order. This is an easy fix since it is just a matter of changing the link line, but it's not reliable. There are various reasons why this is a poor fix. It is limited to just fixing the one application, and does not fix the root of the problem. It is not robust as a change in one of the libraries may cause the whole problem to recur. etc. Better fixes involve avoiding the duplicate symbol problem that causes the library load order to be indeterminate.

If the symbols are introduced because of C++ templates, then the -instlib=<library> flag causes the compiler not to generate symbols that are defined in the listed libraries.

Direct binding is another approach which records the exact library dependencies at link time so that the linker knows exactly which libraries are required, and hence can determine the appropriate load order. This has the downside that it enables different libraries to bind to different definitions of the same symbol, this could be a useful feature, but could also introduce problems.

Wednesday Jun 01, 2011

Avoiding problems at linktime (part 4 in series)

Part 4 in the series on best practices for linking is available. The key takeaways are:

  • Avoid defining duplicate symbols. The Solaris tool lari will produce a report on this issue (besides doing a bundle of other stuff). The problem with multiple definitions of symbols is that it is not predictable which definition will be picked at runtime. This is often deterministic on a particular platform, but could change on a different platform.
  • Always define libraries as a hierarchy, with no circular dependencies. If there are circular dependencies the libraries may get loaded in an unpredictable order.

Tuesday May 24, 2011

Best practices for linking - part 2

Part 2 of the article on library linking best practices is up on OTN. This is a relatively short read about ensuring that the library records its dependencies.

The relevant options are:

  • -z defs which will cause the linker to report any unresolved symbols found in the library. This is the default for applications, but is not the default for libraries. Using this flag requires that all the libraries that are required for successful linking are listed on the link line. Doing this will ensure that the library will fail to link rather than fail at runtime.
  • The command ldd -U -r lib.so will report if the library (or executable) is linked to libraries that it does not use. This is helpful in ensuring that the minimal number of libraries are loaded in order for an application to run.

Wednesday May 11, 2011

Best practices for linking libraries (part 1)

A while ago I was looking into some application start up problems. The problem turned out to be an issue relating to the order in which the libraries were loaded and initialised. It seemed to me that this was a rather tricky area, and it would be very helpful to document the best practices around it. I thought this would be a quick couple of pages, but it turned out to be a rather high page count, and I ended up working on the document with Steve Clamage (with Rod Evans helping out).

The first part of the document is available. This section covers basic linker good practices. Using -L and -R rather than LD_LIBRARY_PATH, generating relocatable code etc. The key take aways are:

  • Use -L to specify the path to where the libraries can be found at compile time.
  • Use -R to specify the location of the libraries at run time.
  • Use the token $ORIGIN to specify a relative path for the libraries' location. This avoids the need to have a hard-coded location where the libraries can be found.

Thursday May 21, 2009

Graph of libraries used by firefox and thunderbird

Just gathered library usage charts for firefox and thunderbird. The full charts look like:



Neither of which is particularly telling. The reduced charts look much better:



Wednesday May 20, 2009

Drawing libraries - neater eye-candy!

Chris Quenelle posted an interesting comment to my post which showed the dependencies for StarOffice. As you can see from the mass of lines below, adding more dependency information, using the latest version of ld_dot, into the StarOffice library map did not make the graphic any clearer!

It turns out that the reduction operation that Chris was alluding to is implemented by tred (the "transitive reduction filter", what great technobabble!). This filtering reduces the graph down to something which even looks ok when shrunk down to fit this page:

This clarifies the relationships between the libraries. More importantly it also looks pretty.

Libraries (5) - Runtime costs - TLBs

The next consideration when using libraries is that each library will get mapped in on a new virtual page of memory; as shown in this pmap output:

% pmap 60500
60500:  a.out
00010000       8K r-x--  /libraries/a.out
00020000       8K rwx--  /libraries/a.out
FEEC0000      24K rwx--    [ anon ]
FEED0000       8K r-x--  /libraries/lib1_26.so
FEEE0000       8K rwx--  /libraries/lib1_26.so
FEEF0000       8K r-x--  /libraries/lib1_25.so
FEF00000       8K rwx--  /libraries/lib1_25.so
FEF10000       8K r-x--  /libraries/lib1_24.so
FEF20000       8K rwx--  /libraries/lib1_24.so
FEF30000       8K r-x--  /libraries/lib1_23.so
FEF40000       8K rwx--  /libraries/lib1_23.so
FEF50000       8K rwx--    [ anon ]
FEF60000       8K r-x--  /libraries/lib1_22.so
FEF70000       8K rwx--  /libraries/lib1_22.so
FEF80000       8K r-x--  /libraries/lib1_21.so
FEF90000       8K rwx--  /libraries/lib1_21.so
FEFA0000       8K r-x--  /libraries/lib1_20.so
FEFB0000       8K rwx--  /libraries/lib1_20.so
FEFC0000       8K r-x--  /libraries/lib1_19.so

There are finite number of TLB entries on a chip. If each library takes an entry, and the code jumps around between libraries, then a single application can utilise quite a few TLB entries. Take a CMT system where there are multiple applications (or copies of the same application) running, and there becomes a lot of pressure on the TLB.

One of the enhancements in Solaris to support CMT processors is Shared Context. When multiple applications map the same library at the same address, then they can share a single context to map that library. This can lead to a significant reduction in the TLB pressure. Shared context only works for libraries that are loaded into the same memory locations in different contexts, so it can be defeated if the libraries are loaded in different orders or any other mechanisms that scramble the locations in memory.

If each library is mapped into a different TLB entry, then every call into a new library is a new ITLB entry, together with a jump through the PLT, together with the normal register spill/fill overhead. This can become quite a significant chunk of overhead.

To round this off, lets look at some figures from an artificial code run on an UltraSPARC T1 system that was hanging around here.

Application that jumps between 26 different routines a->b->c...->z. All the routines are included in the same executable. 3s
Application that jumps between 26 different routines a->...z. The routines are provided as a library, and calls are therefore routed through the PLT. 6s
Application that jumps between 26 different routines a->...z. The routines are provided as a library, but all are declared static except for the initial routine that is called by main. Therefore the calls within the library avoid the PLT. 3s
Application that jumps between 26 different routines a->...z. Each routine is defined in its own library, so calls to the routine have to go through the PLT, and also require a new ITLB entry to be used. 60s

Since the routines in this test code don't actually do anything, the overhead of calling through the PLT is clearly shown as a doubling of runtime. However, this is insignificant when compared with the costs of calling to separate libraries, which is about 10x slower than this.

Moving the experiment to look at the impact on CMT systems:

One copy of this executable per core of an UltraSPARC T1 processor 1 minute
Two copies of this executable per core 5 minutes
Four copies of this executable per core (fully loaded system) 8 minutes

Running multiple copies of the application has a significant impact on performance. The performance counters show very few instructions being executed, and much time being lost to ITLB misses. Now this performance is from a system without the shared context changes - so I would expect much better scaling on a system with these improvements (if I find one I'll rerun the experiment).

The conclusion is that care needs to be taken when deciding to split application code into libraries.

Libraries (4) - Runtime costs - Procedure Lookup Table (PLT)

Most applications spend the majority of their time running - rather than starting up. So it's useful to look at the costs of using libraries at runtime.

The most apparent cost of using libraries is that calls to routines now go indirectly to the target routine through the procedure look up table (PLT). Unless the developer explicitly limits the scope of a function, it is exported from the library as a global function, which means that even calls within the library will go through the PLT. Consider the following code snippet:

void func2()

void func1()

If this is compiled into an executable the assembly code will look like:

        11104:  82 10 00 0f  mov        %o7, %g1
        11108:  7f ff ff f8  call       func2   ! 0x110e8
        1110c:  9e 10 00 01  mov        %g1, %o7

However, if this is compiled as part of a library then the code looks like:

         664:  82 10 00 0f  mov         %o7, %g1
         668:  40 00 40 b9  call        .plt+0x3c       ! 0x1094c
         66c:  9e 10 00 01  mov         %g1, %o7

This is a doubling of the cost of the call.

In C it's possible to limit the scope of the function using the static keyword. Declaring func1 as static will cause the compiler to generate a direct call to that routine. The downside is that the routine will only be visible within the source file that defines it. It is also possible to use other methods to limit the visibility of symbols.

Libraries (3) - Application startup costs

As can be seen from the previous graphs, even a simple application (like ssh) can pull in a fair number of libraries. Whenever a library is pulled in, the linker has to request memory, load the image from disk, and then link in all the routines. This effort takes time - it's basically a large chunk of the start up time of an application. If you profile the start up of an application, you'll probably not see much because much of this time is basically the OS/disk activity of mapping the libraries into memory.

Of course applications also have start up costs associated with initialising data structures etc. However, the biggest risk is that applications will pull in libraries that they don't need, or perhaps do need, but don't need yet. The best work-around for this is to lazy load the libraries. Of course it's fairly easy to write code that either breaks under lazy loading or breaks lazy loading. It's not hard to work around these issues with care, and doing so can have a substantial impact on start up time.

Libraries (2)

Just updated the ld_dot script to include filter libraries. Added a profile for ssh logging into a system, rather than just showing the help message (click the image for the full size version).

Tuesday May 19, 2009


I was talking to Rod Evans about the diagnostic capabilities available in the runtime linker. These are available through the environment setting LD_DEBUG. The setting LD_DEBUG=files gives diagnostic information about which libraries were loaded by which other libraries. This is rather hard to interpret, and would look better as a graph. It's relatively easy to parse the output from LD_DEBUG into dot format. This script does the parsing. The full stesp to do this for the date command are:

$ LD_DEBUG=files date >ld_date 2>&1
$ ld_dot ld_date
$ dot -Tpng -o date.png dot.dot

The lines in the graph represent which libraries use which other libraries. Solid lines indicate "needed" or hard links, the dotted lines represent lazy loading or dynamic loading (dlopen). The resulting graph looks like:

More complex commands like ssh pull in a larger set of libraries:

It is possible to use this on much larger applications. Unfortunately, the library dependencies tend to get very complex. This is the library map for staroffice.

Monday Mar 02, 2009

Relocation Errors

ld.so.1: prog: fatal: relocation error: file ./libfoo.so.1: 
                    symbol bar: referenced symbol not found

Program prog uses libfoo.so.1, and that library has an unresolved dependency on the symbol bar. You can check for this problem using:

$ ldd -d prog

As outlined in the linker guide

Don't, what ever you do, solve it using LD_LIBRARY_PATH!

Tuesday Oct 16, 2007

Building shared libraries for SPARCV9

By default, the SPARC compiler assumes that SPARCV9 objects are built with -xcode=abs44, which means that 44 bits are used to hold the absolute address of any object. Shared libraries should be built using position independent code, either -xcode=pic13 or -xcode=pic32 (replacing the deprecated -Kpic and -KPIC options.

If one of the object files in a library is built with abs44, then the linker will report the following error:

ld: fatal: relocation error: R_SPARC_H44: file file.o: symbol <unknown>:
relocations based on the ABS44 coding model can not be used in building
a shared object
Further details on this can be found in the compiler documentation.

Darryl Gove is a senior engineer in the Solaris Studio team, working on optimising applications and benchmarks for current and future processors. He is also the author of the books:
Multicore Application Programming
Solaris Application Programming
The Developer's Edge


« April 2014
The Developer's Edge
Solaris Application Programming
OpenSPARC Book
Multicore Application Programming