Thursday May 17, 2012

Solaris Developer talk next week

Vijay Tatkar will be talking about developing on Solaris next week Tuesday at 9am PST.

Friday Apr 20, 2012

What is -xcode=abs44?

I've talked about building 64-bit libraries with position independent code. When building 64-bit applications there are two options for the code that the compiler generates: -xcode=abs64 or -xcode=abs44, the default is -xcode=abs44. These are documented in the user guides. The abs44 and abs64 options produce 64-bit applications that constrain the code + data + BSS to either 44 bit or 64 bits of address.

These options constrain the addresses statically encoded in the application to either 44 or 64 bits. It does not restrict the address range for pointers (dynamically allocated memory) - they remain 64-bits. The restriction is in locating the address of a routine or a variable within the executable.

This is easier to understand from the perspective of an example. Suppose we have a variable "data" that we want to return the address of. Here's the code to do such a thing:

extern int data;

int * address()
  return &data;

If we compile this as a 32-bit app we get the following disassembly:

/* 000000          4 */         sethi   %hi(data),%o5
/* 0x0004            */         retl    ! Result =  %o0
/* 0x0008            */         add     %o5,%lo(data),%o0

So it takes two instructions to generate the address of the variable "data". At link time the linker will go through the code, locate references to the variable "data" and replace them with the actual address of the variable, so these two instructions will get modified. If we compile this as a 64-bit code with full 64-bit address generation (-xcode=abs64) we get the following:

/* 000000          4 */         sethi   %hh(data),%o5
/* 0x0004            */         sethi   %lm(data),%o2
/* 0x0008            */         or      %o5,%hm(data),%o4
/* 0x000c            */         sllx    %o4,32,%o3
/* 0x0010            */         or      %o3,%o2,%o1
/* 0x0014            */         retl    ! Result =  %o0
/* 0x0018            */         add     %o1,%lo(data),%o0

So to do the same thing for a 64-bit application with full 64-bit address generation takes 6 instructions. Now, most hardware cannot address the full 64-bits, hardware typically can address somewhere around 40+ bits of address (example). So being able to generate a full 64-bit address is currently unnecessary. This is where abs44 comes in. A 44 bit address can be generated in four instructions, so slightly cuts the instruction count without practically compromising the range of memory that an application can address:

/* 000000          4 */         sethi   %h44(data),%o5
/* 0x0004            */         or      %o5,%m44(data),%o4
/* 0x0008            */         sllx    %o4,12,%o3
/* 0x000c            */         retl    ! Result =  %o0
/* 0x0010            */         add     %o3,%l44(data),%o0

Monday Apr 02, 2012

Efficient inline templates and C++

I've talked before about calling inline templates from C++, I've also talked about calling inline templates efficiently. This time I want to talk about efficiently calling inline templates from C++.

The obvious starting point is that I need to declare the inline templates as being extern "C":

  extern "C"
    int mytemplate(int);

This enables us to call it, but the call may not be very efficient because the compiler will treat it as a function call, and may produce suboptimal code based on that premise. So we need to add the no_side_effect pragma:

  extern "C"
    int mytemplate(int); 
    #pragma no_side_effect(mytemplate)

However, this may still not produce optimal code. We've discussed how the no_side_effect pragma cannot be combined with exceptions, well we know that the code cannot produce exceptions, but the compiler doesn't know that. If we tell the compiler that information it may be able to produce even better code. We can do this by adding the "throw()" keyword to the template declaration:

  extern "C"
    int mytemplate(int) throw(); 
    #pragma no_side_effect(mytemplate)

The following is an example of how these changes might improve performance. We can take our previous example code and migrate it to C++, adding the use of a try...catch construct:

#include <iostream>

extern "C"
  int lzd(int);
  #pragma no_side_effect(lzd)

int a;
int c=0;

class myclass
  int routine();

int myclass::routine()
    for(a=0; a<1000; a++)
    std::cout << "Something happened" << std::endl;
 return 0;

Compiling this produces a slightly suboptimal code sequence in the hot loop:

$ CC -O -xtarget=T4 -S t.cpp
/* 0x0014         23 */         lzd     %o0,%o0
/* 0x0018         21 */         add     %l6,1,%l6
/* 0x001c            */         cmp     %l6,1000
/* 0x0020            */         bl,pt   %icc,.L77000033
/* 0x0024         23 */         st      %o0,[%l7]

There's a store in the delay slot of the branch, so we're repeatedly storing data back to memory. If we change the function declaration to include "throw()", we get better code:

$ CC -O -xtarget=T4 -S t.cpp
/* 0x0014         21 */         add     %i1,1,%i1
/* 0x0018         23 */         lzd     %o0,%o0
/* 0x001c         21 */         cmp     %i1,999
/* 0x0020            */         ble,pt  %icc,.L77000019
/* 0x0024            */         nop

The store has gone, but the code is still suboptimal - there's a nop in the delay slot rather than useful work. However, it's good enough for this example. The point I'm making is that the compiler produces the better code with both the "throw()" and the no side effect pragma.

Friday Feb 03, 2012

Using prtpicl to get cache sizes

If you are on a SPARC system you can get cache size information using the command fpversion, which is provided with Studio:

$ fpversion
 A SPARC-based CPU is available.
 Kernel says main memory's clock rate is 1012.0 MHz.

 Sun-4 floating-point controller version 0 found.
 An UltraSPARC chip is available.

 Use "-xtarget=sparc64vii -xcache=64/64/2:5120/256/10" code-generation option.

The cache parameters are output exactly as you would want to pass them into the compiler - for each cache it describes the size in KB, the line size in bytes, and the associativity.

fpversion doesn't exist on x86 systems. The next best thing is to use prtpicl to output system configuration information, and inspect that output for cache size. Here's the cache output for the same SPARC system using prtpicl.

$ prtpicl -v |grep cache
              :l1-icache-size    0x10000
              :l1-icache-line-size       0x40
              :l1-icache-associativity   0x2
              :l1-dcache-size    0x10000
              :l1-dcache-line-size       0x40
              :l1-dcache-associativity   0x2
              :l2-cache-size     0x500000
              :l2-cache-line-size        0x100
              :l2-cache-associativity    0xa

Tuesday Jan 17, 2012

Separation of debug and executable

To reduce the size of shipped binaries it can be useful to separate the debug information into a separate file. This procedure is covered in the dbx manual. We can use objdump to extract the debug information and then to link the executable with the extracted data.

Here's a short example executable:

#include <stdio.h>
#include <math.h>

int main()
  double d=1.0;
  d = sin(d);
  printf("sin(1.0) = %f\n",d);

Compiled with debug:

$ cc -g hello.c -lm
$ ./a.out
sin(1.0) = 0.841471

We can debug this executable with dbx. Note that, in this case, we compiled without optimisation in order to get the best debug information. Doing this does potentially sacrifice some performance. We can follow the same procedure with optimised code.

$ dbx ./a.out
(dbx) stop in main
(2) stop in main
(dbx) run
Running: a.out
(process id 53296)
stopped in main at line 6 in file "hello.c"
    6     double d=1.0;
(dbx) step
stopped in main at line 7 in file "hello.c"
    7     d = sin(d);
(dbx) print d
d = 1.0
(dbx) cont
sin(1.0) = 0.841471

First of all we are going to use objcopy to extract the debug information from ./a.out and place it into ./a.out.debug:

$ /usr/sfw/bin/gobjcopy --only-keep-debug ./a.out ./a.out.debug

Now we can strip a.out of debug information:

$ strip ./a.out

To prove that this has removed the debug information we can try running under dbx:

$ dbx  ./a.out
(dbx) stop in main
dbx: warning: 'main' has no debugger info -- will trigger on first instruction
(2) stop in main
(dbx) quit

Now we want to use objcopy to make a link between the executable and its debug information:

$ /usr/sfw/bin/gobjcopy --add-gnu-debuglink=./a.out.debug ./a.out

Now when we debug the executable we are back to full debug:

$ dbx ./a.out
(dbx) stop  in main
(2) stop in main
(dbx) run
Running: a.out
(process id 58837)
stopped in main at line 6 in file "hello.c"
    6     double d=1.0;
(dbx) next
stopped in main at line 7 in file "hello.c"
    7     d = sin(d);
(dbx) print d
d = 1.0
(dbx) cont
sin(1.0) = 0.841471

execution completed, exit code is 0
(dbx) quit

Friday Jan 13, 2012

C++ and inline templates

A while back I wrote an article on using inline templates. It's a bit of a niche article as I would generally advise people to write in C/C++, and tune the compiler flags and source code until the compiler generates the code that they want to see.

However, one thing that I didn't mention in the article, it's implied but not stated, is that inline templates are defined as C functions. When used from C++ they need to be declared as extern "C", otherwise you get linker errors. Here's an example template:

.inline nothing

And here's some code that calls it:

void nothing();

int main()

The code works when compiled as C, but not as C++:

$ cc i.c
$ ./a.out
$ CC i.c
Undefined                       first referenced
 symbol                             in file
void nothing()                   i.o
ld: fatal: Symbol referencing errors. No output written to a.out

To fix this, and make the code compilable with both C and C++ we use the __cplusplus feature test macro and conditionally include extern "C". Here's the modified source:

#ifdef __cplusplus
  extern "C"
    void nothing();
#ifdef __cplusplus

int main()

Thursday Jan 12, 2012

Please mind the gap

I find the timeline view in the Performance Analyzer incredibly useful, but I've often been puzzled by what causes the gaps - like those in the example below:

Timeline view

One of my colleagues pointed out that it is possible to figure out what is causing the gaps. The call stack is indicated by the event after the gap. This makes sense. The Performance Analyzer works by sending a profiling signal to the thread multiple times a second. If the thread is not scheduled on the CPU then it doesn't get a signal. The first thing that the thread does when it is put back onto the CPU is to respond to those signals that it missed. Here's some example code so that you can try it out.

#include <stdio.h>

void write_file()
  char block[8192];
  FILE * file = fopen("./text.txt", "w");
  for (int i=0;i<1024; i++)
    fwrite(block, sizeof(block), 1, file);

void read_file()
  char block[8192];
  FILE * file = fopen("./text.txt", "rw");
  for (int i=0;i<1024; i++)
    fwrite(block, sizeof(block), 1, file);

int main()
  for (int i=0; i<100; i++)

This is the code that generated the timeline shown above, so you know that the profile will have some gaps in it. If we select the event after the gap we determine that the gaps are caused by the application either opening or closing the file.


But that is not all that is going on, if we look at the information shown in the Timeline details panel for the Duration of the event we can see that it spent 210ms in the "Other Wait" micro state. So we've now got a pretty clear idea of where the time is coming from.

Wednesday Jan 11, 2012

A static function, an inline function, and a static variable walked into a bar....

... well, not really. Hacking around with some library code, so I thought I'd write up a quick refresher on scoping. Steve Clamage and I cover scoping in more detail in the series on libraries and linking. For the code I was working on today, the problem was much more limited.

I had a single file containing all the source code. I wanted to export only the minimal number of symbols that were needed to act as an interface for the library. You can imagine it being something like:

#include <stdio.h>

int count=0;

inline void printcount()
  printf("Count = %i\n",count);

void next()

If I compile this, and then use nm to inspect the resulting library, I can see a global symbol for count. The function printcount() is defined with local scope. However, the only interface I want to export is next().

bash-3.00$ cc -g -G -O -o t.c
bash-3.00$ nm|grep GLOB
[45]    |     66468|       4|OBJT |GLOB |0    |11     |count
[43]    |       724|      40|FUNC |GLOB |0    |5      |next
[42]    |         0|       0|FUNC |GLOB |0    |UNDEF  |printf
bash-3.00$ nm |grep count
[44]    |     66460|       4|OBJT |GLOB |0    |11     |count
[32]    |       672|      52|FUNC |LOCL |0    |5      |printcount

So I can define count as a static variable, and that reduces its scope to the file in which it is defined. However, this does not actually make it disappear, it is still there, but with name mangling:

bash-3.00$ nm|grep count
[40]    |     66476|       4|OBJT |GLOB |0    |11     |$XAS4IkBuA_CPGtc.count
[33]    |       688|      52|FUNC |LOCL |0    |5      |printcount

The reason for this is that I'm building with debug (-g). With debug, I get a local version of the routine printcount(), and I get a globalised version of the variable count. If I remove -g, I get the following output from nm:

bash-3.00$ nm|grep count
[29]    |     66316|       4|OBJT |LOCL |0    |11     |count
[36]    |         0|       0|FUNC |GLOB |0    |UNDEF  |printcount

The variable count has local scope, which is what we expected - it is no longer exported from the file, so we have avoided possible name conflicts there. However, printcount() is now no longer defined. That might be ok so long as we don't actually call the routine:

bash-3.00$ dis|grep printcount
         2e4:  7f ff ff ef  call        printcount      ! 0x2a0

Oops. We've hit the rule about needing to provide an extern version of any inline functions. Once again, I suggest parsing Douglas Walls' discussion of the topic for the gory details. Anyhow, the upshot is that this library wouldn't work. The fix is trivial, declare printcount() to be static inline, and the compiler will generate the local version of the function:

bash-3.00$ cc -G -O -o t.c
bash-3.00$ nm |grep count
[29]    |     66448|       4|OBJT |LOCL |0    |11     |count
[30]    |       664|      52|FUNC |LOCL |0    |5      |printcount

With these fixes the library no longer exports any functions but the ones I left with external linkage. This substantially reduces the risk of "undefined behaviour".

Tuesday Jan 10, 2012

What's inlined by -xlibmil

The compiler flag -xlibmil provides inline templates for some critical maths functions, but it comes with the optimisation that it does not set errno for these functions. The functions it inlines can vary from release to release, so it's useful to be able to see which functions are inlined, and determine whether you care that they don't set errno. You can see the list of functions using the command:

grep inline /compilerpath/prod/lib/
        .inline sqrtf,1
        .inline sqrt,2
        .inline ceil,2
        .inline ceilf,1
        .inline floor,2
        .inline floorf,1
        .inline rint,2
        .inline rintf,1

From a cursory glance at the list I got when I did this just now, I can only see sqrt as a function that sets errno. So if you use sqrt and you care about whether it set errno, then don't use -xlibmil.

Monday Jan 09, 2012

Understanding binary size

One of my colleagues, Miriam Blatt, has written a great article about understanding the size of binary objects. This is worth a read because it describes both what goes into the objects and what tools you can use to discover this information.

Wednesday Dec 14, 2011

Oracle Solaris Studio 12.3

Oracle Solaris Studio 12.3 was released today. You can download it here.

There's a bundle of exciting stuff that goes into every new release. The headlines are probably the introduction of the Code Analyzer tool which does dynamic and static error reporting on an application, and the ablity of the IDE to be run on a remote system while the builds are done on the host.

I have a couple of other favourite areas of change. First of all we've got spot running on a bunch of recent processors - in particular the SPARC T4 (I'll write more about this later). Secondly, the filtering in the Performance Analyzer has been pushed to the foreground. Let's discuss filtering now.

Filtering is one of those technologies that is very powerful, but has been quite hard to use in previous releases. The change in this release has been that the filters have been placed on the right-click menu. Here's an example:

Adding and removing filters is now just a matter of right clicking. This allows you to rapidly drill down on the profile data. For example filtering out activity by processor, call stack, and so on.

Friday Oct 21, 2011


SPARC and x86 processors have different endianness. SPARC is big-endian and x86 is little-endian. Big-endian means that numbers are stored with the most significant data earlier in memory. Conversely little-endian means that numbers are stored with the least significant data earlier in memory.

Think of big endian as writing numbers as we would normally do. For example one thousand, one hundred and twenty would be written as 1120 using a big-endian format. However, writing as little endian it would be 0211 - the least significant digits would be recorded first.

For machines, this relates to which bytes are stored first. To make data portable between machines, a format needs to be agreed. For example in networking, data is defined as being big-endian. So to handle network packets, little-endian machines need to convert the data before using it.

Converting the bytes is a trivial matter, but it has some performance pitfalls. Let's start with a simple way of doing the conversion.

template <class T>
T swapslow(T in)
  T out;
  char * pcin = (char*)∈
  char * pcout = (char*)&out;

  for (int i=0; i<sizeof(T); i++)
    pcout[i] = pcin[sizeof(T)-i];
  return out;

The code uses templates to generalise it to different sizes of integers. But the following observations hold even if you use a C version for a particular size of input.

First thing to look at is instruction count. Assume I'm dealing with ints. I store the input to memory, then I access the input one byte at a time, storing each byte to a new location in memory, before finally loading the result. So for an int, I've got 10 memory operations.

Memory operations can be costly. Processors may be limited to only issuing one per cycle. In comparison most processors can issue two or more logical or integer arithmetic instructions per cycle. Loads are also costly as they have to access the cache, which takes a few cycles.

The other issue is more subtle, and I've discussed it in the past. There are RAW issues in this code. I'm storing an int, but loading it as four bytes. Then I'm storing four bytes, and loading them as an int.

A RAW hazard is a read-after-write hazard. The processor sees data being stored, but cannot convert that stored data into the format that the subsequent load requires. Hence the load has to wait until the result of the store reaches the cache before the load can complete. This can be multiple cycles of wait.

With endianness conversion, the data is already in the registers, so we can use logical operations to perform the conversion. This approach is shown in the next code snippet.

template <class T>
T swap(T in)
  T out=0;
  for (int i=0; i<sizeof(T); i++)
  return out;

In this case, we avoid the stores and loads, but instead we perform four logical operations per byte. This is higher cost than the load and store per byte. However, we can usually do more logical operations per cycle and the operations normally take a single cycle to complete. Overall, this is probably slightly faster than loads and stores.

However, you will usually see a greater performance gain from avoiding the RAW hazards. Obviously RAW hazards are hardware dependent - some processors may be engineered to avoid them. In which case you will only see a problem on some particular hardware. Which means that your application will run well on one machine, but poorly on another.

Differences between the various STL options on Solaris

Steve Clamage has provided a nice summary of the trade-offs between the various STL options. I'll summarise it here:

  • Default STL. Available as part of the OS so does not require a separate library to be shipped with the application. However, does not support the standard.
  • -library=stlport4 Much better conformance with the standard, but no internationalisation. Must be distributed with applications that use it.
  • -library=stdcxx4 (Apache). Complete implementation of standard. Available on S10U10 and onwards.

I'd also add that stlport4 and stdcxx4 typically have much better performance than the default library.

The other point that bears repetition is that you can only include one STL per application. So you cannot use different implementations for different libraries or for the application.

Tuesday Oct 04, 2011

Slides from Oracle Open World

The slides from my presentation are now available for download.

Sunday Aug 14, 2011


It used to be possible to have comments automatically appear without me having to moderate them. I preferred approach since otherwise it feels a bit like censorship when I need to explicitly approve them. Unfortunately that option has been removed, so I'm now having to manually approve any comments. This may mean a bit of a delay before I notice that I have comments to approve.

Tuesday Aug 09, 2011

Standards and headers

Every so often I encounter, or hear about, a problem with function definitions when the standard header files are included. Most often its mmap, but sometimes it's something else. Every time I think that I should write something up. Well, it's finally happened, a short paper on how to write portable code using the standard headers.

Monday Aug 01, 2011

Standard header files

Interesting (but old) blog post about the standard header files included with Solaris.

Thursday Jul 14, 2011

Best practices for libraries and linkers (part 8)

Part 8 is the conclusion of the series on the best practices for libraries and linking. The core set of best practices are:

  • Ensure at link time that all symbols are resolved.
  • Minimise the number of symbols of global scope.
  • Specify the library search paths at link time.

Putting this series of articles together turned out to be a fair amount of work. Hopefully you can see from the scale of the topics why we chose to break it down into bite-sized chunks. I'll be happy to hear feedback on whether you found it useful, or what other topics you would like discussed.

Using symbol scoping. Libraries and linker best practices part 7

In general the compiler is going to scope symbols declared in object files as being global. This means that they can be seen and bound to by any object. There are two other settings for symbol scope - "symbolic" and "hidden".

Hidden scope is easiest to describe as it just means that the symbol can only be seen within the module and is not exported for applications or libraries to use. This is basically a locally defined symbol. There are multiple advantages to using hidden scoping when possible, it reduces the number of symbols that the linker needs to handle at runtime, so reduces start up time. It also reduces the number of names, so reduces the chance of duplicate names. Finally hidden symbols cannot be bound to externally, so they cannot cause a link order problem. This makes hidden scope a good choice for all those symbols that don't need to be exported.

The other option is symbolic scope. A symbol with symbolic scope is still available for other modules to bind to - so it is like a global symbol in that respect. However, a symbolic symbol can only be satisfied from within the library or application. So if I have an unresolved symbolic symbol foo() then that symbol can only bind within the library or application. So symbolic-scoped symbols avoid the cross-library issue that causes link order problems.

Symbols can be declared with their scoping; __global,__symbolic, or __hidden. We can also use the compiler flag -xldscope=<scope> to set the default scoping for all the symbols not otherwise scoped.

The details of all this are discussed much more thoroughly in Part 7 of the series.

The best practices for symbol scoping come in two flavours:

The easiest way of handling scoping is to declare all the defined symbols to have symbolic scoping (-xldscope=symbolic). This ensures that these symbols end up with local binding rather than pulling in definitions that are present in other libraries. The downside of this is that it could cause multiple definitions for the same symbol to become present in the address space of an application.

The other approach is to carefully define interfaces by declaring exported symbols to be __symbolic, so that other libraries can bind to them, but this library will bind to the local versions in preference. Then to declare imported symbols as __global which will ensure that the library can bind to an external definition for the symbol. Then finally use -xldscope=hidden to avoid further pollution of the name space. This is time consuming but reduces runtime link costs, and also increases the robustness of the application.

Setting the initialisation order for libraries (Best practices for libraries and linking part 6)

Part 5 of the series talked about diagnosing initialisation problems. These are situations where the libraries are loaded in the wrong order and this causes the application not to function correctly (or at all). Part 6 discusses how to resolve this problem.

The easiest, but the least reliable approach is to reorder the libraries on the link line until they get initialised in the right order. This is an easy fix since it is just a matter of changing the link line, but it's not reliable. There are various reasons why this is a poor fix. It is limited to just fixing the one application, and does not fix the root of the problem. It is not robust as a change in one of the libraries may cause the whole problem to recur. etc. Better fixes involve avoiding the duplicate symbol problem that causes the library load order to be indeterminate.

If the symbols are introduced because of C++ templates, then the -instlib=<library> flag causes the compiler not to generate symbols that are defined in the listed libraries.

Direct binding is another approach which records the exact library dependencies at link time so that the linker knows exactly which libraries are required, and hence can determine the appropriate load order. This has the downside that it enables different libraries to bind to different definitions of the same symbol, this could be a useful feature, but could also introduce problems.

Wednesday Jul 13, 2011

Presenting at Oracle Open World

I'll be presenting at Oracle Open World in October. The full searchable catalog is on-line, or you can browse the speakers by name.

Tuesday Jul 12, 2011

Feature Test Macros

Feature test macros are a set of macros that are either:

  • Defined by the development environment indicating that the environment conforms to a particular standard


  • Defined by the source code for the application before the header files are included to indicate that the application requires a particular environment to build

The macros define what APIs are available, and what parameters are passed through the APIs. Adherence to a particular standard (like POSIX) will define a particular set of APIs, and define their parameters. A good example of this is on Solaris where munmap changes definition depending on what standards have been requested:

$ grep munmap /usr/include/sys/*.h
/usr/include/sys/mman.h:extern int munmap(void *, size_t);
/usr/include/sys/mman.h:extern int munmap(caddr_t, size_t);

The Linux man page for feature_test_macros includes useful source code (ftm.c) for reporting which feature test macros are set by default. This changes depending on the the OS and compiler used. One of the big differences between Linux and Solaris are the feature test macros that are set by default. Here's the output from the program compiled on a Linux box and a Solaris box - both using gcc.


$ gcc ftm.c
$ ./a.out
_POSIX_C_SOURCE defined: 200809L
_BSD_SOURCE defined
_SVID_SOURCE defined


$ gcc ftm.c
$ ./a.out
_FILE_OFFSET_BITS defined: 32

The list of standards that Solaris 10 adheres to is documented under man standards, the list for Linux is documented under man feature_test_macros.

Monday Jul 11, 2011

OpenMP 3.1 specification released

OpenMP is a great way to produce parallel applications with the minimal amount of work. The 3.1 specification came out a couple of days ago. As should be apparent from the version number, its more incremental than significant. The significant changes I see are:

  • Support for min and max reductions in C/C++. This was a frustrating omission from the previous versions, so I'm pleased to see that fixed here.
  • Support for thread binding. The specification introduces OMP_PROC_BIND which binds threads to cores. This is rather similar to the original SUNW_MP_PROCBIND in Studio, which only took true or false, more recent compilers allow a much finer granularity of control. Still "true" or "false" is a good start!

Wednesday Jun 29, 2011

Library initialisation in C++ - libraries and linking part 5

Part 5 of the series of articles on linking and libraries is up. This one gets into the details of what can go wrong when writing libraries in C++. The key take aways from the article are to use:

  • LD_DEBUG=init to view runtime initialisation
  • LD_DEBUG=bindings to examine how symbols are bound to libraries at runtime

Wednesday Jun 01, 2011

Avoiding problems at linktime (part 4 in series)

Part 4 in the series on best practices for linking is available. The key takeaways are:

  • Avoid defining duplicate symbols. The Solaris tool lari will produce a report on this issue (besides doing a bundle of other stuff). The problem with multiple definitions of symbols is that it is not predictable which definition will be picked at runtime. This is often deterministic on a particular platform, but could change on a different platform.
  • Always define libraries as a hierarchy, with no circular dependencies. If there are circular dependencies the libraries may get loaded in an unpredictable order.

Friday May 27, 2011

Using LD_DEBUG to examine application startup (linking best practices part 3)

Part 3 of the series on best practices for linking C/C++ applications is up. This sections focuses on using LD_DEBUG to examine application startup.

The paper talks about the options LD_DEBUG=init which shows the initialisation and finalisation stages of an applications run, and LD_DEBUG=bindings which shows how the symbols are bound between the application and libraries.

Tuesday May 24, 2011

Best practices for linking - part 2

Part 2 of the article on library linking best practices is up on OTN. This is a relatively short read about ensuring that the library records its dependencies.

The relevant options are:

  • -z defs which will cause the linker to report any unresolved symbols found in the library. This is the default for applications, but is not the default for libraries. Using this flag requires that all the libraries that are required for successful linking are listed on the link line. Doing this will ensure that the library will fail to link rather than fail at runtime.
  • The command ldd -U -r will report if the library (or executable) is linked to libraries that it does not use. This is helpful in ensuring that the minimal number of libraries are loaded in order for an application to run.

Wednesday May 18, 2011

Profiling running applications

Sometimes you want to profile an application, but you either want to profile it after it has started running, or you want to profile it for part of a run. There are a couple of approaches that enable you to do this.

If you want to profile a running application, then there is the option (-P <pid>) for collect to attach to a PID:

$ collect -P <pid>

Behind the scenes this generates a script and passes the script to dbx, which attaches to the process, starts profiling, and then stops profiling after about 5 minutes. If your application is sensitive to being stopped for dbx to attach, then this is not the best way to go. The alternative approach is to start the application under collect, then collect the profile over the period of interest.

The flag -y <signal> will run the application under collect, but collect will not gather any data until profiling is enabled by sending the selected signal to the application. Here's an example of doing this:

First of all we need an application that runs for a bit of time. Since the compiler doesn't optimise out floating point operations unless the flag -fsimple is used, we can quickly write an app that spends a long time doing nothing:

$ more slow.c
int main()
  double d=0.0;
  for (int i=0;i<10000000000; i++) {d+=d;}

$ cc -g slow.c

The next step is to run the application under collect with the option -y SIGUSR1 to indicate that collect should not start collecting data until it receives the signal USR1.

$ collect -y SIGUSR1 ./a.out &
[1] 1187
Creating experiment database ...

If we look at the generated experiment we can see that it exists, but it contains no data.

$ er_print -func
Functions sorted by metric: Exclusive User CPU Time

Excl.     Incl.      Name
User CPU  User CPU
 sec.      sec.
0.        0.         

To start gathering data we send SIGUSR1 to the application, sending the signal again stops data collection. Sending the signal twice we can collect two seconds of data:

$ kill -SIGUSR1 1187;sleep 2;kill -SIGUSR1 1187
$ er_print -func
Functions sorted by metric: Exclusive User CPU Time

Excl.     Incl.      Name
User CPU  User CPU
 sec.      sec.
2.001     2.001      
2.001     2.001      main
0.        2.001      _start

Friday May 06, 2011

Calling functions

I was looking at some code today and it reminded me of a very common performance issue - reloading data around calls. Suppose I have some code like:

int variable;

void function(int *array)
  for (i=0; i<1000; i++)
     if (variable==1) 

You might be surprised to find that "variable" is reloaded very iteration of the loop. The reason for this is that the loop calls another function - either func1() or func2() and the compiler knows that the function might change the value of "variable" - so to be correct it needs to be reloaded.

This problem can be fixed by caching a local copy of the variable. The compiler "knows" that local (or stack based) variables don't get modified by function calls.

However, the problem is more general than this, in C++ you might observe a reloading of variables that are members of objects - for similar reasons. The general rule for avoiding this is to examine every load or store in the hot region of code to check whether it is necessary, or whether it has been introduced because of a function call.

Thursday Apr 28, 2011

Exploring Performance Analyzer experiments

I was recently profiling a script to see where the time went, and I ended up wanting to extract the profiles for just a single component. The structure of an analyzer experiment is that there's a single root directory ( and inside that there's a single level of subdirectories representing all the child processes. Each subdirectory is a profile of a single process - these can all be loaded as individual experiments. Inside every experiment directory there is a log.xml file. This file contains a summary of what the experiment contains.

The name of the executable that was run is held on an xml "process" line. So the following script can extract a list of all the profiles of a particular application.

$ grep myapp `find -name 'log.xml'`|grep process | sed 's/\\:.\*//' > myapp_profiles

Once we have a list of every time my application was run, I can now extract the times from that list, and sort them using the following line:

$ grep exit `cat <myapp_files`|sed 's/.\*tstamp=\\"//'|sed 's/\\".\*//'|sort -n 

Once I have the list of times I can use then locate an experiment with a particular runtime - it's probably going to be the longest runtime:

$ grep exit `cat <myapp_files`|grep 75.9 

Darryl Gove is a senior engineer in the Solaris Studio team, working on optimising applications and benchmarks for current and future processors. He is also the author of the books:
Multicore Application Programming
Solaris Application Programming
The Developer's Edge
Free Download


« March 2015
The Developer's Edge
Solaris Application Programming
OpenSPARC Book
Multicore Application Programming