Friday Oct 21, 2011

Differences between the various STL options on Solaris

Steve Clamage has provided a nice summary of the trade-offs between the various STL options. I'll summarise it here:

  • Default STL. Available as part of the OS so does not require a separate library to be shipped with the application. However, does not support the standard.
  • -library=stlport4 Much better conformance with the standard, but no internationalisation. Must be distributed with applications that use it.
  • -library=stdcxx4 (Apache). Complete implementation of standard. Available on S10U10 and onwards.

I'd also add that stlport4 and stdcxx4 typically have much better performance than the default library.

The other point that bears repetition is that you can only include one STL per application. So you cannot use different implementations for different libraries or for the application.

Friday Aug 28, 2009

Maps in the STL

I was looking at some code with a colleague and we observed a bunch of time in some code which used the std::map to set up mappings between strings. The source code looked rather like the following:

#include <map>
#include <string>

using namespace std;

int func(map<string,string>&mymap, string &s1, string &s2)
  return 0;

When compiled with CC -O -c -library=stlport4 this expands to a horrendous set of calls, here's the first few:

$ er_src -dis func map.o|grep call 
        [?]     188d:  call    std::basic_string...::basic_string
        [?]     189f:  call    std::basic_string...::basic_string
        [?]     18b2:  call    std::basic_string...::basic_string
        [?]     18c2:  call    std::basic_string...::basic_string
        [?]     18d8:  call    std::_Rb_tree...::insert_unique
        [?]     18f8:  call    std::__node_alloc...::_M_deallocate
        [?]     190c:  call std::_STLP_alloc_proxy...::~_STLP_alloc_proxy

What's happening is that the act of making a pair object is causing copies to be made of the two strings that are passed into the pair constructor. Then the pair object is passed into the insert method of std::map and this results in two more copies of the strings being made. There's a bunch of other stuff going on, and the resulting code is a mess.

There's an alternative way of assigning the mapping:

#include <map>
#include <string>

using namespace std;

int func(map<string,string>&mymap, string &s1, string &s2)
  return 0;

When compiled the resulting code looks a lot neater:

$ er_src -dis func map.o|grep call
        [?]     28e6:  call    std::map...::operator[]
        [?]     2903:  call    std::basic_string...::_M_assign_dispatch

Of course a neater chunk of code is nice, but the question is whether the code for ::operator[] contains the same ugly mess. Rather than disassembling to find out, it's simpler to time the two versions and see which does better. A simple test harness looks like:

int main()
  string s1,s2;
  long long i;
  for (i=0; i<100000000; i++)

It's a less than ideal harness since it uses constant strings, and one version of the code might end up bailing early because of this. The performance of the two codes is quite surprising:

real           6.79
user           6.77
sys            0.00

real        1:03.53
user        1:03.26
sys            0.01 

So the version that creates the pair object is about 10x slower!

Friday Jun 26, 2009

mtmalloc vs umem

A little while back I was looking at the performance of the STL with multithreaded code. I got the opportunity to try this on a particular code, and rather shockingly to me performance was absolutely terrible! I'd linked the code with mtmalloc, and the hottest function in the profile was malloc_internal. I've put together a fake code, and here's the profile from that:

Excl.     Incl.      Name
User CPU  User CPU
   sec.      sec.
266.446   266.446    
258.301   263.084    malloc_internal
  1.661     1.951    free
  1.401     1.401    mutex_lock_impl
  0.961     0.961    mutex_unlock

We can dig into the disassembly of malloc_internal to find out what's going on:

    73.201    73.201            [?]     1724:  cmp         %o5, 0
     1.981     1.981            [?]     1728:  bne         0x1740
     0.320     0.320            [?]     172c:  nop
     0.490     0.490            [?]     1730:  ld          [%i2 + 44], %i2
     1.191     1.191            [?]     1734:  cmp         %i2, 0
     0.901     0.901            [?]     1738:  bne,a       0x1724
## 176.443   176.443            [?]     173c:  ld          [%i2 + 32], %o5

It's not hard to visualise what the original C code would look like:

  while ((ptr->value==0) && (ptr->next!=0)) { ptr=ptr->next; }

Fortunately the source code is searchable and the above loop looks sufficiently similar to line 1032 of mtmalloc.c:

   1032 	while (thiscache != NULL && thiscache->mt_nfree == 0)
   1033 		thiscache = thiscache->mt_next;

So what's going on?

Reading through the source of malloc_internal, it appears that mtmalloc builds up a linked list of chunks of memory for each size of memory request. The size of the chunks of memory is 8KB\*requestsize, and requestsize is 9. So basically each chunk of memory is 72KB in size. So when a memory request comes in, malloc_internal looks at the current chunk, and if memory can be allocated from there, then it returns memory from that chunk. If not it goes to the next chunk and so on. This works very well when memory is allocated at once, but as memory gets freed, these chunks of memory become like Swiss-cheese, with lots of holes in them. If a lot of memory of a particular size is requested, then freed, there can be a large number of these chunks in the linked list, and scanning through the chunks to find one with free space can be time consuming. And that is the condition that my test code exercises.

It's probably worth revealing the test code, at this point, so that you can see what it does:

#include <stdlib.h>
typedef struct s
  struct s \* next;
  char padding[508];
} S;

void main()
  struct s \* head;
  struct s \* keep;
  struct s \* current;
  for (int j=0; j<100; j++)
    for (int i=0; i<100000; i++)
      current=(struct s\*)malloc(sizeof(struct s));
      if (i&1)
    current = head;
    while (current!=0)
      struct s \* tmp = current;
      current = current -> next;
    head = 0;

The code maintains two lists, one that it places memory onto for a long duration, and another list that holds memory for only a short duration. The memory footprint of the code keeps increasing, so more chunks are added to the lists, and holding on to the memory for a long period of time ensures that the chunks end up with lots of gaps in them. The runtime of this code is as follows:

% cc -O mtm.c -lmtmalloc
% timex a.out
real        4:44.18
user        4:33.80
sys            8.70

However there is an API to libmtmalloc that allows us to adjust the size of the chunks. The following changes increase the requestsize from 9 to 20:


The performance reduces from nearly 5 minutes to about 1 minute:

% cc -O mtm.c -lmtmalloc
% timex a.out
real        1:09.10
user        1:01.09
sys            6.53

If we increase the requestsize to 30, performance improves still further:

% cc -O mtm.c -lmtmalloc
% timex a.out
real          38.36
user          31.41
sys            4.96

Of course, libmtmalloc is not the only memory allocator that is optimised for multi-threaded allocation. We also have libumem, compiling the original code to use this results in the following performance:

% cc -O mtm.c -lumem
% timex a.out
real          31.06
user          18.10
sys           10.95

So this is probably a good indication that you will get better performance from libumem if your application allocates and deallocates lots of memory. If you are using libmtmalloc in this role, then you may need to tune the requestsize to a greater number than the default - although this will increase the memory footprint of your application.

Friday Jun 12, 2009

Stlport4 and multithreaded code

I finally resolved a problem that's been annoying me for about 3 years. Codes that use the Standard Template Library don't scale to multiple threads.

First off, it's probably good to take a look at a code that illustrates the problem:

#include <vector>

int main()
  #pragma omp parallel for default (__auto)
  for (int i=0; i<10000000; i++)
    std::vector<int> v;

The first comparison is between the serial performance of the Solaris default STL and stlport4 which is provided with the compiler.

$ CC -O
$ timex a.out
real          15.85
user          15.64
sys            0.01
$ CC -O -library=stlport4
$ timex a.out
real           7.87
user           7.78
sys            0.01

This doesn't tell me anything that I didn't already know. stlport4 is (as far as I know) always faster than the STL provided by Solaris. Hence if you use C++, then you should use stlport4 in preference to the Solaris default. The constraint is that each application (libraries and all) can only use one version of the STL. So if a library that is outside your control uses the Solaris default, then the entire app must use it.

The next thing to investigate is scaling when there are multiple threads:

$ CC -O -xopenmp -library=stlport4
$ timex a.out
real           7.00
user           6.96
sys            0.01
$ export OMP_NUM_THREADS=2
$ timex a.out
real           7.18
user          14.28
sys            0.01

So compiling the code to use OpenMP caused no performance overhead, but running with two threads had the same runtime as a run with a single thread. We can profile the code to see what's happening:

Excl.     Incl.      Name  
User CPU  User CPU         
 sec.      sec.       
8.076     8.076      
1.571     2.272      mutex_lock_impl
1.501     1.971      mutex_unlock
1.051     4.573      std::vector >::_M_insert_overflow(int\*,const int&,const std::__true_type&,unsigned,bool)
0.871     8.076      _$d1A5.main
0.871     3.272      std::__node_alloc<true,0>::_M_allocate(unsigned)
0.560     1.721      std::__node_alloc<true,0>::_M_deallocate(void\*,unsigned)
0.480     0.480      sigon
0.440     0.440      mutex_trylock_adaptive
0.250     0.470      mutex_unlock_queue

So the lost time is due to mutex locks, if you dig through the source you'll find that node_alloc has a single mutex lock that only allows a single thread to allocate or deallocate memory. Which is why the code shows no scaling.

This test code is basically creating and destroying vector objects, so it hits the allocate and deallocate routines very hard. Which is why I picked it. Real codes are much less likely to have this problem at quite the same level. It is not unusual to want to create and destroy objects within a loop. One workaround is to hoist the objects out of the hot loops. This works for some instances, but is not a great solution, as even in the best case it makes the code more complex.

The solution I ended up using was to build the Apache STL. It turned out to be a relatively straightforward experience. The compile line is a bit cryptic, I wanted the optimised, multithreaded, 64-bit version and this translates to:

$ gmake BUILDTYPE=12D CONFIG=sunpro.config 

Once I had it built, I could install it with:

$ gmake BUILDTYPE=12D CONFIG=sunpro.config install PREFIX=`pwd`/install

The steps necessary to use a different STL than the ones supplied with the compiler are documented here. The compile line for the test code was:

CC -m64  -O -xopenmp -library=no%Cstd \\
   -I ./stdcxx-4.2.1/install/include/ \\
   -L ./stdcxx-4.2.1/install/lib/     \\
   -R ./stdcxx-4.2.1/install/lib/ -lstd12D 

So we can build the test and look at the scaling between one and two threads:

$ export OMP_NUM_THREADS=1
$ timex a.out
real          18.98
user          18.93
sys            0.01
$ export OMP_NUM_THREADS=2
$ timex a.out
real          18.42
user          36.73
sys            0.01

Which is not, to be honest, a great start, the runtime is slower, and the code still fails to scale. However, the profile is different:

Excl.     Incl.      Name  
User CPU  User CPU         
  sec.      sec.      
21.145    21.145     
 2.572    16.411     std::vector<int,std::allocator<int> >::_C_insert_n(int\*const&,unsigned long,const int&)
 2.402     4.293     mutex_unlock
 2.342     3.613     mutex_lock_impl
 1.961    10.697     std::vector<int,std::allocator<int> >::_C_realloc(unsigned long)
 1.681     5.634     free
 1.341     1.891     mutex_unlock_queue
 1.271     1.271     _free_unlocked
 0.991     0.991     sigon

So we still see a lot of mutex activity. Looking at where the mutex activity comes from provides an interesting insight:

(er_print) csingle mutex_lock
Attr.    Excl.     Incl.      Name  
User CPU  User CPU  User CPU         
 sec.      sec.      sec.       
0.170     1.681     5.634      free
0.020     0.690     4.623      malloc
0.190     0.190     0.190     \*mutex_lock

So the mutex activity is coming from malloc and free. Which are parts of the default Solaris memory allocator. The default memory allocator is thread safe, but does not give good performance for MT codes. There are two usual alternatives, mtmalloc and libumem. I've usually found mtmalloc to be good enough for me:

CC -m64  -O -xopenmp -library=no%Cstd \\
   -I ./stdcxx-4.2.1/install/include/ \\
   -L ./stdcxx-4.2.1/install/lib/     \\
   -R ./stdcxx-4.2.1/install/lib/ -lstd12D -lmtmalloc

Then we can try the timing tests again:

$ export OMP_NUM_THREADS=1
$ timex a.out
real          18.02
user          17.98
sys            0.01
$ export OMP_NUM_THREADS=2
real          13.76
user          27.05
sys            0.01
$ export OMP_NUM_THREADS=4
$ timex a.out
real           6.92
user          26.97
sys            0.02
$ export OMP_NUM_THREADS=8
$ timex a.out
real           3.51
user          26.99
sys            0.02

So the code is now scaling to multiple threads, which was the original problem. We have lost some serial performance, which is perhaps a concern, but that performance loss may be only for a particular code path, and depending on the usage of the library, we might even see gains in some of the algorithms. So depending on the situation, this might be a good enough solution. [FWIW, I also tested with libumem and did not see a significant difference in performance between the two libraries.]

Thursday May 15, 2008

Redistributable libraries

Steve Clamage and I just put together a short article on using the redistributable libraries that are shipped as part of the compiler. The particular one we focus on is stlport4 since this library is commonly substituted for the default libCstd.

There are two points to take away from the article. First of all, that the required libraries should be copied into a new directory structure for distribution with your application - this makes it easy to patch them, and ensures that the correct version is picked up. The second point is to use the $ORIGIN token when linking the application to specify the path, relative to the location of the executable, where the library will be found at runtime.

Runtime linking is one of my bugbears. I really get fed up with software that requires libraries to be located in particular places in order for it to run, or worse software that requires LD_LIBRARY_PATH to be set for the application to locate the libraries (see Rod Evan's blog entry).

Friday Sep 28, 2007

ClusterTools 7 and stlport

Hit a problem trying to build a C++ application which required -library=stlport4 under ClusterTools 7. Unfortunately they don't currently mix.


Darryl Gove is a senior engineer in the Solaris Studio team, working on optimising applications and benchmarks for current and future processors. He is also the author of the books:
Multicore Application Programming
Solaris Application Programming
The Developer's Edge


« April 2014
The Developer's Edge
Solaris Application Programming
OpenSPARC Book
Multicore Application Programming