Friday Aug 28, 2009

Maps in the STL

I was looking at some code with a colleague and we observed a bunch of time in some code which used the std::map to set up mappings between strings. The source code looked rather like the following:

#include <map>
#include <string>

using namespace std;

int func(map<string,string>&mymap, string &s1, string &s2)
{
  mymap.insert(pair<string,string>(s1,s2));
  return 0;
}

When compiled with CC -O -c -library=stlport4 map.cc this expands to a horrendous set of calls, here's the first few:

$ er_src -dis func map.o|grep call 
        [?]     188d:  call    std::basic_string...::basic_string
        [?]     189f:  call    std::basic_string...::basic_string
        [?]     18b2:  call    std::basic_string...::basic_string
        [?]     18c2:  call    std::basic_string...::basic_string
        [?]     18d8:  call    std::_Rb_tree...::insert_unique
        [?]     18f8:  call    std::__node_alloc...::_M_deallocate
        [?]     190c:  call std::_STLP_alloc_proxy...::~_STLP_alloc_proxy
                ...

What's happening is that the act of making a pair object is causing copies to be made of the two strings that are passed into the pair constructor. Then the pair object is passed into the insert method of std::map and this results in two more copies of the strings being made. There's a bunch of other stuff going on, and the resulting code is a mess.

There's an alternative way of assigning the mapping:

#include <map>
#include <string>

using namespace std;

int func(map<string,string>&mymap, string &s1, string &s2)
{
  mymap[s1]=s2;
  return 0;
}

When compiled the resulting code looks a lot neater:

$ er_src -dis func map.o|grep call
        [?]     28e6:  call    std::map...::operator[]
        [?]     2903:  call    std::basic_string...::_M_assign_dispatch

Of course a neater chunk of code is nice, but the question is whether the code for ::operator[] contains the same ugly mess. Rather than disassembling to find out, it's simpler to time the two versions and see which does better. A simple test harness looks like:

int main()
{
  map<string,string>mymap;
  string s1,s2;
  long long i;
  s1="123456789";
  s2="987654321";
  for (i=0; i<100000000; i++)
  {
    func(mymap,s1,s2);
  }
} 

It's a less than ideal harness since it uses constant strings, and one version of the code might end up bailing early because of this. The performance of the two codes is quite surprising:

real           6.79
user           6.77
sys            0.00


real        1:03.53
user        1:03.26
sys            0.01 

So the version that creates the pair object is about 10x slower!

Friday Jun 12, 2009

Stlport4 and multithreaded code

I finally resolved a problem that's been annoying me for about 3 years. Codes that use the Standard Template Library don't scale to multiple threads.

First off, it's probably good to take a look at a code that illustrates the problem:

#include <vector>

int main()
{
  #pragma omp parallel for default (__auto)
  for (int i=0; i<10000000; i++)
  {
    std::vector<int> v;
    v.push_back(10);
  }
  return(0);
}

The first comparison is between the serial performance of the Solaris default STL and stlport4 which is provided with the compiler.

$ CC -O t1.cc
$ timex a.out
real          15.85
user          15.64
sys            0.01
$ CC -O -library=stlport4 t1.cc
$ timex a.out
real           7.87
user           7.78
sys            0.01

This doesn't tell me anything that I didn't already know. stlport4 is (as far as I know) always faster than the STL provided by Solaris. Hence if you use C++, then you should use stlport4 in preference to the Solaris default. The constraint is that each application (libraries and all) can only use one version of the STL. So if a library that is outside your control uses the Solaris default, then the entire app must use it.

The next thing to investigate is scaling when there are multiple threads:

$ CC -O -xopenmp -library=stlport4 t1.cc
$ timex a.out
real           7.00
user           6.96
sys            0.01
$ export OMP_NUM_THREADS=2
$ timex a.out
real           7.18
user          14.28
sys            0.01

So compiling the code to use OpenMP caused no performance overhead, but running with two threads had the same runtime as a run with a single thread. We can profile the code to see what's happening:

Excl.     Incl.      Name  
User CPU  User CPU         
 sec.      sec.       
8.076     8.076      
1.571     2.272      mutex_lock_impl
1.501     1.971      mutex_unlock
1.051     4.573      std::vector >::_M_insert_overflow(int\*,const int&,const std::__true_type&,unsigned,bool)
0.871     8.076      _$d1A5.main
0.871     3.272      std::__node_alloc<true,0>::_M_allocate(unsigned)
0.560     1.721      std::__node_alloc<true,0>::_M_deallocate(void\*,unsigned)
0.480     0.480      sigon
0.440     0.440      mutex_trylock_adaptive
0.250     0.470      mutex_unlock_queue

So the lost time is due to mutex locks, if you dig through the source you'll find that node_alloc has a single mutex lock that only allows a single thread to allocate or deallocate memory. Which is why the code shows no scaling.

This test code is basically creating and destroying vector objects, so it hits the allocate and deallocate routines very hard. Which is why I picked it. Real codes are much less likely to have this problem at quite the same level. It is not unusual to want to create and destroy objects within a loop. One workaround is to hoist the objects out of the hot loops. This works for some instances, but is not a great solution, as even in the best case it makes the code more complex.

The solution I ended up using was to build the Apache STL. It turned out to be a relatively straightforward experience. The compile line is a bit cryptic, I wanted the optimised, multithreaded, 64-bit version and this translates to:

$ gmake BUILDTYPE=12D CONFIG=sunpro.config 

Once I had it built, I could install it with:

$ gmake BUILDTYPE=12D CONFIG=sunpro.config install PREFIX=`pwd`/install

The steps necessary to use a different STL than the ones supplied with the compiler are documented here. The compile line for the test code was:

CC -m64  -O -xopenmp -library=no%Cstd \\
   -I ./stdcxx-4.2.1/install/include/ \\
   -L ./stdcxx-4.2.1/install/lib/     \\
   -R ./stdcxx-4.2.1/install/lib/ -lstd12D t1.cc 

So we can build the test and look at the scaling between one and two threads:

$ export OMP_NUM_THREADS=1
$ timex a.out
real          18.98
user          18.93
sys            0.01
$ export OMP_NUM_THREADS=2
$ timex a.out
real          18.42
user          36.73
sys            0.01

Which is not, to be honest, a great start, the runtime is slower, and the code still fails to scale. However, the profile is different:

Excl.     Incl.      Name  
User CPU  User CPU         
  sec.      sec.      
21.145    21.145     
 2.572    16.411     std::vector<int,std::allocator<int> >::_C_insert_n(int\*const&,unsigned long,const int&)
 2.402     4.293     mutex_unlock
 2.342     3.613     mutex_lock_impl
 1.961    10.697     std::vector<int,std::allocator<int> >::_C_realloc(unsigned long)
 1.681     5.634     free
 1.341     1.891     mutex_unlock_queue
 1.271     1.271     _free_unlocked
 0.991     0.991     sigon

So we still see a lot of mutex activity. Looking at where the mutex activity comes from provides an interesting insight:

(er_print) csingle mutex_lock
Attr.    Excl.     Incl.      Name  
User CPU  User CPU  User CPU         
 sec.      sec.      sec.       
0.170     1.681     5.634      free
0.020     0.690     4.623      malloc
0.190     0.190     0.190     \*mutex_lock

So the mutex activity is coming from malloc and free. Which are parts of the default Solaris memory allocator. The default memory allocator is thread safe, but does not give good performance for MT codes. There are two usual alternatives, mtmalloc and libumem. I've usually found mtmalloc to be good enough for me:

CC -m64  -O -xopenmp -library=no%Cstd \\
   -I ./stdcxx-4.2.1/install/include/ \\
   -L ./stdcxx-4.2.1/install/lib/     \\
   -R ./stdcxx-4.2.1/install/lib/ -lstd12D t1.cc -lmtmalloc

Then we can try the timing tests again:

$ export OMP_NUM_THREADS=1
$ timex a.out
real          18.02
user          17.98
sys            0.01
$ export OMP_NUM_THREADS=2
real          13.76
user          27.05
sys            0.01
$ export OMP_NUM_THREADS=4
$ timex a.out
real           6.92
user          26.97
sys            0.02
$ export OMP_NUM_THREADS=8
$ timex a.out
real           3.51
user          26.99
sys            0.02

So the code is now scaling to multiple threads, which was the original problem. We have lost some serial performance, which is perhaps a concern, but that performance loss may be only for a particular code path, and depending on the usage of the library, we might even see gains in some of the algorithms. So depending on the situation, this might be a good enough solution. [FWIW, I also tested with libumem and did not see a significant difference in performance between the two libraries.]

About

Darryl Gove is a senior engineer in the Solaris Studio team, working on optimising applications and benchmarks for current and future processors. He is also the author of the books:
Multicore Application Programming
Solaris Application Programming
The Developer's Edge
Free Download

Search

Categories
Archives
« May 2015
SunMonTueWedThuFriSat
     
1
2
3
4
6
7
8
9
10
11
12
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
      
Today
Bookmarks
The Developer's Edge
Solaris Application Programming
Publications
Webcasts
Presentations
OpenSPARC Book
Multicore Application Programming
Docs