Friday Dec 16, 2005

pthread_{get,set}specific vs thread local variable

When you write a multithreaded code, sometimes you want to have a per-thread data that can be easily accessed everywhere. This need is especially important when you port an existing single-threaded application to be multithreaded. Two possible solutions among many possibilities are 1) using pthread_{get,set}specific() API and 2) using compiler's thread local variable support. On this blog, I'll compare those two approaches and show an example.

pthread_{get,set}specific() are part of POSIX thread API that provides access to per-thread data by using a key. This contrasts to the thread local variable which is a language extension provided by the Sun Studio compiler. Both utilize the runtime linker's support of thread local storage (TLS) but are quite different. First of all, they look completely different in the source code. Pthread APIs are just bunch of function calls, but the thread local variable is through a type qualifier. So the thread local variable is more seamlessly integrated into the source but is less portable (well, gcc supports __thread on many platforms, and Microsoft compiler supports __declspec( thread ) to do the same, so portability isn't that big of a deal in practice. But still it is a non-standard extension and once you use it your code won't be 100% standard C). And Microsoft compiler also supports thread local storage This also affects the compiler optimization because compiler recognizes thread local variable and knows its semantic but it has no idea what pthread APIs are about (and even if we teach compilers to recognize them, there's not much a compiler can do because those APIs are in libc.so). pthread_{get,set}specific() can store only void \*. This means that if you want a data that's bigger than void \*, you need to allocate it on the heap yourself and store a pointer. This contrasts to __thread which can be used for data with any size. And for such large cases, __thread doesn't need the extra indirection pthread_{get,set}specific() would if a pointer is stored instead of the actual variable.

Here's a single .c file that can be built to do equivalent stuff in pthread_{get,set}specific() and the compiler language extension.

# cat -n thread.c
     1  #include 
     2  #include 
     3  #include 
     4  #include 
     5  #include 
     6
     7  void \*test(void \*);   /\* thread routine \*/
     8  void nop(int);
     9
    10  #ifdef TLS
    11  __thread int i;
    12  #else
    13   pthread_key_t key_i;
    14  #endif
    15   pthread_t \*tid;
    16
    17   int
    18   main(int argc, char \*argv[])
    19   {
    20           int i;
    21           int iter;
    22           int nthread;
    23           if (argc <= 2)  {
    24                   printf("Usage: %s #threads #iteration\\n", argv[0]);
    25                   return (1);
    26           }
    27
    28  #ifndef TLS
    29           pthread_key_create(&key_i, NULL);
    30  #endif
    31
    32           nthread = atoi(argv[1]);
    33           iter = atoi(argv[2]);
    34           tid = alloca( sizeof(pthread_t) \* nthread );
    35
    36           printf("main() %d threads %d iterations\\n", nthread, iter);
    37
    38           for ( i = 0; i < nthread; i++)
    39                   pthread_create(&tid[i], NULL, test,
    40                       (void \*)iter);
    41           for ( i = 0; i < nthread; i++)
    42                   pthread_join(tid[i], NULL);
    43
    44           printf("main() reporting that all %d threads have terminated\\n", i);
    45           return (0);
    46   }  /\* main \*/
    47
    48   void \*
    49   test(void \*arg)
    50   {
    51           int count = (int)arg;
    52           int v;
    53  #ifdef TLS
    54           i = count;
    55  #else
    56           if ( pthread_setspecific( key_i, (void\*)count ) != 0 ) {
    57                   printf("pthread_setspecific failed\\n");
    58           }
    59  #endif
    60           printf("thread %d test count %d\\n", pthread_self(), count);
    61  #ifdef TLS
    62           while( i > 10 ) {
    63                  i = i - 1;
    64                  nop(i);
    65           }
    66  #else
    67           while( (v = (int)(pthread_getspecific(key_i))) > 10 ) {
    68                  pthread_setspecific(key_i, (void\*)(v -1));
    69                  nop(v);
    70           };
    71  #endif
    72           printf("thread %d finished. count %d\\n", pthread_self(), count);
    73
    74           return (NULL);
    75   }
#
This is a contrived example but is good enough to illustrate difference between the compiler support and pthread API. You might have noticed __thread at line 11, which is a type qualifier that specifies that the variable is per-thread meaning that each thread will see its own copy of i. Without it, all threads will share the same i.

I built the above source code with:

# cc -fast -xarch=v8plus thread.c -DTLS -o tls.out nop.il -Wc,-Qinline-l
# cc -fast -xarch=v8plus thread.c -o specific.out nop.il -Wc,-Qinline-l
You may wonder what those extra compiler arguments (nop.il and -Wc,-Qinline-l) are about. I'll explain later. Anyway, with this compilation, let's compare the output of the code for the while loop above (line 62 and 67).
# dis -F test specific.out
    ...
    test+0x58:              40 00 40 73  call         pthread_setspecific
    test+0x5c:              d0 06 a0 84  ld           [%i2 + 0x84], %o0
    test+0x60:              90 10 00 1c  mov          %i4, %o0
    test+0x64:              01 00 00 00  nop
    test+0x68:              40 00 40 75  call         pthread_getspecific
    test+0x6c:              d0 06 a0 84  ld           [%i2 + 0x84], %o0
    test+0x70:              b8 10 00 08  mov          %o0, %i4
    test+0x74:              80 a2 20 0a  cmp          %o0, 0xa
    test+0x78:              14 4f ff f8  bg,pt        %icc, test+0x58
    test+0x7c:              92 22 20 01  sub          %o0, 0x1, %o1
    ...
pthread API case is straightforward - there are two calls, with two loads that load the value of key_i. But to really see what's going on, you need to look at pthread_getspecific() itself:
# dis -F pthread_getspecific /usr/lib/libc.so.1
                \*\*\*\*   DISASSEMBLER  \*\*\*\*


disassembly for /usr/lib/libc.so.1

section .text
pthread_getspecific()
    pthread_getspecific:    80 a2 20 09  cmp          %o0, 0x9
    pthread_getspecific+0x4: 1a 80 00 05  bgeu         pthread_getspecific+0x18
    pthread_getspecific+0x8: 9a 01 e1 00  add          %g7, 0x100, %o5
    pthread_getspecific+0xc: 97 2a 20 02  sll          %o0, 0x2, %o3
    pthread_getspecific+0x10: 81 c3 e0 08  retl
    pthread_getspecific+0x14: d0 03 40 0b  ld           [%o5 + %o3], %o0
    pthread_getspecific+0x18: c2 01 e0 fc  ld           [%g7 + 0xfc], %g1
    pthread_getspecific+0x1c: 80 a0 60 00  cmp          %g1, 0x0
    pthread_getspecific+0x20: 02 80 00 08  be           pthread_getspecific+0x40
    pthread_getspecific+0x24: 01 00 00 00  nop
    pthread_getspecific+0x28: d4 00 60 00  ld           [%g1], %o2
    pthread_getspecific+0x2c: 80 a2 00 0a  cmp          %o0, %o2
    pthread_getspecific+0x30: 1a 80 00 04  bgeu         pthread_getspecific+0x40
    pthread_getspecific+0x34: 9b 2a 20 02  sll          %o0, 0x2, %o5
    pthread_getspecific+0x38: 81 c3 e0 08  retl
    pthread_getspecific+0x3c: d0 00 40 0d  ld           [%g1 + %o5], %o0
    pthread_getspecific+0x40: 81 c3 e0 08  retl
    pthread_getspecific+0x44: 90 10 20 00  clr          %o0
#
If you follow the dependency chain, you'll see that
    pthread_getspecific+0x18: c2 01 e0 fc  ld           [%g7 + 0xfc], %g1
    pthread_getspecific+0x34: 9b 2a 20 02  sll          %o0, 0x2, %o5
    pthread_getspecific+0x3c: d0 00 40 0d  ld           [%g1 + %o5], %o0
are what it takes to get the return value. So, for each access, pthread_getspecific() needs three loads (one for loading key_i in test(), one for loading a pointer to thread local storage area from thread pointer %g7, and the final load to get the actual value. See _pthread_getspecific for the libc code for pthread_getspecific). Although I didn't show pthread_setspecific(), it's essentially similar in the common case - two loads to form the address and one store to do an actual write.

Now, let's look at __thread case:

# dis -F test tls.out
    ...
    test+0x54:              d0 25 a0 00  st           %o0, [%l6]
    test+0x58:              01 00 00 00  nop
    test+0x5c:              ea 05 a0 00  ld           [%l6], %l5
    test+0x60:              80 a5 60 0a  cmp          %l5, 0xa
    test+0x64:              14 4f ff fc  bg,pt        %icc, test+0x54
    test+0x68:              90 25 60 01  sub          %l5, 0x1, %o0
    ...
#
__thread code looks straightforward on the surface - one load reads the value, one store writes it back. The question is, how %l6 is formed. Here's the assembly snippet for the code just before the above:
# dis -F test tls.out
                \*\*\*\*   DISASSEMBLER  \*\*\*\*


disassembly for tls.out

section .text
test()
    test:                   9d e3 bf a0  save         %sp, -0x60, %sp
    test+0x4:               40 00 00 02  call         test+0xc
    test+0x8:               9e 10 00 0f  mov          %o7, %o7
    test+0xc:               1b 00 00 40  sethi        %hi(0x10000), %o5
    test+0x10:              3b 00 00 00  sethi        %hi(0x0), %i5
    test+0x14:              9a 03 61 5c  add          %o5, 0x15c, %o5
    test+0x18:              b8 1f 7f f8  xor          %i5, -0x8, %i4
    test+0x1c:              b6 03 40 0f  add          %o5, %o7, %i3
    test+0x20:              33 00 00 42  sethi        %hi(0x10800), %i1
    test+0x24:              b4 10 00 1c  mov          %i4, %i2
    test+0x28:              ac 01 c0 1a  add          %g7, %i2, %l6
    ...
#
I'll explain what's going on in this code sequence in more detail in another blog entry, but let me just say that this code shows the dance of the code and the runtime linker to allow forming a pointer to the thread local storage. But this sequence doesn't need any load and usually it happens only once within a routine (or at least outside loop). So, __thread requires only one load (or store) to access the variable.
With all of the above, let's compare the performance. This run was on 750MHz UltraSPARC-III system:
# ptime ./specific.out 1  100000000
main() 1 threads 100000000 iterations
thread 2 test count 100000000
thread 2 finished. count 100000000
main() reporting that all 1 threads have terminated

real        5.974
user        5.903
sys         0.015
# ptime ./tls.out      1  100000000
main() 1 threads 100000000 iterations
thread 2 test count 100000000
thread 2 finished. count 100000000
main() reporting that all 1 threads have terminated

real        2.024
user        1.971
sys         0.011
# ptime ./specific.out 2  100000000
main() 2 threads 100000000 iterations
thread 2 test count 100000000
thread 3 test count 100000000
thread 2 finished. count 100000000
thread 3 finished. count 100000000
main() reporting that all 2 threads have terminated

real        6.813
user       11.802
sys         0.028
# ptime ./tls.out      2  100000000
main() 2 threads 100000000 iterations
thread 2 test count 100000000
thread 3 test count 100000000
thread 3 finished. count 100000000
thread 2 finished. count 100000000
main() reporting that all 2 threads have terminated

real        2.150
user        3.931
sys         0.010
#
Well, as expected, three loads vs one load and the performance difference is close to 3x. I tried this on a Niagara box (T2000, 1Ghz, 8 core):
# ptime ./specific.out 1 100000000
main() 1 threads 100000000 iterations
thread 2 test count 100000000
thread 2 finished. count 100000000
main() reporting that all 1 threads have terminated

real        6.932
user        6.904
sys         0.009
# ptime ./tls.out 1 100000000
main() 1 threads 100000000 iterations
thread 2 test count 100000000
thread 2 finished. count 100000000
main() reporting that all 1 threads have terminated

real        2.637
user        2.602
sys         0.010
# ptime ./specific.out 8 100000000
main() 8 threads 100000000 iterations
thread 2 test count 100000000
thread 3 test count 100000000
thread 4 test count 100000000
thread 5 test count 100000000
thread 6 test count 100000000
thread 7 test count 100000000
thread 8 test count 100000000
thread 9 test count 100000000
thread 2 finished. count 100000000
thread 3 finished. count 100000000
thread 4 finished. count 100000000
thread 6 finished. count 100000000
thread 8 finished. count 100000000
thread 9 finished. count 100000000
thread 7 finished. count 100000000
thread 5 finished. count 100000000
main() reporting that all 8 threads have terminated

real        6.966
user       55.246
sys         0.013
# ptime ./tls.out 8 100000000
main() 8 threads 100000000 iterations
thread 2 test count 100000000
thread 3 test count 100000000
thread 4 test count 100000000
thread 5 test count 100000000
thread 6 test count 100000000
thread 7 test count 100000000
thread 8 test count 100000000
thread 9 test count 100000000
thread 3 finished. count 100000000
thread 9 finished. count 100000000
thread 2 finished. count 100000000
thread 4 finished. count 100000000
thread 5 finished. count 100000000
thread 6 finished. count 100000000
thread 7 finished. count 100000000
thread 8 finished. count 100000000
main() reporting that all 8 threads have terminated

real        2.734
user       21.604
sys         0.011
#
Not so surprisingly, they show similar performance differences. Niagara performs quite well and it scales linearly up to 8 threads (Why no result for 32 threads ? That's for the next blog <img src=)" title=":))" />.

BTW, I didn't explain what those extra argument (nop.il -Wc,-Qinline-l) to the compiler is about. A variable with __thread is still a variable and hence treated the same way as any global variable except for how its address is formed. So, the compiler just optimized away the loop without nop() call at line 64. So I inserted nop() calls at line 64 and the compiler could no longer optimize it away (since it doesn't know what nop() is), but it adds extra call overhead which could be quite big since the loop itself is quite small. So I wrote a little inline template for nop() so that the compiler can inline away nop() call. Well, once I did that, the compiler once again optimized away the loop since it saw that the call doesn't do anything. Arg. So,to prevent the compiler from understanding the body of nop() call but to still allow inlining it, I added -Wc,-Qinline-l which is a code generator internal option that prevents it from inlining inline templates early in the optimization which prevents the loop optimization.

Anyway, in summary, the thread local variable and pthread_{get,set}specific() APIs do similar things but have different interface, performance and portability. If you can afford to use __thread, it is usually a better choice in terms of the amount of code changes and the performance.

Tuesday Jun 28, 2005

What do you prefer - a faulty program to run slowly or to die immediately ?

Whenver we change the default compilation flag, often we have to make interesting tradeoffs. A relatively recent trade off was regarding the default value for -xmemalign (Here's the link to the exact page for the description of the flag). Starting from Studio 9, our compiler uses -xmemalign=8i in 32bit mode as default (vs -xmemalign=4s before).

With -xmemalign=4s, a program that does unaligned memory access would die right away, telling the developer what was wrong and where it went wrong. But with -xmemalign=8i, such program would simply run slowly and it's somewhat difficult to track down such a performance degradation (well, if you know where to look at, dtrace is again your friend here but you pretty much have to know the answer beforehand).

Sounds bad, so why did we do it ? The answer is again, the performance (what else?). When you compile a code with -xmemalign=8i as opposed to 4s, the compiler can safely use 8byte store and load for appropriately sized datum (like double precision floating point). Since most programs are alignment-safe, that is, most code don't do funky typecasting (like casting a char pointer to an integer point and accessing it), this change doesn't cause any performance degradation on those correct programs but actually could improve them somewhat. You can ask why not -xmemalign=8s. Unfortunately, some constructs in Fortran and the artifact of 32bit ABI in SPARC makes it not possible to use -xmemalign=8s even for a completely correct program. However, those occurrences are very rare as to not affect the overal performance in most programs, hence the decision to switch to -xmemalign=8i.

But when you're writing code, you may actually want to use -xmemalign=4s instead of -xmemalign=8i, to make it easier to find any alignment trouble in your code. Anyway, if I have some time, I'll write on how all these unaligned access works - the dance between the user code and the kernel. In the mean time, if you're in a hurry, here would be an interesting point to start looking at how kernel emulates unaligned access for 32bit apps).

About

seongbae

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today