Thursday May 18, 2006

My last day at Sun

Today is my last day at Sun. After 7 years of learning, coding, making friends and having fun, I've finally decided to move on to a new job. I don't know what will happen to this blog, so I've created a new one at blogger.com. Hopefully, this posting will be alive long enough for people to know what happened...

Friday Mar 03, 2006

Neither fish nor fowl - v8plus

A very well kept secret(?) is finally found - I've been wondering for a looooooong time where the heck is the public documentation on v8plus, and it turns out it's very well hidden here.

We're working on making this more easily accessible but in the mean time, the above URL should work. If it doesn't, go to sun.com and search for 802-7447.

BTW, if you have no idea what is v8plus, it's an extension of 32bit ABI that is compatible with the old 32bit ABI but allows the use of 64bit processor features. For more detail, you need to look at SPARC ABI specifications called "SPARC Compliance Definition" at the SPARC International resources page.

PS. Many thanks to Steve, Tunji and Chris.

Thursday Mar 02, 2006

GCC for SPARC Systems released!

Do you use gcc on Solaris/SPARC ? Whether you're using it by choice or you're forced to use it due to other reasons, if the answer is yes, here's something you may want to try - GCC for SPARC Systems.
Enjoy!

Friday Dec 16, 2005

pthread_{get,set}specific vs thread local variable

When you write a multithreaded code, sometimes you want to have a per-thread data that can be easily accessed everywhere. This need is especially important when you port an existing single-threaded application to be multithreaded. Two possible solutions among many possibilities are 1) using pthread_{get,set}specific() API and 2) using compiler's thread local variable support. On this blog, I'll compare those two approaches and show an example.

pthread_{get,set}specific() are part of POSIX thread API that provides access to per-thread data by using a key. This contrasts to the thread local variable which is a language extension provided by the Sun Studio compiler. Both utilize the runtime linker's support of thread local storage (TLS) but are quite different. First of all, they look completely different in the source code. Pthread APIs are just bunch of function calls, but the thread local variable is through a type qualifier. So the thread local variable is more seamlessly integrated into the source but is less portable (well, gcc supports __thread on many platforms, and Microsoft compiler supports __declspec( thread ) to do the same, so portability isn't that big of a deal in practice. But still it is a non-standard extension and once you use it your code won't be 100% standard C). And Microsoft compiler also supports thread local storage This also affects the compiler optimization because compiler recognizes thread local variable and knows its semantic but it has no idea what pthread APIs are about (and even if we teach compilers to recognize them, there's not much a compiler can do because those APIs are in libc.so). pthread_{get,set}specific() can store only void \*. This means that if you want a data that's bigger than void \*, you need to allocate it on the heap yourself and store a pointer. This contrasts to __thread which can be used for data with any size. And for such large cases, __thread doesn't need the extra indirection pthread_{get,set}specific() would if a pointer is stored instead of the actual variable.

Here's a single .c file that can be built to do equivalent stuff in pthread_{get,set}specific() and the compiler language extension.

# cat -n thread.c
     1  #include 
     2  #include 
     3  #include 
     4  #include 
     5  #include 
     6
     7  void \*test(void \*);   /\* thread routine \*/
     8  void nop(int);
     9
    10  #ifdef TLS
    11  __thread int i;
    12  #else
    13   pthread_key_t key_i;
    14  #endif
    15   pthread_t \*tid;
    16
    17   int
    18   main(int argc, char \*argv[])
    19   {
    20           int i;
    21           int iter;
    22           int nthread;
    23           if (argc <= 2)  {
    24                   printf("Usage: %s #threads #iteration\\n", argv[0]);
    25                   return (1);
    26           }
    27
    28  #ifndef TLS
    29           pthread_key_create(&key_i, NULL);
    30  #endif
    31
    32           nthread = atoi(argv[1]);
    33           iter = atoi(argv[2]);
    34           tid = alloca( sizeof(pthread_t) \* nthread );
    35
    36           printf("main() %d threads %d iterations\\n", nthread, iter);
    37
    38           for ( i = 0; i < nthread; i++)
    39                   pthread_create(&tid[i], NULL, test,
    40                       (void \*)iter);
    41           for ( i = 0; i < nthread; i++)
    42                   pthread_join(tid[i], NULL);
    43
    44           printf("main() reporting that all %d threads have terminated\\n", i);
    45           return (0);
    46   }  /\* main \*/
    47
    48   void \*
    49   test(void \*arg)
    50   {
    51           int count = (int)arg;
    52           int v;
    53  #ifdef TLS
    54           i = count;
    55  #else
    56           if ( pthread_setspecific( key_i, (void\*)count ) != 0 ) {
    57                   printf("pthread_setspecific failed\\n");
    58           }
    59  #endif
    60           printf("thread %d test count %d\\n", pthread_self(), count);
    61  #ifdef TLS
    62           while( i > 10 ) {
    63                  i = i - 1;
    64                  nop(i);
    65           }
    66  #else
    67           while( (v = (int)(pthread_getspecific(key_i))) > 10 ) {
    68                  pthread_setspecific(key_i, (void\*)(v -1));
    69                  nop(v);
    70           };
    71  #endif
    72           printf("thread %d finished. count %d\\n", pthread_self(), count);
    73
    74           return (NULL);
    75   }
#
This is a contrived example but is good enough to illustrate difference between the compiler support and pthread API. You might have noticed __thread at line 11, which is a type qualifier that specifies that the variable is per-thread meaning that each thread will see its own copy of i. Without it, all threads will share the same i.

I built the above source code with:

# cc -fast -xarch=v8plus thread.c -DTLS -o tls.out nop.il -Wc,-Qinline-l
# cc -fast -xarch=v8plus thread.c -o specific.out nop.il -Wc,-Qinline-l
You may wonder what those extra compiler arguments (nop.il and -Wc,-Qinline-l) are about. I'll explain later. Anyway, with this compilation, let's compare the output of the code for the while loop above (line 62 and 67).
# dis -F test specific.out
    ...
    test+0x58:              40 00 40 73  call         pthread_setspecific
    test+0x5c:              d0 06 a0 84  ld           [%i2 + 0x84], %o0
    test+0x60:              90 10 00 1c  mov          %i4, %o0
    test+0x64:              01 00 00 00  nop
    test+0x68:              40 00 40 75  call         pthread_getspecific
    test+0x6c:              d0 06 a0 84  ld           [%i2 + 0x84], %o0
    test+0x70:              b8 10 00 08  mov          %o0, %i4
    test+0x74:              80 a2 20 0a  cmp          %o0, 0xa
    test+0x78:              14 4f ff f8  bg,pt        %icc, test+0x58
    test+0x7c:              92 22 20 01  sub          %o0, 0x1, %o1
    ...
pthread API case is straightforward - there are two calls, with two loads that load the value of key_i. But to really see what's going on, you need to look at pthread_getspecific() itself:
# dis -F pthread_getspecific /usr/lib/libc.so.1
                \*\*\*\*   DISASSEMBLER  \*\*\*\*


disassembly for /usr/lib/libc.so.1

section .text
pthread_getspecific()
    pthread_getspecific:    80 a2 20 09  cmp          %o0, 0x9
    pthread_getspecific+0x4: 1a 80 00 05  bgeu         pthread_getspecific+0x18
    pthread_getspecific+0x8: 9a 01 e1 00  add          %g7, 0x100, %o5
    pthread_getspecific+0xc: 97 2a 20 02  sll          %o0, 0x2, %o3
    pthread_getspecific+0x10: 81 c3 e0 08  retl
    pthread_getspecific+0x14: d0 03 40 0b  ld           [%o5 + %o3], %o0
    pthread_getspecific+0x18: c2 01 e0 fc  ld           [%g7 + 0xfc], %g1
    pthread_getspecific+0x1c: 80 a0 60 00  cmp          %g1, 0x0
    pthread_getspecific+0x20: 02 80 00 08  be           pthread_getspecific+0x40
    pthread_getspecific+0x24: 01 00 00 00  nop
    pthread_getspecific+0x28: d4 00 60 00  ld           [%g1], %o2
    pthread_getspecific+0x2c: 80 a2 00 0a  cmp          %o0, %o2
    pthread_getspecific+0x30: 1a 80 00 04  bgeu         pthread_getspecific+0x40
    pthread_getspecific+0x34: 9b 2a 20 02  sll          %o0, 0x2, %o5
    pthread_getspecific+0x38: 81 c3 e0 08  retl
    pthread_getspecific+0x3c: d0 00 40 0d  ld           [%g1 + %o5], %o0
    pthread_getspecific+0x40: 81 c3 e0 08  retl
    pthread_getspecific+0x44: 90 10 20 00  clr          %o0
#
If you follow the dependency chain, you'll see that
    pthread_getspecific+0x18: c2 01 e0 fc  ld           [%g7 + 0xfc], %g1
    pthread_getspecific+0x34: 9b 2a 20 02  sll          %o0, 0x2, %o5
    pthread_getspecific+0x3c: d0 00 40 0d  ld           [%g1 + %o5], %o0
are what it takes to get the return value. So, for each access, pthread_getspecific() needs three loads (one for loading key_i in test(), one for loading a pointer to thread local storage area from thread pointer %g7, and the final load to get the actual value. See _pthread_getspecific for the libc code for pthread_getspecific). Although I didn't show pthread_setspecific(), it's essentially similar in the common case - two loads to form the address and one store to do an actual write.

Now, let's look at __thread case:

# dis -F test tls.out
    ...
    test+0x54:              d0 25 a0 00  st           %o0, [%l6]
    test+0x58:              01 00 00 00  nop
    test+0x5c:              ea 05 a0 00  ld           [%l6], %l5
    test+0x60:              80 a5 60 0a  cmp          %l5, 0xa
    test+0x64:              14 4f ff fc  bg,pt        %icc, test+0x54
    test+0x68:              90 25 60 01  sub          %l5, 0x1, %o0
    ...
#
__thread code looks straightforward on the surface - one load reads the value, one store writes it back. The question is, how %l6 is formed. Here's the assembly snippet for the code just before the above:
# dis -F test tls.out
                \*\*\*\*   DISASSEMBLER  \*\*\*\*


disassembly for tls.out

section .text
test()
    test:                   9d e3 bf a0  save         %sp, -0x60, %sp
    test+0x4:               40 00 00 02  call         test+0xc
    test+0x8:               9e 10 00 0f  mov          %o7, %o7
    test+0xc:               1b 00 00 40  sethi        %hi(0x10000), %o5
    test+0x10:              3b 00 00 00  sethi        %hi(0x0), %i5
    test+0x14:              9a 03 61 5c  add          %o5, 0x15c, %o5
    test+0x18:              b8 1f 7f f8  xor          %i5, -0x8, %i4
    test+0x1c:              b6 03 40 0f  add          %o5, %o7, %i3
    test+0x20:              33 00 00 42  sethi        %hi(0x10800), %i1
    test+0x24:              b4 10 00 1c  mov          %i4, %i2
    test+0x28:              ac 01 c0 1a  add          %g7, %i2, %l6
    ...
#
I'll explain what's going on in this code sequence in more detail in another blog entry, but let me just say that this code shows the dance of the code and the runtime linker to allow forming a pointer to the thread local storage. But this sequence doesn't need any load and usually it happens only once within a routine (or at least outside loop). So, __thread requires only one load (or store) to access the variable.
With all of the above, let's compare the performance. This run was on 750MHz UltraSPARC-III system:
# ptime ./specific.out 1  100000000
main() 1 threads 100000000 iterations
thread 2 test count 100000000
thread 2 finished. count 100000000
main() reporting that all 1 threads have terminated

real        5.974
user        5.903
sys         0.015
# ptime ./tls.out      1  100000000
main() 1 threads 100000000 iterations
thread 2 test count 100000000
thread 2 finished. count 100000000
main() reporting that all 1 threads have terminated

real        2.024
user        1.971
sys         0.011
# ptime ./specific.out 2  100000000
main() 2 threads 100000000 iterations
thread 2 test count 100000000
thread 3 test count 100000000
thread 2 finished. count 100000000
thread 3 finished. count 100000000
main() reporting that all 2 threads have terminated

real        6.813
user       11.802
sys         0.028
# ptime ./tls.out      2  100000000
main() 2 threads 100000000 iterations
thread 2 test count 100000000
thread 3 test count 100000000
thread 3 finished. count 100000000
thread 2 finished. count 100000000
main() reporting that all 2 threads have terminated

real        2.150
user        3.931
sys         0.010
#
Well, as expected, three loads vs one load and the performance difference is close to 3x. I tried this on a Niagara box (T2000, 1Ghz, 8 core):
# ptime ./specific.out 1 100000000
main() 1 threads 100000000 iterations
thread 2 test count 100000000
thread 2 finished. count 100000000
main() reporting that all 1 threads have terminated

real        6.932
user        6.904
sys         0.009
# ptime ./tls.out 1 100000000
main() 1 threads 100000000 iterations
thread 2 test count 100000000
thread 2 finished. count 100000000
main() reporting that all 1 threads have terminated

real        2.637
user        2.602
sys         0.010
# ptime ./specific.out 8 100000000
main() 8 threads 100000000 iterations
thread 2 test count 100000000
thread 3 test count 100000000
thread 4 test count 100000000
thread 5 test count 100000000
thread 6 test count 100000000
thread 7 test count 100000000
thread 8 test count 100000000
thread 9 test count 100000000
thread 2 finished. count 100000000
thread 3 finished. count 100000000
thread 4 finished. count 100000000
thread 6 finished. count 100000000
thread 8 finished. count 100000000
thread 9 finished. count 100000000
thread 7 finished. count 100000000
thread 5 finished. count 100000000
main() reporting that all 8 threads have terminated

real        6.966
user       55.246
sys         0.013
# ptime ./tls.out 8 100000000
main() 8 threads 100000000 iterations
thread 2 test count 100000000
thread 3 test count 100000000
thread 4 test count 100000000
thread 5 test count 100000000
thread 6 test count 100000000
thread 7 test count 100000000
thread 8 test count 100000000
thread 9 test count 100000000
thread 3 finished. count 100000000
thread 9 finished. count 100000000
thread 2 finished. count 100000000
thread 4 finished. count 100000000
thread 5 finished. count 100000000
thread 6 finished. count 100000000
thread 7 finished. count 100000000
thread 8 finished. count 100000000
main() reporting that all 8 threads have terminated

real        2.734
user       21.604
sys         0.011
#
Not so surprisingly, they show similar performance differences. Niagara performs quite well and it scales linearly up to 8 threads (Why no result for 32 threads ? That's for the next blog <img src=)" title=":))" />.

BTW, I didn't explain what those extra argument (nop.il -Wc,-Qinline-l) to the compiler is about. A variable with __thread is still a variable and hence treated the same way as any global variable except for how its address is formed. So, the compiler just optimized away the loop without nop() call at line 64. So I inserted nop() calls at line 64 and the compiler could no longer optimize it away (since it doesn't know what nop() is), but it adds extra call overhead which could be quite big since the loop itself is quite small. So I wrote a little inline template for nop() so that the compiler can inline away nop() call. Well, once I did that, the compiler once again optimized away the loop since it saw that the call doesn't do anything. Arg. So,to prevent the compiler from understanding the body of nop() call but to still allow inlining it, I added -Wc,-Qinline-l which is a code generator internal option that prevents it from inlining inline templates early in the optimization which prevents the loop optimization.

Anyway, in summary, the thread local variable and pthread_{get,set}specific() APIs do similar things but have different interface, performance and portability. If you can afford to use __thread, it is usually a better choice in terms of the amount of code changes and the performance.

Tuesday Dec 06, 2005

My special thanks to HP for helping Niagara message out.

I've been quite busy with lots of exciting stuff, so I just couldn't pay enough attention to posting a new blog. But I can't just stay quiet upon reading this one from HP.

I'd like to express my deep and sincere gratitude to HP PR folks. Thank you guys for helping the Niagara message out. Thanks for letting HP customers know that there's such a thing as Niagara.

PS. Many thanks to The Inquirer for a nice head-up!
PS2. For those of you who wants to know more about Niagara and what the heck HP is talking about, our own McDougal's blog is a pretty good place to start.

Thursday Sep 22, 2005

Have you ever seen US-IV+ ? (Picture)

Sun just announced new UltraSPARC-IV+ chip based systems, and this is how the new chip looks like:

Friday Jul 15, 2005

How to reorder the code to improve the performance

Is your application huge ? A couple of tens of megabytes executables and shared objects ? Then I bet your application would benefit from the code reordering (or sometimes called the instruction cache optimization). A new article posted at developers.sun.com has a very graphical illustration of the effect of the code reordering and how to use the compiler to do that. Sun's top ISVs use those options to build their applications. And chances are, your application may benefit as well.

Tuesday Jul 05, 2005

Interface stability and fixed configurations

It's a little dated but still very interesting interview with Tim Marsland (our CTO of Solaris) at ACM Queue. I found the following quote very insightful:

Fixed configurations are really a Band-Aid for unstable inter-component interfaces.
When you look at our experience inside Solaris, 
the reason for what works with what is really about interface stability, or lack thereof. 

Tuesday Jun 28, 2005

What do you prefer - a faulty program to run slowly or to die immediately ?

Whenver we change the default compilation flag, often we have to make interesting tradeoffs. A relatively recent trade off was regarding the default value for -xmemalign (Here's the link to the exact page for the description of the flag). Starting from Studio 9, our compiler uses -xmemalign=8i in 32bit mode as default (vs -xmemalign=4s before).

With -xmemalign=4s, a program that does unaligned memory access would die right away, telling the developer what was wrong and where it went wrong. But with -xmemalign=8i, such program would simply run slowly and it's somewhat difficult to track down such a performance degradation (well, if you know where to look at, dtrace is again your friend here but you pretty much have to know the answer beforehand).

Sounds bad, so why did we do it ? The answer is again, the performance (what else?). When you compile a code with -xmemalign=8i as opposed to 4s, the compiler can safely use 8byte store and load for appropriately sized datum (like double precision floating point). Since most programs are alignment-safe, that is, most code don't do funky typecasting (like casting a char pointer to an integer point and accessing it), this change doesn't cause any performance degradation on those correct programs but actually could improve them somewhat. You can ask why not -xmemalign=8s. Unfortunately, some constructs in Fortran and the artifact of 32bit ABI in SPARC makes it not possible to use -xmemalign=8s even for a completely correct program. However, those occurrences are very rare as to not affect the overal performance in most programs, hence the decision to switch to -xmemalign=8i.

But when you're writing code, you may actually want to use -xmemalign=4s instead of -xmemalign=8i, to make it easier to find any alignment trouble in your code. Anyway, if I have some time, I'll write on how all these unaligned access works - the dance between the user code and the kernel. In the mean time, if you're in a hurry, here would be an interesting point to start looking at how kernel emulates unaligned access for 32bit apps).

Thursday Jun 09, 2005

Ever heard of sound ISOLATING earphones ?

For last few weeks since I got my iPod Shuffle, I've been using this earphone to listen to my collection of music while in the office. There's a reason why it's called sound isolating - I don't hear almost any sound when I'm plugged.

I don't hear my phone ringing unless I turn the ring volume all the way up. I don't hear my roommate talking on the phone (he doesn't exactly whisper when he's on the phone). I don't hear someone asking me for lunch. I don't hear my manager opening my room door (I sit facing away from the door), walking across my back and sitting down on my side. I have no idea he's there until he taps me on the shoulder. Ooops.

This earphone works...and works too well to use it in the office. So I ended up switching back to less satisfying earbuds bundled with iPod. If you're working in noisy environment (maybe in some cubicle full of noisy coworkers or rack full of servers), sound isolating earphones might be a great help. Of course, it reproduces music \*much\* better than others.

PS. I'd love to get this one if I have some spare cash to burn. Although mine sounds much better than most other earbuds, I know it still lacks precise high to mid-high frequency response (I'm not much of a bass person and the bass of E2c is enough for me). And like most other earphones, the sound field feels "closed". Maybe I should get one of these.

Monday Jun 06, 2005

MacOS X on Solaris ?

Our COO invited Apple to adopt Solaris as the base of MacOS X. For those of you who don't remember or know the history behind this, this isn't the first time there's such a proposal. See our own Rich B's blog here for a bit of history. In fact, OpenSTEP implementation was working (including Mail.app albeit with fonts rendered ugly compared to native NeXTSTEP). FYI, our COO used to work for a company called Lighthouse Design which used to produce softwares for NeXTSTEP.

I used to run NeXTSTEP 3.3j on my SPARCstation, as well as the earlier versions of OpenSTEP on Solaris. All NeXTSTEP GUI applications were great - it indeed was 10 years ahead of everything. However, mach kernel and BSD server in NeXTSTEP was...at best flaky and slow and it was clearly behind Solaris at the time. I think it will take years for darwin to catch up with Solaris 10 - evidence ? OS X has very recently added 64bit support or fine grained locking/reentrant kernel and it's been years since Solaris had those. And it will take even more time for Apple to catch up with all those new features in Solaris 10 - Dtrace, Zones, ZFS, FMA and etc. IMHO, it would make perfect sense for Apple to switch to Solaris 10 on x86 replacing Darwin. Here's hoping that Apple will consider switching to Solaris :)

PS. I know. It's even less likely than Apple's switch to Intel. But surely one can have a dream.

Friday Jun 03, 2005

How Apple can move to Intel

CNet reported that Apple is going to switch to Intel for its CPU supplier. Some people thinks that this means Intel will produce PowerPC compatible processor, but I strongly doubt it. Some other people thinks that this would involve mostly emulation of PowerPC on x86. However, I have slightly different take on this - I think Apple is going to use FAT binary to smooth the transition.

NeXTSTEP used to support FAT binary for four architectures - Motorola 68k, Intel x86, HP PA and Sun SPARC (NIHS). And it worked pretty well. Gcc compiler for NeXTSTEP had an option to produce objects for all four targets at the same time, and linker and runtime loader all supported this very well. A single executable file can contain object codes for four different architectures, and many applications were distributed for all four (NIHS), or at least for 68k and x86 (NI).

Assuming Apple didn't remove FAT support in MacOS X, it would be fairly easy for Apple to provide developers ways to create executables for both targets (a simple recompilation for both target would work for most applications, just like moving applications from Solaris SPARC to Solaris x86) and the ISVs can distribute one software package that supports both platforms.

Of course, Apple may still need to provide a PowerPC emulation for x86. But that shouldn't be too difficult either, since underlying OS and APIs remain the same. But emulation doesn't address the actual transition of ISV code itself, and it alone doesn't really ease the transition for ISVs like FAT binary does. So all in all, I bet Apple will revive FAT binary in OS X. Maybe it's time for Sun to push SPARC support in OS X...after all, Niagara and ROCK will be quite attractive as a CPU for Xserve :)

Monday Apr 25, 2005

Be careful with unlimited stack (or why my 32bit app can not use more than 2GB of memory)

Once in a while, this same question pops up on some of the mailing lists. Usually it's about some 32bit application trying to use more than 2GB of memory, or address space, to be exact. And the application mysteriously can not get past 2GB marker for no apparent reason - the system has plenty of memory and enough free swap space, and 64bit programs work just fine.

I've seen various reasons for this, like a bug in malloc() in libc for example. But by far the most common reason was due to unlimited stack size.

On Solaris, if you set the stack size to unlimited, your heap can not grow past 2GB. That's because the kernel reserves upper 2GB for stack usage. IMHO, this "unlimited" is a little misleading, since it actually means the maximum stack size is 2GB.

The kernel has to put a guard page (or pages?) at the end of the ulimit-specified maximum stack size, so that the overflow of the stack can be detected as well as interim pages are mapped on demand. This means that essentially ulimit-specified stack sized address space is always reserved for stack use.

Following code shows what's going on:

# cat t.c
#include 
#include 
#include 

int
main(void)
{
    pid_t pid;
    int \*p = (int\*)(0xff000000);

    \*p = 0;
    pid = getpid();
    char buf[100];
    sprintf(buf, "pmap %d\\n", pid);
    system(buf);
    return 0;
}
# cc -g t.c
# ulimit -a
core file size        (blocks, -c) unlimited
data seg size         (kbytes, -d) unlimited
file size             (blocks, -f) unlimited
open files                    (-n) 1024
pipe size          (512 bytes, -p) 10
stack size            (kbytes, -s) unlimited
cpu time             (seconds, -t) unlimited
max user processes            (-u) 29995
virtual memory        (kbytes, -v) unlimited
# ./a.out
11267:  ./a.out
00010000       8K r-x--  /home/seongbae/tmp/a.out
00020000       8K rwx--  /home/seongbae/tmp/a.out
7FA80000     848K r-x--  /lib/libc.so.1
7FB64000      32K rwx--  /lib/libc.so.1
7FB6C000       8K rwx--  /lib/libc.so.1
7FB8A000       8K rwxs-    [ anon ]
7FB90000       8K r-x--  /platform/sun4u-us3/lib/libc_psr.so.1
7FBA0000      24K rwx--    [ anon ]
7FBB0000     176K r-x--  /lib/ld.so.1
7FBEC000       8K rwx--  /lib/ld.so.1
7FBEE000       8K rwx--  /lib/ld.so.1
FF000000   12288K rwx--    [ stack ]
 total     13424K
# ulimit -s 10240
# ulimit -a
core file size        (blocks, -c) unlimited
data seg size         (kbytes, -d) unlimited
file size             (blocks, -f) unlimited
open files                    (-n) 1024
pipe size          (512 bytes, -p) 10
stack size            (kbytes, -s) 10240
cpu time             (seconds, -t) unlimited
max user processes            (-u) 29995
virtual memory        (kbytes, -v) unlimited
# ./a.out
Segmentation Fault (core dumped)
# dbx ./a.out
Reading a.out
Reading ld.so.1
Reading libc.so.1
(dbx 1) r
Running: a.out 
(process id 11275)
Reading libc_psr.so.1
signal SEGV (no mapping at the fault address) in main at line 11 in file "t.c"
   11       \*p = 0;
(dbx 2) p p
p = 0xff000000
(dbx 3) quit
# 

Monday Mar 28, 2005

Debugging while driving

I carpool with a colleague who is an avid Apple user (he's got an iBook, an iMac G4 and an iMac G5, and an iPod 20G and an iPod Shuffle 1G). One day when we hop on my car to go home, he said his iCal hangs whenever he starts it...and that was the beginning of my first ever "debugging-while-driving" session.

He suspected his calendar data was corrupted, so he already tried backing it up and removing it. But that didn't fix the problem, so it wasn't the calendar data corruption. And he had no clue what other files iCal software accesses. I've never used MacOS X before, so I had no idea what kind of system level debugging tools it provides, and my carpool buddy didn't use his Apple boxes for any software development so he was almost as clueless as I was. However, my carpool buddy already found and had run some kind of "top"-like utility to see what processes are running and etc, and it showed iCal hanging. The utility also allowed taking a snapshot of the stack trace for a brief period of time, and it showed truss-like output. He inspected the stack trace but didn't find anything interesting - there's no particular function that's at the top of the stack.

Being so accustomed to Solaris, first thing I liked to see was truss-equivalent but I nor my buddy didn't know the equivalent tool on MacOS X, if such a tool exists. While driving and having my buddy on the passenger seat with his iBook, I just asked him to try "ls \*trace\*" on a couple of usual places (/sbin /usr/sbin etc), and voila, there was an executable called "ktrace". A quick "man ktrace" confirmed that this was indeed what we wanted.

The next step was to find the iCal executable - fortunately, I used to run NeXTSTEP 3.3 on my university SPARC boxes (probably I was one of very few users of such systems in Korea at the time - there were handful of NS on Intel users but NS on SPARC was...pretty much non-existent), and remembered that those \*.app was actually a directory containing bunch of different data/executable/etc for that application, and suspected that MacOS X uses the same scheme (btw, I've never used OS X before). The directory layout under \*.app sounded slightly different on MacOS X than on NeXTSTEP as my buddy read what he saw on the screen. But he found the executable under some subdirectory and fired the ktrace on it - and it hung as expected.

As is typical for any trace, the output was huge and my buddy couldn't figure the head and the tail out of it. I suggested to look at the files opened, especially the last file opened. The trace output showed some sort of XML based configuration file. My buddy backup the file and delete it, and started iCal and voila, it worked. Copying the file back to its original location, iCal hung again. It was a sort of configuration/option/preference file and was no big deal to start from scratch by setting the options in the iCal. We speculated that the file is corrupted, causing XML to be malformed or something similar, but didn't bother to investigate further - after all, my buddy was back to being a happy Apple user...and that was my first debugging-while-driving experience.

Wednesday Jan 05, 2005

Linking C++ objects from multiple compilers

It is relatively well known that you can not link c++ objects files compiled with different compilers on various Unix platforms. But quite a few people don't seem to know exactly why that is the case, and I've recently seen quite a few emails and usenet postings on this, so let me give a shot at explaining thie issue.

Unix System V has a specification called "application binary interface", in short, ABI. The System V ABI has usually two parts - generic, platform-independent part called gABI and platform-specific part called psABI. gABI documentation can be found here. psABI is defined for each platform that system V is ported, and for SPARC, it can be found here. The ABI as a whole dicates the calling convention, the linkage convention, the object file format and any other information that's necessary to produce all tools - compiler, linker, dynamic linker, program loader, etc - necessary to produce conforming object files (including executables and shared libraries).

The problem is that the ABI does not specify things that compilers, linkers and runtime libraries to follow, to make C++ objects compatible. Various aspect of C++ - object model, exception handling, runtime type information and name mangling - have to be common for compilers and runtimes to be compatible with each other. Since the ABI does not specify all those aspect, each implementation of compilers and runtime libraries decided to do it in their own way.

The end result of lack of ABI specification is that two dominant compilers on Solaris, namely Sun compiler and gcc, are not compatible with each other for C++, whereas they are compatible for C objects. This causes all kinds of headaches, and the biggest one is that if you have a C++ shared library, you have to provide two version, one compiled by Sun compiler and the other compiled by gcc if you're to allow users of the library to pick any compiler s/he wants.

Two obvious ways to fix this issue: change gcc to follow Sun model, or change Sun compiler to follow gcc. The first simply won't happen - gcc is cross platform and won't change their portable way to accomodate a particular platform. The second is also problematic because gcc's c++ ABI hasn't been exactly stable - it's been revised couple of times in incompatible ways and Sun as a company regards the backward compatibility quite seriously.

Untill this issue is resolved and a common ABI is defined and agreed upon, all the people using C++ - the users, the developers, and the compilers and tools developers - will have to suffer from this C++ object file incompatibility.

About

seongbae

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today