Monday Mar 28, 2005

Debugging while driving

I carpool with a colleague who is an avid Apple user (he's got an iBook, an iMac G4 and an iMac G5, and an iPod 20G and an iPod Shuffle 1G). One day when we hop on my car to go home, he said his iCal hangs whenever he starts it...and that was the beginning of my first ever "debugging-while-driving" session.

He suspected his calendar data was corrupted, so he already tried backing it up and removing it. But that didn't fix the problem, so it wasn't the calendar data corruption. And he had no clue what other files iCal software accesses. I've never used MacOS X before, so I had no idea what kind of system level debugging tools it provides, and my carpool buddy didn't use his Apple boxes for any software development so he was almost as clueless as I was. However, my carpool buddy already found and had run some kind of "top"-like utility to see what processes are running and etc, and it showed iCal hanging. The utility also allowed taking a snapshot of the stack trace for a brief period of time, and it showed truss-like output. He inspected the stack trace but didn't find anything interesting - there's no particular function that's at the top of the stack.

Being so accustomed to Solaris, first thing I liked to see was truss-equivalent but I nor my buddy didn't know the equivalent tool on MacOS X, if such a tool exists. While driving and having my buddy on the passenger seat with his iBook, I just asked him to try "ls \*trace\*" on a couple of usual places (/sbin /usr/sbin etc), and voila, there was an executable called "ktrace". A quick "man ktrace" confirmed that this was indeed what we wanted.

The next step was to find the iCal executable - fortunately, I used to run NeXTSTEP 3.3 on my university SPARC boxes (probably I was one of very few users of such systems in Korea at the time - there were handful of NS on Intel users but NS on SPARC was...pretty much non-existent), and remembered that those \*.app was actually a directory containing bunch of different data/executable/etc for that application, and suspected that MacOS X uses the same scheme (btw, I've never used OS X before). The directory layout under \*.app sounded slightly different on MacOS X than on NeXTSTEP as my buddy read what he saw on the screen. But he found the executable under some subdirectory and fired the ktrace on it - and it hung as expected.

As is typical for any trace, the output was huge and my buddy couldn't figure the head and the tail out of it. I suggested to look at the files opened, especially the last file opened. The trace output showed some sort of XML based configuration file. My buddy backup the file and delete it, and started iCal and voila, it worked. Copying the file back to its original location, iCal hung again. It was a sort of configuration/option/preference file and was no big deal to start from scratch by setting the options in the iCal. We speculated that the file is corrupted, causing XML to be malformed or something similar, but didn't bother to investigate further - after all, my buddy was back to being a happy Apple user...and that was my first debugging-while-driving experience.

Wednesday Jan 05, 2005

Linking C++ objects from multiple compilers

It is relatively well known that you can not link c++ objects files compiled with different compilers on various Unix platforms. But quite a few people don't seem to know exactly why that is the case, and I've recently seen quite a few emails and usenet postings on this, so let me give a shot at explaining thie issue.

Unix System V has a specification called "application binary interface", in short, ABI. The System V ABI has usually two parts - generic, platform-independent part called gABI and platform-specific part called psABI. gABI documentation can be found here. psABI is defined for each platform that system V is ported, and for SPARC, it can be found here. The ABI as a whole dicates the calling convention, the linkage convention, the object file format and any other information that's necessary to produce all tools - compiler, linker, dynamic linker, program loader, etc - necessary to produce conforming object files (including executables and shared libraries).

The problem is that the ABI does not specify things that compilers, linkers and runtime libraries to follow, to make C++ objects compatible. Various aspect of C++ - object model, exception handling, runtime type information and name mangling - have to be common for compilers and runtimes to be compatible with each other. Since the ABI does not specify all those aspect, each implementation of compilers and runtime libraries decided to do it in their own way.

The end result of lack of ABI specification is that two dominant compilers on Solaris, namely Sun compiler and gcc, are not compatible with each other for C++, whereas they are compatible for C objects. This causes all kinds of headaches, and the biggest one is that if you have a C++ shared library, you have to provide two version, one compiled by Sun compiler and the other compiled by gcc if you're to allow users of the library to pick any compiler s/he wants.

Two obvious ways to fix this issue: change gcc to follow Sun model, or change Sun compiler to follow gcc. The first simply won't happen - gcc is cross platform and won't change their portable way to accomodate a particular platform. The second is also problematic because gcc's c++ ABI hasn't been exactly stable - it's been revised couple of times in incompatible ways and Sun as a company regards the backward compatibility quite seriously.

Untill this issue is resolved and a common ABI is defined and agreed upon, all the people using C++ - the users, the developers, and the compilers and tools developers - will have to suffer from this C++ object file incompatibility.

Wednesday Dec 22, 2004

volatile, thread safety and memory ordering

I came across this excellent write up by Scott Meyers and Andrei Alexandrescu. A lot of people expect more things from volatile than what's defined in the standard and this is a good warning message for them. It also touches on the multiprocessor memory ordering issue and that's another area where a lot of casual programmers are not even aware of. Enjoy the reading!

Friday Jul 23, 2004

Don't try to trick the compiler.

This is yet another not-a-compiler-bug-but-a-user-bug story.

Some Sun internal folks built an open source project hosted on with our compiler, and the program produced different output when compiled with -xO4 or above. So they thankfully filed a bug (btw, we're happy to look at any bugs filed against us, even if it turns out to be a user error. So please don't hesitate to file a bug if you think it's compiler's fault).

A short analysis revealed the following:

In a file "r.h", there was a declaration like:
  typedef struct ... {
  } some_struct_t;

  extern const some_struct_t some_struct;
But in "r.c", it wasn't declared "const", and in that file, many functions modified this global variable some_struct.

The original programmer seemed to have thought that since this global variable is modified only in r.c and all other files should just read the variable, it's a good way to force that.

At -xO4 or above, our compiler starts doing aggressive inlining. And the inlining exposed some redudant loads from one field of this global variable "some_struct" in a file "a.c". Of course, the compiler happily eliminated the second redundant load - since "some_struct" is declared "const", there's nothing for the compiler to worry about. Well, the only problem was there was a call to a function defined in "r.c" which modifies the field that the eliminated redundant load was accessing. So the variable wasn't really "const" at all. Of course, this program works just fine when compiled without optimization or low level optimization, since only inlining and redundant code elimination can reveal the problem.

Another interesting tidbit is that there was some if-def that removed this "const"ness when compiled on certain platform. I bet somebody was already bitten by exactly the same problem, and worked around it by removing constness for that platform. Maybe s/he thought it was a platform specific problem. I don't know.

Anyway, I guess the morale of this story is: don't try to trick your compiler.

Friday Jul 16, 2004

Buffer overflow, register window and register allocation.

I work on Sun's compiler, especially the SPARC code generator part. The inevitable (and sometimes boring, and sometimes the most interesting) part of my job is to evaluate bugs and (of course) fix them if I can. But as any engineers working on a complex software know, more often than not, a bug turns out to be an user error - in compiler's case, it could mean the user code has a bug.

This is a story of one recent case of not-a-bug.

One of our largest ISVs filed a bug where their application receives SIGSEGV when the program is compiled at -xO4 or above with our S1S8 compiler. The program worked just fine with WS6U2 at the same optimization level, so the customer naturally thought this is a compiler bug. I can't fault them for that since they had experienced quite a few compiler bugs in the past.

Because the bug went away whenever you turned off the global register allocator, it was sent to me (since I was the author of the register allocator). This particular ISV application was one of the most difficult ones to deal with, because this ISV, like most other large ISVs, does not allow their code to be shipped to us, thus we have to rely on either their engineer or our support engineer working on their site.

Since there's always a possiblity of a user error, running dbx's rtc or purify like tools is one way to exclude some of the most common programming errors. Unfortunately, this application was too large and complex for dbx rtc or purify to handle correctly and produce a userful report.

The symptom was quite simple - the program gets SEGV and at the time of SEGV, the stack trace showed that one pointer parameter had upper 32bit of 64bit pointer "zero"ed. So obviously the caller of the function was the first suspect. Upon manual inspection of the disassembly, it was clear that the code was quite correct because the code looked like following:

add %fp,1xxx,%l0

...bunch code including many calls...

call problematic_func
mov %l0,%o0

On dbx, %l0 contained a correct value right after the add but somehow the upper 32bit of %l0 got zeroed out when the control reached the problematic call. Subsequent dbx printout showed that %l0 gets changed after a call to a certain function.

Assuming save-restore are correctly placed, the only other way to modify %l0 is to change the register window save area. It just so happened that the %l0 is the first entry in the register window save area. Since SPARC is big-endian, the upper 32bit (MSB) is stored in the lower address. This all suggested the function in question was overwriting the first 32bit of the register window save area. This can happen, among others, if there's a buffer overflow on a local array. Because the compiler allocates stack space for local variables from the higher address to the lower address in the order of appearance in the source, the first variable is usually placed at the top, thus right below the current %fp (or the %sp of the caller). Of course, optimization can move stuff around and get rid of variables, and most scalar variables are allocated in the register so there's no guarantee for the above rule.

The preprocessed source code for the function in question looked like following:

returntype func(something \*ptr,...) {
   wchar_t a[81];
   wchar_t b[81];

   ...initialize b by calling some initfunc...
   for(i = 0;i < wslen(b);i++) { some operation on b[i]...
   b[i] = 0;
   ...more code...

The array "a" wasn't used in the function, so the compiler didn't bother to allocate it on the stack. Thus "b" was at the top of the stack. If b was to overflow, the window save area could be overwritten - i.e. b[81] = 0 would overwrite the upper 32bit of %l0 save area.

After hearing the above analysis, our support engineer looked at the code of the initfunc and found a bug as expected, and the bug was closed as not a bug.

One may wonder why this code worked fine in the past. That's because %l0 wasn't live across that particular function call. The morale of the story is that any slight change in the register assignment can reveal a user error.




« July 2016