Wednesday Sep 24, 2008

A Quirk of the SPARC Architecture

Most engineers rarely have to write assembly code, yet my experience has been that large applications (particularly if they've been around for a decade or so) contain at least a few functions of hand-written assembly code. Hopefully these functions will be inspected during a porting project to see if they can be rewritten in C or replaced with with calls to the appropriate graphics, atomic, or numeric library.

Since working with assembly is not very common, I won't dwell on the topic very often. However, in the last six months I've seen the same mistake in three unrelated projects (porting from 32-bit to 64-bit SPARC), so this particular topic deserves to be mentioned.

Some 64-bit assembly porting is fairly mechanical: converting to the 64-bit calling conventions, using 64-bit registers, adjusting the bias of stack offsets, and accounting for 64-bit sizes and alignments. However, the following snippet of code illustrates a slight quirk of the SPARC architecture that is the root cause of a particular porting problem. In the following code, %o2 and %o3 point to memory buffers and the pointer in %o4 marks the end of the copy:

    add     %o2,1,%o2       # inc to the next destination location
    ldsb    [%o3],%o1       # load byte from source buffer
    add     %o3,1,%o3       # inc to the next source location
    cmp     %o2,%o4         # check for end of loop
    bcs     top_of_loop     # if not done, then branch to top
    stb     %o1,[%o2-1]     # store to dest buffer in delay slot of branch
The add instruction updates all 64-bits of the output register, and the memory accesses don't need to change. However, this sequence doesn't quite work for 64-bit. Unfortunately, it works so much of the time that regression tests could easily miss the failure case.

The problem is that unlike the x64 architecture which has two sizes of compare instructions ("cmpl" and "cmpq"), SPARC has a single instruction which sets two different sets of condition codes. The conditional branch in the code sequence above inspects the 32-bit condition codes, so it jumps based on a 32-bit comparison.

To correctly base the branch on the the 64-bit condition codes, it needs to be rewritten to use the extended condition codes, %xcc:

    bcs     %xcc,top_of_loop
I can't explain why this particular architecture feature is so easily missed, but I can point out that the original sequence works correctly unless the memory buffer pointed to by %o2 crosses a 2GB boundary. At least for one application, the failure only occurred at a single customer site and only once every few weeks (and was, therefore, tough to diagnose).

Wednesday Sep 03, 2008

AMD Performance Counter Not Broken After All

After using the CPU performance counters on various processors for the last few years, I'm not surprised to occasionally find one which doesn't work. So, when looking at L1 cache refill statistics on a current Opteron, I assumed the worst when the event count was consistently zero. Of course, it wasn't that simple.

One clue to the complexity of using the many counters on the various CPUs is that each of the Solaris counter-based performance measurement tools (for example collect/analyzer, cputrack, and cpustat) all include the following footnote in their respective help messages:

    See Chapter 10 of the "BIOS and Kernel Developer's Guide for the
    Athlon 64 and AMD Opteron Processors", AMD publication #26094.

This document (and its revision, AMD publication #25759) explain that some of the performance counters use a unit mask which further specifies or qualifies the event. In the particular case of data cache refills, the unit mask specifies exactly which kind of refills are being counted, as described in the following table:

    0x01 Refill from system memory
    0x02 Refill from Shared-state line from L2 cache
    0x04 Refill from Exclusive-state line from L2
    0x08 Refill from Owned-state line from L2
    0x10 Refill from Modified-state line from L2

The problem for the naive user is that the default mask is 0x0, which means that no events are selected (and thus the counts will always be zero). The performance tools would be more user friendly if they warned that a counter was being monitored which could not possibly return any useful data (since the associated unit mask is clear). I presume they don't attempt this because of the complexity of tracking the quirks of many different supported CPU's.

To see the problem, consider the following command and output:

    $ cputrack -c DC_refill_from_L2  application
       time lwp      event      pic0
      1.015   1       tick         0
      2.015   1       tick         0
      2.178   1       exit         0
However, by specifying the unit mask (in this example, the union of all of the "refill from L2" flags), it becomes:
    $ cputrack -c DC_refill_from_L2,umask=0x1e application
       time lwp      event      pic0
      1.028   1       tick     47981
      2.018   1       tick     47225
      2.144   1       exit    101299

The problem is the same for collect/analyzer, but the syntax for specifying the unit mask is slightly different. As the documentation explains, it uses the hardware counter syntax:

which translates the example to the following:
    collect -h DC_refill_from_L2~umask=0x1e,hi application
This issue isn't a problem for most uses of collect which use the well known profiling counters like: cycles, insts, icm, etc; however, you need to pay attention when using the list of CPU-specific flags.

Thursday Jun 28, 2007

Fixing a 10X Slowdown

Another tuning situation where I was involved as a sounding-board for my colleague Dan Souder dealt with an application which ran at only 1/10th its expected rate. [Rather than executing a particular test case in 18 seconds, it took 200 seconds.]

The first interesting clue was that when the code was compiled for debug, it ran as expected, but when compiled for production, it ran quite slowly. Unfortunately, collect and analyzer showed that the extra time was spread over a broad set of functions (rather than being concentrated in a few misbehaving hot-spots).

There are plenty of problems (like cache conflicts or code scheduling) which can cause a mostly CPU-bound application to slow down by a few percent (or even several tens of percent). But when that kind of application slows down by a factor of ten, it often means that that it's having some kind of bad interaction with the system (ie. paging, I/O bottleneck, TLB thrashing, FP underflow handling, etc). Rather than immediately jumping in with DTrace, it was simpler to first check for obvious problems using vmstat, truss, and trapstat.

trapstat revealed that the application was generating a huge number of lddf-unalign and stdf-unalign traps (for floating point loads and stores of incorrectly aligned addresses). It looked something like:

    # /usr/sbin/trapstat ./some_application
    vct name                |     cpu0
     20 fp-disabled         |        5
     24 cleanwin            |      794
     35 lddf-unalign        |  1085199
     36 stdf-unalign        |  1012567
     41 level-1             |       66
    ... rest of output elided ...

The time spent in the trap handler easily explained the performance symptom. Then some follow-on forensics showed that an application specific memory management layer was returning memory blocks which were only aligned to a 4-byte boundary (when compiled for production). Once that was reconfigured, the performance returned to normal.

Still, I was surprised by this because my experience had been that SPARC generated a SIGBUS in response to a misaligned access. The few times that I had ever tried to work around misalignment (rather than actually fixing it), I had to resort to either specifying the -misalign compiler option or issuing a ST_FIX_ALIGN trap to turn on the kernel trap handler.

A review of the compiler docs reminded me of the "i" (for interpret) variations to the -xmemalign option. This suggested that the kernel trapping behavior could be explained if the application had been compiled with the option "-xmemalign=8i". A more subtle explanation, though, is that the compilers have the following defaults:

  • -xmemalign=8i for all v8 architectures
  • -xmemalign=8s for all v9 architectures
which means that for 64-bit builds, the default behavior would be what I expected (a misaligned access would signal and cause a SIGBUS). But for 32-bit compiles, the default would have the kernel interpret any misaligned access (and thus account for the trap handling).

But even this didn't explain some of the tests I tried. In particular, for 64-bit executables, misaligned loads and stores of type double (on 4-byte boundaries) were never generating a SIGBUS, even when compiled with "-xmemalign=8s". The explanation for this is in Section A.25, "Load Floating-Point" of the SPARC Architecture Manual which contains the following note:

    LDDF requires only word alignment.  However, if the effective
    address is word-aligned but not double word-aligned, LDDF may
    cause an LDDF_mem_address_not_aligned exception.  In this case
    the trap handler software shall emulate the LDDF instruction
    and return.
So for the special case of 8-byte floating-point loads on a 4-byte boundary, the SPARC V9 architecture (not just a particular implementation) requires the misalignment to be handled. As far as I know, in all other cases on SPARC, the operand size should be no larger than the operand alignment (and this still holds true for integer accesses).

This has no impact on most applications because the compiler allocates type double on a double word boundary, and memory returned from malloc (or new) will also be at least double word aligned.

One interesting aspect to this is that if there had been fewer misaligned loads, the performance impact might not have been enough to trigger an investigation (thus leaving an undiagnosed performance problem). So, from a performance analysis perspective, it might be better if misaligned loads would signal, since that would immediately alert the developer that something was wrong.

Monday Jun 18, 2007

Spurious NaN's

I will usually blog about my own experiences, but my co-workers Dan Souder and Bogdan Vasiliu included me on the periphery of a couple cases that I think were interesting and worth discussing.

The first of these cases involved a large application which correctly worked on multiple platforms (including Solaris/SPARC and Solaris/x64). But when it was compiled for Solaris 10 for 32-bit x86 using Sun Studio compilers, it generated floating point NaN's (IEEE format "Not A Number" values). The mystery was that when the calculation which exhibited the problem was traced in the debugger, the input values were correct and the assembly instructions looked right, but the result was a NaN. To compound the mystery, when the the same code sequence was transplanted into a small test program, it worked correctly.

After a bit of thrashing around on theories which proved baseless, one of the engineers did a Google search and discovered a bug report which described a similar failure signature. The test case was something like the following:

  #include <stdio.h>
  #include <math.h>

  typedef void (\*ptr_to_void_function_t)();

     return 0.0;

  main(int argc, char\* argv[])
      ptr_to_void_function_t pfunc = (ptr_to_void_function_t)double_function;

      for (int i=0; i<6; i++) {
        double dValue = exp(-5.0);
        printf("Iteration %d -> Value %g\\r\\n", i, dValue);
      return 0;
The main program calls double_function() (via the function pointer pfunc) as well as the the math library function exp(). Since it always uses a constant argument, there doesn't appear to me much of an opportunity for failure. Surprisingly, though, it produces:
  $ cc bug.c -lm
  $ a.out
  Iteration 0 -> Value 0.00673795
  Iteration 1 -> Value 0.00673795
  Iteration 2 -> Value 0.00673795
  Iteration 3 -> Value 0.00673795
  Iteration 4 -> Value 0.00673795
  Iteration 5 -> Value -NaN
Despite the location of the symptom (the "NaN"), there's nothing wrong with the exp() computation. The underlying problem is unique to the 32-bit ABI on the x86 architecture, which uses the 8087 floating point register stack to return floating point function results. The return value mechanism involves two parts: (1)the called function places the return value in st0 and (2) the calling function is responsible for consuming the value and removing it from the register stack. In this example, however, the user tricked the compiler by hiding the real function prototype behind a pointer cast (with no return value specified).
  ptr_to_void_function_t pfunc = (ptr_to_void_function_t)double_function;
Because of this, the compiler doesn't place any code in the calling function to clean up the register stack, so it "leaks" a value onto the stack each time it is called. By the sixth iteration, the register stack overflows.

So that explains what is happening, but how do you find where this occurs in a multi-million line application? To find this specific situation, it might be helpful for the compilers to issue a warning on this kind of cast. However, the example is syntactically legal C, so the tools are certainly not required to complain.

The compiler group explained that the compiler maintains the invariant that the floating-point-register stack must be empty upon entry to any function and should contain no more than one value upon exit. This suggested using DTrace to perform a consistency check on the FP stack at each function entry (or exit). The idea was to check the x86 ftag register, which holds a mask of the valid (active) floating-point-stack registers. In order to check for, say, 1 active entry, you would need to look for the binary mask 10000000 (hex 0x80). In theory, the DTrace command might have been something like:

  dtrace -n 'pid$target:a.out::entry /uregs[R_FTAG] == 0xF0/ { ustack(); }' -c application
This doesn't work, though, because the floating point registers (and corresponding status registers) are not available from within DTrace.

A work-around for this might have been to invoke a program (via "system()") which would inspect the FP status register (via /proc) on each function entry:

  dtrace -wn 'pid$target:a.out::entry {system("fpcheck %d", $pid); }' -c application
The downside to this is that aside from being quite slow (invoking a program for every function call), there also appears to be a conflict inspecting a process via /proc when it is also being controlled by DTrace.

The actual solution Dan used was to manually narrow the interval between the last known point where the FP stack was OK and the first known point where it contained a leaked value. Then he used dbx to look for the problem with some brute-force single-stepping. Because this section of code was not floating point intensive, he checked for any case where the FP stack contained multiple values. This corresponded to the ftag register containing a binary mask of 11000000, so this search was accomplished with a dbx conditional breakpoint of:

  stop cond $ftag==0xC0
As it turns out, the problem was not caused by an abusive cast in the source but rather a bug in a compiler optimization (since fixed) which incorrectly ignored an unused return value.

I found this case intriguing because it highlighted some of the strengths and limitations of DTrace, /proc, and dbx (not to mention demonstrating a bit of quirkiness about the x86).

Thursday Jan 04, 2007

Decoding Symbol+Offset Addresses

I've been learning to use DTrace in situations where I used to write quick-and-dirty interpose libraries. DTrace doesn't provide for the full generality of what an interposer can do, but a DTrace script is so much faster to develop that it allows me to explore things I might not be willing to investigate if I had to develop C code to do it.

Greg Nakhimovsky recently told me about an interpose library that he wrote to track down small-sized allocations (which can cause heap bloat since the minimum allocation from libc's malloc is 8 bytes for 32-bit binaries and 16 bytes for 64-bit binaries). I suggested that a DTrace script could do much of the same work and offered the following:

   #!/usr/sbin/dtrace -s

   / arg0 <= 16 /

This tracks all calls to malloc() for 16 or fewer bytes and prints out the top 10 most frequently executed call sites. By passing the argument "2" to ustack(), it only keeps two levels for each stack frame (one for malloc() and one for the call site). The output looks like:`malloc
     12              109485`malloc`delta+0x235
     16              160086`malloc`chi+0x451
     8              250510
Usually this kind of stack trace is sufficient because methods/functions are often short and the thing I'm looking for is distinct. For example, to find the first location in the previous listing, I'd just look near the beginning of function epsilon() for a call to malloc(). However, sometimes the location isn't obvious because the method might be quite large or there might be multiple call sites for the function in question.

In that case, what's the easiest way to convert function+offset into a source line location?

If the code is compiled with -g (either for use with dbx during development or with optimization for use with collect), I just use dbx to do the mapping. I suppose I could try to execute the program (using dbx) up to the same point that triggered the DTrace probe, but that's not always easy. A short cut which is inelegant but often successful is to:

  1. Invoke the debugger on the binary.

    $ dbx a.out

  2. Execute up to _start() (in order to load the shared libraries).

    (dbx) stop in _start
    (dbx) run

  3. Set the program counter to function+offset address.

    (dbx) assign $pc=epsilon+0x18

  4. Have the debugger print out the current location.

    (dbx) where
    =>[1] epsilon(sz = 0), line 1012 in "testprog.c"

    (dbx) list +1
    1012 int_p = (int\*) malloc(3\*sizeof(int));

Unfortunately, this doesn't consistently work. However, if I advance the PC by one machine instruction (using "stepi"), then it does. I admit that this is not a particularly reasonable thing to do: execute up to the beginning of _start() and then execute a single instruction in an arbitrary method of the application. Despite the illogic of it all, it generally provides the source line information that I want.

When I try this on a SPARC system, the initial where command almost never works; however, if I set $npc (as opposed to $pc), and perform the machine level single-step, then it does provide the source information that I want.

The whole thing is a kludge, and the extra stepi command is a kludge on top of a kludge; however, I still find it sufficiently useful to have it encapsulated in a short script (called "lineinfo"):

    $ lineinfo testprog chi+0x27c
    =>[1] chi(sz = ), line 219 in "libtc.c"
       219          p = (char\*)malloc(sz);

This isn't exactly ready for prime time, but I'd like to hear if anyone has a better solution (either more robust or more elegant).

The script is:


    if [ ! -x $executable ] ; then
        echo "Usage: $0 executable symbol+offset"
        exit 0

    case `uname -p` in
        sparc) PC='$npc';;
        \*)     PC='$pc';;

    dbx -q $executable 2> /dev/null <<%
    >/dev/null stop in _start
    >/dev/null run
    >/dev/null assign $PC=$\*
    >/dev/null stepi
    list +1

Tuesday Nov 07, 2006

Applications Breaking the Rules

I tend to take for granted that applications will be compatible between successive Solaris releases, but this behavior also depends on the applications "playing by the rules". On the theory that it's sometimes useful to see what went wrong, I thought I might describe a few situations where software failed on Solaris 10 (because of faulty assumptions in the applications).

Abusing the ABI: One of the first portability failures I experienced on Solaris 10 was when an application core dumped because it used the SPARC register %g7. This register has always been reserved for system use by the SPARC ABI, but for well over a decade developers have known that it was only used by the thread library (to point to the user-level thread structure). This lead to the assumption that it could freely be used by single-threaded programs. However, because of the Process Model Unification in Solaris 10, even single threaded applications now have to obey the restriction.

Fortunately, this problem wasn't too difficult to diagnose because the instruction which caused the segmentation violation was using register %g7. Still, this illustrates that point that code should be written based on the documentation (using guaranteed interfaces) rather than on how things are currently implemented (even if they've been that way for a long time).

Assuming Implementation Details: One of the software vendors I've been working with develops on Solaris 8 and then qualifies their applications for Solaris 8, 9, and 10. Because of that trailing development environment, I only recently ran across a change which has been in the Solaris linker for a few years. As an optimization, the linker compresses an object file's symbol table by merging symbols which have the same tail string. My understanding is that this is a significant optimization for the system libraries because many symbols are nearly duplicated, for example, "memcpy" and "_memcpy".

This linker change was invisible to most applications, but ran afoul of a particular vendor's obfuscation utility which is used to make it more difficult to reverse engineer binaries. When every symbol still had a unique string table entry, the utility could simply replace a name like "get_license" with some undecipherable string like "xr56j". Unfortunately, when the string table is compressed, this technique also garbles any symbol which has the same tail string, for example, "license", or even "nse".

Fortunately, the linker provides an easy work around in the form of the -z nocompstrtab option (which inhibits the default compression of the string table).

Neglecting Optimizations: Another optimization-related failure occurred with an improvement in the Sun Studio compilers. In this case, the vendor includes some support functions which are not called during normal operation (but are available, for example, to customer support engineers from within "dbx"). [This is not exactly how the software vendor uses these symbols, but this explanation will still illustrate the problem.] The support functions are encapsulated into an archive library and pulled into the main application with references like:

    static int DEBUG=0;
    if (DEBUG==1) {
In all of their previous build environments (including Solaris 8 with Sun Studio 8), this successfully referenced the functions and linked them in from the archive library. However, on Solaris 10 using Sun Studio 11, the compiler correctly notes that "DEBUG" will always logically be zero, so the references are optimized away (thus the functions are not even pulled in from the archive library). This happened to be a quiet failure because everything seemed fine during the build, but failed later when the diagnostic functions were needed.

A bit of inspection with /usr/ccs/bin/nm revealed the problem, and the software vendor agreed that the compiler was well within its rights to remove these references. The point, though, is that the code could only work by depending on the lower level of optimization provided by the older compiler releases.

So What? These are just cautionary examples showing how easy it is to (inadvertently or deliberately) slip off the righteous path of chip/OS/language standards and end up with a non-portable application. In each case, the software had worked for several years on prior releases of Solaris (and/or other platforms), so they superficially looked like problems with the new OS release.

Thursday Oct 19, 2006

Debugging Symbol Collisions

I was assisting an engineer in debugging a large application (consisting of millions of lines of code in dozens of libraries) and we suspected that the problem we were working on might have been caused by (inadvertent) symbol interposition.

With enough analysis of the output of /usr/ccs/bin/nm (applied across all of the libraries), I knew I could find symbols which were duplicated in multiple libraries. And by pouring over the diagnostics of the linker and/or runtime loader, I could even determine which symbols were incorrectly bound. So I started by dumping the runtime loader help message (to remind myself how to turn on the linker/loader diagnostics):

$ LD_DEBUG=help /bin/echo
[By the way, this is also supported by the loader on Linux.]

Even with instructions, I usually have to experiment for a couple of minutes to discover if I want to look at the diagnostics from bindings or symbols and whether or not I need to add the detail specifier. Pretty quickly, though, I settled on:

$ LD_DEBUG=bindings LD_DEBUG_OUTPUT=/tmp/dump myapplication 
This generated a lot of output, but with some filtering, I eventually saw something like the following:
08764: 1: binding file=./ to file=./ symbol `lookup_symbol'
... many lines omitted ...
08764: 1: binding file=./ to file=./ symbol `lookup_symbol'
Because expected to use its own version of lookup_symbol, this was the symbol problem we were looking for.

The real point of this story, however, is that I was reading a two year old entry from Rod Evan's blog and (re)discovered that Solaris 10 provides a tool, /usr/ccs/bin/lari, which would have uncovered this problem automatically. Going back and trying this on my original problem:

$ /usr/ccs/bin/lari myapplication | c++filt 
[2:0]: std::bad_cast::~bad_cast()(): /opt/SUNWspro/lib/
[2:1EP]: std::bad_cast::~bad_cast()(): myapplication
[2:0]: htonl(): /lib/
[2:3ES]: htonl(): /lib/
[2:2ES]: lookup_symbol(): ./
[2:0]: lookup_symbol(): ./
The syntax of each line is "[symbol count, bindings] symbol name: object"

I still need to figure out what is causing the messages for bad_cast and htonl, but the last two lines immediately show the problem that we were looking for. There are two definitions of lookup_symbol, but only the version from is being used.

I don't have much experience with lari (yet), but just for good linker hygiene, I think I might run it proactively to check for lurking problems.

Monday Oct 16, 2006

Replacing assembly code with atomic_ops

Replacing assembly code with atomic_ops Occasionally I run across small bits of assembly code buried within large software projects. Usually, the assembly code is there to implement an atomic spin lock or possibly to gain access to a special register or instruction which isn't available via a compiler.

Along those lines, I was recently asked to translate some assembly instructions which were used to atomically increment a global counter (rather than incrementing the counter between calls to mutex_lock and mutex_unlock. However, I was reminded by a co-worker that Solaris 10 provides implementations of several low-level atomic ops. See the man pages for:

So rather than propagating more assembly into this port, the engineers are instead considering calls to atomic_add_64.

This solution avoids application-level assembly code and therefore should be easier to maintain than the original version (since it is identical between both SPARC and x86, as well as for 32-bit and 64-bit). However, because this implementation uses function calls, it appears to be very slightly slower than inlined assembly code (which can be injected using either a parameterized asm() statement with the GNU compilers or with an inline assembly template with the Sun compilers.

I'll revisit this topic if the engineers on the project decide that the extra complexity of inline assembly is worth the maintenance burden.

Wednesday Oct 11, 2006

The Rare Platform Difference

I've learned to expect almost complete upward compatibility when moving to a new release of Solaris and to expect nearly identical behavior when moving between platforms running Solaris. Only occasionally will a platform differences show through -- for example, the different VM page sizes on x86 as compared to SPARC.

Occasionally, I'll run across a difference which is not so obviously forced by the platform. One example of this showed up when a simple interpose routine (used to gather some runtime statistics) would not compile on 32-bit x86 Solaris even though there were no problems compiling it for x64 or for SPARC:

#include <sys/types.h>
#include <sys/stat.h>
#include <dlfcn.h>

typedef int (\*stat_functype)(const char \*path, struct stat \*buf);

stat(const char \*path, struct stat \*buf)
    static stat_functype stat_handle = 0;
    if (stat_handle == 0)
        stat_handle = (stat_functype)dlsym(RTLD_NEXT, "stat");

    int retvalue = stat_handle(path, buf);
    /\* interposed instrumentation code elided for brevity \*/
    return retvalue;
The problem turned out to be that in the x86-specific header file, <sys/stat_impl.h>, stat is defined (not just declared) as a file-local "static". Something like:
static int
stat(const char \*_path, struct stat \*_buf)
    return (_xstat(2, _path, _buf));
When the application code tried to define the interpose routine, the compiler complained about seeing multiple definitions for the symbol.

I couldn't see a platform-specific reason for this implementation difference, but others explained to me that it was probably a historical difference. If I understand correctly, Solaris on x86 used to provide some level of binary compatibility with a legacy Unix (maybe SCO Unix or Interactive Unix). By using this implementation trick, Solaris-compiled binaries would have SVR4 semantics (by using _xstat()), but non-Solaris binaries would get their legacy semantics from stat(). Please take my version of this rationale with a grain of salt; however, it's the best explanation I've heard.

As of today, one of the kernel engineers submitted a request to remove this feature (ie. treat it like a bug). However, until/unless it is changed, how do you work around it?

The most common suggestion I received was to compile the 32-bit code large file aware. Since this only requires a tweak to the compiler options, it may be the easiest solution:

$ cc -D_FILE_OFFSET_BITS=64 interpose_stat.c

However, if compiling for large files is not acceptable (for example if you're also using /proc), another work-around would be to interpose on the _xstat function. Don't settle on this solution lightly though, because it is NOT guaranteed to be supported or portable. Even the header file warns:

   \* NOTE: Application software should NOT program
   \* to the _xstat interface.

Thursday Sep 28, 2006

Porting Should Be Easy

Solaris provides stable API's which are consistent across hardware platforms, so a port from Solaris on SPARC to Solaris on x86 should mostly reduce to a recompile. Right?

Obviously there are some details which might get in the way: (un)availability of x86 versions of the third-party libraries, SPARC assembly code, architecture-specific compiler options in the Makefiles, and low-level data endian-ness.

However, on a project I was recently involved with, the problems of byte order and assembly code had already been solved because the product had been ported to Linux/x86. It might seem that the Solaris/x86 porting work would just require a little tweaking of the existing #ifdef'd code. [Then again, doing anything manually across several million lines of code is more than "just a little tweaking".]

One problem that I hadn't really expected was that we had to manually inspect a large number of #ifdef sequences which were sprinkled throughout the source. Consider the following:

#if defined(SOLARIS)
This code sequence appears to be specific to Solaris, but because "Solaris" and "Sun" have been so intimately associated with SPARC for so long, this conditional (as well as a couple of dozen variations) was also used for protecting big-endian vs. little-endian data accesses, some SPARC specific details, and even occasionally, Sun Studio vs. GNU compiler differences. Similarly, variations of "Linux" could also be used for checking endianness and x86 details.

My first preference would have been to define preprocessor symbols to differentiate OS, platform, and maybe compiler (and to install them consistently throughout the project). However, for this port we chose a "short cut" by using pre-defined symbols provided by the compilers.

As you might expect, there were some differences in what the different compilers provided (for example, the GNU C/C++ compilers define "__sun__", but the Sun compilers do not). However, all compilers define both "sun" and "__sun", so either could be used to recognize when building on "Solaris". I wasn't able to find an argument for using one versus the other, although I'd be interested if anyone has an opinion. Because of some existing uses, we decided to use the pre-defined symbol "sun".

Similarly, all of the SPARC compilers define the symbols "sparc" and "__sparc", and for 32-bit, the x86 compilers define "i386" and "__i386". For 64-bit compiles, the single commonly defined symbol is "__x86_64__".

In retrospect it seems obvious that there should have been distinct discriminants for hardware architecture vs. operating system, however, maybe that was less clear when the code was originally written (when Solaris meant SPARC, HP/UX meant Precision, and AIX meant the Power architecture). That model has changed with Linux running on many platforms (including SPARC), Solaris running on x86, and even Sun Studio tools running on Linux.

For a more in-depth discussion of this topic, see: Lessons Drawn from a Non-trivial Solaris x86 Port

Sunday Aug 13, 2006

What You'll Find Here

I work in an engineering group at Sun which is chartered with helping software vendors adopt, support, and optimize for Sun technologies. Sometimes the projects are obviously Sun-oriented (like porting to Solaris or performance-tuning an application for current UltraSPARC hardware), but sometimes they are less directly Sun-focused (like migrating an application to Java or re-engineering one to support multithreading). I would like to use this blog to discuss tricks and tips that I run across that I think might be useful for other developers.



« June 2016