Thursday Mar 20, 2008

Performance tuning recipe

Dan Berger posted a comment about the compiler flags we'd used for Ruby. Basically, we've not done compiler flag tuning yet, so I'll write a quick outline of the first steps in tuning an application.

  • First of all profile the application with whatever flags it usually builds with. This is partly to get some idea of where the hot code is, but it's also useful to have some idea of what you're starting with. The other benefit is that it's tricky to debug build issues if you've already changed the defaults.
  • It should be pretty easy at this point to identify the build flags. Probably they will flash past on the screen, or in the worse case, they can be extracted (from non-stripped executables) using dumpstabs or dwarfdump. It can be necessary to check that the flags you want to use are actually the ones being passed into the compiler.
  • Of course, I'd certainly use spot to get all the profiles. One of the features spot has that is really useful is to archive the results, so that after multiple runs of the application with different builds it's still possible to look at the old code, and identify the compiler flags used.
  • I'd probably try -fast, which is a macro flag, meaning it enables a number of optimisations that typically improve application performance. I'll probably post later about this flag, since there's quite a bit to say about it. Performance under the flag should give an idea of what to expect with aggressive optimisations. If the flag does improve the applications performance, then you probably want to identify the exact options that provide the benefit and use those explicitly.
  • In the profile, I'd be looking for routines which seem to be taking too long, and I'd be trying to figure out what was causing the time to be spent. For SPARC, the execution count data from bit that's shown in the spot report is vital in being able to distinguish from code that runs slow, or code that runs fast but is executed many times. I'd probably try various other options based on what I saw. Memory stalls might make me try prefetch options, TLB misses would make me tweak the -xpagesize options.
  • I'd also want to look at using crossfile optimisation, and profile feedback, both of which tend to benefit all codes, but particularly codes which have a lot of branches. The compiler can make pretty good guesses on loops ;)

These directions are more a list of possible experiments than necessary an item-by-item checklist, but they form a good basis. And they are not an exhaustive list...

Cross-linking support in Nevada

Ali Bahrami has just written an interesting post about cross-linking support going into Nevada. This is the facility to enable the linking of SPARC object files to produce SPARC executables on an x86 box (or the other way around).

Friday Mar 14, 2008

32-bits good, 64-bits better?

One of the questions people ask is when to develop 64-bit apps vs 32-bit apps. The answer is not totally clear cut, it depends on the application and the platform. So here's my take on it.

First, let's discuss SPARC. The SPARC V8 architecture defines the 32-bit SPARC ISA (Instruction Set Architecture). This was found on a selection of SPARC processors that appeared before the UltraSPARC I, ie quite a long time ago. Later the SPARC V9 architecture appeared. (As an aside, these are open standards, so anyone can download the specs and make one, of course not everyone has the time and resources to do that, but it's nice in theory ;)

The SPARC V9 architecture added a few instructions, but mainly added the capability to use 64-bit addressing. The ABI (Application Binary Interface) was also improved (floating point values passed in FP registers rather than the odd V8 choice of using the integer registers). The UltraSPARC I and onwards have implemented the V9 architecture, which means that they can execute both V8 and V9 binaries.

One of the things that it's easy to get taken in by is the animal farm idea that if 32-bits is good 64-bits must be better. The trouble with 64-bit address space is that it takes more instructions to set up an address, pointers go from 4 bytes to 8 bytes, and the memory footprint of the application increases.

A hybrid mode was also defined, which took the 32-bit address space, together with a selection of the new instructions. This was called v8plus, or more recently sparcvis. This has been the default architecture for SPARC binaries for quite a while now, and it combines the smaller memory footprint of SPARC V8 with the more recent instructions from SPARC V9. For applications that don't require 64-bit address space, v8plus or sparcvis is the way to go.

Moving to the x86 side things are slightly more complex. The high level view is similar. You have the x86 ISA, or IA32 as its been called. Then you have the 64-bit ISA, called AMD64 or EMT64. EMT64 gives you both 64-bit addressing, a new ABI, a number of new instructions, and perhaps most importantly a bundle of new registers. The x86 has impressively few registers, EMT64 fixes that quite nicely.

In the same way as SPARC, moving to a 64-bit address space does cost some performance due to the increased memory footprint. However, the x64 gains a number of additional registers, which usually more than make up for this loss in performance. So the general rule is that 64-bits is better, unless the application makes extensive use of pointers.

Unlike SPARC, EMT64 does not currently provide a mode which gives the instruction set extensions and registers with a 32-bit address space.

Thursday Mar 13, 2008

SPARC64 VI docs

Programmers guide for the Fujitsu SPARC64 vi.

Friday Mar 07, 2008

Ruby performance gains on SPARC

The programming language Ruby is run on a VM. So the VM is responsible for context switches as well as garbage collection. Consequently, the code contains calls to flush register windows. A colleague of mine, Miriam Blatt, has been examining the code and we think we've found some places where the calls to flush register windows are unnecessary. The code appears in versions 1.8/1.9 of Ruby, but I'll focus on 1.8.\* in this discussion.

As outlined in my blog entry on register windows, the act of flushing them is both high cost and rarely needed. The key points at which it is necessary to flush the register windows to memory are on context switches and before garbage collection.

Ruby defines a macro called FLUSH_REGISTER_WINDOWS in defines.h. The macro only does something on IA64 and SPARC, so the changes I'll discuss here are defined so that they leave the behaviour on IA64 unchanged. My suspicion is that the changes are equally valid for IA64, but I lack an IA64 system to check them on.

The FLUSH_REGISTER_WINDOWS macro gets used in eval.c in the EXEC_TAG macro, THREAD_SAVE_CONTEXT macro, rb_thread_save_context routine, and rb_thread_restore_context routine. (There's also a call in gc.c for the garbage collection.)

The first thing to notice is that the THREAD_SAVE_CONTEXT macro calls rb_thread_save_context, so the FLUSH_REGISTER_WINDOWS call in the THREAD_SAVE_CONTEXT macro is unnecessary (the register windows have already been flushed). However, we've not seen this particular flush cause any performance issues in our tests (although it's possible that the tests didn't stress multithreading).

The more important call is the one in EXEC_TAG. This is executed very frequently in Ruby codes, but this flush does not appear to be at all necessary. It is neither a context switch or the start of garbage collection. Removing this call to flush register windows leads to significant performance gains (upwards of 10% when measured in an older v880 box. Some of the benchmarks nearly doubled in performance).

The source code modifications for 1.8.6 are as follows:

$ diff defines.h.orig defines.h.mod
> #  define EXEC_FLUSH_REGISTER_WINDOWS ((void)0)

The change to defines.h adds new variants of the FLUSH_REGISTER_WINDOWS macro to be used for the EXEC_TAG and THREAD_SAVE_CONTEXT macros. To preserve the current behaviour on IA64, they are left as defined as ((void)0) on all architectures but IA64 where they are defined as FLUSH_REGISTER_WINDOWS.

$ diff eval.c.orig eval.c.mod
< #define EXEC_TAG()    (FLUSH_REGISTER_WINDOWS, ruby_setjmp(((void)0), prot_tag->buf))
> #define EXEC_TAG()    (EXEC_FLUSH_REGISTER_WINDOWS, ruby_setjmp(((void)0), prot_tag->buf))
<     (rb_thread_switch((FLUSH_REGISTER_WINDOWS, ruby_setjmp(rb_thread_save_context(th), (th)->context))))
>     (rb_thread_switch((SWITCH_FLUSH_REGISTER_WINDOWS, ruby_setjmp(rb_thread_save_context(th), (th)->context))))

The changes to eval.c just use the new macros instead of the old FLUSH_REGISTER_WINDOWS call.

These code changes have worked on all the tests we've used (including `gmake test-all`). However, I can't be certain that there is not a workload which requires these flushes. This appears to be putback that added the flush call to EXEC_TAG, and the comment suggests that the change may not be necessary. I'd love to hear comments either agreeing with the analysis, or pointing out why the flushes are necessary.

Update: to add diff -u output
$ diff -u defines.h.orig defines.h.mod
--- defines.h.orig      Tue Mar  4 16:32:05 2008
+++ defines.h.mod       Wed Mar  5 14:22:06 2008
@@ -226,12 +226,18 @@
 #  define FLUSH_REGISTER_WINDOWS flush_register_windows()
+#  define EXEC_FLUSH_REGISTER_WINDOWS ((void)0)
 #elif defined(__ia64)
 void \*rb_ia64_bsp(void);
 void rb_ia64_flushrs(void);
 #  define FLUSH_REGISTER_WINDOWS rb_ia64_flushrs()
 #  define FLUSH_REGISTER_WINDOWS ((void)0)

 #if defined(DOSISH)
$ diff -u eval.c.orig eval.c.mod
--- eval.c.orig Tue Mar  4 16:32:00 2008
+++ eval.c.mod  Wed Mar  5 14:22:13 2008
@@ -1022,7 +1022,7 @@
 #define PROT_LAMBDA INT2FIX(2) /\* 5 \*/
 #define PROT_YIELD  INT2FIX(3) /\* 7 \*/

-#define EXEC_TAG()    (FLUSH_REGISTER_WINDOWS, ruby_setjmp(((void)0), prot_tag->buf))
+#define EXEC_TAG()    (EXEC_FLUSH_REGISTER_WINDOWS, ruby_setjmp(((void)0), prot_tag->buf))

 #define JUMP_TAG(st) do {              \\
     ruby_frame = prot_tag->frame;      \\
@@ -10287,7 +10287,7 @@

 #define THREAD_SAVE_CONTEXT(th) \\
-    (rb_thread_switch((FLUSH_REGISTER_WINDOWS, ruby_setjmp(rb_thread_save_context(th), (th)->context))))
+    (rb_thread_switch((SWITCH_FLUSH_REGISTER_WINDOWS, ruby_setjmp(rb_thread_save_context(th), (th)->context))))

 NORETURN(static void rb_thread_restore_context _((rb_thread_t,int)));
 NORETURN(NOINLINE(static void rb_thread_restore_context_0(rb_thread_t,int,void\*)));

Flush register windows

The SPARC architecture has an interesting feature called Register Windows. The idea is that the processor should contain multiple sets of registers on chip. When a new routine is called, the processor can give a fresh set of registers to the new routine, preserving the value of the old registers. When the new routine completes and control returns to the calling routine, the register values for the old routine are also restored. The idea is for the chip not to have to save and load the values held in registers whenever a routine is called; this reduces memory traffic and should improve performance.

The trouble with register windows, is that each chip can only hold a finite number of them. Once all the register windows are full, the processor has to spill a complete set of registers to memory. This is in contrast with the situation where the program is responsible for spilling and filling registers - the program only need spill a single register if that is all that the routine requires.

Most SPARC processors have about seven sets of register windows, so if the program remains in a call stack depth of about seven, there is no register spill/fill cost associated with calls of other routines. Beyond this stack depth, there is a cost for the spills and fills of the register windows.

The SPARC architecture book contains a more detailed description of register windows in section 5.2.2.

Most of the time software is completely unaware of this architectural decision, in fact user code should never have to be aware of it. There are two situations where software does need to know about register windows, these really only impact virtual machines or operating systems:

  • Context switches. In a context switch the processor changes to executing another software thread, so all the state from that thread needs to be saved for the thread to later resume execution. Note that setjmp and longjmp which are sometimes used as part of code to implement context switching already have the appropriate flushes in them.
  • Garbage collection. Garbage collection involves inspecting the state of the objects held in memory and determining whether each object is live or dead. Live objects are identified by having other live objects point to them. So all the registers need to be stored in memory so that they can be inspected to check whether they point to any objects that should be considered live.

The SPARC V9 instruction flushw will cause the processor to store all the register windows in a thread to memory. For SPARC V8, the same effect is attained through trap x03. Either way, the cost can be quite high since the processor needs to store up to about 7 sets of register windows to memory Each set is 16 8-byte registers, which results in potentially a lot of memory traffic and cycles.

Tuesday Feb 12, 2008

CMT related books has a side bar featuring books that are relevant to CMT. Mine is featured as one of the links. The other two books are Computer Architecture - a quantitative approach which features the UltraSPARC T1, and Chip-Multiprocessor Architecture which is by members of the team responsible for the UltraSPARC T1 processor.

Friday Feb 01, 2008

The meaning of -xmemalign

I made some comments on a thread on the forums about memory alignment on SPARC and the -xmemalign flag. I've talked about memory alignment before, but this time the discussion was more about how the flag works. In brief:

  • The flag has two parts -xmemalign=[1|2|4|8][i|s]
  • The number specifies the alignment that the compiler should assume when compiling an object file. So if the compiler is not certain that the current variable is correctly aligned (say it's accessed through a pointer) then the compiler will assume the alignment given by the flag. Take a single precision floating point value that takes four bytes. Under -xmemalign=1[i|s] the compiler will assume that it is unaligned, so will issue four single byte loads to load the value. If the alignenment is specified as -xmemalign=2[i|s] the compiler will assume two byte alignment, so will issue two loads to get the four byte value.
  • The suffix [i|s] tells the compiler how to behave if there is a misaligned access. For 32-bit codes the default is i which fixes the misaligned access and continues. For 64-bit codes the default is s which causes the app to die with a SIGBUS error. This is the part of the flag that has to be specified at link time because it causes different code to be linked into the binary depending on the desired behaviour. The C documentation captures this correctly, but the C++ and Fortran docs will be updated.

Tuesday Oct 16, 2007

Building shared libraries for SPARCV9

By default, the SPARC compiler assumes that SPARCV9 objects are built with -xcode=abs44, which means that 44 bits are used to hold the absolute address of any object. Shared libraries should be built using position independent code, either -xcode=pic13 or -xcode=pic32 (replacing the deprecated -Kpic and -KPIC options.

If one of the object files in a library is built with abs44, then the linker will report the following error:

ld: fatal: relocation error: R_SPARC_H44: file file.o: symbol <unknown>:
relocations based on the ABS44 coding model can not be used in building
a shared object
Further details on this can be found in the compiler documentation.

Tuesday Sep 25, 2007

Solaris Application Programming book

I'm thrilled to see that my book is being listed for pre-order on Amazon in the US. It seems to take about a month for it to travel the Atlantic to Amazon UK.

Friday Aug 31, 2007

Register windows and context switches

Interesting paper on register windows and context switching

Total Store Ordering (TSO)

Total Store Ordering is the name for the ordering implemented on multi-processor SPARC systems. The SPARC architecture also defines a number of weaker orderings. TSO basically guarantees that loads are visible between processors in the order in which they occurred, the same for stores, and that loads will see the values of earlier stores. There's a very helpful description of this in the SPARC architecture documentation on pages 418-427 (section 9.5). Particularly useful is table 9-2 on page 422, which gives a useful summary of where membar operations are required.


Darryl Gove is a senior engineer in the Solaris Studio team, working on optimising applications and benchmarks for current and future processors. He is also the author of the books:
Multicore Application Programming
Solaris Application Programming
The Developer's Edge


« April 2014
The Developer's Edge
Solaris Application Programming
OpenSPARC Book
Multicore Application Programming