Thursday Mar 05, 2009

Peak rate of window spill and fill traps

I've been looking at the performance of a code recently. Written in C++ using many threads. One of the things with C++ is that as a language it encourages developers to have lots of small routines. Small routines lead to many calls and returns; and in particular they lead to register window spills and fills. Read more about register windows. Anyway, I wondered what the peak rate of issuing register window spills and fills was?

I'm going to use some old code that I used to examine the cost of library calls a while back. The code uses recursion to reach a deep call depth. First off, I define a couple of library routines:

extern int jump2(int count);

int jump1(int count)
  if (count==0)
    return 1;
    return 1+jump2(count);
int jump1(int count);

int jump2(int count)
  if (count==0)
    return 1;
    return 1+jump1(count);

I can then turn these into libraries:

$ cc -O -G -o jump1.c
$ cc -O -G -o jump2.c

Done in this way, both libraries have hanging dependencies:

$ ldd -d
        symbol not found: jump2         (./

So the main executable will have to resolve these. The main executable looks like:

#include <stdio.h>

#define RPT 100
#define SIZE 600

extern int jump1(int count);

int main()
  int index,count,links,tmp;
  #pragma omp parallel for default(none) private(count,index,tmp,links)
  for(links=1; links<10000; links++)
   for (count=0; count<RPT; count++)
     for (index=0;index<SIZE;index++) 
       if (tmp!=links) {printf("mismatch\\n");}

This needs to be compiled in such a way as to resolve the unresolved dependencies

$ cc -xopenmp=noopt -o par par.c -L. -R. -ljump1 -ljump2

Note that I'm breaking rules again by making the runtime linker look for the dependent libraries in the current directory rather than use $ORIGIN to locate them relative to the executable. Oops.

I'm also using OpenMP in the code. The directive tells the compiler to make the outer loop run in parallel over multiple processors. I picked 10,000 for the trip count so that the code would run for a bit of time, so I could look at the activity on the system. Also note that the outer loop defines the depth of the call stack, so this code will probably cause stack overflows at some point, if not before. Err...

I'm compiling with -xopenmp=noopt since I want the OpenMP directive to be recognised, but I don't want the compiler to use optimisation, since if the compiler saw the code it would probably eliminate most of it, and that would leave me with nothing much to test.

The first thing to test is whether this generates spill fill traps at all. So we run the application and use trapstat to look at trap activity:

vct name           |   cpu13  
 84 spill-user-32       |  3005899  
 c4 fill-user-32        |  3005779 

So on this 1.2GHz UltraSPARC T1 system, we're getting 3,000,000 traps/second. The generated code is pretty plain except for the save and restore instructions:

         21c:  9d e3 bf a0  save        %sp, -96, %sp
         220:  90 86 3f ff  addcc       %i0, -1, %o0
         224:  12 40 00 04  bne,pn      %icc,jump1+0x18 ! 0x234
         228:  01 00 00 00  nop       

         22c:  81 c7 e0 08  ret       
         230:  91 e8 20 01  restore     %g0, 1, %o0

         234:  40 00 00 00  call        jump2
         238:  01 00 00 00  nop       
         23c:  81 c7 e0 08  ret       
         240:  91 ea 20 01  restore     %o0, 1, %o0

So you can come up with an estimate of 300 ns/trap.

The reason for using OpenMP is to enable us to scale the number of active threads. Rerunning with 32 threads, by setting the environment variable OMP_NUM_THREADS to be 32, we get the following output from trapstat:

vct name                |    cpu21    cpu22    cpu23    cpu24    cpu25    cpu26 
 84 spill-user-32       |  1024589  1028081  1027596  1174373  1029954  1028695
 c4 fill-user-32        |   996739   989598   955669  1169058  1020349  1021877

So we're getting 1M traps per thread, with 32 threads running. Let's take a look at system activity using vmstat.

 vmstat 1
 kthr      memory            page            disk          faults      cpu
 r b w   swap  free  re  mf pi po fr de sr s1 s2 s3 s4   in   sy   cs us sy id
 0 0 0 64800040 504168 0  0  0  0  0  0  0  0  0  0  0 3022  427  812 100 0  0
 0 0 0 64800040 504168 0  0  0  0  0  0  0  0  0  0  0 3020  428  797 100 0  0
 0 0 0 64800040 504168 0  0  0  0  0  0  0  0  0  0  0 2945  457  760 100 0  0
 0 0 0 64800040 504168 0  0  0  0  0  0  0  0  0  0  0 3147  429 1025 99  1  0
 0 0 0 64800040 504168 0 15  0  0  0  0  0  0  0  0  0 3049  666  820 99  1  0
 0 0 0 64800040 504168 0  1  0  0  0  0  0  0  0  0  0 3044  543  866 100 0  0
 0 0 0 64800040 504168 0  0  0  0  0  0  0  0  0  0  0 3021  422  798 100 0  0

So there's no system time being recorded - the register spill and fill traps are fast traps, so that's not a surprise.

One final thing to look at is the instruction issue rate. We can use cpustat to do this:

  2.009   6  tick  63117611 
  2.009  12  tick  69622769 
  2.009   7  tick  62118451 
  2.009   5  tick  64784126 
  2.009   0  tick  67341237 
  2.019  17  tick  62836527 

As might be expected from the cost of each trap, and the sparse number of instructions between traps, the issue rate of the instructions is quite low. Each of the four threads on a core is issuing about 65M instructions per second. So the core is issuing about 260M instructions per second - that's about 20% of the peak issue rate for the core.

If this were a real application, what could be done? Well, obviously the trick would be to reduce the number of calls and returns. At a compiler level, that would using flags that enable inlining - so an optimisation level of at least -xO4; adding -xipo to get cross-file inlining; using -g0 in C++ rather than -g (which disables front-end inlining). At a more structural level, perhaps the way the application is broken into libraries might be changed so that routines that are frequently called could be inlined.

The other thing to bear in mind, is that this code was designed to max out the register window spill/fill traps. Most codes will get nowhere near this level of window spills and fills. Most codes will probably max out at about a tenth of this level, so the impact from register window spill fill traps at that point will be substantially reduced.

Friday Mar 07, 2008

Ruby performance gains on SPARC

The programming language Ruby is run on a VM. So the VM is responsible for context switches as well as garbage collection. Consequently, the code contains calls to flush register windows. A colleague of mine, Miriam Blatt, has been examining the code and we think we've found some places where the calls to flush register windows are unnecessary. The code appears in versions 1.8/1.9 of Ruby, but I'll focus on 1.8.\* in this discussion.

As outlined in my blog entry on register windows, the act of flushing them is both high cost and rarely needed. The key points at which it is necessary to flush the register windows to memory are on context switches and before garbage collection.

Ruby defines a macro called FLUSH_REGISTER_WINDOWS in defines.h. The macro only does something on IA64 and SPARC, so the changes I'll discuss here are defined so that they leave the behaviour on IA64 unchanged. My suspicion is that the changes are equally valid for IA64, but I lack an IA64 system to check them on.

The FLUSH_REGISTER_WINDOWS macro gets used in eval.c in the EXEC_TAG macro, THREAD_SAVE_CONTEXT macro, rb_thread_save_context routine, and rb_thread_restore_context routine. (There's also a call in gc.c for the garbage collection.)

The first thing to notice is that the THREAD_SAVE_CONTEXT macro calls rb_thread_save_context, so the FLUSH_REGISTER_WINDOWS call in the THREAD_SAVE_CONTEXT macro is unnecessary (the register windows have already been flushed). However, we've not seen this particular flush cause any performance issues in our tests (although it's possible that the tests didn't stress multithreading).

The more important call is the one in EXEC_TAG. This is executed very frequently in Ruby codes, but this flush does not appear to be at all necessary. It is neither a context switch or the start of garbage collection. Removing this call to flush register windows leads to significant performance gains (upwards of 10% when measured in an older v880 box. Some of the benchmarks nearly doubled in performance).

The source code modifications for 1.8.6 are as follows:

$ diff defines.h.orig defines.h.mod
> #  define EXEC_FLUSH_REGISTER_WINDOWS ((void)0)

The change to defines.h adds new variants of the FLUSH_REGISTER_WINDOWS macro to be used for the EXEC_TAG and THREAD_SAVE_CONTEXT macros. To preserve the current behaviour on IA64, they are left as defined as ((void)0) on all architectures but IA64 where they are defined as FLUSH_REGISTER_WINDOWS.

$ diff eval.c.orig eval.c.mod
< #define EXEC_TAG()    (FLUSH_REGISTER_WINDOWS, ruby_setjmp(((void)0), prot_tag->buf))
> #define EXEC_TAG()    (EXEC_FLUSH_REGISTER_WINDOWS, ruby_setjmp(((void)0), prot_tag->buf))
<     (rb_thread_switch((FLUSH_REGISTER_WINDOWS, ruby_setjmp(rb_thread_save_context(th), (th)->context))))
>     (rb_thread_switch((SWITCH_FLUSH_REGISTER_WINDOWS, ruby_setjmp(rb_thread_save_context(th), (th)->context))))

The changes to eval.c just use the new macros instead of the old FLUSH_REGISTER_WINDOWS call.

These code changes have worked on all the tests we've used (including `gmake test-all`). However, I can't be certain that there is not a workload which requires these flushes. This appears to be putback that added the flush call to EXEC_TAG, and the comment suggests that the change may not be necessary. I'd love to hear comments either agreeing with the analysis, or pointing out why the flushes are necessary.

Update: to add diff -u output
$ diff -u defines.h.orig defines.h.mod
--- defines.h.orig      Tue Mar  4 16:32:05 2008
+++ defines.h.mod       Wed Mar  5 14:22:06 2008
@@ -226,12 +226,18 @@
 #  define FLUSH_REGISTER_WINDOWS flush_register_windows()
+#  define EXEC_FLUSH_REGISTER_WINDOWS ((void)0)
 #elif defined(__ia64)
 void \*rb_ia64_bsp(void);
 void rb_ia64_flushrs(void);
 #  define FLUSH_REGISTER_WINDOWS rb_ia64_flushrs()
 #  define FLUSH_REGISTER_WINDOWS ((void)0)

 #if defined(DOSISH)
$ diff -u eval.c.orig eval.c.mod
--- eval.c.orig Tue Mar  4 16:32:00 2008
+++ eval.c.mod  Wed Mar  5 14:22:13 2008
@@ -1022,7 +1022,7 @@
 #define PROT_LAMBDA INT2FIX(2) /\* 5 \*/
 #define PROT_YIELD  INT2FIX(3) /\* 7 \*/

-#define EXEC_TAG()    (FLUSH_REGISTER_WINDOWS, ruby_setjmp(((void)0), prot_tag->buf))
+#define EXEC_TAG()    (EXEC_FLUSH_REGISTER_WINDOWS, ruby_setjmp(((void)0), prot_tag->buf))

 #define JUMP_TAG(st) do {              \\
     ruby_frame = prot_tag->frame;      \\
@@ -10287,7 +10287,7 @@

 #define THREAD_SAVE_CONTEXT(th) \\
-    (rb_thread_switch((FLUSH_REGISTER_WINDOWS, ruby_setjmp(rb_thread_save_context(th), (th)->context))))
+    (rb_thread_switch((SWITCH_FLUSH_REGISTER_WINDOWS, ruby_setjmp(rb_thread_save_context(th), (th)->context))))

 NORETURN(static void rb_thread_restore_context _((rb_thread_t,int)));
 NORETURN(NOINLINE(static void rb_thread_restore_context_0(rb_thread_t,int,void\*)));

Flush register windows

The SPARC architecture has an interesting feature called Register Windows. The idea is that the processor should contain multiple sets of registers on chip. When a new routine is called, the processor can give a fresh set of registers to the new routine, preserving the value of the old registers. When the new routine completes and control returns to the calling routine, the register values for the old routine are also restored. The idea is for the chip not to have to save and load the values held in registers whenever a routine is called; this reduces memory traffic and should improve performance.

The trouble with register windows, is that each chip can only hold a finite number of them. Once all the register windows are full, the processor has to spill a complete set of registers to memory. This is in contrast with the situation where the program is responsible for spilling and filling registers - the program only need spill a single register if that is all that the routine requires.

Most SPARC processors have about seven sets of register windows, so if the program remains in a call stack depth of about seven, there is no register spill/fill cost associated with calls of other routines. Beyond this stack depth, there is a cost for the spills and fills of the register windows.

The SPARC architecture book contains a more detailed description of register windows in section 5.2.2.

Most of the time software is completely unaware of this architectural decision, in fact user code should never have to be aware of it. There are two situations where software does need to know about register windows, these really only impact virtual machines or operating systems:

  • Context switches. In a context switch the processor changes to executing another software thread, so all the state from that thread needs to be saved for the thread to later resume execution. Note that setjmp and longjmp which are sometimes used as part of code to implement context switching already have the appropriate flushes in them.
  • Garbage collection. Garbage collection involves inspecting the state of the objects held in memory and determining whether each object is live or dead. Live objects are identified by having other live objects point to them. So all the registers need to be stored in memory so that they can be inspected to check whether they point to any objects that should be considered live.

The SPARC V9 instruction flushw will cause the processor to store all the register windows in a thread to memory. For SPARC V8, the same effect is attained through trap x03. Either way, the cost can be quite high since the processor needs to store up to about 7 sets of register windows to memory Each set is 16 8-byte registers, which results in potentially a lot of memory traffic and cycles.

Friday Aug 31, 2007

Register windows and context switches

Interesting paper on register windows and context switching


Darryl Gove is a senior engineer in the Solaris Studio team, working on optimising applications and benchmarks for current and future processors. He is also the author of the books:
Multicore Application Programming
Solaris Application Programming
The Developer's Edge


« April 2014
The Developer's Edge
Solaris Application Programming
OpenSPARC Book
Multicore Application Programming