Ruby performance gains on SPARC

The programming language Ruby is run on a VM. So the VM is responsible for context switches as well as garbage collection. Consequently, the code contains calls to flush register windows. A colleague of mine, Miriam Blatt, has been examining the code and we think we've found some places where the calls to flush register windows are unnecessary. The code appears in versions 1.8/1.9 of Ruby, but I'll focus on 1.8.\* in this discussion.

As outlined in my blog entry on register windows, the act of flushing them is both high cost and rarely needed. The key points at which it is necessary to flush the register windows to memory are on context switches and before garbage collection.

Ruby defines a macro called FLUSH_REGISTER_WINDOWS in defines.h. The macro only does something on IA64 and SPARC, so the changes I'll discuss here are defined so that they leave the behaviour on IA64 unchanged. My suspicion is that the changes are equally valid for IA64, but I lack an IA64 system to check them on.

The FLUSH_REGISTER_WINDOWS macro gets used in eval.c in the EXEC_TAG macro, THREAD_SAVE_CONTEXT macro, rb_thread_save_context routine, and rb_thread_restore_context routine. (There's also a call in gc.c for the garbage collection.)

The first thing to notice is that the THREAD_SAVE_CONTEXT macro calls rb_thread_save_context, so the FLUSH_REGISTER_WINDOWS call in the THREAD_SAVE_CONTEXT macro is unnecessary (the register windows have already been flushed). However, we've not seen this particular flush cause any performance issues in our tests (although it's possible that the tests didn't stress multithreading).

The more important call is the one in EXEC_TAG. This is executed very frequently in Ruby codes, but this flush does not appear to be at all necessary. It is neither a context switch or the start of garbage collection. Removing this call to flush register windows leads to significant performance gains (upwards of 10% when measured in an older v880 box. Some of the benchmarks nearly doubled in performance).

The source code modifications for 1.8.6 are as follows:

$ diff defines.h.orig defines.h.mod
228a229,230
> #  define EXEC_FLUSH_REGISTER_WINDOWS ((void)0)
> #  define SWITCH_FLUSH_REGISTER_WINDOWS ((void)0)
232a235,236
> #  define EXEC_FLUSH_REGISTER_WINDOWS FLUSH_REGISTER_WINDOWS
> #  define SWITCH_FLUSH_REGISTER_WINDOWS FLUSH_REGISTER_WINDOWS
234a239,240
> #  define EXEC_FLUSH_REGISTER_WINDOWS FLUSH_REGISTER_WINDOWS
> #  define SWITCH_FLUSH_REGISTER_WINDOWS FLUSH_REGISTER_WINDOWS

The change to defines.h adds new variants of the FLUSH_REGISTER_WINDOWS macro to be used for the EXEC_TAG and THREAD_SAVE_CONTEXT macros. To preserve the current behaviour on IA64, they are left as defined as ((void)0) on all architectures but IA64 where they are defined as FLUSH_REGISTER_WINDOWS.

$ diff eval.c.orig eval.c.mod
1025c1025
< #define EXEC_TAG()    (FLUSH_REGISTER_WINDOWS, ruby_setjmp(((void)0), prot_tag->buf))
---
> #define EXEC_TAG()    (EXEC_FLUSH_REGISTER_WINDOWS, ruby_setjmp(((void)0), prot_tag->buf))
10290c10290
<     (rb_thread_switch((FLUSH_REGISTER_WINDOWS, ruby_setjmp(rb_thread_save_context(th), (th)->context))))
---
>     (rb_thread_switch((SWITCH_FLUSH_REGISTER_WINDOWS, ruby_setjmp(rb_thread_save_context(th), (th)->context))))

The changes to eval.c just use the new macros instead of the old FLUSH_REGISTER_WINDOWS call.

These code changes have worked on all the tests we've used (including `gmake test-all`). However, I can't be certain that there is not a workload which requires these flushes. This appears to be putback that added the flush call to EXEC_TAG, and the comment suggests that the change may not be necessary. I'd love to hear comments either agreeing with the analysis, or pointing out why the flushes are necessary.

Update: to add diff -u output
$ diff -u defines.h.orig defines.h.mod
--- defines.h.orig      Tue Mar  4 16:32:05 2008
+++ defines.h.mod       Wed Mar  5 14:22:06 2008
@@ -226,12 +226,18 @@
        ;
 }
 #  define FLUSH_REGISTER_WINDOWS flush_register_windows()
+#  define EXEC_FLUSH_REGISTER_WINDOWS ((void)0)
+#  define SWITCH_FLUSH_REGISTER_WINDOWS ((void)0)
 #elif defined(__ia64)
 void \*rb_ia64_bsp(void);
 void rb_ia64_flushrs(void);
 #  define FLUSH_REGISTER_WINDOWS rb_ia64_flushrs()
+#  define EXEC_FLUSH_REGISTER_WINDOWS FLUSH_REGISTER_WINDOWS
+#  define SWITCH_FLUSH_REGISTER_WINDOWS FLUSH_REGISTER_WINDOWS
 #else
 #  define FLUSH_REGISTER_WINDOWS ((void)0)
+#  define EXEC_FLUSH_REGISTER_WINDOWS FLUSH_REGISTER_WINDOWS
+#  define SWITCH_FLUSH_REGISTER_WINDOWS FLUSH_REGISTER_WINDOWS
 #endif

 #if defined(DOSISH)
$ diff -u eval.c.orig eval.c.mod
--- eval.c.orig Tue Mar  4 16:32:00 2008
+++ eval.c.mod  Wed Mar  5 14:22:13 2008
@@ -1022,7 +1022,7 @@
 #define PROT_LAMBDA INT2FIX(2) /\* 5 \*/
 #define PROT_YIELD  INT2FIX(3) /\* 7 \*/

-#define EXEC_TAG()    (FLUSH_REGISTER_WINDOWS, ruby_setjmp(((void)0), prot_tag->buf))
+#define EXEC_TAG()    (EXEC_FLUSH_REGISTER_WINDOWS, ruby_setjmp(((void)0), prot_tag->buf))

 #define JUMP_TAG(st) do {              \\
     ruby_frame = prot_tag->frame;      \\
@@ -10287,7 +10287,7 @@
 }

 #define THREAD_SAVE_CONTEXT(th) \\
-    (rb_thread_switch((FLUSH_REGISTER_WINDOWS, ruby_setjmp(rb_thread_save_context(th), (th)->context))))
+    (rb_thread_switch((SWITCH_FLUSH_REGISTER_WINDOWS, ruby_setjmp(rb_thread_save_context(th), (th)->context))))

 NORETURN(static void rb_thread_restore_context _((rb_thread_t,int)));
 NORETURN(NOINLINE(static void rb_thread_restore_context_0(rb_thread_t,int,void\*)));
Comments:

Thank you for the information. Could you show us your patch in unified diff format from diff -u, if you don't mind? I am very interested in merging the patch.

Posted by matz on March 12, 2008 at 01:45 AM PDT #

Hi Darryl,

Thanks for this, very interesting.

BTW, can you please share what compiler flags you use when building Ruby?

Regards,

Dan

Posted by Daniel Berger on March 20, 2008 at 12:34 AM PDT #

Dan,

We used -xO4, which is quite low optimisation really. The issue being that the the flush register windows code was taking sufficient time that the compiler flags were not particularly important.

I'll write up something about tuning shortly.

Darryl.

Posted by Darryl Gove on March 20, 2008 at 02:44 AM PDT #

Post a Comment:
Comments are closed for this entry.
About

Darryl Gove is a senior engineer in the Solaris Studio team, working on optimising applications and benchmarks for current and future processors. He is also the author of the books:
Multicore Application Programming
Solaris Application Programming
The Developer's Edge

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
5
6
8
9
10
12
13
14
15
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today
Bookmarks
The Developer's Edge
Solaris Application Programming
Publications
Webcasts
Presentations
OpenSPARC Book
Multicore Application Programming
Docs