Monday Apr 17, 2006

The compiler bug that wasn't...

People hanging out in the #opensolaris IRC channel last week may have seen a discussion between myself, slowhog, and dprice as we tried to debug why gaim was hanging on slowhog's machine - it seemed to be the compiler had generated an impossible-to-exit infinite loop: [Note: IRC transcripts below have had some portions dropped and/or reordered to make the discussion easier to read]

<slowhog> dbx shows interesting code
<slowhog> 0xd208b301: _XFlushInt+0x00e5:  testl    %eax,%eax
<slowhog> (dbx) stepi
<slowhog> t@1 (l@1) stopped in _XFlushInt at 0xd208b303
<slowhog> 0xd208b303: _XFlushInt+0x00e7:  je       _XFlushInt+0xe5        [ 0xd208b301, .-2 ]
<slowhog> (dbx) stepi
<slowhog> t@1 (l@1) stopped in _XFlushInt at 0xd208b301
<slowhog> 0xd208b301: _XFlushInt+0x00e5:  testl    %eax,%eax
<slowhog> as you suggested, it's a loop
<slowhog> curious from where is the eax register got updated.

Unfortunately, I've never learned x86 assembly opcodes, and while I know the basic concepts from doing 6502 assembly on the Apple II years ago, and MIPS assembly in CS classes in college, so can mostly follow, I don't know what exactly is going on without a bit of help which Dan and Henry were able to provide for me.

<dprice_at_sun> alanc, it is asking if eax & eax != 0, basically
<alanc> so it's just checking to see if any bits in eax are set?
<alanc> i.e. eax != 0 ?
<slowhog> testl src, dest (an AND that does not change dest, only flags)
<dprice_at_sun> sorry.  so slowhog, the sequence is:  testl eax,eax;  je 
<dprice_at_sun> is that right, just those two?
<slowhog> it's like while (eax!=0)
<slowhog> yup
<dprice_at_sun> right except that eax is a register
<slowhog> exactly
<dprice_at_sun> we're not reloading it right?
<slowhog> not from the two lines of assembly
<dprice_at_sun> you often see this kind of thing when someone forgets a "volatile"
<slowhog> dprice_at_sun, that's what I suspect
<alanc> is this from Solaris 10?
<slowhog> on nv b35
<alanc> X software on Solaris 10 x86 was built with a Studio 10 build that didn't
         always honor volatile when it should
<alanc> Nevada switched to a newer build long ago though

So the first step is to narrow down where in the code this loop was being hit:

<alanc> that's probably in the code that's checking to see if the other thread
         holding the lock has released it yet
<alanc> sadly cc -S doesn't give helpfully inlined source on x86 like it does on
         SPARC to more easily match to the right spot
<slowhog> I am building my lib/app with Sun Studio 11
<slowhog> dprice_at_sun, gaim on gtk on gdk on XPending
<alanc> I don't see any loops in _XFlushInt with no bodies
<alanc> the closest is: while (dpy->flags & XlibDisplayWriting) {
<alanc>             ConditionWait(dpy, dpy->lock->writers);
<alanc>         }
<dprice_at_sun> alanc, I looked too...
<dprice_at_sun> hmm

<dprice_at_sun> I see the same loop in my disassembly of nv_36
<alanc> the code in question is xc/lib/X11/XlibInt.c

For those of you following along at home, you can't see the exact version of the code we were looking at, but it's similar to the X11R6.6 version you can see here.

<dprice_at_sun> so, this looks pretty weird to me
<dprice_at_sun> the loop comes after _XSmeAllocateHeapBuffer() in the assembly
<dprice_at_sun> So yeah, it does look like the loop alan mentioned.  line 814 of XlibInt.c
<dprice_at_sun> alanc, that sure doesn't look correct to my tired eyes.
<alanc> ConditionWait should be a function call though, not just jump back to the test
<dprice_at_sun> alanc, yes.  which raises a spectre....
<dprice_at_sun> on SPARC, you can clearly see the test, backwards branch loop happening.
<dprice_at_sun> on the binary bits I have on x86, I don't (yet) see it
<dprice_at_sun> actually, it does look there is a loop here: libX11.so.4`_XFlushInt+0x102:   jne    -0x23    
<alanc> now the cc -S matches
<dprice_at_sun> the comparison to 0x40 is the check for XlibDisplayWriting ; the 
                 comparison with 1 is the test for XlibDisplayIOError
<alanc> .CGB7.195:
<alanc>         movl       1220(%esi),%eax
<alanc> .CGA1.173:
<alanc>         testl      %eax,%eax
<alanc>         je         .CGA1.173
<dprice_at_sun> it'd be nice to know what ConditionWait is expanding to, once the
                 CPP is done with it
<alanc> #define ConditionWait(d,c) if ((d)->lock) \\
<alanc>         (\*(d)->lock->condition_wait)(c, (d)->lock->mutex)
<dprice_at_sun> well I'm not seeing it do the sort of  'call   \*%eax' that you 
                 would need for this code to be working
<dprice_at_sun> there is one later, at libX11.so.4`_XFlushInt+0xf1:    call   \*%ecx
<dprice_at_sun> but I think that is the subsequent
                       for (ext = dpy->flushes; ext; ext = ext->next_flush)  loop
<dprice_at_sun> maybe anyway
<alanc> with no -xO flags it gives code that doesn't appear to loop back on itself
<dprice_at_sun> shocker :)
<alanc> instead it's a much longer loop with a call in the middle
<alanc> I see studio 11 also nicely added line number comments to x86 -S output

We did a lot of comparing of the different code outputs with the different optimizer levels that I'll spare you from here (mainly because they were done via pastebin.com pages that have now expired and I didn't archive them).

<dprice_at_sun> alanc, although one wonders why we're not seeing this 
                 problem with lots of X apps?
<alanc> not lots of multithreaded X apps
<dprice_at_sun> I guess if while (dpy->flags & XlibDisplayWriting)  fails 
                 the first time through the loop, we never hit this
<alanc> or rather, not lots of X apps that initialize the threading support 
         so they can call X from multiple threads
<alanc> instead of just having only one thread ever call X

So I filed a bug with our compiler team. I provided the output from cc -E so they had a single source file to debug and didn't need our full build environment. They determined that turning off the conditional eliminator made the loop go away and sent it to the engineers who know that part of the optimizer best. Unfortunately, they were able to prove in a few days that it was not a compiler bug, but the compiler was simply being very clever. It saw this code (expanded after cpp preprocessing):

  12710         while (dpy->flags &  ( 1L << 6 )) {
  12711              if ( ( dpy ) -> lock )
  12712                 ( \* ( dpy ) -> lock -> condition_wait ) 
                            ( dpy -> lock -> writers , ( dpy ) -> lock -> mutex );
  12713         }

and translated it to (line numbers thanks to the improved cc -S output for x86 in the Sun Studio 11 compilers):

.CGA0.172:
        movl       %ecx,%eax                            ;/ line : 12710
        andl       $64,%eax                             ;/ line : 12710
        je         .CGA3.175                            ;/ line : 12710
.LP5.1903:
.CGB7.195:
        xorl       %eax,%eax                            ;/ line : 12711
        movl       1220(%esi),%ecx                      ;/ line : 12710
.CGA1.173:
        cmpl       %eax,%ecx                            ;/ line : 12711
        je         .CGA1.173                            ;/ line : 12711
.LX4.1905:
.LE4.1906:
        movl       72(%ecx),%eax                        ;/ line : 12712
        movl       52(%ecx),%edx                        ;/ line : 12712
        push       (%ecx)                               ;/ line : 12712
        push       %edx                                 ;/ line : 12712
        call       \*%eax                                ;/ line : 12712
        addl       $8,%esp                              ;/ line : 12712
        movl       148(%esi),%ecx                       ;/ line : 12710
        movl       %ecx,%eax                            ;/ line : 12710
        andl       $64,%eax                             ;/ line : 12710
        jne        .CGB7.195                            ;/ line : 12710

It turns out the conditional eliminator had noticed that neither dpy->flags nor dpy->lock were declared volatile, so they could not change in the running of this loop in the case in which the condition_wait function was not called, so it rewrote it effectively to:

 if (dpy->flags & ( 1L << 6 ))
   if (dpy->lock) {
    do {
     ( \* ( dpy ) -> lock -> condition_wait ) 
           ( dpy -> lock -> writers , ( dpy ) -> lock -> mutex );
    } while (dpy->flags &  ( 1L << 6 )) && (dpy->lock));
   } else {
     while (1) {}
   }
 }

After further digging, we found out that Henry was using a gaim that was making X11 calls from multiple threads without calling XInitThreads() first, which violates one of the two most important rules of X11 programming:

  1. Never call Xlib from a signal handler.
  2. Never call Xlib from multiple threads without calling XInitThreads() first.

(Most OS'es actually require XInitThreads() be the very first Xlib call made in the application - Solaris allows you to call it after other functions in some cases - see the Solaris Xlib threading changes I made available to Xorg last year.)

Code that violates these is known and expected to run off the rails and do unpredictable things, but simply going into a infinite loop that's impossible to exit is something we still want to avoid. Two simple changes got rid of this. The first was simply to declare the flags member of the Display struct to be volatile, since it clearly expects to be accessed from multiple threads. (Though since it's not using any sort of atomic test-and-set, I'm not sure how it was ever expected to be free of race conditions - fortunately, the Xlib internals are in the process of being replaced in the open source community by a new model which through the magic of grad student labor [Correction: undergrad student & professor labor - see comment #1 below] has been formally proven to handle threading correctly - see the XCB project site, especially their conference papers, for more details on this effort.)

The second change we brought in from X11R6.7 which instead of just looping when called when no locks were initialized instead adds a call to _XWaitForWritable there instead:

 	while (dpy->flags & XlibDisplayWriting) {
+	    if (dpy->lock) {
 	        ConditionWait(dpy, dpy->lock->writers);
+	    } else {
+		_XWaitForWritable (dpy, cv);
+	    }
 	}

Between these two changes, the compiler can no longer reduce down to an infinite loop, and while we can't promise correct, or even very reasonable, behaviour when you fail to call XInitThreads() in your multi-threaded X code, at least we won't chew up the output of an entire CPU when you do.

[Technorati Tags: , , Sun Studio, , ]

Tuesday Nov 15, 2005

Which compiler to build Xorg with on Solaris

In yesterday's post, I mentioned how to tell the Xorg monolith which compiler to use, Sun Studio cc or GNU gcc. Figuring out which one you wanted to use used to be somewhat clear - if you had paid all that money for Sun Studio cc, you probably wanted to use it, otherwise gcc was free. Things got a bit murkier a few months ago, when Sun Studio compilers were made available for free to members of the OpenSolaris community, and now this week, Sun Studio 11 was released with a free-to-use license for everyone.

So now both compilers can be used without having to pay anyone - which one should you use to compile Xorg? First the easy one - though the Sun compilers now support Linux, neither the Xorg monolith or modular builds work with it yet on Linux. To get the 6.9 monolith working, much of the #if HasSunCC bits from sun.cf would probably need to be replicated into the linux.cf Imake configuration and gcc-specific bits replaced in linux.cf and xorg.cf. For 7.0, I tried a couple of months ago with the Sun Studio cc beta and found that autoconf couldn't yet handle building with the Sun compilers - it was able to recognize that it wasn't gcc or any other known compiler, but couldn't quite get the options set right. This patch to autoconf's config.guess got me a little farther, but I still couldn't build it all:

--- config.guess.orig	2005-08-17 15:56:41.000000000 -0700
+++ config.guess	2005-08-17 15:57:31.000000000 -0700
@@ -964,7 +964,7 @@
 	LIBC=gnulibc1
 	# endif
 	#else
-	#ifdef __INTEL_COMPILER
+	#if defined(__INTEL_COMPILER) || defined(__SUNPRO_C)
 	LIBC=gnu
 	#else
 	LIBC=gnuaout

On Solaris though, both the Sun & GNU compilers should work well. As mentioned before, I build the Xorg CVS trees regularly with both and try to fix problems as I find them, but I haven't had a chance to do any comparative performance benchmarking yet. (I know conventional wisdom used to be that gcc had better performance than Sun compilers on x86, and Sun compilers made faster binaries than gcc on SPARC - but a lot of work has gone into the x86/x64 compilers in recent Sun Studio releases, and the Sun Studio team have posted a number of record benchmark scores with the Studio 11 version.)

gcc has an advantage in some areas because it's the compiler most of the Xorg developers use and most other vendors build with, so gcc extensions are commonly used - for instance, the MMX optimization code in libfb for accelerating alpha-blending and anti-aliasing in Xorg still requires gcc (the new MMX intrinsics support in Studio 11 comes close, but isn't completely compatible, though the code probably could be ported with a little effort). A number of areas use #ifdefs and support both Sun Studio cc & gcc - for instance, the new macros in Xfuncproto.h to control symbol visibility use gcc attributes and Sun Studio's __hidden/__global keywords. (There's basically three compilers we know of that are still used somewhat regularly and well supported in Xorg - in order of support/popularity, they'd be gcc, Sun Studio, and USLC for Unixware/OpenServer. There's support in the Imake configs for Intel's compiler, but there are reported bugs against it, and I don't know if it currently works.)

So which compiler should you use? I guess that depends on what you're trying to do. Of course, we use Sun Studio tools for the Solaris builds (though we're still on Studio 10 for the official Nevada builds, but are testing with Studio 11 and preparing to switch to those soon), so if you're trying to match our builds, that's the way to go. Otherwise, the best answer I can give is to try both out, play with the different optimization tuning flags, and see which works best for you.

On the other hand, for the rest of the toolchain, I definitely find I'm more productive with the Sun Studio tools - perhaps that's partially because I know Sun dbx much better than gdb, but dbx features like the integrated Korn shell and memory leak/access checking (though access checking works better on SPARC than on x86) are ones I miss when I have to use gdb. I haven't seen anything in the GNU toolchain to even compare to the Performance Analyzer.

[Technorati Tags: , , , . Sun Studio]

About

Engineer working on Oracle Solaris and with the X.Org open source community.

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle, the X.Org Foundation, or anyone else.

See Also
Follow me on twitter

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today