The compiler bug that wasn't...

People hanging out in the #opensolaris IRC channel last week may have seen a discussion between myself, slowhog, and dprice as we tried to debug why gaim was hanging on slowhog's machine - it seemed to be the compiler had generated an impossible-to-exit infinite loop: [Note: IRC transcripts below have had some portions dropped and/or reordered to make the discussion easier to read]

<slowhog> dbx shows interesting code
<slowhog> 0xd208b301: _XFlushInt+0x00e5:  testl    %eax,%eax
<slowhog> (dbx) stepi
<slowhog> t@1 (l@1) stopped in _XFlushInt at 0xd208b303
<slowhog> 0xd208b303: _XFlushInt+0x00e7:  je       _XFlushInt+0xe5        [ 0xd208b301, .-2 ]
<slowhog> (dbx) stepi
<slowhog> t@1 (l@1) stopped in _XFlushInt at 0xd208b301
<slowhog> 0xd208b301: _XFlushInt+0x00e5:  testl    %eax,%eax
<slowhog> as you suggested, it's a loop
<slowhog> curious from where is the eax register got updated.

Unfortunately, I've never learned x86 assembly opcodes, and while I know the basic concepts from doing 6502 assembly on the Apple II years ago, and MIPS assembly in CS classes in college, so can mostly follow, I don't know what exactly is going on without a bit of help which Dan and Henry were able to provide for me.

<dprice_at_sun> alanc, it is asking if eax & eax != 0, basically
<alanc> so it's just checking to see if any bits in eax are set?
<alanc> i.e. eax != 0 ?
<slowhog> testl src, dest (an AND that does not change dest, only flags)
<dprice_at_sun> sorry.  so slowhog, the sequence is:  testl eax,eax;  je 
<dprice_at_sun> is that right, just those two?
<slowhog> it's like while (eax!=0)
<slowhog> yup
<dprice_at_sun> right except that eax is a register
<slowhog> exactly
<dprice_at_sun> we're not reloading it right?
<slowhog> not from the two lines of assembly
<dprice_at_sun> you often see this kind of thing when someone forgets a "volatile"
<slowhog> dprice_at_sun, that's what I suspect
<alanc> is this from Solaris 10?
<slowhog> on nv b35
<alanc> X software on Solaris 10 x86 was built with a Studio 10 build that didn't
         always honor volatile when it should
<alanc> Nevada switched to a newer build long ago though

So the first step is to narrow down where in the code this loop was being hit:

<alanc> that's probably in the code that's checking to see if the other thread
         holding the lock has released it yet
<alanc> sadly cc -S doesn't give helpfully inlined source on x86 like it does on
         SPARC to more easily match to the right spot
<slowhog> I am building my lib/app with Sun Studio 11
<slowhog> dprice_at_sun, gaim on gtk on gdk on XPending
<alanc> I don't see any loops in _XFlushInt with no bodies
<alanc> the closest is: while (dpy->flags & XlibDisplayWriting) {
<alanc>             ConditionWait(dpy, dpy->lock->writers);
<alanc>         }
<dprice_at_sun> alanc, I looked too...
<dprice_at_sun> hmm

<dprice_at_sun> I see the same loop in my disassembly of nv_36
<alanc> the code in question is xc/lib/X11/XlibInt.c

For those of you following along at home, you can't see the exact version of the code we were looking at, but it's similar to the X11R6.6 version you can see here.

<dprice_at_sun> so, this looks pretty weird to me
<dprice_at_sun> the loop comes after _XSmeAllocateHeapBuffer() in the assembly
<dprice_at_sun> So yeah, it does look like the loop alan mentioned.  line 814 of XlibInt.c
<dprice_at_sun> alanc, that sure doesn't look correct to my tired eyes.
<alanc> ConditionWait should be a function call though, not just jump back to the test
<dprice_at_sun> alanc, yes.  which raises a spectre....
<dprice_at_sun> on SPARC, you can clearly see the test, backwards branch loop happening.
<dprice_at_sun> on the binary bits I have on x86, I don't (yet) see it
<dprice_at_sun> actually, it does look there is a loop here:`_XFlushInt+0x102:   jne    -0x23    
<alanc> now the cc -S matches
<dprice_at_sun> the comparison to 0x40 is the check for XlibDisplayWriting ; the 
                 comparison with 1 is the test for XlibDisplayIOError
<alanc> .CGB7.195:
<alanc>         movl       1220(%esi),%eax
<alanc> .CGA1.173:
<alanc>         testl      %eax,%eax
<alanc>         je         .CGA1.173
<dprice_at_sun> it'd be nice to know what ConditionWait is expanding to, once the
                 CPP is done with it
<alanc> #define ConditionWait(d,c) if ((d)->lock) \\
<alanc>         (\*(d)->lock->condition_wait)(c, (d)->lock->mutex)
<dprice_at_sun> well I'm not seeing it do the sort of  'call   \*%eax' that you 
                 would need for this code to be working
<dprice_at_sun> there is one later, at`_XFlushInt+0xf1:    call   \*%ecx
<dprice_at_sun> but I think that is the subsequent
                       for (ext = dpy->flushes; ext; ext = ext->next_flush)  loop
<dprice_at_sun> maybe anyway
<alanc> with no -xO flags it gives code that doesn't appear to loop back on itself
<dprice_at_sun> shocker :)
<alanc> instead it's a much longer loop with a call in the middle
<alanc> I see studio 11 also nicely added line number comments to x86 -S output

We did a lot of comparing of the different code outputs with the different optimizer levels that I'll spare you from here (mainly because they were done via pages that have now expired and I didn't archive them).

<dprice_at_sun> alanc, although one wonders why we're not seeing this 
                 problem with lots of X apps?
<alanc> not lots of multithreaded X apps
<dprice_at_sun> I guess if while (dpy->flags & XlibDisplayWriting)  fails 
                 the first time through the loop, we never hit this
<alanc> or rather, not lots of X apps that initialize the threading support 
         so they can call X from multiple threads
<alanc> instead of just having only one thread ever call X

So I filed a bug with our compiler team. I provided the output from cc -E so they had a single source file to debug and didn't need our full build environment. They determined that turning off the conditional eliminator made the loop go away and sent it to the engineers who know that part of the optimizer best. Unfortunately, they were able to prove in a few days that it was not a compiler bug, but the compiler was simply being very clever. It saw this code (expanded after cpp preprocessing):

  12710         while (dpy->flags &  ( 1L << 6 )) {
  12711              if ( ( dpy ) -> lock )
  12712                 ( \* ( dpy ) -> lock -> condition_wait ) 
                            ( dpy -> lock -> writers , ( dpy ) -> lock -> mutex );
  12713         }

and translated it to (line numbers thanks to the improved cc -S output for x86 in the Sun Studio 11 compilers):

        movl       %ecx,%eax                            ;/ line : 12710
        andl       $64,%eax                             ;/ line : 12710
        je         .CGA3.175                            ;/ line : 12710
        xorl       %eax,%eax                            ;/ line : 12711
        movl       1220(%esi),%ecx                      ;/ line : 12710
        cmpl       %eax,%ecx                            ;/ line : 12711
        je         .CGA1.173                            ;/ line : 12711
        movl       72(%ecx),%eax                        ;/ line : 12712
        movl       52(%ecx),%edx                        ;/ line : 12712
        push       (%ecx)                               ;/ line : 12712
        push       %edx                                 ;/ line : 12712
        call       \*%eax                                ;/ line : 12712
        addl       $8,%esp                              ;/ line : 12712
        movl       148(%esi),%ecx                       ;/ line : 12710
        movl       %ecx,%eax                            ;/ line : 12710
        andl       $64,%eax                             ;/ line : 12710
        jne        .CGB7.195                            ;/ line : 12710

It turns out the conditional eliminator had noticed that neither dpy->flags nor dpy->lock were declared volatile, so they could not change in the running of this loop in the case in which the condition_wait function was not called, so it rewrote it effectively to:

 if (dpy->flags & ( 1L << 6 ))
   if (dpy->lock) {
    do {
     ( \* ( dpy ) -> lock -> condition_wait ) 
           ( dpy -> lock -> writers , ( dpy ) -> lock -> mutex );
    } while (dpy->flags &  ( 1L << 6 )) && (dpy->lock));
   } else {
     while (1) {}

After further digging, we found out that Henry was using a gaim that was making X11 calls from multiple threads without calling XInitThreads() first, which violates one of the two most important rules of X11 programming:

  1. Never call Xlib from a signal handler.
  2. Never call Xlib from multiple threads without calling XInitThreads() first.

(Most OS'es actually require XInitThreads() be the very first Xlib call made in the application - Solaris allows you to call it after other functions in some cases - see the Solaris Xlib threading changes I made available to Xorg last year.)

Code that violates these is known and expected to run off the rails and do unpredictable things, but simply going into a infinite loop that's impossible to exit is something we still want to avoid. Two simple changes got rid of this. The first was simply to declare the flags member of the Display struct to be volatile, since it clearly expects to be accessed from multiple threads. (Though since it's not using any sort of atomic test-and-set, I'm not sure how it was ever expected to be free of race conditions - fortunately, the Xlib internals are in the process of being replaced in the open source community by a new model which through the magic of grad student labor [Correction: undergrad student & professor labor - see comment #1 below] has been formally proven to handle threading correctly - see the XCB project site, especially their conference papers, for more details on this effort.)

The second change we brought in from X11R6.7 which instead of just looping when called when no locks were initialized instead adds a call to _XWaitForWritable there instead:

 	while (dpy->flags & XlibDisplayWriting) {
+	    if (dpy->lock) {
 	        ConditionWait(dpy, dpy->lock->writers);
+	    } else {
+		_XWaitForWritable (dpy, cv);
+	    }

Between these two changes, the compiler can no longer reduce down to an infinite loop, and while we can't promise correct, or even very reasonable, behaviour when you fail to call XInitThreads() in your multi-threaded X code, at least we won't chew up the output of an entire CPU when you do.

[Technorati Tags: , , Sun Studio, , ]


Thanks for the kind words about XCB! Any app that is using multiple threads should probably switch to XCB; its API is completely thread-safe and thread-transparent.

The Xlib-with-XCB-internals is currently in pretty good shape, but it is really just intended as a transition strategy (since calls to Xlib/XCB and XCB can be freely mixed). We expect that everything will eventually move to libraries built directly over XCB, such as Cairo, Pango, GDK and Qt.

Wasn't grad students that built XCB, BTW; it was me, my undergrads, and some great volunteers. Many grad students are ineffective for this kind of thing; they don't code well and don't much care to learn. :-)

Any and all XCB help is greatly appreciated. Thanks much for an entertaining post.

Posted by Bart Massey on April 17, 2006 at 05:52 PM PDT #

As an XCB developer, thank you very much for your post and your pointer to XCB. We have worked quite hard on it, and comments like yours really make our day. I really enjoy the idea that our software will soon end up on every new graphical GNU/Linux box.

Posted by Josh Triplett on April 18, 2006 at 08:48 AM PDT #

Post a Comment:
Comments are closed for this entry.

Engineer working on Oracle Solaris and with the X.Org open source community.


The views expressed on this blog are my own and do not necessarily reflect the views of Oracle, the X.Org Foundation, or anyone else.

See Also
Follow me on twitter


« April 2014