On Bugs and Reproducibility

On Bugs and Reproducibility

I love bugs -- specifically, I love putting an end to bugs[1]. I find myself a bit tongue-twisted when asked to articulate just what makes root-causing a bug so engrossing. In brief: it's the thrill of the hunt, the pull of a mystery, the search for a deeper understanding of how the system works, and -- of course -- the gripping fear that some customer somewhere will hit it again if we don't nail the bloody thing[2].

One way to subclass bugs is by reproducibility. That is, can the bug be reproduced on demand? However, as this blog entry will make clear, almost[3] any bug can be reproduced on demand once it is sufficiently understood. Thus, when someone states "this bug is not reproducible", they're likely really saying "this bug is not sufficiently understood". Sadly, one established way a bug can be "resolved" at Sun is for it to be closed out as "not reproducible" -- not just a shameful admission of utter defeat, but a near-guarantee to customers that they will be inconvenienced by the problem yet again. Thus, many of us simply refuse to close out bugs as "not reproducible"[4].

Nonetheless, sometimes bugs are indeed closed out as "not reproducible". In fact, if a bug takes up residence in a subsystem that is poorly understood but widely used, it can thump around for years -- or even decades -- before it is finally smoked out by a particularly determined engineer. Further, often the manifestations of the bug can be so varied that even after diagnosing, it can prove impossible to track down all of the duplicates that are either still open or have been closed out over the years as "not reproducible". Here I discuss my dealings with one such bug -- 4927647.

The Sewer of STREAMS

A classic example of a poorly-understood but widely-used area is the pseudo-terminal subsystem[5]. While the basics of how pseudo-terminals work is well-documented in several texts[6], the STREAMS code that implements it is a arcane and delicate piece of machinery from AT&T that has undergone little change since its initial impact with SVR3.2 some 18 years ago.

Again, that's not to say the code is unimportant. On the contrary, it is a core building block upon which many critical utilities, such as xterms, remote logins (ssh/telnet), and tools like expect(1) are built. Thus, when something goes wrong, someone has to be called in to diagnose it -- and since it's STREAMS code, these days that often means me[7].

Enter 4927647.

Specifically, I got called in because a high-profile customer filed the following bug:

    In expect 5.38.0 on Solaris 9, run the following script: 

        for { set i 1 } { 1 } { incr i } {
            spawn echo foobarbaz
            expect {
                "foobarbaz" {
                   puts "ok"
                   expect eof wait
                eof {
                    puts "$i passes"
                    exit 1

    It should loop forever; instead, it exits after a few iterations on
    multiprocessor systems.  It appears to loop "forever" on uniprocessors.

Indeed, running this test I was able to confirm that it did fail on multiprocessor systems -- though usually after thousands of iterations (each iteration taking close to a second), rather than the "few" indicated in the bug report. Without a reproducible test case, this was surely going to be a painful battle.

The Hunt for Reproducibility

Using truss(1), it had already been confirmed that the problem was not with expect, but with the pseudo-terminal implementation itself. Specifically, something was causing the poll(2) to occasionally wake up with POLLERR, and the subsequent read(2) to return EINVAL rather than the number of bytes in the expected message:

    25258:  open("/dev/ptmx", O_RDWR)                       = 4
    [ ... lots happens ... ]
    25258:  poll(0xFFBFEBC0, 1, 10000)                      = 1
    25258:          fd=4  ev=POLLRDNORM|POLLRDBAND rev=POLLERR
    25258:  read(4, 0x00044FE8, 3999)                       Err#22 EINVAL

Of course, with this came the sobering realization that this bug was undoubtedly responsible for countless other sporadic "not reproducible" pseudo-terminal-related bugs, and an accompanying growing list of suspected duplicates. So, the key question was: could we intuit enough about this bug to make it reproducible and finally smoke this thing out?

At the time, DTrace had not yet integrated into Solaris, but was available for internal use and had an ever-growing buzz, so I was eager to see if it could shed some light on this bug. In particular, based on the truss output, my suspicion was that a toxic message was arriving on the /dev/ptmx stream that was causing STRDERR to be set, triggering POLLERR to be returned from strpoll():

        if (sd_flags & (STPLEX | STRDERR | STWRERR)) {
                if (sd_flags & STPLEX) {
                        \*reventsp = POLLNVAL;
                        return (EINVAL);
                if (((events & (POLLIN | POLLRDNORM | POLLRDBAND | POLLPRI)) &&
                    (sd_flags & STRDERR)) ||
                    ((events & (POLLOUT | POLLWRNORM | POLLWRBAND)) &&
                    (sd_flags & STWRERR))) {
                        if (!(events & POLLNOERR)) {
                                \*reventsp = POLLERR;
                                return (0);

Further, my suspicion was that most of the time, the application called poll() before this toxic message arrived, thus sidestepping the problem. However, occasionally, the application would get delayed (e.g., by other system activity), causing the toxic message to set STRDERR before the application was able to call poll().

Thus, I used the following simple D script to "force" the application to become stalled:

      #!/usr/sbin/dtrace -ws

And just like that, the problem became reproducible on the first try! That is, by simply widening the failure window, the problem became immediately reproducible, simultaneously confirming my hypothesis and significantly easing further analysis. (It should be noted that the proper value to chill(), which is specified in nanoseconds, is highly dependent on the underlying hardware -- for my hardware, delaying by 30ms was sufficient. Of course, to prevent widespread system misbehavior, the amount of time one can chill() is bounded.)

The Analysis

With the problem reproducible, further analysis with DTrace and other tools proceeded quickly. In less than an hour, the specifics behind the bug were well-understood, though still extraordinarily twisted.

For those who are interested in such sordid details, here's a crude diagram which shows the data flow (indicated by arrows) and the relevant streams modules (shown in brackets; more about the the modules can be gleaned from their manpages -- e.g., pts(7D)). Note that pckt(7D), while critical to the problem, is not actually on the stream -- hence the dotted lines going through it:

                  expect                   echo
                [ stream ]              [ stream ]
                [  head  ]              [  head  ]
                   |  \^                    |  \^
                   |  |                    v  |
                   |  |                 [ ldterm ]
                   |  |                    |  \^
                   :  :                    v  |
                [  pckt  ]              [  ptem  ]
                   :  :                    |  \^
                   v  |                    v  |
                [  ptm   ]   <------>   [  pts   ]

                master side             slave side
Using the above diagram, here are the steps that lead to the problem [8]:
  1. The message (e.g., "foobarbaz" or "something") is sent down the pseudo-terminal slave stream by the echo command.
  2. The pseudo-terminal slave is closed by echo (e.g., through exit(2)) and the dismantling of the stream begins; the message sent in (1) is still in-flight, perhaps even still on the slave side.
  3. Eventually, ldterm's close routine (ldtermclose()) is called, which causes a TCSBRK M_IOCTL to be sent downstream (see Jim Carlson's blog for lots more on that misery).
  4. The ptem module receives the M_IOCTL, acks it, and then passes a copy of of the M_IOCTL downstream for pckt(7M), as documented in ptem(7M).
  5. Shortly after ldtermrput() receives the M_IOCACK, ldtermclose() is allowed to complete, and the following processing elements (ptem and pts) quickly follow suit, being careful to drain (if necessary) the message and the M_IOCTL copy to the master side stream.
  6. Both the M_IOCTL and original data message reach the master side stream. Eventually, the data message makes it to the stream head and a pollwakeup() is issued to notify the application (expect) that data is available. However, as discussed before, for whatever reason the application is delayed.
  7. Since the pckt(7M) module is actually not on the master side, the M_IOCTL reaches the master side's stream head. Since it's not expecting to receive this message, strrput_nondata() sends an M_IOCNAK back downstream.
  8. The M_IOCNAK reaches ptm. However, since the slave-side has been closed, ptmwput() frees the message and sends an M_ERROR upstream to indicate the problem.
  9. The stream head receives the M_ERROR and strrput_nondata() sets STRDERR on the stream.
  10. The application thread finally receives the notification from step (6) and is awoken, but since STRDERR is set, the block of code shown earlier causes strpoll() to return POLLERR rather than POLLIN. Thus, the observed behavior.

All That ... For This?

Despite the tortuous nature of the bug, the fix itself proved to be trivial[9]. In particular, while there are several possible simple fixes, the least risky is to change the ptmwput() logic in step (8) to use the obscure 2-byte form of the M_ERROR message to only mark the write-side as having an error. That is, to change:

         if ((ptmp->pt_state & PTLOCK) || (ptmp->pts_rdq == NULL)) {
                 DBG(("got msg. but no slave\\n"));
                 (void) putnextctl1(RD(qp), M_ERROR, EINVAL);

... to[10]:

         if ((ptmp->pt_state & PTLOCK) || (ptmp->pts_rdq == NULL)) {
                 DBG(("got msg. but no slave\\n"));
                 mp = mexchange(NULL, mp, 2, M_ERROR, -1);
                 if (mp != NULL) {
                         mp->b_rptr[0] = NOERROR;
                         mp->b_rptr[1] = EINVAL;
                         qreply(qp, mp);

         [ See the putnextctl(9F), qreply(9F), and mexchange(9F)
           manpages to get a better handle on this code. ]

This seems eminently reasonable since ptm is unable to forward a message through the write-side, but the read-side is still usable -- so why wasn't it done this way from day one? The reason is simple: the 2-byte form of the M_ERROR message post-dates the original code.

Again, thanks to the reproducible test case, it was quick and easy to verify that this fixed the problem.

Due Diligence

As I previously mentioned, most of the pseudo-terminal code dates back to AT&T and has simply been hitched to the nearest wrecker and gingerly towed to the latest release. Indeed, inspecting an internal gates, it's clear this bug goes all the way back to the introduction of the STREAMS-based pseudo-terminal subsystem in 1987! While this is certainly an extreme case, it is not rare for apparently-intermittent bugs to terrorize subsystems for years before they are finally root-caused.

In this case, scanning the bug database suggests the conditions needed to reproduce this bug did not become widespread until Solaris 9 -- at which point, numerous bizarre application failures began to be reported, such as:

    4530041 Unable to restore/save backup during DSR upgrade to S9/S10 due to internal error
    4654418 ripv2 test routed.perf.3 failed intermittently
    4695559 spurious failures of tcp.serverclose
    4813669 ripv2 test routed.perf.3 failed intermittently
    4897163 ppp tests often return with "FAIL: Unable to ping"
    4892148 unexpected POLLERR causing early EOF from spawned processes in "expect"
    4926624 ssh exits with -1 if stdin is not a terminal
    4985311 rdisc.3 testcase randomly fails
    5012511 ripv2: some test cases in rtquery subdir failed randomly and intermittently

Of course, there were surely many more anomalies due to this bug which were either never filed, "fixed" by other means (workaround), or closed out as "not reproducible" -- we will never know the full scope of the chaos caused by this persistent little bug.


Hopefully, this treatise has clarified three key points regarding hard-to-reproduce bugs:

  1. Almost any bug can be reproduced once it is properly understood. Resist the temptation to close a bug out as non-reproducible -- it will come back to bite someone!
  2. DTrace is a powerful tool for investigating hard-to-reproduce bugs, and the chill() action is particularly ideal for investigating bugs that you suspect are timing-related.
  3. Once the problem is understood, scanning the bug database for similar symptoms can often reveal years -- or even decades -- of other bits of inexplicable behavior. Like nabbing a serial killer, the search through the cold cases can prove tedious but ultimately provides a sense of closure to others who have been victimized. Moreover, the search provides a basis for understanding the true impact of the bug and thus how many releases the fix needs to be patched back to.

Happy hunting!


[1] I literally put an end to bugs during a recent trip to China -- though sadly I was not able to dine on anything quite this large. (Upon seeing that photo, Jim Carlson quipped: "I guess if you can put a stick through it, it's food." -- a disturbingly accurate observation.)
[2] Of course, as a developer, much of my time is spent on the other side of the fence: introducing new code, which will invariably have a new crop of challenging bugs. Thus, there is another element to bugfixing: paying my dues to society.
[3] There exists a very small subset of bugs that are tied to factors that are beyond software's ability to control, and thus cannot be reproduced on demand, even when they are sufficiently understood.
[4] Unfortunately, our internal "list" of reasons for closing out a bug is currently too limited, forcing "not reproducible" to be pressed into service for "fixed as a side effect of X", or "insufficient information available to evaluate", among others.
[5] Actually, we have two pseudo-terminal subsystems on Solaris: an System V one, and a BSD one. However, the BSD flavor is provided only for compatibility and for purposes of this discussion can be ignored.
[6] An excellent description of pseudo-terminals can be found in the recently-revised Advanced Programming in the Unix Environment.
[7] For the record, I've never formally worked on STREAMS -- but Sasha Kolbasov and myself have become the unofficial caretakers of the much-maligned STREAMS core.
[8] To keep the description from growing too bloated, I have left out the code snippets, but inspecting the linked functions will allow more detailed study of the problem. Of course, the bug has been fixed, so the broken code in ptmwsrv() no longer exists.
[9] It is surprising how often hard bugs have trivial fixes. Surely there is a reason for this, but it's not obvious to me.
[10] Note that this fix makes use of mexchange(9F), which has long been available internally, but which I added to the Solaris DDI (along with a collection of other STREAMS utility routines) for Solaris 10. Hopefully these routines will make writing new STREAMS code a little less painful. Perhaps worthy of a future blog entry.

Technorati Tag:
Technorati Tag:
Technorati Tag:


its... ewww........... bugs!!!! i hate eat a bug
if i touch a bug i feel my hand is so very dirty...yak!!!! i hate a bug....ever!!

Posted by nicole on September 04, 2007 at 04:53 PM EDT #

Post a Comment:
  • HTML Syntax: NOT allowed



« April 2014

No bookmarks in folder


No bookmarks in folder