Sendmail Died in a Two SIGALRM Fire

In honor of today's full release of OpenSolaris I thought I would retell the story of one of my favorite bugs that I've debugged while working at Sun on Solaris. This problem is Sun bug 4278156 and was originally debugged during development of Solaris 8 during the first week of October, 1999.

The investigation itself began with a single core file found on one of our internal test servers from a sendmail daemon that was running there. At the time, we were testing out the use of a new Solaris facility called coreadm(1M) that Roger Faulkner and I designed. If you haven't used it before, coreadm permits administrators and developers to create different core file patterns (i.e. other than ./core) to be used when creating core files, and to create a global core file repository where a copy of all core dumps will appear on a system. This is an essential valuable part of our strategy of first-fault debugging at Sun, how we try to ensure a single failure can be debugged to root-cause.

Aside from illustrating how we do this with a good story which ultimately benefited everyone using Sendmail, I'll also go into some details about:

  • How we debug core files at a low-level with mdb
  • How signal handling works between the kernel and user processes
  • How sometimes one bug helps you find another

The problem began simply enough: a single core file from sendmail sitting in our global core file directory /var/core on a machine running build 31 of Solaris 8. The stack trace was as follows:

$ mdb core
> $c
sfgets+0x80(0, 0, 0, 0, f0878, d1800)
That is, the stack was truncated at a single frame in sfgets(). Before launching into a detailed analysis of how exactly sendmail died, here are a few pieces of background information to help understand the analysis:

  • Our sendmail is configured (in conf.h) such that in sendmail's source code, jmp_bufs are #defined to be sigjmp_bufs, setjmp() is #defined to be sigsetjmp(env, 1), and longjmp() is #defined to be siglongjmp(). This is important to keep in mind when reading the sendmail code snippets.

  • This bug report presumes a reasonably detailed knowledge of how signal delivery (in this case on SPARC 32-bit) is implemented by the kernel and libc. To understand how this works and how a debugger can locate signal frames on the stack in Solaris, refer to my comments in libproc which are found here.

  • To debug the problem, I wrote several new mdb dcmds and walkers. It's always worth the effort to do so: as part of this debugging effort, we not only got sendmail fixed, we also shipped these new debugging features, thereby benefitting all future development efforts. You can learn more about mdb by reading the Solaris Modular Debugger Guide. I'll go into further details about mdb and how to write commands for it in a future blog.

By looking at the core file I was able to observe the immediate cause of death: the process died from a SIGBUS at sfgets+0x80 because it attempted to store to %fp - 0xc, and somehow its frame pointer has been corrupted (it was set to 0x1, a clearly bogus value):


> $c
sfgets+0x80(0, 0, 0, 0, f0878, d1800)
> <pc/i
sfgets+0x80:    st        %o0, [%fp - 0xc]
> <fp=K
                1

Specifically, sendmail was attempting to store the pointer to the string "local" to the stack for a call to sm_syslog(), as show below:
char \*
sfgets(buf, siz, fp, timeout, during)
{
        /\* ... \*/

        if (timeout != 0)
        {
                if (setjmp(CtxReadTimeout) != 0)
                {
                        if (LogLevel > 1)
                                sm_syslog(LOG_NOTICE, CurEnv->e_id,
                                       "timeout waiting for input from %.100s during %s",
                                       CurHostName ? CurHostName : "local",
                                       during);

        /\* ... \*/
}

You can observe that it was executing code following a read timeout, which caused a siglongjmp back to the CtxReadTimeout sigjmp_buf, and caused the setjmp (remember, really a sigsetjmp) to return non-zero. Somehow when sendmail returned from an interrupt (a SIGALRM used to timeout sfgets()), the fram e pointer ended up as 0x1. The invalid frame pointer is also the reason why the stack backtrace only shows the single sfgets() frame. The readtimeout() routine used as a SIGALRM callback is quite simple, and I disassembled it to determine the address of CtxReadTimeout (since the binary was stripped). Here's the source code for readtimeout():

static void
readtimeout(time_t timeout)
{
        longjmp(CtxReadTimeout, 1);
}
mdb permits developers to add a custom symbol missing from a stripped symbol table to a private symbol table so you can use its name for later debugger queries. So I added a symbol for CtxReadTimeout at 0xcc800+304 with size 76 bytes. Once a symbol is added, the address-to-symbol conversions elsewhere in the debugger can reference it.
> 0xcc800+304::nmadd -s 0t76 CtxReadTimeout
added CtxReadTimeout, value=ccb04 size=4c
> ::nm -P
Value      Size       Type  Bind  Other Shndx    Name
0x000ccb04|0x0000004c|NOTY |GLOB |0x0  |UNDEF   |CtxReadTimeout
In order to confirm that sendmail came through readtimeout(), I decided to verify this hypothesis in two ways: (1) by confirming that the current %sp matched the %sp saved inside of CtxReadTimeout, and (2) by confirming that %o7 was still set to somewhere inside of siglongjmp(). Here is the annotated debugger output:
> CtxReadTimeout/3p
CtxReadTimeout:
CtxReadTimeout: 1               0xffbe8630      sfgets+0x50
                \^ flags         \^ saved %sp     \^ saved %pc <--[ for reference ]
> <sp=K 
                ffbe8630        <-- same as the saved %sp above
> <o7=p
                libc.so.1`_libc_siglongjmp+0x1c
With this confirmed, my next move was to figure out what the %sp was before we made the siglongjmp of death, so I could figure out what the stack looked like before sendmail died. To do this, I wrote a new mdb walker called oldcontext to produce each lwp's lwpstatus_t.pr_oldcontext value, which is the address of the last ucontext_t saved on the stack by a signal handler. Again, to learn about how signal frames are saved on a stack, look here in the Solaris source code.
> ::walk oldcontext
ffbe82c0
Using the knowledge of how signal frames are set up, I knew that prior to sendmail's death, its stack basically looked like this:
   ucontext_t   \\__ signal                               \^ high addresses
   struct frame /   frame                                |
   struct frame <-- frame saved by libc`sigacthandler    |
   struct frame <-- frame saved by siglongjmp            v low addresses
In 32-bit SPARC Solaris, sizeof (struct frame) = 0x60, so the old stack pointer should have been located at the address ffbe82c0 - (60 \* 3) = ffbe81a0. And sure enough:
> 0xffbe81a0$C
00000000 0(0, 0, 0, 0, 0, 0)
ffbe81a0 0(d0800, 654c6, 6386c, 0, 0, 0)
ffbe8200 libc.so.1`sigacthandler+0x28(e, 0, ffbe82c0, cfb60, 21858, 24cf4)
ffbe8578 libc.so.1`_wait+0xc(654d7, f07c8, d00c8, f07c8, 654d7, 0)
ffbe85e0 endmailer+0x7c(188c6c, d0ae0, 0, d0ae0, a9c08, 5)
ffbe8640 smtpquit+0x88(f0878, d1800, 1, d0ae0, 188c6c, 54)
ffbe86a0 reply+0x3fc(cf400, cfc00, cfc00, 1, 0, 188c6c)
ffbe9700 smtpquit+0x64(f0878, d1800, 0, d0ae0, 188c6c, 78)
ffbe9760 mci_flush+0x58(2, 4, cfb28, 0, 1, 0)
ffbe97c0 finis+0xdc(cec00, cfabe, ffbe98e8, ff1ba000, 0, 0)
ffbe9828 libc.so.1`sigacthandler+0x28(f, ffbe9ba0, ffbe98e8, ff1ba000, 0, 0)
ffbe9c20 libc.so.1`sigacthandler+0x18(1, ffbe9f98, ffbe9ce0, 0, 21d2c, 245e0)
ffbea018 libc.so.1`_read+0xc(cfba0, 18a0f4, cfba0, 0, 213a8, 21b94)
ffbea078 libc.so.1`fgets+0x24c(ff1c164c, 0, cfba0, ff1ba000, 18a0f4, 7ff)
ffbea0d8 sfgets+0x164(d0000, cfba0, cfba0, e10, c44a0, a9b8f)
ffbea148 reply+0xbc(cf400, 1, ce0d9, d019c, 0, 188c6c)
ffbeb1a8 smtpgetstat+0x20(f0878, 188c6c, cf5dc, 162, 188c6c, cf5dc)
ffbeb208 deliver+0x2490(c1de0, 1000, 130208, ffbecb3c, c1cf0, d00c8)
ffbecff8 sendenvelope+0x1e0(cf5dc, 69, d98e8, a03b0, cfb60, 0)
ffbed170 sendall+0xddc(cf5dc, cfac9, ce6d0, 0, 0, 0)
ffbed1d8 dowork+0x200(ce400, 6d, 0, cfae4, 1, d98e8)
ffbed238 smtp+0x1a14(1, d98e8, 0, f1b70, cf4f8, ff00)
ffbee338 main+0x2ffc(cec00, cfc00, d0ae0, cf4f8, ce400, cfc00)
ffbefd68 _start+0xb8(0, 0, 0, 0, 0, 0)
Notice that there are 3 (!) observed instances of libc's sigacthandler() function on the stack, with signals 0x1 (HUP), 0xf (TERM), and 0xe (ALRM) pending respectively. To walk backward through the list of signals that should have been delivered, I wrote another mdb command: a ucontext walker which follows the uc_link chain. Here is the list of signal ucontext_t's I found by following the ucontext_t.uc_link field and my notes on each:
> ::walk oldcontext | ::walk ucontext
ffbe82c0 <-- SIGALRM that caused siglongjmp of death    [1]
ffbe82b8 <-- ?                                          [2]
ffbe98e8 <-- siginfo ffbe9ba0 = SIGTERM from PID 424186 [3]
ffbe9ce0 <-- siginfo ffbe9f98 = SIGHUP  from PID 424186 [4]
I've annotated each ucontext_t address with what I knew so far, and a bit more. I knew that ucontext 1 was the most recent, put there by the SIGALRM that caused sendmail's death. Looking back on the stack, I saw that there were siginfo_t's (the 2nd parameter to sigacthandler()) corresponding to the third and fourth ucontext_t's. I then decoded the contents of each one and determined that they were externally generated SIGTERM and SIGHUP signals respectively, both from process 424186. Interestingly, the SIGTERM was received at sigacthandler+0x18; that is, while we were still in libc setting up to handle the SIGHUP, but before we called sendmail's SIGHUP handler, we took a SIGTERM. Finally, the second ucontext_t was a mystery to me at this point: it didn't have a corresponding sigacthandler() frame on the stack.

In investigating this oddity, I discovered an interesting (and useful) bug in the Solaris SPARC siglongjmp implementation: normally, at the completion of a signal handler routine, a setcontext(2) to the interrupted context is performed. This setcontext() call effectively restores the interrupted context's uc_link value to lwp_oldcontext in the kernel, thus "deleting" the ucontext_t from the "list" of saved signal contexts. However, at that time, Solaris did not store the uc_link obtained from a getcontext(2) call in sigsetjmp() into the sigjmp_buf. Furthermore, libc's siglongjmp() performed another getcontext(), modified the resulting context, and then performed a setcontext() after updating the %pc and %sp, all without touching uc_link. Thus, at this time, Solaris SPARC callers of siglongjmp from a signal handler never delete ucontexts from the uc_link list, even when these signals are retired. (After the sendmail problem had been fixed, a bug on this was also filed and fixed).

At first glance this uc_link bug seemed bad to me, until I considered that it gave me a critical piece of data I wouldn't have had otherwise: I knew that a signal was delivered and retired after the SIGTERM but before the SIGALRM that caused sendmail's death, and I knew that its ucontext_t was saved at ffbe82b8 on the stack. So I was now ready to make a couple of observations and a hypothesis:

  • The initial signal (SIGHUP) was delivered while sendmail was blocked inside of read(2), as called from sfgets(), and thus the sfgets() SIGALRM had not yet fired. I also knew that much later, after sendmail decided to quit because of the subsequent SIGTERM, that reply() called smtpquit():
       ...
       ffbe8640 smtpquit+0x88(f0878, d1800, 1, d0ae0, 188c6c, 54)
       ffbe86a0 reply+0x3fc(cf400, cfc00, cfc00, 1, 0, 188c6c)
       ffbe9700 smtpquit+0x64(f0878, d1800, 0, d0ae0, 188c6c, 78)
       ...
       
    Notice, using the saved frame pointers, that the mystery ucontext_t pointer (ffbe82b8) was placed on the stack prior to reply()'s call to smtpquit(). I also knew (from groveling inside of CtxReadTimeout) that the signal mask (sigset_t) restored by our siglongjmp of death was:
       0x00004001 0x00000000 0x00000000 0x00000000
       
    which is the mask consisting of SIGTERM and SIGHUP blocked. This told me that SIGTERM and SIGHUP had not yet completed processing (as I had guessed from the stack trace), but that SIGALRM was unblocked (that is, its processing was done). I also knew from sendmail's source code that reply() itself calls sfgets() before calling smtpquit(), thus installing another SIGALRM event.

  • I now hypothesized that the mystery ucontext_t represented another SIGALRM, namely the SIGALRM corresponding to the first sfgets(), the one still pending on the stack. This SIGALRM was handled as if it was the timeout for the second sfgets (called from reply(), no longer on the stack) because sfgets always uses (and thus overwrites) the same global sigjmp_buf, and caused reply() to abort its sfgets() and call back to smtpquit(). At this point, a second SIGALRM (the one corresponding to reply()'s sfgets() call) was delivered, and this caused a second siglongjmp() back to the same stack location, a stack location we had now re-used and overwritten with local variable information. I then attempted to prove this hypothesis using sendmail's event queue and free event list.

When sendmail wishes to set up an sfgets() timeout, it uses an internal setevent() function that keeps a linked list of pending event structures. alarm() is then used to install a SIGALRM event for the time of the next event that requires a timeout. The event structure looked like this:

struct event {
        time_t          ev_time;        /\* time of the function call \*/
        void            (\*ev_func)__P((int)); /\* function to call \*/
        int             ev_arg;         /\* argument to ev_func \*/
        int             ev_pid;         /\* pid that set this event \*/
        struct event    \*ev_link;       /\* link to next item \*/
};

At the time of death, the EventQueue list was empty:
> EventQueue/K
0xd09ec:        0
but luckily, retired events were not immediately freed; instead, they were placed on the head of the FreeEventList (linked again using ev_link). Dumping out this list showed that two events had been retired:
> FreeEventList/K
0xcf4ec:        f1cd0
> f1cd0/YpnXDX
0xf1cd0:        1999 Sep 15 15:04:10    readtimeout
                0               414918          f1b50
> f1b50/YpnXDX
0xf1b50:        1999 Sep 15 15:02:20    readtimeout
                0               414918          0
Notice that (1) they both were set up to call readtimeout(), the aforementioned SIGALRM handler, and (2) sendmail conveniently records the time_t of expiration, which we have formatted using /Y in the output above. The code for reply() leading up to smtpquit() looks like this (trimmed for simplicity):
int
reply(m, mci, e, timeout, pfunc)
{
        /\*  ... \*/

                p = sfgets(bufp, MAXLINE, mci->mci_in, timeout, SmtpPhase);
                mci->mci_lastuse = curtime();

        /\* ... \*/

                if (p == NULL)
                {
                        if (errno == 0)
                                errno = ECONNRESET;

                        mci->mci_errno = errno;
                        mci->mci_state = MCIS_ERROR;
                        smtpquit(m, mci, e);
                }

        /\* ... \*/
}
Notice that when sfgets() returns, sendmail records the current time in an mci structure. Then if p is NULL, indicating sfgets() failure, it also records errno, and calls smtpquit(). Again from the stack, I was able retrieve the pointer to the struct mci in question and dump it out:
> 188c6c/6xnXXnppDnXXXXXnYp
0x188c6c:       2048    83      0       0       0       0
                6c              0
                0               0               414935
                c44a0           f0878           130920
                0               0
                1999 Sep 15 15:02:20    0
If you haven't debugged enough core files and crash dumps to have format characters wired into your brain and fingers, use ::formats in mdb to list them out. In the output above I observed two things: (1) the 0x83 at the top is mci_errno, which is decimal 131 which is ECONNRESET, and (2) the formatted time_t at the bottom (mci_lastuse) exactly matches the expiry time of the first SIGALRM event structure (0xf1b50 shown above). (1) told me that errno was zero when we came back from sfgets() with p == NULL, which means sendmail timed out, as opposed to a read() error. (2) told me that indeed the earlier SIGALRM event (last on the FreeEventList) triggered the later siglongjmp back to sfgets(), confirming my hypothesis.

So the complete chain of events leading to the core dump was:

  • sfgets() was pending, SIGHUP interrupted
  • SIGHUP interrupted by SIGTERM, causing a call to finis()
  • reply() called sfgets(), overwriting global CtxReadTimeout with new sigsetjmp
  • sfgets() was pending, SIGALRM from earlier sfgets() interrup ted
  • readtimeout() called siglongjmp(), sfgets() returned to reply() instead of to caller of earlier sfgets()
  • reply() called into smtpquit(), overwriting the portion of stack used by second sfgets()'s saved register window (including %fp) and pointed to by CtxReadTimeout's saved %sp value
  • SIGALRM from second sfgets() now delivered, causing another siglongjmp() to the same CtxReadTimeout sigjmp_buf values
  • %sp was restored to ffbe8630, but now this pointed to more recent s tack data instead of a saved register window, so %fp was restored to 0x1
  • sfgets() now managed to run along from sfgets+0x50 to sfgets+0x80, the first instruction following return from the sigsetjmp call at sfgets+0x48 that references %fp, and then died from a SIGBUS trying to perform a stor e.

Essentially the bug illustrated a pretty serious design flaw in sendmail's timeout mechanism -- with an sfgets() timeout pending, it responded to a SIGTERM and begin to execute a long sequence of cleanup code with other calls back into sfgets(). Since every sfgets() timeout re-used the same global sigjmp_buf, this buffer was then overwritten and the earlier timeout triggers a logical return to the wrong stack location. The sendmail code needed to potentially clear previously pending timeouts once it decided it was on a one-way path to quit (or block HUP and TERM while waiting on read in sfgets()), and perhaps the setevent() mechanism needed to have a per-event sigjmp_buf. In any case, the designers of this code needed to rethink the mechanism based on the evidence uncovered here. And they did just that: fixing the problem in the 8.9.3 release.

Looking back, this problem is a great illustration of how much can be discerned from a single core dump, whether it be from the kernel or a complex user application, and the value in investing the time in root-causing every failure, even those that happen once. Without coreadm this core file would likely never have been seen. And without root-causing the first time this fault was seen, this type of subtle signal condition would likely have gone unreproducible and undebugged for months or even years, until biting one of our customers. Sun's investment in a first-fault debugging strategy helps to ensure that Solaris and other OpenSource software we leverage like Sendmail are of the highest possible quality. As we move into the OpenSolaris era, I know we'll benefit from the debugging skills and brains of those who join the effort, and I hope that our ideas, tools, and techniques for debugging will spread to other Open efforts where developers believe in and value a high-quality product.

$q

Technorati Tag:
Technorati Tag:
Technorati Tag:

Comments:

Post a Comment:
  • HTML Syntax: NOT allowed
About

mws

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today
Bookmarks