Brokenness Hides Itself

When engineers get together and talk, one of the things they like to bring out and share is war stories; tales of debugging daring-do and the amazing brokenness that can be found in the process. I recently went through an experience that makes good war story material, and I thought I'd share it.

A couple weeks ago, there were multiple reports of svc.configd(1M) failing repeatedly with one of:

svc.configd: Fatal error: invalid integer "10" in field "id"
svc.configd: Fatal error: invalid integer in database

Since I'm one of the main developers of svc.configd(1M), I started to investigate. I first had the people hitting it send me their repositories, but they all checked out as having no problems. The problem was only being seen on prototype Niagara machines and some Netra T1s; the first reproducible machine I got console access to was a Niagara box.

Figuring out what happened

Unfortunately, the box was running old firmware, which significantly restrained its usability; I spent more time fighting the machine than working on tracking down the problem. I finally boot neted the machine, mounted the root filesystem, and added a line:

sulg::sysinit:/sbin/sulogin </dev/console 2<>/dev/console >&2
to /etc/inittab to get a shell very early in boot. After booting and logging into that shell, I was able attach mdb(1) to svc.configd(1M):
# mdb -p `pgrep svc.configd`
Loading modules: [ svc.configd ]
Now, I knew that svc.configd uses a utility function, uu_strtouint(), to do all of its integer-to-number conversions, and that the most likely cause of the failure seen was some kind of unprintable character in its first argument. So I set a breakpoint there and started it running:
> uu_strtouint::bp
> :c
mdb: stop at`uu_strtouint
mdb: target stopped at:
uu_strtouint:     save      %sp, -0x68, %sp
> ::step[1]
mdb: target stopped at:
uu_strtouint+4:   ld        [%fp + 0x60], %l6
> $C 2
feefb968`uu_strtouint+4(1cb4e4, feefba9c)
feefb9d0 string_to_id+0x24(1cb4e4, feefba9c)
feefba38 fill_child_callback+0x20(feefbbe4, 2)
feefbaa0 sqlite_exec+0xd8(13de08, 2)
feefbb18 backend_run+0x74(89b48, 169848)
feefbb80 scope_fill_children+0x38(5e9f00, 1000)
So the first argument to uu_strtouint() is 1cb4e4. I attempted to print out the string there:
> 1cb4e4/s
mdb: failed to read data from target: no mapping for address
"failed to read data from target"? A reported address of 0x2e instead of 0x1cb4e4? Clearly, the problem is affecting mdb(1) as well. Some additional tests in mdb(1) revealed the magnitude of the problem:
> 1=J
> 10=J
> 11=J
> 111=J
> 101=J
i.e. the results were completely busted, and had a lot of high-word bits set unexpectedly. Interestingly, it looked like only multiple-digit numbers were affected.

So I proceeded to investigate uu_strtouint(). The first step was to see how the function was failing; there are a number of different ways to get to uu_set_error(), which sets up libuutil's equivalent of errno. A simple breakpoint led to the following code segment:

   269         if (strtoint(s, &val, base, 0) == -1)
   270                 return (-1);
   272         if (val < min) {
   273                 uu_set_error(UU_ERROR_UNDERFLOW);
   274                 return (-1);
   275         } else if (val > max) {
   276                 uu_set_error(UU_ERROR_OVERFLOW);
   277                 return (-1);
   278         }
The failure was occurring at line 276; i.e. an overflow. The value (for an input string of "10") was 0x10000000a, that is, 1\*2\^32 + 10. So something was going terribly wrong in strtoint(). Since mdb(1) was also failing, the problem was probably in some shared library routine. Looking at the disassembly, there are very few external calls:
> strtoint::dis ! grep call | sed 's/`//g'
strtoint+0xc:     call      +8            <strtoint+0x14>
strtoint+0x204:   call      +0x12a30      <PLT:__udiv64>
strtoint+0x2b8:   call      +0x12988      <PLT:__umul64>
strtoint+0x404:   call      -0xd30        <uu_set_error>
strtoint+0x414:   call      -0xd40        <uu_set_error>
strtoint+0x424:   call      -0xd50        <uu_set_error>
strtoint+0x440:   call      -0xd6c        <uu_set_error>
strtoint+0x450:   call      -0xd7c        <uu_set_error>
strtoint+0x460:   call      -0xd8c        <uu_set_error>
The first call is part of the PIC[2] calling sequence, and uu_set_error() is only called in the failure paths, which we knew weren't being hit. So __umul64() and __udiv64() are the next suspects. These are runtime support routines in libc, which the compiler inserts calls to when it wants to do 64-bit multiplications and divisions. The code for strtoint() only has one multiply and one divide, so it's easy to see where they occur:
    103         multmax = (uint64_t)UINT64_MAX / (uint64_t)base;
    105         for (c = \*++s; c != '\\0'; c = \*++s) {
    116                 if (val > multmax)
    117                         overflow = 1;
    119                 val \*= base;
    120                 if ((uint64_t)UINT64_MAX - val < (uint64_t)i)
    121                         overflow = 1;
    123                 val += i;
    124         }
The division always occurs, so I looked at the multiply routine first; disassembling it showed the following suspicious section:
> __umul64::dis ! sed 's/`//g'
__umul64+0x38:        cmp       %l7, 0
__umul64+0x3c:        call      +0xc95e4      <PLT:.umul>
__umul64+0x40:        mov       %i3, %i0
__umul64+0x44:        mov       %l6, %i1
For a function call, %o0-%o6 hold the arguments to the function, and afterwards, %o0 and %o1 hold the results. But here, there's no manipulation of the %os. In fact, the function doesn't reference them anywhere:
> __umul64::dis ! grep '%o'
Here's the smoking gun; we've hit some kind of compiler bug. The relevant source code is:
36 extern unsigned long long __umul32x32to64(unsigned, unsigned);
70 unsigned long long
71 __umul64(unsigned long long i, unsigned long long j)
72 {
81                 if (i1)
82                         result = __umul32x32to64(i1, j1);
88         return (result);
89 }

     29         .inline __umul32x32to64,8
     30         call    .umul,2
     31         nop
     32         mov     %o0, %o2
     33         mov     %o1, %o0
     34         mov     %o2, %o1
     35         .end
From previous experience with the compiler's inlining (I codereviewed the fix for another inliner-related bug, 6225876) I knew that there is an optimization stage after the inline code is generated; from the looks of it, that stage thought that the call had no effect on register state, and so optimized away the argument set up. As a result, the final return value of __umul64 is just some junk register state. So I updated the original bug, and sent it over to the compiler people:

6323803 compiler bug causes __\*mul64 failure; svc.configd dies

After some investigation, the compiler folks accepted the bug, and noted that it only effects call statements in inline assembly. Checking the rest of the '.il' files in OS/Net, I verified that this was the only place where we used call.

I still needed to figure out why this was only effecting certain platforms, and how we were going to deal with the problem in the short-term, so that we weren't waiting on a compiler patch.

Why only sun4v and Netras?

I first wrote a small C program, to test:

% cat > math_test.c <<EOF
#include <stdio.h>

main(int argc, char \*argv[])
        unsigned long long x, y, z;

        x = 10;
        y = 1;
        z = x \* y;
        printf("%llx\\n", z);

        return (0);
% cc -o math_test math_test.c
% ./math_test
% truss -t \\!all -u '::__\*mul64' ./math_test
/1@1:   -> libc_psr:__mul64(0x0, 0xa, 0x0, 0x1)
/1@1:   <- libc_psr:__mul64() = 0
... (moving over to niagara machine) ...
% truss -t \\!all -u '::__\*mul64' ./math_test
/1@1:   -> libc:__mul64(0x0, 0xa, 0x0, 0x1)
/1@1:   <- libc:__mul64() = 1
A hah! For sun4u, libc_psr is overriding the libc definitions. So sun4v won't work for two reasons:
  1. none of the sun4v libc_psrs override __\*mul64
  2. even if they did, sun4v has a wacky "mount the correct libc_psr over /platform/sun4v/lib/ which wouldn't occur until later in boot anyway.
It ends up that the reason __\*mul64 are overriden in libc_psr is a hold-over from Solaris 9 and earlier, where 32-bit Solaris was supported on pre-UltraSPARC chips. In fact, the call to .umul is a hold-over from the days of sparcv7, where the umul instruction wasn't guaranteed to be there. The sun4u libc_psr version has sparcv8plus versions of __mul64, which take advantage of the 64-bit processing that sparcv9 adds, in a form compatible with 32-bit code.

The fact that these weren't included in the sun4v libc_psr is an oversight, but #2 means that it wouldn't have mattered if they did. The Netra T1's running into this problem are explained by the fact that there are a set of missing /platform/\*/lib symlinks for the following platforms:

Which are all various "Netra T1" varieties. Since the link to the sun4u libc_psr is missing, they exhibit the same problem as the sun4v does. I filed 6324790 to cover adding the missing links, since it is a performance problem. (programs will not be able to take advantage of the faster mem{cpy,cmp,move,set} versions contained there)

The final question is "Why now?". What changed to make this a problem? Four days before the first reported incident, 6316914 was putback, which switched the build from the Studio 8 to the Studio 10 compilers. Because of the libc_psr masking, no-one noticed the problem until they put the bits on the (much rarer) platforms with the bug.

The Fix

To fix this, you simply move the __{u,}{mul,div}64 functions from libc_psr back into libc, using the v8plus versions that were in libc_psr. libc's assembly files are already being compiled in v8plus mode due to atomic_ops(3C), so it just required shuffling around some code, removing stuff from makefiles, and deleting the old, out of date code. This was done under bugid 6324631, integrated in the same build as the compiler switch, so only a limited number of OS/Net developers were effected. Life was all better. Well, almost.

The follow-on

In testing my fix, I did a full build, bfued a machine, and just dumped fixed binaries on other machines. The one thing I didn't test was BFUing from broken bits to fixed bits. And, of course, there was an unforeseen problem (bugid 6327152). To understand what went wrong, I'm going to have to do some background on how BFU works.

bfu is a power-tool whose essential job is to dump a full set of binaries over a running system, even if the new bits are incompatible with the ones running the system. To do this, it copies binaries, libraries, and the dynamic linker into a set of subdirectories of /tmp: /tmp/bfubin, /tmp/bfulib, and /tmp/bl. It then uses a tool called "bfuld" to re-write the "interepreter" information for the binaries in /tmp/bfubin, to point at the copied It then sets LD_LIBRARY_PATH in the environment, to re-direct any executed programs to the copied libraries, and sets PATH=/tmp/bfubin. This gives BFU a protected environment to run in.

The problem is that auxiliary filters (like libc_psr) were not disabled, so programs running in the BFU environment were picking up the libc_psr from /platform. Once the \*new\* libc_psr was extracted, programs were no longer protected from the broken __\*mul64() routines. Since things like scanf(3C) use __mul64 internally, this caused breakage all over the place, most noticeably in cpio(1).

The fix for this is reasonably simple; set LD_NOAUXFLTR=1 in the environment to prevent auxiliary filters from being used,[3] make a copy of into /tmp/bfulib, and use LD_PRELOAD=/tmp/bfulib/ to override the bad libc functions. The latter part of this can be removed once we're sure no broken libcs are running around.


I hope you've enjoyed this. The bug ended up being surprisingly subtle (as many compiler bugs are), but luckily the fix was relatively simple. The Law of Unintended Consequences applies, as always.


[1] ::steping over the "save" instruction is a standard SPARC debugging trick; it makes the arguments to the function and the stack trace correct.
[2] Position-Independent Code, which is how shared libraries are compiled.
[3] Ironically, if we had done this \*before\* the compiler switch was done, BFU would have immediately failed when run on the broken bits, and the whole problem would have been noticed much more quickly.

[ Technorati ]


Post a Comment:
  • HTML Syntax: NOT allowed



Top Tags
« February 2015