X

Brokenness Hides Itself

When engineers get together and talk, one of the things they like to
bring out and share is war stories; tales of debugging daring-do and
the amazing brokenness that can be found in the process. I recently
went through an experience that makes good war story material, and I
thought I'd share it.

A couple weeks ago, there were multiple reports of svc.configd(1M)
failing repeatedly with one of:


svc.configd: Fatal error: invalid integer "10" in field "id"

svc.configd: Fatal error: invalid integer in database



Since I'm one of the main developers of svc.configd(1M), I started to
investigate. I first had the people hitting it send me their repositories,
but they all checked out as having no problems. The problem was only being
seen on prototype Niagara machines and some Netra T1s; the first
reproducible machine I got console access to was a Niagara box.

Figuring out what happened

Unfortunately, the box was running old firmware, which significantly
restrained its usability; I spent more time fighting the machine than
working on tracking down the problem. I finally boot neted the
machine, mounted the root filesystem, and added a line:

sulg::sysinit:/sbin/sulogin </dev/console 2<>/dev/console >&2

to /etc/inittab to get a shell very early in boot. After booting
and logging into that shell, I was able attach mdb(1) to
svc.configd(1M):
# mdb -p `pgrep svc.configd`
Loading modules: [ svc.configd ld.so.1 libumem.so.1 libuutil.so.1
libc.so.1 ]
>

Now, I knew that svc.configd uses a utility function,
uu_strtouint(),
to do all of its integer-to-number conversions, and that the most likely cause
of the failure seen was some kind of unprintable character in its first
argument. So I set a breakpoint there and started it running:
> uu_strtouint::bp
> :c
mdb: stop at libuutil.so.1`uu_strtouint
mdb: target stopped at:
uu_strtouint: save %sp, -0x68, %sp
> ::step[1]
mdb: target stopped at:
uu_strtouint+4: ld [%fp + 0x60], %l6
> $C 2
feefb968 libuutil.so`uu_strtouint+4(1cb4e4, feefba9c)
feefb9d0 string_to_id+0x24(1cb4e4, feefba9c)
feefba38 fill_child_callback+0x20(feefbbe4, 2)
feefbaa0 sqlite_exec+0xd8(13de08, 2)
feefbb18 backend_run+0x74(89b48, 169848)
feefbb80 scope_fill_children+0x38(5e9f00, 1000)
...
>

So the first argument to uu_strtouint() is 1cb4e4. I attempted
to print out the string there:
> 1cb4e4/s
mdb: failed to read data from target: no mapping for address
0x2e:
>

"failed to read data from target"? A reported address of 0x2e instead
of 0x1cb4e4? Clearly, the problem is affecting mdb(1) as well.
Some additional tests in mdb(1) revealed the magnitude of the problem:
> 1=J
1
> 10=J
1000000001
> 11=J
1000000002
> 111=J
11000000003
> 101=J
11000000002

i.e. the results were completely busted, and had a lot of high-word bits set
unexpectedly. Interestingly, it looked like only multiple-digit numbers were
affected.

So I proceeded to investigate
uu_strtouint().
The first step was to see how
the function was failing; there are a number of different ways to get
to uu_set_error(), which sets up libuutil's equivalent
of errno. A simple breakpoint led to the following code
segment:

269         if (strtoint(s, &val, base, 0) == -1)270                 return (-1);271 272         if (val < min) {273                 uu_set_error(UU_ERROR_UNDERFLOW);274                 return (-1);275         } else if (val > max) {276                 uu_set_error(UU_ERROR_OVERFLOW);277                 return (-1);278         }

The failure was occurring at line 276; i.e. an overflow. The value (for
an input string of "10") was 0x10000000a, that is, 1\*2\^32
+ 10
. So something was going terribly wrong in
strtoint().
Since mdb(1) was also failing, the problem was probably in some shared
library routine. Looking at the disassembly, there are very few external
calls:
> strtoint::dis ! grep call | sed 's/libuutil.so.1`//g'
strtoint+0xc: call +8 <strtoint+0x14>
strtoint+0x204: call +0x12a30 <PLT:__udiv64>
strtoint+0x2b8: call +0x12988 <PLT:__umul64>
strtoint+0x404: call -0xd30 <uu_set_error>
strtoint+0x414: call -0xd40 <uu_set_error>
strtoint+0x424: call -0xd50 <uu_set_error>
strtoint+0x440: call -0xd6c <uu_set_error>
strtoint+0x450: call -0xd7c <uu_set_error>
strtoint+0x460: call -0xd8c <uu_set_error>
>

The first call is part of the PIC[2]
calling sequence, and uu_set_error() is only called in the
failure paths, which we knew weren't being hit. So __umul64()
and __udiv64() are the next suspects. These
are runtime support routines in libc, which the compiler inserts calls
to when it wants to do 64-bit multiplications and divisions. The code
for strtoint() only has one multiply and one divide, so it's easy to
see where they occur:
103         multmax = (uint64_t)UINT64_MAX / (uint64_t)base;104 105         for (c = \*++s; c != '\\0'; c = \*++s) {
...116 if (val > multmax)117 overflow = 1;118 119 val \*= base;120 if ((uint64_t)UINT64_MAX - val < (uint64_t)i)121 overflow = 1;122 123 val += i;124 }

The division always occurs, so I looked at the multiply routine first;
disassembling it showed the following suspicious section:
> __umul64::dis ! sed 's/libc.so.1`//g'
...
__umul64+0x38: cmp %l7, 0
__umul64+0x3c: call +0xc95e4 <PLT:.umul>
__umul64+0x40: mov %i3, %i0
__umul64+0x44: mov %l6, %i1
...

For a function call, %o0-%o6 hold the arguments to
the function, and afterwards, %o0 and %o1 hold the
results. But here, there's no manipulation of the %os. In
fact, the function doesn't reference them anywhere:
> __umul64::dis ! grep '%o'
>

Here's the smoking gun; we've hit some kind of compiler bug. The relevant
source code is:
usr/src/lib/libc/sparc/crt/mul64.c:
36 extern unsigned long long __umul32x32to64(unsigned, unsigned);
...
70 unsigned long long
71 __umul64(unsigned long long i, unsigned long long j)
72 {
...
81 if (i1)
82 result = __umul32x32to64(i1, j1);
88 return (result);
89 }
usr/src/lib/libc/sparc/crt/muldiv64.il:
29 .inline __umul32x32to64,8
30 call .umul,2
31 nop
32 mov %o0, %o2
33 mov %o1, %o0
34 mov %o2, %o1
35 .end

From previous experience with the compiler's inlining (I codereviewed
the fix for another inliner-related bug,

6225876
)
I knew that there is an optimization stage after the inline code is generated;
from the looks of it, that stage thought that the call had no effect on
register state, and so optimized away the argument set up. As a result,
the final return value of __umul64 is just some junk register state. So
I updated the original bug, and sent it over to the compiler people:



6323803
compiler bug causes __\*mul64 failure; svc.configd dies


After some investigation, the compiler folks accepted the bug, and noted
that it only effects call statements in inline assembly.
Checking the rest of the '.il' files in OS/Net, I verified that this was
the only place where we used call.

I still needed to figure out why this was only effecting certain platforms,
and how we were going to deal with the problem in the short-term, so that
we weren't waiting on a compiler patch.

Why only sun4v and Netras?

I first wrote a small C program, to test:

% cat > math_test.c <<EOF
#include <stdio.h>
int
main(int argc, char \*argv[])
{
unsigned long long x, y, z;
x = 10;
y = 1;
z = x \* y;
printf("%llx\\n", z);
return (0);
}
EOF

% cc -o math_test math_test.c
% ./math_test
a
% truss -t \\!all -u '::__\*mul64' ./math_test
/1@1: -> libc_psr:__mul64(0x0, 0xa, 0x0, 0x1)
/1@1: <- libc_psr:__mul64() = 0
a
%
... (moving over to niagara machine) ...
% truss -t \\!all -u '::__\*mul64' ./math_test
/1@1: -> libc:__mul64(0x0, 0xa, 0x0, 0x1)
/1@1: <- libc:__mul64() = 1
10000000a
%

A hah! For sun4u, libc_psr is overriding the libc definitions. So sun4v
won't work for two reasons:
  1. none of the sun4v libc_psrs override __\*mul64
  2. even if they did, sun4v has a wacky "mount the correct libc_psr over/platform/sun4v/lib/libc_psr.so.1 which wouldn't occur until
    later in boot anyway.


It ends up that the reason __\*mul64 are overriden in libc_psr is a hold-over
from Solaris 9 and earlier, where 32-bit Solaris was supported on
pre-UltraSPARC chips. In fact, the call to .umul is a hold-over from
the days of sparcv7, where the umul instruction wasn't
guaranteed to be there. The sun4u libc_psr version has sparcv8plus
versions of __mul64, which take advantage of the 64-bit processing that
sparcv9 adds, in a form compatible with 32-bit code.

The fact that these weren't included in the sun4v libc_psr is an oversight,
but #2 means that it wouldn't have mattered if they did. The Netra T1's
running into this problem are explained by the fact that there are a set of
missing /platform/\*/lib symlinks for the following platforms:

SUNW,Ultra-1-Engine
SUNW,UltraAX-MP
SUNW,UltraAX-e
SUNW,UltraAX-e2
SUNW,UltraSPARC-IIi-Engine
SUNW,UltraSPARC-IIi-cEngine
SUNW,UltraSPARCengine_CP-20
SUNW,UltraSPARCengine_CP-40
SUNW,UltraSPARCengine_CP-60
SUNW,UltraSPARCengine_CP-80

Which are all various "Netra T1" varieties. Since the link to the sun4u
libc_psr is missing, they exhibit the same problem as the sun4v does. I
filed

6324790
to cover adding the missing links, since it is a performance
problem. (programs will not be able to take advantage of the faster
mem{cpy,cmp,move,set} versions contained there)

The final question is "Why now?". What changed to make this a problem? Four
days before the first reported incident,

6316914
was putback, which switched the build from the Studio 8 to
the Studio 10 compilers. Because of the libc_psr masking, no-one noticed the
problem until they put the bits on the (much rarer) platforms with the bug.

The Fix

To fix this, you simply move the __{u,}{mul,div}64 functions from
libc_psr back into libc, using the v8plus versions that were in
libc_psr. libc's assembly files are already being compiled in v8plus
mode due to atomic_ops(3C), so it just required shuffling around
some code, removing stuff from makefiles, and deleting the old, out of date
code. This was done under bugid

6324631
, integrated in the same build as the compiler switch, so only a
limited number of OS/Net developers were effected. Life was all better.
Well, almost.

The follow-on

In testing my fix, I did a full build, bfued a machine, and just dumped fixed
binaries on other machines. The one thing I didn't test was BFUing from
broken bits to fixed bits. And, of course, there was an unforeseen
problem (bugid

6327152
). To understand what went wrong, I'm going to have to do some
background on how BFU works.

bfu is a power-tool whose essential job is to dump a full set of binaries over
a running system, even if the new bits are incompatible with the ones running
the system. To do this, it copies binaries, libraries, and the dynamic
linker into a set of subdirectories of /tmp: /tmp/bfubin,
/tmp/bfulib, and /tmp/bl. It then uses a tool called "bfuld"
to re-write the "interepreter" information for the binaries in /tmp/bfubin,
to point at the copied ld.so.1(1). It then sets LD_LIBRARY_PATH in
the environment, to re-direct any executed programs to the copied libraries,
and sets PATH=/tmp/bfubin. This gives BFU a protected environment
to run in.

The problem is that auxiliary filters (like libc_psr) were not disabled, so
programs running in the BFU environment were picking up the libc_psr from
/platform. Once the \*new\* libc_psr was extracted, programs were no longer
protected from the broken __\*mul64() routines. Since things like
scanf(3C) use __mul64 internally, this caused breakage all over the place,
most noticeably in cpio(1).

The fix for this is reasonably simple; set LD_NOAUXFLTR=1 in the
environment to prevent auxiliary filters from being
used,[3] make a copy of
libc_psr.so.1 into /tmp/bfulib, and use
LD_PRELOAD=/tmp/bfulib/libc_psr.so.1 to override the bad libc
functions. The latter part of this can be removed once we're sure no broken
libcs are running around.

Conclusion

I hope you've enjoyed this. The bug ended up being surprisingly subtle (as
many compiler bugs are), but luckily the fix was relatively simple. The
Law of Unintended Consequences applies, as always.

Footnotes:

[1]
::steping over the "save" instruction is a standard SPARC debugging
trick; it makes the arguments to the function and the stack trace correct.


[2]
Position-Independent Code, which is how shared libraries are compiled.


[3]
Ironically, if we had done this \*before\* the compiler switch was done, BFU
would have immediately failed when run on the broken bits, and the whole
problem would have been noticed much more quickly.

[ Technorati


]

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.