Two bad Solaris bugs that affect dbx users

Updated information about both bugs.  See below:


There are two bugs that popped up in Solaris 10 since it went to FCS. One cases dbx to hang, and the other causes dbx to crash. The hang bug is 6329593 (pr_wait_die() can hang while waiting for SIGKILL to be processed) and the crash bug is 6283570 (misaligned ELF64 section heads). You can find out information about these bugs on the opensolaris.org web site (under bug database).

The hang bug is described in more detail on the What's New page for Sun Studio. It doesn't have any workaround that I know of, other than get the bug fixed. The bug is supposed to be fixed in Nevada build 25. I don't know what that translates to for OpenSolaris releases.

The crash bug will have a workaround in dbx available in a dbx patch coming out for Sun Studio 11, watch this space for availability, or check the sunsolve web site for Sun Studio 11 patches. I wrote a perl script to detect if any of your sparcv9 libraries suffer from this problem. The bad section alignment shows up on both sparc and x86 64-bit libraries, but it only causes dbx to crash on sparc machines.

#!/usr/bin/perl

# This script looks for sparcv9 libraries that will make
# dbx crash.  They result from solaris bug:
#   6283570 misaligned ELF64 section heads
#
# This script only looks in /usr and /usr/lib. But libraries
# in other directories might also suffer from the same bug.

use File::Find;
sub wanted {
  return unless -x && -f;
  return unless /.\*\\.so\\.[0-9]$/;
  return unless `/bin/file $_ | grep 64-bit`;
  $out = `elfdump -e $_ | grep e_shoff`;
  $out =~ m/e_shoff:\\s+(0x[0-9a-f]+)\\s/;
  if ($1 =~ m/[4c]$/) {
     print "bad alignment in file: $File::Find::name\\n";
  }
}

print "Looking for bad ELF section header table alignment in 64-bit files\\n";
find(\\&wanted, ( "/lib", "/usr/lib" ));

Late breaking update: I've figured out how to use the mediacast server, and I built a temporary, hacked, unsupported, (well you get the idea) dbx binary that doesn't fall over dead when it sees a misaligned elf section header. If you are running into this problem, you can download the bootleg binary and try it out. The usual caveats apply. I wouldn't recommend pasting it on top of the real dbx in a SS11 install directory, unless you have to. If you do have to, then save the original dbx binary and put it back before you apply the next Sun Studio patch. Okay, I'm done. That satisfies my "common sense" paranoia coefficient. You can find the binary here: bootleg dbx
(Don't use this link, get the patches below!)

Note:

Addendum as of Mar 14th, 2006: The patches for dbx to work around the crash bug are now available for Sun Studio 10 and Sun Studio 11, for both SPARC and x86 platforms.

Note:

Addendum about hang bug.  Dave Ford wrote up a good summary, and I'll  repost the information here:

Description

There is a kernel bug for Solaris 10 that causes dbx to hang immediately after loading program information for the program the user is debugging. The bug initially was found in build 18 of S10U1, but was also released in kernel patches for SPARC and x86.

Response.

After we detected this bug in build 18 of S10U1, we ensured that the bug was fixed by FCS, so that dbx would work with S10U1.

For users who applied bad patches to S10 and are experiencing this problem, use at least:

  • SPARC kernel patch 118822-23
  • x86 kernel patch 118844-27

Note that 118844-27 requires several other patches:

  • 119255
  • 117435
  • 118344
  • 113000

Workaround

When dbx hangs, you can type control-c twice, or you can use the prun command on the dbx process ID.

Note:

The hang bug will show up as a side effect of the fix for Solaris bug 6272865 (race condition ...) and the hang bug itself is fixed by 6329593.  So if your version of Solaris has patches that "fix" 6272865 but not the fix for 6329593, then you need to get some more patches.  On Solaris 9, the patch that fixes the first bug also includes the fix for the regression that it causes.  So you shouldn't see the bug happen on Solaris 9.  For Solaris 9, the patches in question are: 120884-02 (for x86) or 117125-03 (for sparc).

 

Note:

The Soalris 10 patches needed to fix the crash bug (for older versions of dbx) are:

  • sparc 118371-04 or newer (currently at -06)
  • intel 118372-04 or newer (currently at -06)


Comments:

Happy New Year, Chris:

Problem not solved!

My dbx woes apparently arise from Solaris 10's AMD-Athlon related implementation of cpc() and associated functionality.

The problem seems to me to stem from the coding of /usr/kernel/pcbe/pcbe.AuthenticAMD.15.

While I look forward to reading your far more cogent analysis (if you have time), it seems for the moment clear to me that the associated code does not recognize and use the AMD "performance counters." Thus, a you would expect, cpustat doesn't work either, and delivers an error message which says "cannot access performance countes."

Applying truss to the failing software reveals cpc() to be the point of failure.

That can be fixed. Some of the Open Solaris folks have reported that they have interim fixes and have others on their list.

Question is, what can I do here, other than acquire new hardware?

Still frustrated,

George

Posted by George Frink on February 01, 2007 at 07:08 AM PST #

It sounds like you are running into this bug:
6335196 dbx 7.5 dumps core in cpc_close upon startup, when cpc driver is not available
<p/> We are working on a patch for dbx. The bug is in code related to the performance data collector. The code which uses the CPC library doesn't check for errors properly, so if the driver fails to initialize, then dbx crashes. The bug report doesn't list a workaround. <p/> From reading the bug report, I think the code in question is called "cpc_close()" with a bogus handle. If you wrote a shared library with a function called cpc_close() which just returns. Then put that shared library into LD_PRELOAD before you run dbx. This might short-circuit the problem until we get a patch out there.

Posted by Chris Quenelle on February 01, 2007 at 07:08 AM PST #

Oops. I said "code in question is called cpc_close". I meant to say:

...is callING cpc_close() with a bogus handle.

If you supply a dummy cpc_close routine which ignores its parameters, then dbx might start up normally. It's worth a shot.

Posted by Chris Quenelle on February 01, 2007 at 07:08 AM PST #

Very impressive, Chris.
I have the most recent/patched Sun Studio 11 supported by Solaris 10 on a 1.1 GH 1386/i387 fpu (so says psrinfo -v). My lovely dbx is, well, suicidal. It always concludes with "cpc() Err#89 ENOSYS followed by the sad news of a segmentation fault ( signals #12, #16, #11).
Any wise advice and remedies?
Frustrated,
George

Posted by George Frink on February 01, 2007 at 07:08 AM PST #

Post a Comment:
Comments are closed for this entry.
About

Chris Quenelle

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today