Wednesday Jul 16, 2008

Solaris Crash Analysis Tool 5.0 released

Well it's been five long years since an update of Solaris Crash Analysis Tool (CAT) was released to the public and I'm happy to report that the Solaris CAT Development Team (John, Paul, and I) were finally given the time to work through the red tape and get a new release out. Yes!  Solaris CAT 5.0 is available for immediate download from here.  This new version not only supports the newer releases of Solaris, namely Solaris 10 and OpenSolaris/Nevada, it also suports both SPARC and x86/x64 architectures and includes commands that support zones, the Solaris Volume Manager, ZFS, Sun Cluster, plus many more features.  I'd recommend reading the Release Notes (/opt/SUNWscat/docs/index.html) for both this release and Release 4.2 to get yourself up to date on everything that has been changed, fixed, and added in 5.0.  Enjoy and please let the Solaris CAT Team know how you like the tool by commenting on our blog or sending email to SolarisCAT_Feedback@Sun.COM.  I'm also happy to report that we'll be trying to  release  versions once every six months.

What is Solaris CAT?  It's a Solaris kernel crash dump (and live kernel) access tool that provides simple intuitive commands which can be used to quickly analyze crash dumps.  It's developed by Sun kernel engineers  (those who support customers) who analyze kernel core files for a living.  It's different from mdb because its geared more towards analysis instead of debugging.  It's also different from mdb because it's development is done as a hobby by a handful of people and is officially an "unsupported" tool (though if one finds a bug and let's us know, we likely fix it quickly.)  And so you are confident,  SolarisCAT is used thousands of times a month here at Sun.  Therefore, it gets plenty of testing :)

As an example of Solaris CAT's power, the following is from a system that was hanging and where the user interrputed the kernel with a "break".  As you can see below, the dev busy command not only isolated the devices that were "hanging", it also discovered that an interface card, /SUNW,Sun-Fire/ssm@0,0/pci@1c,700000/lpfc@3, is likely the culprit.  Without that analysis, the engineer working on that crash dump may waste time trying to chase "failing" devices before the interface card or  lpfc  was identified as the culprit.

SolarisCAT(vmcore.1/8U)> dev busy

Scanning for busy devices:
sd321 @ 0x3000537db30(scsi_disk), DEVICE BUSY, un_ncmds: 20
    /SUNW,Sun-Fire/ssm@0,0/pci@1c,700000/lpfc@3/sd@0,27
sd322 @ 0x3000537d370(scsi_disk), DEVICE BUSY, un_ncmds: 20
    /SUNW,Sun-Fire/ssm@0,0/pci@1c,700000/lpfc@3/sd@0,28
sd323 @ 0x3000537cbb0(scsi_disk), DEVICE BUSY, un_ncmds: 1
    /SUNW,Sun-Fire/ssm@0,0/pci@1c,700000/lpfc@3/sd@0,29
sd324 @ 0x3000537c3f0(scsi_disk), DEVICE BUSY, un_ncmds: 1
    /SUNW,Sun-Fire/ssm@0,0/pci@1c,700000/lpfc@3/sd@0,2a
sd327 @ 0x3000542cbb8(scsi_disk), DEVICE BUSY, un_ncmds: 20
    /SUNW,Sun-Fire/ssm@0,0/pci@1c,700000/lpfc@3/sd@0,26
sd343 @ 0x3000546abd8(scsi_disk), DEVICE BUSY, un_ncmds: 1
    /SUNW,Sun-Fire/ssm@0,0/pci@1c,700000/lpfc@3/sd@0,1b
sd347 @ 0x30005456be0(scsi_disk), DEVICE BUSY, un_ncmds: 3
    /SUNW,Sun-Fire/ssm@0,0/pci@1c,700000/lpfc@3/sd@0,1f
sd348 @ 0x30005456040(scsi_disk), DEVICE BUSY, un_ncmds: 2
    /SUNW,Sun-Fire/ssm@0,0/pci@1c,700000/lpfc@3/sd@0,20
sd349 @ 0x300054493a8(scsi_disk), DEVICE BUSY, un_ncmds: 4
    /SUNW,Sun-Fire/ssm@0,0/pci@1c,700000/lpfc@3/sd@0,21
sd351 @ 0x30005436050(scsi_disk), DEVICE BUSY, un_ncmds: 9
    /SUNW,Sun-Fire/ssm@0,0/pci@1c,700000/lpfc@3/sd@0,23
sd353 @ 0x300054a8fd8(scsi_disk), DEVICE BUSY, un_ncmds: 5
    /SUNW,Sun-Fire/ssm@0,0/pci@1c,700000/lpfc@3/sd@0,3f
sd358 @ 0x30005476450(scsi_disk), DEVICE BUSY, un_ncmds: 3
    /SUNW,Sun-Fire/ssm@0,0/pci@1c,700000/lpfc@3/sd@0,40
sd432 @ 0x300054a8818(scsi_disk), DEVICE BUSY, un_ncmds: 9
    /SUNW,Sun-Fire/ssm@0,0/pci@1c,700000/lpfc@3/sd@a,5
sd433 @ 0x300054a8058(scsi_disk), DEVICE BUSY, un_ncmds: 1
    /SUNW,Sun-Fire/ssm@0,0/pci@1c,700000/lpfc@3/sd@a,6
sd436 @ 0x30005496c00(scsi_disk), DEVICE BUSY, un_ncmds: 1
    /SUNW,Sun-Fire/ssm@0,0/pci@1c,700000/lpfc@3/sd@a,9
sd476 @ 0x30005534850(scsi_disk), DEVICE BUSY, un_ncmds: 6
    /SUNW,Sun-Fire/ssm@0,0/pci@1c,700000/lpfc@3/sd@1,5a

By aggregating "busy" device paths, the following devices 
should also be investigated:
    /SUNW,Sun-Fire/ssm@0,0/pci@1c,700000/lpfc@3

Scanning for threads in biowait:

   103 threads in biowait() found.

threads in biowait() by device:
count   device (thread: max idle time)
   36   32,2570(sd321,2) (0x3001c0a5ba0: 1 hours 35 minutes 13.49 seconds) /dev/dsk/c1t0d39s2(swap)
   25   32,2578(sd322,2) (0x2a1002cbd20: 1 hours 35 minutes 13.95 seconds) /dev/dsk/c1t0d40s2(swap)
   19   32,2618(sd327,2) (0x300979fa460: 1 hours 35 minutes 13.49 seconds) /dev/dsk/c1t0d38s2(swap)
    4   237(vxio),47006 (0x300171d9ca0: 1 hours 35 minutes 13.04 seconds) /play/oradata003
    4   237(vxio),47008 (0x30085682e60: 1 hours 35 minutes 13.26 seconds) /play/oradata005
    2   32,3812(sd476,4) (0x30015f12600: 1 hours 35 minutes 13.48 seconds) /opt/tools
    2   237(vxio),99003 (0x30014a40180: 1 hours 35 minutes 13.47 seconds) /test/archive
    2   237(vxio),10000 (0x3001c0a45a0: 1 hours 35 minutes 13.31 seconds) /appl/oradata001
    2   237(vxio),47004 (0x30099665b80: 1 hours 35 minutes 11.33 seconds) /play/oradata001
    1   237(vxio),10004 (0x3001de53900: 1 hours 35 minutes 13.31 seconds) /appl/oradata005
    1   237(vxio),51000 (0x3005c6d83c0: 1 hours 35 minutes 13.49 seconds) /bcv/prod/transfer
    1   32,3808(sd476,0) (0x3001a310c00: 49 minutes 16.60 seconds) /crash
    1   237(vxio),99001 (0x3009c625bc0: 9 minutes 4.67 seconds) /test/peace
    1   237(vxio),99002 (0x30015f36e20: 1 hours 26 minutes 42.69 seconds) /test/oracle
    1   237(vxio),8003 (0x30005fa1620: 1 hours 34 minutes 52.51 seconds) /test/redo4
    1   237(vxio),10002 (0x3001405ce40: 1 hours 35 minutes 13.31 seconds) /appl/oradata003

Scanning for procs with aio:
proc             PID       fd                  dev   state       count
====           ======      ==                  ===   =====       =====
0x3001710d520   25667     408      237(vxio),99008   pending         1
SolarisCAT(vmcore.1/8U)>
About

Don't ya just love toys and taking them apart to see how they work? To me it doesn't matter if it's an iPod, a laptop, or the biggest baddest thing a company makes. And nothing makes me more happy than showing how easy it is to develop stuff on Oracle Solaris.

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today