Tuesday Nov 29, 2011

Oracle Solaris Crash Analysis Tool 5.3 now available

I am pleased to announce that we've been able to release the latest version of the Oracle Solaris Crash Analysis Tool. I'm also happy to let everyone know that the tool has a new home in Oracle and will continue to be supported.  Details on the release are available on the Oracle Solaris Crash Analysis Tool blog.

Friday Feb 06, 2009

new release of Solaris CAT available

We just released the latest version of the Solaris Crash Analysis Tool.  For more information, see the Solaris CAT Blog. Many folks have asked why we are developing and supporting Solaris CAT when Solaris provides mdb. There's several reasons: mdb kind of assumes the user has some knowledge of the Solaris kernel, mdb is more of a debugger than an analysis tool, Solaris CAT provides aggregation features (thread scanners, device aggregaters, streams checkers, etc) that mdb does not provide, and, frankly, Solaris CAT makes analyzing Solaris crashes easier. I look at it like the editor wars for the 80's and 90's - I prefer vi but have no problem with folks using Emacs. Use what you like but don't complain about those who use something different.  

Tuesday Aug 19, 2008

hey, I have a new patent

I got a surprise on Saturday.  I opened my mailbox and discovered a letter from a patent certificate company congratulating me on my latest patent.  What?  I submitted that puppy four years ago!  Man!  that's a long time to wait.  What's the patent on?  It protects the idea I developed for aggregating devices to find a common cause for hardware problems.  It's the method I implemented in the Solaris Crash Analysis Tool's dev busy command and I documented in this article. The patent is available for review on Free Patents Online.

Now to bug the powers that be at Sun for my certificate and $$$ reward :) 

Wednesday Jul 16, 2008

Solaris Crash Analysis Tool 5.0 released

Well it's been five long years since an update of Solaris Crash Analysis Tool (CAT) was released to the public and I'm happy to report that the Solaris CAT Development Team (John, Paul, and I) were finally given the time to work through the red tape and get a new release out. Yes!  Solaris CAT 5.0 is available for immediate download from here.  This new version not only supports the newer releases of Solaris, namely Solaris 10 and OpenSolaris/Nevada, it also suports both SPARC and x86/x64 architectures and includes commands that support zones, the Solaris Volume Manager, ZFS, Sun Cluster, plus many more features.  I'd recommend reading the Release Notes (/opt/SUNWscat/docs/index.html) for both this release and Release 4.2 to get yourself up to date on everything that has been changed, fixed, and added in 5.0.  Enjoy and please let the Solaris CAT Team know how you like the tool by commenting on our blog or sending email to SolarisCAT_Feedback@Sun.COM.  I'm also happy to report that we'll be trying to  release  versions once every six months.

What is Solaris CAT?  It's a Solaris kernel crash dump (and live kernel) access tool that provides simple intuitive commands which can be used to quickly analyze crash dumps.  It's developed by Sun kernel engineers  (those who support customers) who analyze kernel core files for a living.  It's different from mdb because its geared more towards analysis instead of debugging.  It's also different from mdb because it's development is done as a hobby by a handful of people and is officially an "unsupported" tool (though if one finds a bug and let's us know, we likely fix it quickly.)  And so you are confident,  SolarisCAT is used thousands of times a month here at Sun.  Therefore, it gets plenty of testing :)

As an example of Solaris CAT's power, the following is from a system that was hanging and where the user interrputed the kernel with a "break".  As you can see below, the dev busy command not only isolated the devices that were "hanging", it also discovered that an interface card, /SUNW,Sun-Fire/ssm@0,0/pci@1c,700000/lpfc@3, is likely the culprit.  Without that analysis, the engineer working on that crash dump may waste time trying to chase "failing" devices before the interface card or  lpfc  was identified as the culprit.

SolarisCAT(vmcore.1/8U)> dev busy

Scanning for busy devices:
sd321 @ 0x3000537db30(scsi_disk), DEVICE BUSY, un_ncmds: 20
sd322 @ 0x3000537d370(scsi_disk), DEVICE BUSY, un_ncmds: 20
sd323 @ 0x3000537cbb0(scsi_disk), DEVICE BUSY, un_ncmds: 1
sd324 @ 0x3000537c3f0(scsi_disk), DEVICE BUSY, un_ncmds: 1
sd327 @ 0x3000542cbb8(scsi_disk), DEVICE BUSY, un_ncmds: 20
sd343 @ 0x3000546abd8(scsi_disk), DEVICE BUSY, un_ncmds: 1
sd347 @ 0x30005456be0(scsi_disk), DEVICE BUSY, un_ncmds: 3
sd348 @ 0x30005456040(scsi_disk), DEVICE BUSY, un_ncmds: 2
sd349 @ 0x300054493a8(scsi_disk), DEVICE BUSY, un_ncmds: 4
sd351 @ 0x30005436050(scsi_disk), DEVICE BUSY, un_ncmds: 9
sd353 @ 0x300054a8fd8(scsi_disk), DEVICE BUSY, un_ncmds: 5
sd358 @ 0x30005476450(scsi_disk), DEVICE BUSY, un_ncmds: 3
sd432 @ 0x300054a8818(scsi_disk), DEVICE BUSY, un_ncmds: 9
sd433 @ 0x300054a8058(scsi_disk), DEVICE BUSY, un_ncmds: 1
sd436 @ 0x30005496c00(scsi_disk), DEVICE BUSY, un_ncmds: 1
sd476 @ 0x30005534850(scsi_disk), DEVICE BUSY, un_ncmds: 6

By aggregating "busy" device paths, the following devices 
should also be investigated:

Scanning for threads in biowait:

   103 threads in biowait() found.

threads in biowait() by device:
count   device (thread: max idle time)
   36   32,2570(sd321,2) (0x3001c0a5ba0: 1 hours 35 minutes 13.49 seconds) /dev/dsk/c1t0d39s2(swap)
   25   32,2578(sd322,2) (0x2a1002cbd20: 1 hours 35 minutes 13.95 seconds) /dev/dsk/c1t0d40s2(swap)
   19   32,2618(sd327,2) (0x300979fa460: 1 hours 35 minutes 13.49 seconds) /dev/dsk/c1t0d38s2(swap)
    4   237(vxio),47006 (0x300171d9ca0: 1 hours 35 minutes 13.04 seconds) /play/oradata003
    4   237(vxio),47008 (0x30085682e60: 1 hours 35 minutes 13.26 seconds) /play/oradata005
    2   32,3812(sd476,4) (0x30015f12600: 1 hours 35 minutes 13.48 seconds) /opt/tools
    2   237(vxio),99003 (0x30014a40180: 1 hours 35 minutes 13.47 seconds) /test/archive
    2   237(vxio),10000 (0x3001c0a45a0: 1 hours 35 minutes 13.31 seconds) /appl/oradata001
    2   237(vxio),47004 (0x30099665b80: 1 hours 35 minutes 11.33 seconds) /play/oradata001
    1   237(vxio),10004 (0x3001de53900: 1 hours 35 minutes 13.31 seconds) /appl/oradata005
    1   237(vxio),51000 (0x3005c6d83c0: 1 hours 35 minutes 13.49 seconds) /bcv/prod/transfer
    1   32,3808(sd476,0) (0x3001a310c00: 49 minutes 16.60 seconds) /crash
    1   237(vxio),99001 (0x3009c625bc0: 9 minutes 4.67 seconds) /test/peace
    1   237(vxio),99002 (0x30015f36e20: 1 hours 26 minutes 42.69 seconds) /test/oracle
    1   237(vxio),8003 (0x30005fa1620: 1 hours 34 minutes 52.51 seconds) /test/redo4
    1   237(vxio),10002 (0x3001405ce40: 1 hours 35 minutes 13.31 seconds) /appl/oradata003

Scanning for procs with aio:
proc             PID       fd                  dev   state       count
====           ======      ==                  ===   =====       =====
0x3001710d520   25667     408      237(vxio),99008   pending         1

Don't ya just love toys and taking them apart to see how they work? To me it doesn't matter if it's an iPod, a laptop, or the biggest baddest thing a company makes. And nothing makes me more happy than showing how easy it is to develop stuff on Oracle Solaris.


« April 2014