Wednesday, January 31, 2018

More adventures in Software FMA

By: Chris Beal | Senior Principal Software Engineer

Those of you that have ever read my own blog (Ghost Busting) will know I've a long standing interest in trying to get the systems we all use and love to be easier to fix, and ideally tell you themselves what's wrong with them.

Back in Oracle Solaris 11 we added the concept of Software Fault Management Architecture (SWFMA), with two types of event modelled as FMA defects. One was panic events, the other SMF service state transitions. This also allowed for notification of all FMA events via a new service facility (svccfg setnotify) over SNMP and email.

With a brief diversion to make crash dumps smaller, and faster, and easier to diagnose, we've come back to the SWFMA concept and extended it in two ways.

Corediag

It's pretty easy to see that the same concept for modelling System panic as FMA events could be applied to user level process core dumps. So that's what we've done. coreadm(8) has been extended so that by default a diagnostic core file is created for every Solaris binary that crashes. This is smaller than a regular core dump. We then have a service (svc:/system/coremon:default) which runs a daemon (coremond) that will monitor for these being created, and summarize them,. By default the summary file is kept, though you can use coreadm to remove them. coremond then turns these in to FMA alerts. These are considered more informational than full on defects, but are still present in the FMA logs. You can run fmadm(8) like in the screen shot below to see any alerts. This was one I got when debugging a problem with a core file.

 

Stackdiag

Over the years we've learned that for known problems, there is a very good correlation between the raw stack trace and an existing bug. We've had tools internally to do this for years. We've mined our bug database and extracted stack information, this is bundled up in to a file delivered by pkg:/system/diagnostic/stackdb. Any time FMA detects there is stack telemetry in an FMA event, and the FMA message indicates we could be looking at a software bug, it'll trigger a lookup and try to add the bug id for significant threads to the FMA message.

So if you look in the message above you'll see the description that says:

Description : A diagnostic core file was dumped in
              /var/diag/e83476f7-104d-4c85-9de4-bf7e45f261d1 for RESOURCE
              /usr/bin/pstack whose ASRU is . The ASRU is the Service FMRI for
              the resource and will be NULL if the resource is not part of a
              service. The following are potential bugs.
              stack[0] - 24522117 

The stack[0] shows which thread within the process caused in the core dump, and obviously the bugid following it is one you can search for in MOS to find any solution records. Alternatively you can see if you've already got the fix in your Oracle Solaris package repository, by using the pkg search option to search for bugid.

Hang on. If I've got the fix in the package repository why isn't it installed?

The stackdb package is what we refer to as "unincorporated". This is just a data file, and has no code within it, and the latest version you have available will be installed (not just the one matching the SRU you're on). So you can update it regularly to get the latest diagnosis, without updating the SRU on your system or rebooting it. This means you may get information about bugs which are already fixed and the fix is available when you are on older SRUs.

Software FMA

We believe that these are the first steps to a self diagnosing system, and will reduce the need to log SRs for bugs we already know about. Hopefully this will mean you can get the fixes quicker, with minimal process or fuss.

 

Join the discussion

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.Captcha
 

Visit the Oracle Blog

 

Contact Us

Oracle

Integrated Cloud Applications & Platform Services