Tuesday May 19, 2015

Oracle Solaris Crash Analysis Tool 5.5 Release

The Oracle Solaris Crash Analysis Engineering Team is happy to announce that Oracle Solaris CAT  5.5 is available for download.

The 5.5 release patches are available on MOS, and can be found by searching on the patchIDs 21099218 (Combined package supporting SPARC and X86/X64) and 21099215 (platform specific packages).

Go to MOS:  https://support.oracle.com and login
Click on tab entitled Patches and Updates at top
Enter 21099218 and 21099215 for patch numbers

Click on Search

From here, click on the patchID link you want, and then click on the  Readme or Download button. 

Release Notes

These release notes include all changes to the tool since the 5.4 release.


Oracle Solaris 12 Support

As of Oracle Solaris Crash Analysis Tool 5.5, support for Oracle Solaris 12 has been added.

Version 11 dumphdr Support

The new multi-part version 11 crashdump format and its dumphdr are now supported. The tool will open all available vmcore files with the same number by default.

Running with No Crashdump or Live Kernel

The tool can now be run without opening a crashdump or live kernel. This is useful for running the calc command or some of the data conversion or display commands.

To open the tool with no crashdump, use the --nocore command-line option. It will also open in this mode if no arguments are provided and /dev/kmem cannot be opened (typically due to not running the tool as root and thus not having permissions to open /dev/kmem).

Only a limited set of commands are available when run this way:

clock (partially)

When running with no crashdump or live kernel, the scatenv calc_ishex setting is enabled by default.

Branch Targets

Branch targets are now displayed on x86 if the dis_br_label setting is enabled.


The --scat_explore option has been renamed to just --explore. The files will still be written into a directory that starts with the name scat_explore.

Write-locked Page Locks

When printing a thread waiting for a write-locked page, the tool can find the owner only in some situations.

The lock is marked by part of the lock owner thread's address. So to find the matching owner thread, a list of all the threads is required. That list is created the first time the tlist command is run. So if a thread is displayed prior to running tlist, that lock owner cannot be determined.

For example:

CAT(vmcore.0/11V)> thread 0x1000a7e08aa0
==== user (LWP_SYS) thread: 0x1000a7e08aa0  PID: 2203 ====
fmri: lrc:/etc/rc3_d/XXXXXXXX
t_wchan: 0x3002a631144  sobj: condition var (from unix:page_lock_es+0x27c) 
t_procp: 0x1000a6677020

CAT(vmcore.0/11V)> tlist pagelock      
  thread         pri  pctcpu                     idle   PID          wchan command
  0x10006c34fc20  60   0.000          3m17.784433135s  2201  0x3002a631144 XXXXXXXXXXXXXX
  0x1000a7e08aa0  60   0.000          3m17.828439725s  2203  0x3002a631144 XXXXXXXXXXXXXX
  0x1000a23f4020 101   0.000          3m17.828442535s  2205  0x3002a631144 XXXXXXXXXXXXXX
  0x1000a8c9bc00 101   0.000          3m17.828442215s  2209  0x3002a631144 XXXXXXXXXXXXXX
  0x100078da7c20 101   0.000          3m17.784432255s  2213  0x3002a631144 XXXXXXXXXXXXXX
  0x1000a6400e00 101   0.000          3m17.828441575s  2217  0x3002a631144 XXXXXXXXXXXXXX
  0x1000a8c9b880 101   0.000          3m17.784441635s  2225  0x3002a631144 XXXXXXXXXXXXXX

   7 threads in page_lock_es() found.

threads in page_lock_es() by page:
7 threads: 0x10006c34fc20 0x1000a7e08aa0 0x1000a23f4020 0x1000a8c9bc00 0x100078da7c20...
page @ 0x3002a631100
  vnode:           0x10006b0d3740(*genunix(bss):swap_vnodeops)
  offset:          0x20015a3e000
  state:           !FREE|SWAP
  p_selock:        0x87e083a0 (EXCL owner:0x1000a7e083a0)
  p_lckcnt:        1
  p_slckcnt:       0
  p_cowcnt:        0
  p_mapping:       0x30009c639e8
  p_szc:           0 (8K)
  p_nrm:           0x3 MOD|REF
  calculated PA:   0x78f944000 (pagenum: 0x3c7ca2)

CAT(vmcore.0/11V)> thread 0x1000a7e08aa0
==== user (LWP_SYS) thread: 0x1000a7e08aa0  PID: 2203 ====
fmri: lrc:/etc/rc3_d/XXXXXXXX
t_wchan: 0x3002a631144  sobj: condition var (from unix:page_lock_es+0x27c)  p_selock owner: 0x1000a7e083a0
top owner (0x1000a7e083a0) is waiting for read-locked rwlock 0x10009987a798
t_procp: 0x1000a6677020

Piping sdump/sarray/slist


The rationale for scat pipes was mainly performance. The added convenience is a bonus, but not enough to justify the work of implementing it.

The main problem for writing scripts for scat is the time it takes to fork(). Even though only the page tables are copied, the time spent for a single fork() can be greater than 500ms for a large crashdump.

As example, take a crashdump with 500 disks multipathed with 8 paths each. We want to check one attribute per path that can be directly looked up. This requires 500 commands (fork()s) for the disks and 4000 for the paths. With a fork() time of 500ms that will take 2250 seconds just for the fork()s plus all time spent computing. If awk, sed and other utilities are used for each entry, the time multiplied for each step. This is clearly not feasible, as runtime quickly moves into hours.

The approach to solve this is to minimize the calls to fork(). This is done by allowing sdump and other commands to take an array of values to dump on standard input. This way the number of fork()s is unrelated to the number of elements to parse, but only related to the number of steps required to access the data for one single path.

For the above example, the script would be something like:

sdump ... | sarray ... | filter | sdump ... | slist ... | filter | sdump ...
With a few more steps we get ~20 fork()s and a runtime of roughly 10s. Each additional step adds 0.5s, as opposed to 2250s with the old approach.

The complexity of the steps adds very little additional time, as all steps of the pipeline run in parallel and can be executed on different CPUs.


One major goal in the implementation was to keep everything working exactly as before, unless the new features are used. This prevents any issues with existing scripts. For most things that was easy to implement, the only issues appear with slist and sarray, but more about this later.

To tell a command to read one of its arguments from stdin, the argument is replaced by a "-". For example:

CAT(live/10U)> echo "sd_state\nssd_state" | sdump - long
dumps the values for sd_state and ssd_state. The requested action will always be repeated for each input line.

Some commands like slist and sarray can take multiple values from stdin, like:

echo <addr> <start> <stop> | sarray - - - <type>
Lines starting with "#" are treated as comments and are passed through. This may seem unimportant, but by manipulating comments you can implement if/then/else constructs. It also allows drilling down different paths in the structures in one go.

All commands that take values from stdin will accept a -d switch that prints out the command-line equivalent for the stdin line that is printed. Besides helping debugging scripts, this makes all intermediate steps and results available to steps later in the pipeline. This can be made default by:

scatenv typedb_dump_comment on

Type info, offsets and field names from sdump output are ignored. Trailing text after the number of fields requested is also ignored.

Setting this parameter is required, if you want to parse the output of slist or sarray.


As an example, lets print the svl_transient field of the scsi_vhci_lun_t structures of all ssd devices. First we need some helper functions:

# we may encounter NULL pointers while drilling down the tree, comment
# these out
function comnull {
  nawk '
        /^#/    {print $0; next}
        /NULL/  {print "#" $0; next}
                {print $0; next} '

# for sarray, we may need <address> and <count> on one line, but sdump
# will print the requested fields in two separate line, so we merge them
function merge {
  nawk '
        /^#/            { print $0 }
        /'$1' =/        { printf("%s",$3) }
        /'$2' =/        { printf(" %s\n",$3)}'

# print the ssd or sd state array
function statearray {
  [ $# -ne 1 ] && echo "Usage: statearray <statearray>" && return
  sdump *$1 i_ddi_soft_state array,n_items | merge array n_items | sarray - - "unsigned long";

# drill down from the [s]sd device to the vhci structure and print
# either the whole structure or the fields passed as argument
function vhcilun {
  sdump - sd_lun un_sd |
  sdump - scsi_device sd_dev |
  sdump - dev_info devi_mdi_client |
  comnull |
  sdump - mdi_client ct_vprivate |
  sdump - scsi_vhci_lun $1;

With these functions we can now write:

statearray ssd_state | vhcilun svl_transient
# sarray 0x60015e2e000 0x100 unsigned long
# sarray 0x60015e2e000 0x100 unsigned long
# ssd18, address 0x60015e2e090:
# sdump 0x60015ff2cc0 sd_lun un_sd
# sdump 0x60015ed0108 scsi_device sd_dev
# sdump 0x60015ff1ce8 dev_info devi_mdi_client
# sdump 0x60015f3fc80 mdi_client ct_vprivate
# sdump 0x60015e58bc0 scsi_vhci_lun svl_transient
   svl_transient = 0
# ssd19, address 0x60015e2e098:
# sdump 0x60015ffb340 sd_lun un_sd
# sdump 0x60015ff7e00 scsi_device sd_dev
# sdump 0x60015ff1888 dev_info devi_mdi_client
# sdump 0x60015f3fa40 mdi_client ct_vprivate
# sdump 0x60015e5eb40 scsi_vhci_lun svl_transient
   svl_transient = 0

The script executes 9 fork()s. Done in a conventional way with 500 LUNs, this would involve a minimum of 4500 fork()s.

CAT(vmcore.0/12X)> sdump 0xffffc100229fc9c0 kthread_t t_procp | sdump - proc u_psargs
    char [0x50] u_psargs = [ '/' 'u' 's' 'r' '/' 's' 'b' 'i' 'n' '/' 'z' 'p' 'o' 'o' 'l' ' ' 'c' 'r' 'e' 'a' 't' 'e' ' ' 't' 'e' 's' 't' 'p' 'o' 'o' 'l' '.' '2' '2' '9' '2' '.' '1' ' ' '/' 'd' 'e' 'v' '/' 'z' 'v' 'o' 'l' '/' 'd' 's' 'k' '/' 't' 'e' 's' 't' 'p' 'o' 'o' 'l' '.' '2' '2' '9' '2' '/' 't' 'e' 's' 't' 'v' 'o' 'l' '.' '9' '3' '3' '3' '\0' ]

CAT(vmcore.0/12X)> sdump *cpu_list cpu cpu_thread | sdump - kthread_t t_procp | sdump - proc u_psargs
    char [0x50] u_psargs = [ 'z' 'p' 'o' 'o' 'l' '-' 't' 'e' 's' 't' 'p' 'o' 'o' 'l' '.' '2' '2' '9' '2' '.' '1' '\0' … ]

CAT(vmcore.0/12X)> sdump *cpu_list cpu disp_q | sarray - 100 dispq_t

Array element #0, address 0xffffc100000ce080:
dispq_t {
  kthread_t *dq_first = NULL
  kthread_t *dq_last = NULL
  int dq_sruncnt = 0
Array element #60, address 0xffffc100000ce620:
dispq_t {
  kthread_t *dq_first = 0xfffffffc80251bc0
  kthread_t *dq_last = 0xfffffffc80251bc0
  int dq_sruncnt = 1

CAT(vmcore.0/12X)> sdump *cpu_list cpu disp_q | sarray -d - 60 1 dispq_t dq_first | sdump -d - kthread_t t_procp | sdump -d - proc u_psargs
# sarray 0xffffc100000ce080 60 1 dispq_t dq_first
# Array element #60, address 0xffffc100000ce620:
# sdump 0xfffffffc80251bc0 kthread_t t_procp
# sdump 0xfffffffffc039eb0 proc u_psargs
    char [0x50] u_psargs = [ 's' 'c' 'h' 'e' 'd' '\0' … ]

Recycled Frames

Recycled frames are now displayed on x86 if the stk_recycled setting is enabled. They are also searched for now with tlist call.

System Duty Cycle Scheduling Class

Support has been added for the System Duty Cycle (SDC) scheduling class. This includes output of the thread command with the scatenv thr_cldata setting enabled, and the classtbl command now has an SDC subcommand.

STABS Support in Oracle Solaris 9+

Because of the availability of CTF, support for STABS files has been removed the tool for Oracle Solaris 9+. This includes the typedb command, which was used to load and reorder type databases.

Startup Files

Previously, the tool included a scat_env file to set aliases for all users of the tool. That file is now named scatstartup and is run just before the user's ~/.scatstartup file.

Additionally, the tool provides for a scatinit file, but doesn't include one. Like the scatstartup, the scatinit file is run just before the user's ~/.scatinit file, but is run earlier during the tool's initialization process, and is intended to store scatenv setting such that they can affect how the tool starts up, and what is displayed in the tool's banner.

Thus, the startup files are processed in this order:

(banner and sanity checks)

Note that previous names of the ~/.scatinit (.scatrc and .fmrc) and ~/.scatstartup (.fmstartup and .fmlogin) files will still be run if present, but only if the files with the new names are not present.

New Commands

ereport [-raw|-dump]

This command works similarly to memerr, but only displays the ereport errorqs.

softstate [-c] [<softstate addr>]

softstate [-aosmntig] <softstate addr> [<module>:]<type> [[!]<field>[,<field>[,...]]]

This command dumps the contents of an i_ddi_softstate.

It can be used to display all such structures it can find in the symbol table which contain "_soft_state" or "_softstate" and are part of any module's data or bss segments. The i_ddi_softstate is displayed, along with any non-NULL array elements.

With a single argument, it is treated as the address of an i_ddi_softstate structure and displayed as above.

If the -c flag is used, instead of a list of non-NULL elements, a count of them is printed.

If an address and a type is provided, each element of the array is displayed as the specified type, just as if sdump was run on it. An optional list of fields can be used to specify which fields of the structure to display or exclude, just as sdump.

This version of the command also allows the -aosmntig flags of sdump to be used to control whether the address, offset, size, module, type, array index, gaps, or to perform ntoh*() on the numeric structure fields.

vnode <vnode_addr>

This command displays a vnode in the same format as findfiles.

vnode summary

This command displays a summary of vnodes open or mapped by processes. The list is sorted by the number of references seen, and those whose reference count is less than the scatenv vnode_summary_min setting are not shown. vnode_summary_min defaults to 10.

It is not pedantic about this, and if there's just one left that's got less than vnode_summary_min, it will be displayed anyway.

This only works on Oracle Solaris 9+.

vnode vn_cache

This command displays a summary of vnodes which are allocated in the vn_cache. The list is sorted by the vnodes' v_count, and any with a v_count less than the scatenv vnode_summary_min setting are not shown. vnode_summary_min defaults to 10.

It is not pedantic about this, and if there's just one left that's got less than vnode_summary_min, it will be displayed anyway.

This only works on Oracle Solaris 9+.

Previously Undocumented Commands

These commands were present in the tool previously, but never documented.


This command displays the specified value(s) as an IP address.

An IPv4 address is a 32b value, and thus if one argument is provided, it is treated as an IPv4 address. An IPv6 address is a 128b value, and thus two separate 64b values must be provided, one representing the upper 64b, and the second the lower 64b.

The <value> will have ntoh() performed on it, which will reverse the byte order on x86.

2neg <value>

This command performs a two's complement operation on the <value> and displays the results.

ire <ire_addr>

This command displays the ire_t structure at the specified <ire_addr> and decodes the fields therein.

pctcpu <value>

This command converts that <value> to a percent and displays it.

The value stored in a kthread_t's t_pctcpu is a 32-bit scaled integer less than or equal to 1 with the binary point to the right of the high-order bit.


Displays the var structure v. It is equivalent to running the command sdump v var. Note that the configuration information therein in most cases has been deprecated.

Interface Changes

base [dec|hex|<number>]

This command has been un-deprecated, and changed slightly. While you can still change the input base and output base with scatenv ibase, and scatenv obase respectively, this command changes both simultaneously.

The base can be specified as dec to get decimal (base 10), hex to get hexadecimal (base 16), or any number from 2 to 36.

The default input and output base is 16. The default base for <number> is decimal.

buf -a|-b

The buf command now has a -a flag to follow the buf's av_forw field or -b flag to follow the bufs' b_forw field to print the list of bufs.

bufs with zero b_flags are considered the end of the list since the hbuf and dwbuf structures end lists for the hbuf and dwbuf arrays, and have a zero b_flags.

calc Base Specification

A number in the input may include a base specifier by following the number by a hash symbol (#) followed by a decimal base.

This base is ignored if the number starts with 0 which specifies an octal number, 0x which specifies a hexadecimal number, or 0b which specifies a binary number.

An output base may also now be specified in the calc command by ending the expression with two hash symbols (##) followed by a decimal base.

In the result, the output base is indicated by a leading 0b for decimal, a leading 0x for hexadecimal, no specifier for decimal, or a trailing ##<base> for all other bases.

The input and output bases may be any number in the range 2 to 36.


CAT> calc 'ffff&1010##2'
ffff&010#2 = 0b1000000010000
CAT> calc 'ffff&1010##3'
ffff&1010#3 = 12122022##3
CAT> calc 'ffff&1010##32'
ffff&1010#32 = 40g##32
CAT> calc 'ffff&1010##16'
ffff&1010#16 = 0x1010 
CAT> calc 'ffff&1010##10'
ffff&1010#10 = 4112 

This example converts 332 base 5 to base 8:

CAT> calc '332#5##8'
332#5##8 = 0134

This example converts 12 (default base 16) plus 12 base 12 to default base 16:

CAT> calc '12+12#12'
12+12#12 = 0x20 

In effect, this allows the calc command to supercede the 2hex, 2dec, and 2base commands.

Due to the hash symbol also being the shell comment character, it must be used inside quotes, similar to the multiplication (*), division (/), and AND (&;) symbols.

calc offset() Specification

A new operator has been added to calc to obtain the offset of a field within a structure. The format is:

offset(<type>, <field>;)

classtbl SDC

The classtbl command now includes an SDC subcommand. This displays the contents of the System Duty Class scheduling table, including the current and target duty cycles (DutyC), the minimum, current, and maximum priority (min:pri:max), and whether the thread is asleep (s).

codepath on x86

The codepath command is now supported on x86. The -o (tail-recursion or leaf optimization) and -j (jmpl) flags are not supported.

dev snode

This new subcommand displays the table of special device files (snodes) in the system.

dis on x86

The dis command now works on x86/x64. Because an x86/x64 instruction may be up to 14 bytes in length, it will not fit into a C type. Thus, the instruction must be expressed as a hexadecimal string, with or without the leading 0x. Using a leading 0t for decimal or a leading 0 for octal will not be recognized and may give unexpected results.

Since the instruction size on x86/x64 is variable length, the length of the instruction that was disassembled is also displayed.

dispq -t

A new -t flag was added to the dispq command to display how long a thread waiting in a dispatch queue has been waiting. This is obtained from the t_waitrq field in the kthread_t structure.

door data <addr>
door node <addr>

A new data subcommand was added to allow displaying a door_data structure. For consistency, a node subcommand was also added to display door_node structure, although with no subcommand, the <addr> is still displayed as a door_node.

findfiles -n <vnode>
findfiles -v <vfs>

In addition to open files, these commands will now also search memory segments for the specified <vnode> or <vfs>. This only works on Oracle Solaris 9+.

findfiles -l

The -l flag has been removed. Structure addresses are now always displayed.

findfiles -s

A new -s flag was added. If this flag is included, memory segments' vnodes are also displayed. This only works on Oracle Solaris 9+.

flip -c on x86

The flip command now works on x86/x64 instructions. Because an x86/x64 instruction may be up to 14 bytes in length, it will not fit into a C type. Thus, the instruction must be expressed as a hexadecimal string, with or without the leading 0x. Using a leading 0t for decimal or a leading 0 for octal will not be recognized and may give unexpected results.

Since the instruction size on x86/x64 is variable length, the length of the instruction that was disassembled is also displayed.

The instruction provided is padded to 14 bytes with zeroes, so warnings are displayed if the resulting disassembled instruction is longer than the instruction provided, or if a bit was flipped beyond the providec instruction provided that resulted in a valid disassembled instruction.

ipc -a

This new flag for the ipc command causes the address of the relevant structure being displayed to be displayed in the short/table output.


This command was renamed from meminfo.

A new -s option has been added to allow sorting of the output for the user subcommand. The supported sort fields are:

sort field description
pid sort by PID
command sort by the command. This is either the u_psargs or u_comm, depending on the scatenv proc_comm setting.
assize sort by the proc.p_as.as_size
rss sort by the calculated rss
swresv sort by the swap reserved
anon sort by the anonymous memory in use
file sort by the memory in use by files
swap sort by the amount of data on swap

If no sort type is specified, the default is anon.

This command does not work on Oracle Solaris 8.

mem log

This new subcommand for mem displays the log of writes to /dev/mem or /dev/kmem, or reads or writes to /dev/allkmem. This is typically run when a sanity check such as:

WARNING: 26 writes to /dev/kmem (run "mem log")
is seen. The RW+ column is used to indicate whether the memory was read from or written to, and a + indicates the operation was successful.

nfs rnode

This subcommand displays the NFS rnodes in memory.

pdump [rpc|smb]

The pdump command now has basic support for rpc and smb headers.

The never-implemented tr and fddi command-line options were removed.

pkma streams_dblk

If the cache name provided is streams_dblk, then all caches whose name start with streams_dblk are scanned instead of an individual cache.

pkma -p <cachename> <filename>

The pkma command can now write buffers to a file instead of trying to interpret/summarize them itself. To specify this, use the -p flag, and specify the <filename>. The file is written in libpcap format. Since there are no timestamps in the caches, all times will be written as zero.

As described above, using a <cachename> of streams_dblk will process all of the caches whose name starts with streams_dblk.

By default, at most the first 1024 bytes of the buffer are written. This value may be changed using the -m flag.

By default, an offset of two bytes from the start of the buffer is used to start copying, based on the alignment of the size of an ether_header, which is 14 bytes. This can be altered using the -o option.

If the -a flag is used, the file is appended to instead of overwritten.

The -f flag is also supported to write free buffers in addition to allocated buffers.

proc -s

Processes may now be sorted by p_lwpcnt by supplying lwpcnt as the sort type.

If no sort type is provided, it now defaults to sorting by PID (equivalent to -s pid).

rd/rdh/rd16/rd32/rd64/rdf/rdd/rdw -n

The new -n flag for these command causes the values read to be converted from network byte order to host byte order prior to display.

rdi -m

The ability to read all instructions in a kernel module has been removed.

rwlock -n <krnumalock_addr>

A new option has been added to the rwlock command. If the -n flag is given, then the <rwlock_addr> is treated as the address of a krwnumalock instead.

This is normally unnecessary, as the tool will detect a rwlock being a krwnumalock and display it appropriately.

Sanity Check Changes

The follow sanity checks were added:

  • zfs_arc_max limited
  • entries in the mm_*mem*_log of writes to memory
  • ZFS dataset over or near quota
  • DTrace error injection in use
  • tmpfs using more than 1GB and more than 10% of memory
  • str_ftnever is 0
  • freebs_list is non-NULL
  • squeues with non-zero sq_count (Oracle Solaris 10+ only)

stack find Selectors

stack find now has a set of selector subcommands available. The now-deprecated selection flags have been replaced with with strings similar to tlist's selectors.

selector flag description
align 4|8|16 -i 4|8|16 select threads whose stack frames are aligned by the specified value - 4|8 on 32-bit crashdumps, or 8|16 on 64-bit crashdumps
arg <value> -a <value> select thread stacks which include <value> as an argument to a stack function
call <function> -c <function> select thread stacks which include <function> (exact match)
frames <value> -f <value> select thread stacks which include at least <value> frames
module <module> -m <module> select thread stacks which include <module> (exact match)
stkbase -g select threads whose final frame is at the thread's t_stkbase

stack summary Selectors

stack summary now uses selectors similar to tlist. The selector flags have been replaced with strings similar to tlist's selectors and more selectors added.

selector flag description
call <function> -f <function> select thread stacks which include <function> (exact match)
callstr <function> -F <function> select thread stacks which include <function> (substring match)
module <module> -m <module> select thread stacks which include <module> (exact match)
modulestr <module> -M <module> select thread stacks which include <module> (substring match)
proc <proc> (none) select threads whose t_procp matches the specified <proc>. The <proc> may be specified as a decimal PID or the process address.
cmd <string> (none) select threads whose process's command contain <string> (substring match). This matches u_comm or u_psargs depending on the scatenv proc_comm setting.
state <state> (none) select threads whose t_state matches the specified state. The <state> may be any one of free, sleep, run, onproc, zomb, stopped, or wait.
sobj <sobj> (none) select threads which are waiting on the specified type of synchronization object. The <sobj> may be any one of none, mutex, reader, writer, cv, sema, rwlock, locks, user, or shuttle.
wchan <wchan> (none) select threads whose t_wchan or t_wchan0 matches the specified <wchan>.

An optional ! before a specifier causes it to exclude stacks which match that specifier.

These specifiers may be chained together to further specify which thread stacks to display in the summary, and those are ANDed together.

CAT(vmcore.0/11X)> stacks call zfs_write !call zfs_range_lock
would summarize stack's threads which include a call to zfs_write but don't have a call to zfs_range_lock.

stream [-s] [-c] [-n] mblk

The stream mblk command interface was changed to allow for summarizing mblks as well as to provide more control as to whether the b_cont and b_next fields are followed.

The -l flag previously was used to cause stream mblk to follow the b_next fields for more mblks. This no longer works. Instead, the -c flag causes the b_cont field to be followed, and the -n flag causes the b_next field to be followed. If neither are provided, a single mblk is displayed. If both are provided, the b_cont is followed first.

The new -s flag to stream mblk, instead of displaying the mblks, will cause a summary of the mblks seen by count and size between db_base and db_lim of the linked dblks.

This flag will also give a summary by count and size of the db_frtnp->free_func if db_frtnp is set.

stream [-d] summary

The stream summary command will now handle the addition of the -d flag. If -d is included, only streams, queues and syncqs with data in them will be displayed.

tlist Subcommands

tlist subcommands can now be chained together to indicate that the specifiers be ANDed together. For example:
tlist module zfs call cdev_ioctl
will find all threads that have the zfs module in their stack AND the function cdev_ioctl.

As part of this change, aliases were added as follows:

subcommand alias
arg arg
call call
cmd cmd
module module

The original subcommands will be removed in a future release in favor of the aliases.

Additionally, the subcommands may be preceded by a !, which causes the selection to be reversed, e.g. !swapped would select threads which are not swapped out instead of those that are.

This can also be useful for further specifying a call subcommand, such as:

CAT(vmcore.0/11X)> tlist call zfs_read !call zfs_range_lock
  52 matching threads found
    with function "zfs_write" in its stack AND
    without function "zfs_range_lock" in its stack
Thus it selects threads which have zfs_write in its stack but not zfs_range_lock.

tlist -f

The -f flag was removed from tlist in favor of the global scatenv thr_ignore_free setting.

tlist arg on x86

This command now works on x86 using the function arguments stored in the stack. Since stack arguments are only saved on Oracle Solaris 11+, it only works on those versions.

On x86, there is no concept of "local" registers, so the "-l" flag does not enable searching local registers stored in the stack as it does on SPARC.

sdump/sarray/slist/slistt/savl/stree/shash/skma/softstate -n

This new flag for these commands causes the data dumped to be converted from network byte order to host byte order via ntoh*(). The flag may be enabled for all commands by enabling the scatenv typedb_dump_ntoh setting.

Note that this is only performed on numeric fields, and is ignored for fields which are known pointers.

sdump/stype/sarray/slist/slistt/savl/stree/shash/skma/softstate and fields

Fields may be specified by name or offset, and multiple entries are specified with a comma-separated list. If any field specification starts with a '!', the whole list is treated as a list of fields to exclude.

thread locks

This new subcommand displays a list of locks being waited for by threads. Only mutexes, rwlocks, UPI mutexes, and page locks are included.

The threadlist is walked, and any threads waiting to get a lock are summarized by lock address. The output includes any locks for which more than the scatenv thr_locks_min setting threads are waiting.

After each displayed lock, a list of caller functions is included, showing the function which called the locking code. This list is also limited by the thr_locks_min setting. It is not pedantic about this, and if there's just one left that's got less than thr_locks_min, it will be displayed anyway.

This command does not work on Oracle Solaris 8.

thread summary

A count of threads reading or writing a vnode and vfs is now included in the output. vnodes and vfses with less than the scatenv thr_rw_min setting are not displayed individually. It is not pedantic about this, and if there's just one left that's got less than thr_rw_min, it will be displayed anyway.

thread tree

A new subcommand for thread was added to help display the interrelationships between threads. This tree shows dependencies that come from locks, being pinned by an interrupt, doors client/server relationships, and threads in dispatch queues to onproc threads.

This command does not work on Oracle Solaris 8.

zone -L

Some of the information previously displayed with zone -l is now only displayed if the new -L flag is used. This was changed to reduce the output from zone -l.

The zone_zsd data is currently the only part added by -L at this time.

zfs -lz <zio_addr> zio

The new -l flag for zfs -z <zio_addr> zio causes the single zio at <zio_addr> to be displayed in more detail.

scatenv Changes


If a command is given which is not a shell builtin, your PATH is searched for the command. If the new scatenv builtins_only setting is enabled, such non-builtins result in an error instead.

If builtins_only is enabled, this can be bypassed by starting a command with an exclamation point (!) to indicate that the external command is intentional.

This setting is disabled by default.


This flag was renamed from alternate_cpu_walk.


Function argument types are now used for more than the stack, such as the callout, taskq, iommu, intr, clock cyclic, and syscall display in thread output.

Therefore, the setting was renamed from stk_arg_types to func_arg_types. The old name will still work for now, but will be removed in a future release.


In kma stat output, the allocations columns makes the output wider than the average 80 column screen, yet adds very little useful information.

This new setting allows hiding those columns in the output, and allowed the other relevant ones to be widened so the columns are not misaligned for large caches.

This setting is disabled by default.


Since the packet headers should be in the early part of a dblk's data, a limit has been added on how much of the data to scan for network headers. This both provides a limited sanity check, and improves the performance on large buffers being scanned.

At most, the first pdump_max_scan bytes of a buffer will be scanned for network headers. This defaults to 200.


This scatenv setting was renamed from pdump_min_pkt since it controls pkma -s behavior rather than pdump, plus limits all results rather than just packets.

Use of this value is no longer pedantic in that if printing another line of results would print all of the results, they are printed instead of a line with a count of how many more there are.

Additionally, it is not adhered to if there is room to print a few more results on a results line already started.


This flag causes the -c flag to be forced on for the proc command, meaning the contract information is always displayed.

By default, this is disabled.


This flag causes the time column in proc output to be replaced by the process's p_lwpcnt.

It is enabled by default.


This flag causes the -j flag to be forced on for the proc command, meaning the project information is always displayed.

By default, this is disabled.


This flag causes the -k flag to be forced on for the proc command, meaning the task information is always displayed.

By default, this is disabled.


This flag causes the -z flag to be forced on for the proc command, meaning the zone information is always displayed.

By default, this is disabled.


Any ZFS filesystems using more than this percentage of their quota are reported. This is in addition to any being over their quota being reported.

The default is 95%.


Flags in stream previously were displayed one per line with an explanation of each. With this flag set, all the values are displayed on a single line with no explanation. This is a way to shorten the overall stream output.

It is enabled by default.


This scatenv setting was renamed from near_symbol.


Flags in thread/tlist output previously were displayed one per line with an explanation of each. With this flag set, all the values are displayed on a single line with no explanation. This is a way to shorten the overall thread/tlist output.

It is enabled by default.


This new setting causes tlist, thread [summary|locks|tree], and stack summary to ignore threads which have their t_state set to TS_FREE.

It is enabled by default.


Each CPU has an assigned idle and pause thread. Most of the time, these threads are uninteresting, and can distract from threads which might be relevant. If this flag is set, CPU idle and pause threads are ignored by tlist thread [summary|locks|tree], and stack summary.

It is disabled by default.


The new thread locks command will only display locks which have more than the scatenv thr_locks_min threads trying to acquire it. It also limits the list of caller functions listed with any displayed lock to caller functions with this number or more threads at that point in the function.

It is not pedantic about this, and if there's just one left that's got less than thr_locks_min, it will be displayed anyway.

The default value for this setting is 10.


This setting changes how thread summary displays its count of threads in read/write per vnode/vfs. If a given vnode or vfs has less than thr_rw_min threads reading or writing it, it is not displayed, but instead included in a summary count at the end of the list.

It can be set to a high value to only see the summary, or to 0 to see all vnodes/vfses being read from or written to. The default is 10.

It is not pedantic about this, and if there's just one left that's got less than thr_rw_min, it will be displayed anyway.


Much of the tool checks whether it's displaying the panic_thread, and changes its behavior to use panic-related data to display its stack. However, in some cases, this obfuscates important of the data for that thread. Disabling the thr_use_panic flag causes it to treat the panic_thread as a regular thread for the purposes of displaying its stack.

It is enabled by default.


The typedb_charp flag was replaced with this one, and now controls whether pointer types have further information displayed about them when available.

This includes:

type information
char * string
refstr_t * string
struct vnode * v_path string
struct vfs * vfs_resource and vfs_mntpt strings
dev_info_t * devi_node_name strings up through devi_parents

It is enabled by default.


This setting forces the sdump/sarray/slist/slistt/savl/stree/shash/skma/softstate -n flag to be turned on.

It is disabled by default.


This setting changes whether vnode summary and vnode vn_cache display a given vnode. If a given vnode is seen less times than this setting, it is not displayed.

It can be set to a high value to only see the summary, or to 0 to see all vnodes. The default is 10.

It is not pedantic about this, and if there's just one left that's got less than thr_rw_min, it will be displayed anyway.

Deprecated Command Removal

The following previously-deprecated commands have been removed:

removed command replacement
base scatenv base
bigdump bufc buf list
bigdump inode inode list
bigdump idleq inode idleq
bigdump dnlc dnlc list
bigdump dwbuf buf dw
bigdump tmpfs tmpfs
bigdump ssfcplog ssfcplog
bigdump fplog qlcfc
bigdump vfs findfiles


Monday Dec 12, 2011

Dumping Stacks

Stack dumping seems like it should be an easy thing.  After all, you're just walking a linked list.  What the truth is that stack dumping in Oracle Solaris Crash Analysis Tool is one its most complicated parts.

Sure, generating a list of function/frame pointers is pretty easy, but it turns out there are vast amounts of useful data and clues in that stack that it's useful to dig up and display (cue the digging pirate).

The first thing to consider is that when Oracle Solaris takes a trap (any kind of fault, exception or interrupt, which includes system calls), it has to save all of the registers to the stack.  This happens on system calls, interrupts, and of course, traps caused by approaching system crashes.  The reason this is done is to remember the state of the current thread before totally switching contexts to a different stack, and the operating system, therefore, needs to preserve all of the registers should it ever return so it can restore state and continue.

That's the reason you will often see register dumps in the stack.  Getting the register values at a precise point in time is valuable information - particularly when it contains state data about something going wrong and we're preparing to crash the operating system.  

You can see this trap information by enabling the stk_trap scatenv setting, which is on by default.  You can also see the user thread registers for userland threads in a core by displaying the full stack data with stack -l.

Other things which arise are architecture-specific, and related to how the compiler tries to optimize things for a particular architecture.  On SPARC, for example, a useful optimization is re-using stack space for functions we'll never return to.

For example, functions which look like:

  funcA() {
    return (funcB());
  callerFunc() {

have no reason to return to funcA ever again.  What the compiler does in such situations is to pop the stack space it reserved for funcA for funcB's use.  We coined the term "recycled frame" for this.

The way this would normally appear in the stack is this:


which means the person looking at this stack has no information about funcA ever having been called.  If they then went to look at callerFunc, they'd see it never calls funcB.  But, if you look at callerFunc+offset, you can see a call to funcA, which isn't the next thing in the stack.  Hence you can see why the tool was enhanced to show:

  funcA() - frame recycled

for these cases.  Display of the stack with recycled frames is controlled by the stk_recycled scatenv setting, which is on by default.

Note that if there are two such optimizations in the calling sequence, there's no useful way to dig up anything but the first step.  The codepath command was written to search for call linkages between such functions.

Something similar is also done for leaf functions.  Leaf functions are functions which never call anything else.  An optimization done for those is that they don't have to necessarily do a save if they're short enough to operate in the volatile global and output registers (%g and %o on SPARC).

Knowledge of this is used when walking the stack as the "next" function down the stack will use the same stack pointer as we are using now.

There are other interesting bits done in the stack which are worth noting.  For example, as part of the ABI on SPARC 64-bit, stack pointers are offset by a number so as to make them easily distinguishable from 32-bit stack pointers.  That is the STACK_BIAS, which is 2047.  So if you find an address that's in the range of stack addresses, but is misaligned (in this case, it ends up with a "1" as the last digit), it probably needs the STACK_BIAS added to it to get the number you actually want.  Note that 32-bit SPARC, and x86/x64 use a STACK_BIAS of 0.

For convenience, the frame pointers that Oracle Solaris Crash Analysis Tool uses already have the STACK_BIAS applied. 

Another useful concept to know is MINFRAME.  The ABI defines how functions pass arguments between functions, and on SPARC, when a save instruction is run, it saves some space for input (%i) and local (%l) registers of the caller to be stored, plus space for output registers (%o) for any functions it calls (note that for leaf functions, this is sometimes optimized to skip the output registers - something we've tagged a MINIFRAME).

Knowing that a function uses MINFRAME is a good clue that this is a short function, and makes no use of local variables beyond what it can get away with using the local and input registers.

Knowing when we switch between stacks is also useful information.  This means we've switched from the userland stack space to kernel, or from one kernel stack to another.  This happens on traps, and also happens when we've detected a stack overflow - if we're out of stack space where we are, we still need stack space to deal with it.  In Oracle Solaris Crash Analysis Tool you can see stack switches with the stk_switch scatenv setting.  It also tells you details about whose stack it is, such as the thread's kernel stack, a CPU's interrupt or idle thread's stack, or the ptl1_stk.

The ptl1_stk is a special space used for dealing with panics when we're already processing a trap.  When we process a trap, we first switch to the kernel's nucleus, which is low-level code for handling all the details required switching between userland and kernel.  However, when we are already in the nucleus on SPARC, and we take a trap, we could be processing a kernel stack overflow, which means we're in trouble in the low-level code, and Oracle Solaris sets aside a special stack space for dealing with those on SPARC - the ptl1_stk.

Stack overflows are another interesting area where the stack dumper can help.  Kernel stack space is a limited resource - typically only 1-2 pages of memory.  Kernel code needs to be aware of this, and not allocate too much stack space for local variables.  Problems still arise here, and you can examine stack space usage by enabling scatenv settings stk_s_fromend and stk_s_size (both disabled by default).  stk_s_fromend shows how far each frame is from the end of the stack.  stk_s_size shows the size of each frame. Each kernel stack also has an unmapped page of vmem assigned to it at the end so that any accesses past the end of the stack trigger a page fault which can't be resolved and thus results in a panic.  That page is referred to as the redzone.  This prevents stack overflows from corrupting a neighboring thread stack.

One of the most useful things the stack dumper can do is display arguments passed into a function.  On SPARC, arguments are passed in registers. However, those registers are re-used by the callee, and thus can't be relied upon to determine what was passed to the function. The passed-in values can often be determined by examining the assembly code in the caller to see what it put in the output registers (input registers for leaf functions).  Doing that manually is time-consuming, even if you've had a lot of practice.

The scatenv setting stk_args causes the stack dumper to attempt to calculate those passed-in values for you, and display it in the stack. It isn't perfect, and can't always determine the arguments, but saves a lot of time in most cases.  It only works for SPARC at this time.

There are a few other more obscure scatenv settings which control how the stack dumper behaves. 

  • stk_l_sym - decodes any numbers to a kernel symbol if possible in the long stack output
  • stk_l_symonly - any numbers that can be decoded as a kernel symbol are displayed as only the kernel symbol - without the  number
  • stk_s_addr - displays the address of each frame in the normal stack output
  • stk_s_regs - displays the values of the input registers (%i0 - %i5) in the stack output (less useful with the stack arguments available)
  • stk_s_sym - decodes any numbers to a kernel symbol if possible in the normal stack output
  • stk_trap_mmu_sfsr - display and decode the mmu_sfsr information available in SPARC trap frames
  • stk_trap_tstate - display and decode the tstate information available in SPARC trap frames
There is a vast amount useful information available in a thread stack.  The Oracle Solaris Crash Analysis Tool stack dumper has many options to control how much and which information is displayed - probably more than anyone will ever need.  However, new pieces come up all the time and will be added.

Monday Dec 05, 2011


Have you ever wondered if there was a way to get the Oracle Solaris Crash Analysis Tool to always run a set of commands when it starts?  And have you ever seen one user's command output look different from the output printed when you run the same command? 


When the tool starts, it checks your home directory for a file named .scatrc.  If it finds one, it runs all the commands listed.  For example, here's the author's .scatrc:

export HISTFILE=.scathist$USER
alias less=more
set -o vi
scatenv human_readable on
alias ibase="base -i"
alias obase="base -o"
scatenv minsize 0x1000000
scatenv dis_synth_only on
scatenv dis_synth_cc on
scatenv dis_br_label on
scatenv stk_switch on
scatenv str_syncq on
scatenv str_data on
scatenv sym_size_full on
scatenv thr_stkdata off
scatenv thr_pri on
scatenv thr_lwp off
scatenv thr_cpu off
scatenv thr_age on
scatenv thr_syscall off
scatenv thr_flags off
scatenv table_whitespace off
scatenv scroll 0
color background light
alias cpuc=cpu | grep "cpu id"

As you can see, the purpose is to initialize Oracle Solaris Crash Analysis Tool with the settings I'm accustomed to using.  Given that the tool based on ksh, you can also set ksh environment settings, such as the editing mode, set ksh environment variables for later use, or create command aliases.


In the .scatrc, you see many settings made using the scatenv command which is used to query and change the settings of the environment settings used by the tool.  To see the complete list of settings available, simply issue the scatenv command with no options. Since the scatenv settings can be either boolean, a number value, or a string, the variable type is provided in the command output.  Luckily, one can search for applicable settings using the -? option.  For example, to find all settings that involve threads, one would use:

 CAT(vmcore.0)> scatenv -? threads   
    Flag Name    Current  Type  Description
    dispq_empty  on       on    When displaying dispatch queues, also show  
                                CPUs that have no threads in their dispatch 
    thr_age      on       on    Show age information when dumping threads  
    thr_cpu      off      on    Show CPU information when dumping threads
    thr_flags    off      on    Show flag information when dumping threads
    thr_idle     on       on    Show idle time when dumping threads
    thr_lwp      off      on    Show lwp information when dumping threads
    thr_pri      on       on    Show priority information when dumping threads
    thr_proc     on       on    Show process information when dumping threads
    thr_stime    off      on    Show t_stime when dumping threads.
    thr_stkdata  off      on    Show stack related information when dumping
    thr_syscall  off      on    Show syscall information when dumping threads
    thr_wchan    on       on    Show wchan information when dumping threads

 To set an environment flag, simply enter:

scatenv flag_name setting


  • flag_name  - the name of the flag in question.
  • setting - the value to assign to the flag.

Friday Aug 28, 2009

Crash Dump Info Extractor

Though the Solaris Crash Analysis Tool (CAT) script scat_explore has existed in prior releases of the Solaris CAT, release 5.1and now 5.2 includes a stand-alone mode that allows users to extract crash data without having to run Solaris CAT first as well as a send_scat_explore feature which automates the process of data collection and transmits that data to Sun.  The following is a description of how to use send_scat_explore usage and how to run scat_explore in "stand-alone" mode.


Sending crash data to Sun requires an open Sun Service Request (SR). Once a valid SR is open, the customer can then run /opt/SUNWscat/bin/send_scat_explore to send the crash data.

send_scat_explore usage is as follows:

send_scat_explore [-n sr_number] [-e email] [unix.x] vmcore.x


        \* -n sr_number - sets the Sun Service Request number
        \* -e email          - sets the reply-to email address that Sun
                                   should use to acknowledge the receipt of the
        \* [unix.x] vmcore.x - the crash dump from which crash data should
                                   be gathered. Please note that unix.X need not
                                   be supplied and the core number, X, can be
                                   specified with or without the vmcore. prefix.

If the above -n and -e options are not specified, the user is prompted for Sun SR number and reply-to address.

Note that if the reply-to address is not specified on the command line, send_scat_explore looks to see if a reply-to address has been saved in the Sun Explorer configuration and it will offer that address as a response, see the example below.

Example 1: Using send_scat_explore without options:

#/opt/SUNWscat/bin/send_scat_explore vmcore.0
   Found a reply address in Sun Explorer settings...
   Email address for replies? [someone@a.com]: me@a.com
   Sun Service Request number: XXXXXXXX
   Collecting scat_explore from core file instance 0...
   #Extracting crash data...
   #Successful extraction
   Sending scat_explorer data file ./oomph_1775b1b7_0xf93_vmcore.0 for SR 51234567 to dreap@sun.com...

Example 2: Using send_scat_explore with command line options:

# /opt/SUNWscat/bin/send_scat_explore -n XXXXXXXX -e me@a.com vmcore.0
  Collecting scat_explore from core file instance 0...
  #Extracting crash data...
  #Successful extraction
  Sending scat_explorer data file ./oomph_1775b1b7_0xf93_vmcore.0 for SR XXXXXXXX to dreap@sun.com...


scat_explore is a script included with Solaris CAT which extracts crash data from a crash dump.  When the --scat_explore option is issued to Solaris CAT, the crash dump is opened and scat_explore is run. The collected crash data is saved in a directory with the crash dump and the  directory name is displayed. scat_explore also saves a compressed tar archive of the crash data in this directory.
The scat_explores usage is:
scat --scat_explore [-v] [-a] [-d dest] [unix.N] [vmcore.]N

-v Verbose Mode: The command will print messages highlighting what it's doing.
-a Auto Mode: The command does not prompt for input from the user as it runs.
-d dest Instructs scat_explore to save it's output in the directory dest instead of the present working directory.
N The number of the crash dump. Specifying unix.N vmcore.N is optional and not required.


$ scat --scat_explore -a -v 0
#Output directory: ./scat_explore_ebsmro2_808cc87b_0xde2d09d_vmcore.0
#Tar filename:     scat_explore_ebsmro2_808cc87b_0xde2d09d_vmcore.0.tar
#Extracting crash data...
#Gathering standard crash data collections...
#Panic string indicates a possible hang...
#Gathering Hang Related data...
#Creating tar file...
#Compressing tar file...
#Successful extraction

Tuesday Mar 03, 2009

Release 5.1 re-spun

A few folks inside Sun asked if there was anything we could do to make the Solaris Crash Analysis Tool package(s) smaller.  Sure! We just left the package "fat" to ease debugging but folks don't always have the disk space to install a monster analysis tool.  We've therefore, re-spun release 5.1 as 5.1b with a much smaller footprint.   The bonus is that since we had to spin new packages, we decided to add the bug fixes that we've added since the original release.  

Therefore, if you pulled a copy of 5.1, you might want to revisit the download page and pull a copy of 5.1b. 

The installed packages are now:

Solaris CAT 5.1b x86/x64
Solaris CAT 5.1b SPARC
Solaris CAT 5.1b Combined




« May 2015