Solaris serviceability and nifty tools

Deferred dump in Solaris

Kernels of even most robust operating systems do crash from time to time and getting crash dump image of kernel and process memory is usually indispensable for root causing the problem. With growing memory sizes (the M6-32 SPARC server can have up to 32TB of RAM) the amount of data which needs to be written to disk in order to store a crash dump can be pretty big - for larger systems uncompressed crash dumps with tens of gigabytes in size are not exceptional. Even with crash dump restructured into pieces (see my last entry about crash dump restructuring for more info) the kernel portion can be still significant. In Solaris, a ZFS zvol is dedicated as a dump device and during panic processing the crash dump is stored in compressed form to the dump device and extracted from there by savecore(1M) into crash directory once the system comes up again. This is all managed by dumpadm(1M). The problems with dump device in the context of large systems are mutlifold - the dump device can be too small to hold the dump (usually it resides on the root pool which is designed to store just the operating system), it can reside on iSCSI volume (where it is not possible to write the dump to during panic as the kernel is very limited in what it can do during this time) or it can be simply too slow to write the big dump (kernel I/O is restricted to polling operation during panic processing) and dedicating multi-gigabyte SSD to dump device would be waste of resources.

To overcome this limitation I've been working with my colleagues Sriman, Brian, Nick and Chris to store dump in memory, reboot the system and extract the dump to disk. This technology is called Deferred dump. It appears in Solaris starting with Oracle Solaris (Oracle Solaris 11.2 SRU 8.4.0, you can read more about this SRU in My Oracle Support document Doc ID 1672221.1 or generically about SRUs on Gerry's blog about Solaris 11 lifecycle management and SRUs).

In reality there are few hints which give away the fact that dump device was not used for storing the dump, here's a capture of a panic triggered in LDOM:

root@s11:~# reboot -d
Apr 24 06:40:02 s11 reboot: initiated by root on /dev/console
panic[cpu7]/thread=c4003b9d7320: forced crash dump initiated at user request
000002a102519930 genunix:kadmin+650 (fc, 0, c400308f5e38, 4, 5, 1)
%l0-3: 0000000020895800 00000000102be000 0000000000000004 0000000000000004
%l4-7: 0000000000000600 0000000000000010 0000000000000004 0000000000000004
000002a102519a00 genunix:uadmin+1d0 (1, c4003bd72460, 0, 6d7000, ff00, 5)
%l0-3: 000000000000852e 000003000000c000 0000000000000004 0000000000000000
%l4-7: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
syncing file systems... done
Preserving kernel image in RAM, content: kernel sections: proc, zfs
0:07 91% done (kernel)
0:08 97% done (proc)
0:08 100% done (zfs)
100% done: 331865 (kernel) + 21023 (proc) + 10523 (zfs) pages dumped, dump succeeded
NOTICE: Entering OpenBoot.
NOTICE: Fetching Guest MD from HV.
NOTICE: Starting additional cpus.
NOTICE: Initializing LDC services.
NOTICE: Probing PCI devices.
NOTICE: Finished PCI probing.
SPARC T5-8, No Keyboard
Copyright (c) 1998, 2015, Oracle and/or its affiliates. All rights reserved.
OpenBoot 4.37.0-nightly_01.13.2015, 23.2500 GB memory available, Serial #83489682.
Ethernet address 0:14:4f:f9:f3:92, Host ID: 84f9f392.
Boot device: /virtual-devices@100/channel-devices@200/disk@0:a File and args:
SunOS Release 5.11 Version 11.2 64-bit
Copyright (c) 1983, 2015, Oracle and/or its affiliates. All rights reserved.
NOTICE: Verified previous kernel image
Reconciling deferred dump: 99%..100% done.
Hostname: s11
Apr 24 06:40:58 s11 savecore: Until crashdump is saved to disk the multi-user-server milestone is delayed.
Apr 24 06:40:58 s11 savecore: Saving decompressed system crash dump files in directory /var/crash/data/61a392f8-3336-4d88-a0bb-d4ea84c11e60
Storing crash dump section kernel: 0:00 5% done (16956 pages)
Storing crash dump section kernel: 0:51 100% done (331865 pages)
Storing crash dump section proc: 0:08 100% done (21023 pages)
Storing crash dump section zfs: 0:00 100% done (10523 pages)

There are couple of things to notice. First the message about preserving system image in RAM. Then the reconcile phase where deferred dump pieces are reunited with VM and most importantly the message about delayed multi-user-server milestone which is necessary to prevent memory hungry apps from fighting with deferred dump while it is still residing in memory.

Now, you may ask: how can this work given that the data which is dumped reside in the same memory we are dumping into ? Well, this is a bit like feet switching when climbing walls - for a bit of time the hands take the weight of the body until the feet are switched on a single step. For deferred dump, the aid is preallocated memory (this is what you see in mdb(1) in the output of ::memstat d-command as Defdump prealloc), careful page allocation and compression. The technique to preserve the memory across reboot is different on SPARC and x86. On x86 fastreboot is the necessary pre-requisite while on SPARC this is a result of Hypervisor and OBP cooperation.

In overall, deferred dump should be faster than classic on-disk dump. This is because storing the dump to crash directory happens when the system can use system capabilities fully (namely DMA and full ZFS). Of course, the crash directory can reside on network volume too since the system is almost fully initialized at that point.

Sometimes, the system can perform 2 successive reboots - one to get from the panicked kernel to new kernel and another to clear the residue in case the new kernel deems that memory is fragmented enough to cause performance issues. This happens on x86 systems only.

There are a few limitations which can prevent the system from not performing deferred dump, namely:

  • fast reboot support on x86 - the system has to support switching kernels without having to go through BIOS
  • uptime on x86 - by default the system has to be up for at least 10 minutes
  • memory size - there is minimal system memory required, currently it is cca 9.5GB
  • firmware version on SPARC has to be sufficiently new to support deferred dump (9.4.x on T5/M6 and above, 8.7.x on T4)

Deferred dump is enabled by default so it will become de-facto standard way of dumping on most systems.

Join the discussion

Comments ( 4 )
  • SANTOSH LOKE Monday, April 27, 2015

    Hi, nice information on this new feature.

    However I noticed that this deferred dump is not available in kernel zones and I was wondering as to why?



  • vlad Monday, April 27, 2015

    Hi Santosh, this stems from the implementation of kernel zones. When a kz reboots the process representing the machine goes away and with it all its pages. So, there is no way how to preserve page contents across reboots in kz which is necessary condition for defdump.

  • Arc C. Thursday, May 7, 2015

    Very interesting feature, I like it. However the very last note is puzzling:

    >Firmware version on SPARC has to be sufficiently new to support deferred dump (9.4.x on T5/M6 and above, 8.7.x on T4)

    Just did a search on MOS today and the latest firmware I could find for T5-8 was 9.3.0.D

    Is the 9.4.x firmware an experimental one?

  • vlad Friday, May 8, 2015

    9.4.x firmware is in the process of being built, please stay tuned.

Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.