Friday Apr 24, 2015

Deferred dump in Solaris

Kernels of even most robust operating systems do crash from time to time and getting crash dump image of kernel and process memory is usually indispensable for root causing the problem. With growing memory sizes (the M6-32 SPARC server can have up to 32TB of RAM) the amount of data which needs to be written to disk in order to store a crash dump can be pretty big - for larger systems uncompressed crash dumps with tens of gigabytes in size are not exceptional. Even with crash dump restructured into pieces (see my last entry about crash dump restructuring for more info) the kernel portion can be still significant. In Solaris, a ZFS zvol is dedicated as a dump device and during panic processing the crash dump is stored in compressed form to the dump device and extracted from there by savecore(1M) into crash directory once the system comes up again. This is all managed by dumpadm(1M). The problems with dump device in the context of large systems are mutlifold - the dump device can be too small to hold the dump (usually it resides on the root pool which is designed to store just the operating system), it can reside on iSCSI volume (where it is not possible to write the dump to during panic as the kernel is very limited in what it can do during this time) or it can be simply too slow to write the big dump (kernel I/O is restricted to polling operation during panic processing) and dedicating multi-gigabyte SSD to dump device would be waste of resources.

To overcome this limitation I've been working with my colleagues Sriman, Brian, Nick and Chris to store dump in memory, reboot the system and extract the dump to disk. This technology is called Deferred dump. It appears in Solaris starting with Oracle Solaris (Oracle Solaris 11.2 SRU 8.4.0, you can read more about this SRU in My Oracle Support document Doc ID 1672221.1 or generically about SRUs on Gerry's blog about Solaris 11 lifecycle management and SRUs).

In reality there are few hints which give away the fact that dump device was not used for storing the dump, here's a capture of a panic triggered in LDOM:

root@s11:~# reboot -d
Apr 24 06:40:02 s11 reboot: initiated by root on /dev/console

panic[cpu7]/thread=c4003b9d7320: forced crash dump initiated at user request

000002a102519930 genunix:kadmin+650 (fc, 0, c400308f5e38, 4, 5, 1)
  %l0-3: 0000000020895800 00000000102be000 0000000000000004 0000000000000004
  %l4-7: 0000000000000600 0000000000000010 0000000000000004 0000000000000004
000002a102519a00 genunix:uadmin+1d0 (1, c4003bd72460, 0, 6d7000, ff00, 5)
  %l0-3: 000000000000852e 000003000000c000 0000000000000004 0000000000000000
  %l4-7: 0000000000000000 0000000000000000 0000000000000000 0000000000000000

syncing file systems... done
Preserving kernel image in RAM, content: kernel sections: proc, zfs
 0:07  91% done (kernel)
 0:08  97% done (proc)
 0:08 100% done (zfs)
100% done: 331865 (kernel) + 21023 (proc) + 10523 (zfs) pages dumped, dump succeeded
NOTICE: Entering OpenBoot.
NOTICE: Fetching Guest MD from HV.
NOTICE: Starting additional cpus.
NOTICE: Initializing LDC services.
NOTICE: Probing PCI devices.
NOTICE: Finished PCI probing.

SPARC T5-8, No Keyboard
Copyright (c) 1998, 2015, Oracle and/or its affiliates. All rights reserved.
OpenBoot 4.37.0-nightly_01.13.2015, 23.2500 GB memory available, Serial #83489682.
Ethernet address 0:14:4f:f9:f3:92, Host ID: 84f9f392.

Boot device: /virtual-devices@100/channel-devices@200/disk@0:a  File and args: 
SunOS Release 5.11 Version 11.2 64-bit
Copyright (c) 1983, 2015, Oracle and/or its affiliates. All rights reserved.
NOTICE: Verified previous kernel image
Reconciling deferred dump: 99%..100% done.
Hostname: s11
Apr 24 06:40:58 s11 savecore: Until crashdump is saved to disk the multi-user-server milestone is delayed.
Apr 24 06:40:58 s11 savecore: Saving decompressed system crash dump files in directory /var/crash/data/61a392f8-3336-4d88-a0bb-d4ea84c11e60
Storing crash dump section kernel: 0:00   5% done (16956 pages)
Storing crash dump section kernel: 0:51 100% done (331865 pages)
Storing crash dump section proc: 0:08 100% done (21023 pages)
Storing crash dump section zfs: 0:00 100% done (10523 pages)

There are couple of things to notice. First the message about preserving system image in RAM. Then the reconcile phase where deferred dump pieces are reunited with VM and most importantly the message about delayed multi-user-server milestone which is necessary to prevent memory hungry apps from fighting with deferred dump while it is still residing in memory.

Now, you may ask: how can this work given that the data which is dumped reside in the same memory we are dumping into ? Well, this is a bit like feet switching when climbing walls - for a bit of time the hands take the weight of the body until the feet are switched on a single step. For deferred dump, the aid is preallocated memory (this is what you see in mdb(1) in the output of ::memstat d-command as Defdump prealloc), careful page allocation and compression. The technique to preserve the memory across reboot is different on SPARC and x86. On x86 fastreboot is the necessary pre-requisite while on SPARC this is a result of Hypervisor and OBP cooperation.

In overall, deferred dump should be faster than classic on-disk dump. This is because storing the dump to crash directory happens when the system can use system capabilities fully (namely DMA and full ZFS). Of course, the crash directory can reside on network volume too since the system is almost fully initialized at that point.
Sometimes, the system can perform 2 successive reboots - one to get from the panicked kernel to new kernel and another to clear the residue in case the new kernel deems that memory is fragmented enough to cause performance issues. This happens on x86 systems only.

There are a few limitations which can prevent the system from not performing deferred dump, namely:

  • fast reboot support on x86 - the system has to support switching kernels without having to go through BIOS
  • uptime on x86 - by default the system has to be up for at least 10 minutes
  • memory size - there is minimal system memory required, currently it is cca 9.5GB
  • firmware version on SPARC has to be sufficiently new to support deferred dump (9.4.x on T5/M6 and above, 8.7.x on T4)

Deferred dump is enabled by default so it will become de-facto standard way of dumping on most systems.

Thursday Jul 24, 2014

Crashdump restructuring in Solaris

In Solaris 11.2 the crashdump restructuring project changed the way how dump data are stored. Data which are not pure kernel pages now go into separate files. Together with my colleague Sriman we made it to happen.

The first noticeable change was a change in the layout of the crash directory. The files are stored under /var/crash/data/uuid/ directory. The long hexadecimal string (uuid) was added to better align with FMA - it's actually uuid (universally unique ID) of the crash event which can be found in fmadm faulty output. Actually, if you look at FMA panic events from earlier versions you can see that the resource string for the event was already designed this way, it's just materialized with this project.

For example, after 2 panic events the /var/crash directory will look like this:

0 -> data/404778fb-88da-4188-f222-8da174d44fa4
1 -> data/6e50417e-95fc-4ab4-e9a8-abbe80bc6b48

The 0, 1 symlinks maintain the sequential ordering of the old layout.

The example reflects a configuration when savecore is not automatically run after boot (i.e. dumpadm -n is in effect) and the administrator has extracted the first crash dump by hand (running savecore 0 in /var/crash/0/ directory). If you take a look at the console after the system rebooted after panic there are commands which you can copy-n-paste to the terminal to perform the extraction.

The other change in the above example is that there is new vmcore-zfs.N file. This is not the only new file which can appear. Depending on dumpadm(1M) configuration there can be files like:

  • vmcore.N - core kernel pages
  • vmcore-zfs.N - ZFS metadata (ZIO buffers)
  • vmcore-proc.N - process pages
  • vmcore-other.N - other pages (ZFS data, free pages)

By splitting the dump into multiple files it is possible to transfer just vmcore.N file for analysis to quickly assess what caused the panic and transfer the rest of the files later on.

If any of the "auxiliary" files is missing, mdb will report it:

root@va64-v20zl-prg06:/var/crash/0# mdb 0
mdb: failed to locate file ./vmcore-zfs.0. Contents of ZFS metadata (ZIO buffers) 
will not be available
Loading modules: [ unix genunix specfs dtrace mac cpu.generic 
cpu_ms.AuthenticAMD.15 uppc pcplusmp zvpsm scsi_vhci zfs mpt sd ip hook neti arp
 usba kssl sockfs lofs idm cpc nfs fcip fctl ufs logindmux ptm sppp ipc ]
> ::status
debugging crash dump vmcore.0 (64-bit) from va64-v20zl-prg06
operating system: 5.12 dump.on12 (i86pc)
usr/src version: 23877:f2e76e2d0329:on12_51+48
usr/closed version: 1961:cad017e4c7e4:on12_51+4
image uuid: 404778fb-88da-4188-f222-8da174d44fa4
panic message: forced crash dump initiated at user request
complete: yes, all pages present as configured
dump content: kernel [LOADED,UNVERIFIED] (core kernel pages)
              zfs [MISSING] (ZFS metadata (ZIO buffers))
panicking PID: 101575 (not dumped)

Another choice is not to dump some of the sections at all. E.g. to dump only pages kernel and currently running process at the time of panic but not ZFS metadata the system can be configured as:

dumpadm -c curproc-zfs

Also, the unix.N file is no longer extracted automatically (it can be done with -u option for savecore(1M) if you need the file for some reason) since it is embedded in vmcore.N file; mdb will find it automatically.

How to load the files into mdb with all these files around ? The easiest way how to access an extracted dump is to use just the suffix, i.e.

  mdb N

which will pick up all the files based on metadata in the main vmcore.N file. This worked even before this change, except there was just one file (2 if counting unix.N).

It is still possible to specify the files by hand, just remember to put the main file (vmcore.N) as the first argument:

  mdb vmcore.N vmcore-zfs.N ...

The other change (which is hard to notice unless you're dealing with lots of crash dump files) was that we laid out the infrastructure in kernel/mdb/libkvm to be properly backwards compatible w.r.t. on-disk crash dump format. As a result mdb will automatically load crash dump files produced in earlier Solaris versions. Currently it supports 3 latest versions.


blog about security and various tools in Solaris


« November 2015