You may have noticed the new system crash dump file vmdump.N that
was introduced in Solaris 10 9/10. However, you
perhaps did not notice that the crash dump is generated much more quickly
than before, reducing your down time by many minutes on large memory
systems, by harnessing parallelism and high compression at dump time.
In this entry, I describe the Fast Crash optimizations that Dave Plauger
and I added to Solaris.
In the previous implementation, if a system panics, the panic thread
freezes all other CPUs and proceeds to copy system memory to the dump
device. By default, only kernel pages are saved, which is usually
sufficient to diagnose a kernel bug, but that can be changed with the
dumpadm(1M) command. I/O is the bottleneck, so the panic thread
compresses pages on the fly to reduce the data to be written. It
uses lzjb compression, which provides decent compression at a
reasonable CPU utilization. When the system reboots, the
single-threaded savecore(1M) process reads the dump device,
uncompresses the data, and creates the crash dump files vmcore.N and
unix.N, for a small integer N.
Even with lzjb compression, writing a crash dump on systems with gigabytes
to terabytes of memory takes a long time. What if we use stronger
compression to further reduce the amount of data to write? The
following chart compares the compression ratio of lzjb vs bzip2 for
42 crash dumps picked at random from our internal support site.
bzip2 compresses 2x more than lzjb for most cases, and in the extreme case,
bzip2 achieves a 39X compression vs 9X for lzjb. (We also tested gzip
levels 1 through 9, and they fall in between the two.)
Thus we could reduce the disk I/O time for crash dump
by using bzip2. The catch is that bzip2 requires significantly more
CPU time than lzjb per byte compressed, some 20X to 40X more
on the SPARC and x86 CPUs we tested, so introducing bzip2 in a
single threaded dump would be a net loss. However, we hijack the
frozen CPUs to compress different ranges of physical memory in parallel.
The panic thread traverses physical memory in 4 MB chunks, mapping
each chunk and passing its address to a helper CPU. The helper compresses
the chunk to an output buffer, and passes the result
back to the panic thread, which writes it to disk. This is
implemented in a pipelined, dataflow fashion such that the helper CPUs
are kept busy compressing the next batch of data while the
panic thread writes the previous batch of data to disk.
We dealt with several practical problems to make this work.
Each helper CPU needs several MB of buffer space to run the
bzip2 algorithm, which really adds up for 100's of CPUs, and we did
not want to statically reserve that much memory per domain. Instead,
we scavenge memory that is not included in the dump, such as userland
pages in a kernel-only dump. Also, during a crash dump, only the
panic thread is allowed to use kernel services, because the state of
kernel data structures is suspect and concurrency is not safe. Thus the
panic thread and helper CPUs must communicate using shared memory and
spin locks only.
The speedup of parallel crash dump versus the serial dump depends on
compression factor, CPU speed, and disk speed, but here are a few
examples. These are "kernel only" dumps, and the dumpsize column
below is the uncompressed kernel size. The disk is either a raw disk
or a simple ZFS zvol, with no striping. Before is the time for a
serial dump, and after is the time for a parallel dump, measured from
the "halt -d" command to the last write to the dump device.
system NCPU disk dumpsize compression before after speedup
(GB) mm:ss mm:ss
------ ---- ---- -------- ----------- ----- ----- -------
M9000 512 zvol 90.6 42.0 28:30 2:03 13.8X
T5440 256 zvol 19.4 7.2 27:21 4:29 6.1X
x4450 16 raw 4.9 6.7 0:22 0:07 3.2X
x4600 32 raw 14.6 3.1 3:47 1:46 2.1X
The higher compression, performed concurrently with I/O, gives
a significant speedup, but we are still I/O limited, and future speedup
will depend on improvements in I/O. For example, a striped zvol is
not supported as a dump device, but that would help. You can use
hardware raid to configure a faster dump device.
We also optimized live dumps, which are crash dumps that are
generated with the "savecore -L" command, without stopping the
system. This is useful for diagnosing systems that are misbehaving
in some way but are still performing adequately, without interrupting
service. We fixed a bug (CR 6878030) in which live dump writes were
broken into 8K physical transfers, giving terrible I/O
throughput. The live dumps could take hours on large systems, making
them practically unusable. We obtained these speedups for various
system disk before after speedup
------ ---- ------- ----- -------
M9000 zvol 3:28:10 15:06 13.8X
T5440 zvol 23:29 1:19 18.6X
T5440 raw 9:17 0:55 10.5X
Lastly, we optimized savecore, which runs when the system boots after
a panic, and copies data from the dump device to a persistent file,
such as /var/crash/hostname/vmdump.N. savecore is 3X to 5X faster
faster because the dump device contents are more compressed, so there
are fewer bytes to copy, and because it no longer uncompresses the
dump by default. That makes more sense if you need to send the
vmdump across the internet. To uncompress by default, use "dumpadm -z off".
To uncompress a vmdump.N file and produce
vmcore.N and unix.N, which are required for using tools such as mdb,
run the "savecore -f" command. We multi-threaded savecore to uncompress
Having said all that, I sincerely hope you never see a crash dump on
Solaris! The best way to reduce downtime is for the system to stay
up. The Sun Ray server I am working on now has been up for 160 days
since its previous planned downtime.