Tuesday Jul 19, 2011

Fast Crash Dump

You may have noticed the new system crash dump file vmdump.N that was introduced in Solaris 10 9/10. However, you perhaps did not notice that the crash dump is generated much more quickly than before, reducing your down time by many minutes on large memory systems, by harnessing parallelism and high compression at dump time. In this entry, I describe the Fast Crash optimizations that Dave Plauger and I added to Solaris.

In the previous implementation, if a system panics, the panic thread freezes all other CPUs and proceeds to copy system memory to the dump device. By default, only kernel pages are saved, which is usually sufficient to diagnose a kernel bug, but that can be changed with the dumpadm(1M) command. I/O is the bottleneck, so the panic thread compresses pages on the fly to reduce the data to be written. It uses lzjb compression, which provides decent compression at a reasonable CPU utilization. When the system reboots, the single-threaded savecore(1M) process reads the dump device, uncompresses the data, and creates the crash dump files vmcore.N and unix.N, for a small integer N.

Even with lzjb compression, writing a crash dump on systems with gigabytes to terabytes of memory takes a long time. What if we use stronger compression to further reduce the amount of data to write? The following chart compares the compression ratio of lzjb vs bzip2 for 42 crash dumps picked at random from our internal support site.

bzip2 compresses 2x more than lzjb for most cases, and in the extreme case, bzip2 achieves a 39X compression vs 9X for lzjb. (We also tested gzip levels 1 through 9, and they fall in between the two.) Thus we could reduce the disk I/O time for crash dump by using bzip2. The catch is that bzip2 requires significantly more CPU time than lzjb per byte compressed, some 20X to 40X more on the SPARC and x86 CPUs we tested, so introducing bzip2 in a single threaded dump would be a net loss. However, we hijack the frozen CPUs to compress different ranges of physical memory in parallel. The panic thread traverses physical memory in 4 MB chunks, mapping each chunk and passing its address to a helper CPU. The helper compresses the chunk to an output buffer, and passes the result back to the panic thread, which writes it to disk. This is implemented in a pipelined, dataflow fashion such that the helper CPUs are kept busy compressing the next batch of data while the panic thread writes the previous batch of data to disk.

We dealt with several practical problems to make this work. Each helper CPU needs several MB of buffer space to run the bzip2 algorithm, which really adds up for 100's of CPUs, and we did not want to statically reserve that much memory per domain. Instead, we scavenge memory that is not included in the dump, such as userland pages in a kernel-only dump. Also, during a crash dump, only the panic thread is allowed to use kernel services, because the state of kernel data structures is suspect and concurrency is not safe. Thus the panic thread and helper CPUs must communicate using shared memory and spin locks only.

The speedup of parallel crash dump versus the serial dump depends on compression factor, CPU speed, and disk speed, but here are a few examples. These are "kernel only" dumps, and the dumpsize column below is the uncompressed kernel size. The disk is either a raw disk or a simple ZFS zvol, with no striping. Before is the time for a serial dump, and after is the time for a parallel dump, measured from the "halt -d" command to the last write to the dump device.

    system  NCPU  disk  dumpsize  compression  before  after  speedup
                          (GB)                 mm:ss   mm:ss
    ------  ----  ----  --------  -----------  -----   -----  -------
    M9000   512   zvol    90.6      42.0       28:30    2:03    13.8X
    T5440   256   zvol    19.4       7.2       27:21    4:29     6.1X
    x4450    16   raw      4.9       6.7        0:22    0:07     3.2X
    x4600    32   raw     14.6       3.1        3:47    1:46     2.1X

The higher compression, performed concurrently with I/O, gives a significant speedup, but we are still I/O limited, and future speedup will depend on improvements in I/O. For example, a striped zvol is not supported as a dump device, but that would help. You can use hardware raid to configure a faster dump device.

We also optimized live dumps, which are crash dumps that are generated with the "savecore -L" command, without stopping the system. This is useful for diagnosing systems that are misbehaving in some way but are still performing adequately, without interrupting service. We fixed a bug (CR 6878030) in which live dump writes were broken into 8K physical transfers, giving terrible I/O throughput. The live dumps could take hours on large systems, making them practically unusable. We obtained these speedups for various dump sizes:

    system disk  before   after  speedup
                 h:mm:ss  mm:ss
    ------ ----  -------  -----  -------
    M9000  zvol  3:28:10  15:06  13.8X
    T5440  zvol    23:29   1:19  18.6X
    T5440  raw      9:17   0:55  10.5X

Lastly, we optimized savecore, which runs when the system boots after a panic, and copies data from the dump device to a persistent file, such as /var/crash/hostname/vmdump.N. savecore is 3X to 5X faster faster because the dump device contents are more compressed, so there are fewer bytes to copy, and because it no longer uncompresses the dump by default. That makes more sense if you need to send the vmdump across the internet. To uncompress by default, use "dumpadm -z off". To uncompress a vmdump.N file and produce vmcore.N and unix.N, which are required for using tools such as mdb, run the "savecore -f" command. We multi-threaded savecore to uncompress in parallel.

Having said all that, I sincerely hope you never see a crash dump on Solaris! The best way to reduce downtime is for the system to stay up. The Sun Ray server I am working on now has been up for 160 days since its previous planned downtime.


Steve Sistare


« July 2011 »