Why is the crash dump running really slowly?

December 23, 2016 2 minute read

Solaris Server OS development engineer

Periodically I get pinged about a crash dump which is running really slowly, but the issue is really the way in which the dump was started. When a SPARC system panics, if the obpdebug variable is set non-zero (which is the default on DEBUG kernels), the panic sequence will drop to the OBP before taking the dump. On recent systems, especially if running within an LDom, you’ll get this prompt:

c)ontinue, s)ync, r)eset?

Now let’s consider what each of these three options actually does:

Continue: resumes the running kernel (which is currently in the middle of a panic and about to produce a crash dump).
Sync: calls the sync_handler callback installed at boot time. This is an OBP hook into the kernel for forcing a kernel panic with the string “sync initiated”.
Reset: Immediate reset of the LDom / platform.

Selecting “continue” will allow the kernel to contine the panic sequence and generate the dump. So what’s the problem with selecting “sync”, which seems like a reasonable choice?

First, some background: Back in 2009, code was added to the dump subsystem to make use of otherwise idle cpus for parallel compression of the dump. This improved throughput significantly, and of course reduced downtime as a result of a panic.

The way the OBP implements the sync handler callback is to jump back into the kernel on that one CPU only. The rest of the cpus are left spinning in the OBP code, and therefore cannot join in the parallel compression of the crash dump. This results in a single-threaded dump, with the single-cpu responsible both for compression and polling disk I/O (which also slows things down).

So, if faced with the “c)ontinue, s)ync, r)eset?” prompt in a panic context, “continue” is normally the best option.

Brian Ruthven

Solaris Server OS development engineer

Brian Ruthven joined Sun Microsystems in 2000 as a front-line support engineer, dealing with either hardware or software problems. He moved to an engineering position in 2008, diagnosing, fixing and providing patches to customers for networking bugs in the kernel. Later projects include the design and implementation of the x86 prototype for the Deferred Dump feature, driving through to completion on both architectures. Currently he holds the position of ON Tech Lead for Solaris.

Why is the crash dump running really slowly?

Brian Ruthven

Solaris Server OS development engineer

C, Solaris & SPARC M7: Get Defensive with Few Techniques .. Part 2/3

Mandatory Integrity Control and the Trusted Path Domain

Why is the crash dump running really slowly?

Authors

Brian Ruthven

Solaris Server OS development engineer

C, Solaris & SPARC M7: Get Defensive with Few Techniques .. Part 2/3

Mandatory Integrity Control and the Trusted Path Domain