Periodically I get pinged about a crash dump which is running really slowly, but the issue is really the way in which the dump was started. When a SPARC system panics, if the obpdebug variable is set non-zero (which is the default on DEBUG kernels), the panic sequence will drop to the OBP before taking the dump. On recent systems, especially if running within an LDom, you’ll get this prompt:
c)ontinue, s)ync, r)eset?
Now let’s consider what each of these three options actually does:
- Continue: resumes the running kernel (which is currently in the middle of a panic and about to produce a crash dump).
- Sync: calls the sync_handler callback installed at boot time. This is an OBP hook into the kernel for forcing a kernel panic with the string “sync initiated”.
- Reset: Immediate reset of the LDom / platform.
Selecting “continue” will allow the kernel to contine the panic sequence and generate the dump. So what’s the problem with selecting “sync”, which seems like a reasonable choice?
First, some background: Back in 2009, code was added to the dump subsystem to make use of otherwise idle cpus for parallel compression of the dump. This improved throughput significantly, and of course reduced downtime as a result of a panic.
The way the OBP implements the sync handler callback is to jump back into the kernel on that one CPU only. The rest of the cpus are left spinning in the OBP code, and therefore cannot join in the parallel compression of the crash dump. This results in a single-threaded dump, with the single-cpu responsible both for compression and polling disk I/O (which also slows things down).
So, if faced with the “c)ontinue, s)ync, r)eset?” prompt in a panic context, “continue” is normally the best option.
