Using Kernel Crash dumps for Performance Analysis
By clive on May 27, 2008
Kernel Crash dumps are a point in time snapshot of the Solaris Kernel state. The aim is to allow post mortem analysis of the system state at the point the crash dump was taken. For system panic's and hangs, the ability to look at the system state is the primary failure analysis tool and one of the reasons Solaris is as reliable as it is.
I think of system failures as a 2 dimensional problem. The interaction of data and code at the point in time of the failure can be analyzed with tools such as MDB which are designed for this type of post-mortem analysis.
Performance adds the 3rd dimension of time.
Autopsy is not commonly used as a tool for determining the root cause of individual productivity issues. In a small subset of cases, poor individual productivity may be the result of a medical condition requiring a CAT scan (the medical version of a live Kernel Crash Dump). However, these cases are very rare and such techniques would only be used with a significant body of supporting evidence.
Kernel Crash Dumps are useful for a very small subset of performance cases. Specific performance problems rooted in memory shortfall caused by a memory leak would be one example, but these are quite rare in the big scheme of things and would need supporting evidence to use the Kernel Crash Dump approach.
I have come across a number of cases in the last few months where a crash dump has been requested and only one was possibly valid.
Before collecting the CAT scan equivalent of your system (with the associated cost) in the hope it shows up the cause of a performance problem, check the pulse, breathing and circulation 1st. If you do collect a live crash dump, make sure the supporting evidence and rational are sound.