Kernel fault injection – testing the diagnostics codepaths

Within Solaris, we test for conformance – spec says “do X when Y happens” – and we test for regression – “X used to work but now it doesn’t”. However, when working on deferred dump, I wanted a way to create different panics, partly to make sure I was handling the panic and crash dump that I thought I was, and partly to test the various codepaths which took the system through the panic mechanism.

So, I created a loadable module called kfaultinj – the “Kernel FAULT INJector”. Obviously I’m not the first to have ever written a module like this which deliberately panics the system, but I tried to include as many common faults as I could. Here are some examples of what it can test for:

Direct call to panic()
Read from a NULL pointer
Execute code from a NULL pointer
Kernel stack overflow
Invalid instruction
Divide-by-zero
Read from the VA hole present in most current CPUs

I also threw in some kmem_alloc / kmem_free testing too for testing leak detection, double-free, use-after-free and other common errors.

Using this, I found a bizarre fault in the x86 trap handling codepath in some newly-added code which corrupted the stack during the panic sequence. This code was only present in the development source code, so there is no impact to production systems. The problem was that an illegal instruction fault injection did indeed result in a panic, but not the panic I was expecting. This was compounded by the ensuing stack corruption, but eventually I figured out where it was going wrong. The fix was simple, and the bug was caught before it got out to any customer systems.

Sure, it was already in the panic codepath, so you get a panic when you were on route to panic, but the resulting mess would have confused several people before it was figured out. I had the advantage that I knew what panic I was expecting, so I already knew which codepath to look in. If this was a panic in the wild, with even the reason for the original panic trashed on the stack, it would have been nigh on impossible to figure out what happened.

One other type of testing we do is fuzz testing – feeding garbage into a program to check it handles it gracefully. Since I’d just written destructive kernel code and wrapped it up in an easy-to-use loadable module, I wanted to make sure it wouldn’t be triggered accidentally by fuzz testing. So I used a simple write-to-arm mechanism, where a pre-agreed block of text would be written into the device to arm it, before the ioctl to do the damage was issued.

The text I chose was this, adapted from a combination of Psalm 130 and the standard Requiem Mass text, in Latin, of course:

De profundis clamavit ad te,
Libera animas omnium fidelium defunctorum
de poenis inferni et de profundo lacu.
Fac eas de morte transire ad vitam

which (very!) roughly translates as:

Out of the deep I have cried out to you,
Save the souls of the faithful departed
from the pains of hell and the bottomless pit.
Let them pass from death to life.

It seemed appropriate for a panic and crash dump scenario!

Kernel fault injection – testing the diagnostics codepaths

Brian Ruthven

Solaris Server OS development engineer

Minimal Content for a Local IPS Package Repository

The Flashy ZFSSA OS8.7 : The Best Getting Better

Kernel fault injection – testing the diagnostics codepaths

Authors

Brian Ruthven

Solaris Server OS development engineer

Minimal Content for a Local IPS Package Repository

The Flashy ZFSSA OS8.7 : The Best Getting Better