Error Handling Philosophy: It Happens
By user12652883 on Mar 10, 2005
There is, of course, an informed minority who recognise the value and necessity of proper error and fault handling. But it seems that too many people subscribe to the quaint fallacy that "hardware errors should not happen, and where they do they indicate a vendor design or manufacturing flaw". Hmmm, even given a perfect chip design it's more than a little counter-intuitive to believe that we can squeeze a few gazillion transistors onto a tiny chip and expect perfect operation from what is a physical process. The reality is that electronics, like software, is imperfect both in design and in implementation. Moreover it is realized in a physical medium and is subject to the laws of physics more than the laws of your data centre or desktop! Now take all these components and imagine assembling them even into a simple system - opportunities abound!
So rather than stick your head in the sand, accept that hardware errors do happen, that they are expected, and that (whatever your hardware, from whoever) they near certainly have happened to you. Whether or not you or anything actually noticed and, if so, did anything about it is quite another question. We're into the domain of detection, correction, and remedial action.
Good hardware components and systems are designed to, at the very least, detect the presence of errors they have suffered (they detect the consequence of the event, say a bit flip, rather than the event itself which may have occured some time earlier). I'm always amused by the existence of "non-parity memory" for many PC systems (especially those sometimes quoted in the "build your own server for a few hundred bucks and run your business on it - why pay a vendor" articles. "Non-parity" makes it sounds like a feature, not an ommision; like "Non-fattening" foods are a good thing so "non-parity" must be good. Lacking even parity protection means that your data in memory is completely unprotected - if it is corrupted by a bit flip the only way you'll ever know is if you notice "cheap" spelled as "nasty" in your precious document (and many would blame the application), or if higher-level application software is performing checksumming on your data for you (not common). The system has silently corrupted your data and allowed use of that corrupted data as if it had been good, and you're none the wiser!
Of course any self-respecting system nowadays has ECC on memory (typically single-bit-error correction and dual-bit-error detection without correction). OK, so my cheapo home PC doesn't but that's only used for private email so I'll get over it - but I wouldn't run anything important on there. A self-respecting system also has data protection, and increasingly correction, on datapaths, ASICs, CPU caches etc.
At it's simplest, system error-handling software need only do what is required by the architecture to correctly survive the error. This may range from nothing at all (hardware detected and corrected a single bit error and, optionally, let you know about its good deed) to cache flushes, memory rewrites etc. But we want to do a great deal more than that. We want to log the event, look for patterns and trends, predict impending uncorrectable failures and "head them off at the pass" (apologies) through preemptive actions, classify errors (eg, transient or not going away) and so on.
Furthermore, if we accept that "it happens" - errors and faults will occur - we should also accept that they're not always going to be neat and tidy. For example a particle strike that upsets a single memory cell is easily handled - it's an isolated event and the overhead of handling it (correction, logging, collating in case this is a part of a pattern) is trivial - but a damaged or compromised memory chip (manufacturing defect, electrostatic handling failures, nuts and bolts loose in the system, poor memory slot insertion, coffee in the memory slot etc) may produce a "storm" of errors - do you want the system to spend all its time logging and collating those or would you prefer it also do some useful work on your application?
While a single memory chip problem will only affect accesses that involve that memory, think what happens when such problems beset the CPU - say a "stuck" bit (always reads 0 even after you write a 1 to it, or vice-versa) is in a primary (on-chip) cache such as the data cache. Such a failure is going to generate many many error events - how do you make sure nobody notices (i.e., no data is corrupted, everybody runs at full-speed etc).
Of course some failures are simply not correctable. They may not involve extremes of storms and stuck bits - just 2 flipped bits in a memory word cannot be corrected with most ECC codes currently in use. Data may be lost (e.g., if it can't be recovered from another copy elsewhere) and process or even operating system state may be compromised. How do you contain the damage done with least interruption to system service?
I hope I've made the beginnings of a convincing case of why error and fault handling as a first-class citizen is essential in modern systems and operating systems. In followup posts (probaby after I get some such software putback-ready) I'll continue to make the case, describe where Sun is at and where we're going in the arena etc. It's perhaps less glamouress working in the cesspit of things that we'd prefer did not happen, but since they do happen and will continue to with current technology it's certainly sexy when you can handle them with barely a glitch to the system or predict the occurence and have already taken steps to contain the damage.