ZFS saves the day(-ta)!
By elowe on Nov 16, 2005
I've been using ZFS internally for awhile now. For someone who used to administer several machines with Solaris Volume Manager (SVM), UFS, and a pile of aging JBOD disks, my experience so far is easily summed up: "Dude this so @#%& simple, so reliable, and so much more powerful, how did I never live without it??"
So, you can imagine my excitement when ZFS finally hit the gate. The very next day I BFU'ed my workstation, created a ZFS pool, setup a few filesystems and (four commands later, I might add) started beating on it.
Imagine my surprise when my machine stayed up less than two hours!!
No, this wasn't a bug in ZFS... it was a fatal checksum error. One of those "you might want to know that your data just went away" sort of errors. Of course, I had been running UFS on this disk for about a year, and apparently never noticed the silent data corruption. But then I reached into the far recesses of my brain, and I recalled a few strange moments -- like the one time when I did a bringover into a workspace on the disk, and I got a conflict on a file I hadn't changed. Or the other time after a reboot I got a strange panic in UFS while it was unrolling the log. At the time I didn't think much of these things -- I just deleted the file and got another copy from the golden source -- or rebooted and didn't see the problem recur -- but it makes sense to me now. ZFS, with its end-to-end checksums, had discovered in less than two hours what I hadn't known for almost a year -- that I had bad hardware, and it was slowly eating away at my data.
Figuring that I had a bad disk on my hands, I popped a few extra SATA drives in, clobbered the disk and this time set myself up a three-disk vdev using raidz. I copied my data back over, started banging on it again, and after a few minutes, lo and behold, the checksum errors began to pour in:
elowe@oceana% zpool status pool: junk state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool online' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: none requested config: NAME STATE READ WRITE CKSUM junk ONLINE 0 0 0 raidz ONLINE 0 0 0 c0d0 ONLINE 0 0 0 c3d0 ONLINE 0 0 1 c3d1 ONLINE 0 0 0
A checksum error on a different disk! The drive wasn't at fault after all.I emailed the internal ZFS interest list with my saga, and quickly got a response. Another user, also running a Tyan 2885 dual-Opteron workstation like mine, had experienced data corruption with SATA disks. The root cause? A faulty power supply.
Since my data is still intact, and the performance isn't hampered at all, I haven't bothered to fix the problem yet. I've been running over a week now with a faulty setup which is still corrupting data on its way to the disk, and have yet to see a problem with my data, since ZFS handily detects and corrects these errors on the fly.
Eventually I suppose I'll get around to replacing that faulty power supply...