ZFS saves the day(-ta)!

I've been using ZFS internally for awhile now. For someone who used to administer several machines with Solaris Volume Manager (SVM), UFS, and a pile of aging JBOD disks, my experience so far is easily summed up: "Dude this so @#%& simple, so reliable, and so much more powerful, how did I never live without it??"

So, you can imagine my excitement when ZFS finally hit the gate. The very next day I BFU'ed my workstation, created a ZFS pool, setup a few filesystems and (four commands later, I might add) started beating on it.

Imagine my surprise when my machine stayed up less than two hours!!

No, this wasn't a bug in ZFS... it was a fatal checksum error. One of those "you might want to know that your data just went away" sort of errors. Of course, I had been running UFS on this disk for about a year, and apparently never noticed the silent data corruption. But then I reached into the far recesses of my brain, and I recalled a few strange moments -- like the one time when I did a bringover into a workspace on the disk, and I got a conflict on a file I hadn't changed. Or the other time after a reboot I got a strange panic in UFS while it was unrolling the log. At the time I didn't think much of these things -- I just deleted the file and got another copy from the golden source -- or rebooted and didn't see the problem recur -- but it makes sense to me now. ZFS, with its end-to-end checksums, had discovered in less than two hours what I hadn't known for almost a year -- that I had bad hardware, and it was slowly eating away at my data.

Figuring that I had a bad disk on my hands, I popped a few extra SATA drives in, clobbered the disk and this time set myself up a three-disk vdev using raidz. I copied my data back over, started banging on it again, and after a few minutes, lo and behold, the checksum errors began to pour in:

elowe@oceana% zpool status
  pool: junk
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool online' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        junk        ONLINE       0     0     0
          raidz     ONLINE       0     0     0
            c0d0    ONLINE       0     0     0
            c3d0    ONLINE       0     0     1
            c3d1    ONLINE       0     0     0

A checksum error on a different disk! The drive wasn't at fault after all.

I emailed the internal ZFS interest list with my saga, and quickly got a response. Another user, also running a Tyan 2885 dual-Opteron workstation like mine, had experienced data corruption with SATA disks. The root cause? A faulty power supply.

Since my data is still intact, and the performance isn't hampered at all, I haven't bothered to fix the problem yet. I've been running over a week now with a faulty setup which is still corrupting data on its way to the disk, and have yet to see a problem with my data, since ZFS handily detects and corrects these errors on the fly.

Eventually I suppose I'll get around to replacing that faulty power supply...

Technorati Tags: [ ]

Comments:

So I guess the new term you just coined is "Throw software at the problem" ?? :-)

Posted by Asgeir S. Nilsen on November 29, 2005 at 06:45 AM CST #

That all sounds great. When do us "normals" get to play with it? -brian

Posted by Brian Hechinger on January 11, 2006 at 12:35 AM CST #

Do you think ZFS can integrate with EVMS ?

Posted by Humberto Ramirez on August 03, 2006 at 05:19 PM CDT #

Post a Comment:
Comments are closed for this entry.
About

elowe

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today