When ZFS first started, it was just Jeff trying to pair old problems with new solutions in margins too small to contain either. Then Matt joined up to bring some young blood to the project. By the time the project putback, the team had grown to more than a dozen. And now I've been pulled in -- if only for a cameo.
When ZFS first hit the streets, Jeff wrote about RAID-Z, an implementation of RAID designed for ZFS. RAID-Z improves upon previous RAID schemes primarily in that it eliminates the so-called "write hole" by using a full (and variable-sized) stripe for all write operations. It's worth noting that RAID-Z exploits the fact that ZFS is an end-to-end solution such that metadata (traditionally associated with the filesystem layer) is used to interpret the RAID layout on disk (an operation usually ascribed to a volume manager). In that post, Jeff mentioned that a double-parity version of RAID-Z was in the works. What he actually meant is that he had read a paper, and thought it might work out -- you'd be forgiven for inferring that actual code had been written.
Over lunch, Bill -- yet another elite ZFS hacker -- mentioned double-parity RAID-Z and their plans for implementing it. I pressed for details, read the paper, got interested in the math, and started yakking about it enough for Bill to tell me to put up or shut up.
The basic notion behind double-parity RAID or RAID-6 is that a stripe can survive two failures without losing data where RAID-5 can survive only a single failure. There are a number of different ways of implementing double-parity RAID; the way Jeff and Bill had chosen (due to its computational simplicity and lack of legal encumbrance) was one described by H. Peter Anvin in this paper. It's a nice read, but I'll attempt to summarize some of the math (warning: this summary is going to be boring and largely unsatisfying so feel free to skip it).
For a given stripe of n data blocks, D_{0} .. D_{n-1}, RAID-5 computes the contents of the parity disk P by taking the bitwise XOR of those data blocks. If any D_{n} is corrupted or missing, we can recover it by taking the XOR of all other data blocks with P. With RAID-6, we need to compute another prity disk Q using a different technique such that Q alone can reconstruct any D_{n} and P and Q together can reconstruction any two data blocks.
To talk about this, it's easier -- believe it or not -- to define a Galois field (or a finite field as I learned it) over the integers [0..255] -- the values that can be stored in a single byte. The addition field operation (+) is just bitwise XOR. Multiplication (x) by 2 is given by this bitwise operation for x x 2 = y:
y_{7} | = | x_{6} |
y_{6} | = | x_{5} |
y_{5} | = | x_{4} |
y_{4} | = | x_{3} + x_{7} |
y_{3} | = | x_{2} + x_{7} |
y_{2} | = | x_{1} + x_{7} |
y_{1} | = | x_{0} |
y_{0} | = | x_{7} |
A couple of simple things worth noting: addition (+) is the same as subtraction (-), 0 is the additive identity and the multiplicative annihilator, 1 is the multiplicative identity. Slightly more subtle: each element of the field except for 0 (i.e. [1..255]) can be represented as 2^{n} for some n. And importantly: x^{-1} = x^{254}. Also note that x x y can be rewritten as 2^{log x} x 2^{log y} or 2^{log x + log y} (where + in that case is normal integer addition).
We compute Q as
2^{n-1} D_{0} + 2^{n-2} D_{1} ... + D_{n-1}
or equivalently
((...(((D_{0} x 2 + D_{1} + ...) x 2 + D_{n-2}) x 2 + D_{n-1}.
Computing Q isn't much slower than computing P since we're just dealing with a few simple bitwise operations.
With P and Q we can recover from any two failures. If D_{x} fails, we can repair it with P. If P also fails, we can recover D_{x} by computing Q_{x} where Q_{i} = Q + 2^{n - 1 - x} x D_{x} (easily done by performing the same computation as for generating Q but with D_{x} set to 0); D_{x} is then (Q_{x} + Q) / 2^{n - 1 - x} = (Q_{x} + Q) x 2^{x + 1 - n}. Once we solve for D_{x}, then we recompute P as we had initially.
When two data disks are missing, D_{x} and D_{y}, that's when the rubber really meets the road. We compute P_{xy} and Q_{xy} such that P_{xy} + D_{x} + D_{y} = P and Q_{xy} + 2^{n - 1 - x} x D_{x} + 2^{n - 1 - y} x D_{y} = Q (as before). Using those two expressions and some basic algebra, we can solve for D_{x} and then plug that in to solve for D_{y}. The actual expressions are a little too hairy for HTML, but you can check out equation 16 in the paper or the code for the gory details.
As of build 42 of OpenSolaris, RAID-Z comes in a double-parity version to complement the existing single-parity version -- and it only took about 400 additional lines of code. Check out the code here. Of particular interest are the code to generate both parity blocks and the code to do double block reconstruction. What's especially cool about ZFS is that we don't just blithely reconstruct data, but we can verify it against the known checksum. This means, for example, that we could get back seemingly valid data from all disks, but fail the checksum; in that case we'd first try reconstructing each individual block, and then try reconstructing every pair of blocks until we've found something that checksums. You can see the code for combinatorial reconstruction here.
To make a double-parity RAID-Z vdev, specify raidz2 to
zpool(1M):
# zpool create pool raidz2 c1t0d0 c1t0d1 c1t0d2 c1t0d3 c1t0d4
This will create a pool with a double-parity RAID-Z vdev of width 5 where all data can sustain up to two failures be they corrupt data coming off the drives or drives that are failed or missing. The raidz vdev type continues to mean single-parity RAID-Z as does the new alias raidz1.
Double-parity RAID-Z is probably going to supplant the use of its single-parity predecessor in many if not most cases. As Dave Hitz of NetApp helpfully noted in a recent post double-parity RAID doesn't actually cost you any additional space because you'll typically have wider stripes. Rather than having two single-parity stripes of 5 disks each, you'll have one double-parity stripe with 10 disks -- the same capacity with extra protection against failures. It also shouldn't cost you in terms of performance because the total number of disk operations will be the same and the additional math, while slightly more complex, is still insignificant compared with actually getting bits on disk. So enjoy the extra parity.
Technorati Tags:
OpenSolaris
ZFS
A key component of the ZFS Hybrid Storage Pool is Logzilla, a very fast device to accelerate synchronous writes. This component hides the...
Thank you for this interesting blog entry.
I think you're referring to Roch's
recent post.
Roch's analysis is strictly for random reads (and largely ignores the
common case of sequential reads); since parity blocks aren't
normally read (a fact I think Roch didn't fully consider) the usage
calculations will be the same modulo the disk space consumed for
extra parity.
zmx,
A way to rephrase your question would be: can one convert between
raidz1 and raidz2 vdevs? The answer is no. While that capability would
be nice, it's a natural consequence of the way RAID-Z avoids the
"RAID-5 write hole". Recall that RAID-Z always makes full-stripe
writes; it does this by varying the size of the stripe. If, for example,
you have a 7+1 RAID-Z configuration a write of 3 sectors of data will
result in a stripe of width 4. Two such writes could coexist on the same
logical row of 8 sectors across the 8 disks in the RAID-Z vdev. With
traditional RAID-5 you can toss in an additional disk, generate parity, and
call it RAID-6, but with RAID-Z an individual row might contain several
parity blocks so a similar gimmick isn't possible. For what it's worth ZFS
has facilities for migrating filesystems between pools if you want to
change the underlying fault tolerance.
But still in case of small random reads two raidz1 in one group will be faster then one big raidz2 pool, right? (in terms of IOPS).
Yes, for example RAID-Z 8+2 would be more comparable -- in terms of Roch's analysis -- to RAID-Z 8+1 than to 2 x 4+1.
How to migrating filesystems between pools? The ZFS Administration Guide only mention "Migrating Storage Pools" (between machine)
You can use 'zfs send' and 'zfs receive'. See the zfs(1M) man page for details.