X

The Wonders of ZFS Storage
Performance for your Data

  • ZFS |
    May 31, 2006

WHEN TO (AND NOT TO) USE RAID-Z

Roch Bourbonnais
Principal Performance Engineer
  

WHEN TO (AND NOT TO) USE RAID-Z RAID-Z is the technology used by ZFS to implement a data-protection scheme which is less costly than mirroring in terms of block overhead. Here, I'd like to go over, from a theoretical standpoint, the performance implication of using RAID-Z. The goal of this technology is to allow a storage subsystem to be able to deliver the stored data in the face of one or more disk failures. This is accomplished by joining multiple disks into a N-way RAID-Z group. Multiple RAID-Z groups can be dynamically striped to form a larger storage pool. To store file data onto a RAID-Z group, ZFS will spread a filesystem (FS) block onto the N devices that make up the group. So for each FS block, (N - 1) devices will hold file data and 1 device will hold parity information. This information would eventually be used to reconstruct (or resilver) data in the face of any device failure. We thus have 1 / N of the available disk blocks that are used to store the parity information. A 10-disk RAID-Z group has 9/10th of the blocks effectively available to applications. A common alternative for data protection, is the use of mirroring. In this technology, a filesystem block is stored onto 2 (or more) mirror copies. Here again, the system will survive single disk failure (or more with N-way mirroring). So 2-way mirror actually delivers similar data-protection at the expense of providing applications access to only one half of the disk blocks. Now let's look at this from the performance angle in particular that of delivered filesystem blocks per second (FSBPS). A N-way RAID-Z group achieves it's protection by spreading a ZFS block onto the N underlying devices. That means that a single ZFS block I/O must be converted to N device I/Os. To be more precise, in order to acces an ZFS block, we need N device I/Os for Output and (N - 1) device I/Os for input as the parity data need not generally be read-in. Now after a request for a ZFS block has been spread this way, the IO scheduling code will take control of all the device IOs that needs to be issued. At this stage, the ZFS code is capable of aggregating adjacent physical I/Os into fewer ones. Because of the ZFS Copy-On-Write (COW) design, we actually do expect this reduction in number of device level I/Os to work extremely well for just about any write intensive workloads. We also expect it to help streaming input loads significantly. The situation of random inputs is one that needs special attention when considering RAID-Z. Effectively, as a first approximation, an N-disk RAID-Z group will behave as a single device in terms of delivered random input IOPS. Thus a 10-disk group of devices each capable of 200-IOPS, will globally act as a 200-IOPS capable RAID-Z group. This is the price to pay to achieve proper data protection without the 2X block overhead associated with mirroring. With 2-way mirroring, each FS block output must be sent to 2 devices. Half of the available IOPS are thus lost to mirroring. However, for Inputs each side of a mirror can service read calls independently from one another since each side holds the full information. Given a proper software implementation that balances the inputs between sides of a mirror, the FS blocks delivered by a mirrored group is actually no less than what a simple non-protected RAID-0 stripe would give. So looking at random access input load, the number of FS blocks per second (FSBPS), Given N devices to be grouped either in RAID-Z, 2-way mirrored or simply striped (a.k.a RAID-0, no data protection !), the equation would be (where dev represents the capacity in terms of blocks of IOPS of a single device):

Random

Blocks Available

FS Blocks / sec

----------------

-------------- RAID-Z

(N - 1) \* dev

1 \* dev

Mirror

(N / 2) \* dev

N \* dev

Stripe

N \* dev

N \* dev

Now lets take 100 disks of 100 GB, each each capable of 200 IOPS and look at different possible configurations; In the table below the configuration labeled:

"Z 5 x (19+1)" refers to a dynamic striping of 5 RAID-Z groups, each group made of 20 disks (19 data disk + 1 parity). M refers to a 2-way mirror and S to a simple dynamic stripe.

Random

Config

Blocks Available

FS Blocks /sec

------------

----------------

---------

Z 1 x (99+1)

9900 GB

200

Z 2 x (49+1)

9800 GB

400

Z 5 x (19+1)

9500 GB

1000

Z 10 x (9+1)

9000 GB

2000

Z 20 x (4+1)

8000 GB

4000

Z 33 x (2+1)

6600 GB

6600

M 2 x (50)

5000 GB

20000

S 1 x (100) 10000 GB

20000

So RAID-Z gives you at most 2X the number of blocks that mirroring provides but hits you with much fewer delivered IOPS. That means that, as the number of devices in a group N increases, the expected gain over mirroring (disk blocks) is bounded (to at most 2X) but the expected cost in IOPS is not bounded (cost in the range of [N/2, N] fewer IOPS). Note that for wide RAID-Z configurations, ZFS takes into account the sector size of devices (typically 512 Bytes) and dynamically adjust the effective number of columns in a stripe. So even if you request a 99+1 configuration, the actual data will probably be stored on much fewer data columns than that. Hopefully this article will contribute to steering deployments away from those types of configuration. In conclusion, when preserving IOPS capacity is important, the size of RAID-Z groups should be restrained to smaller sizes and one must accept some level of disk block overhead. When performance matters most, mirroring should be highly favored. If mirroring is considered too costly but performance is nevertheless required, one could proceed like this:

Given N devices each capable of X IOPS.

Given a target of delivered Y FS blocks per second

for the storage pool.

Build your storage using dynamically striped RAID-Z groups of

(Y / X) devices. For instance:

Given 50 devices each capable of 200 IOPS.

Given a target of delivered 1000 FS blocks per second

for the storage pool.

Build your storage using dynamically striped RAID-Z groups of

(1000 / 200) = 5 devices. In that system we then would have 20% block overhead lost to maintain RAID-Z level parity. RAID-Z is a great technology not only when disk blocks are your most precious resources but also when your available IOPS far exceed your expected needs. But beware that if you get your hands on fewer very large disks, the IOPS capacity can easily become your most precious resource. Under those conditions, mirroring should be strongly favored or alternatively a dynamic stripe of RAID-Z groups each made up of a small number of devices.

Join the discussion

Comments ( 11 )
  • Marc Wednesday, May 31, 2006
    This is a very important article, and thank you for writing it. I have to admit that I did not understand everything. In particular, it always seemed to me that in a first approximation, raid5 (and raidz) was just (n-1) striped disks and the last one used for parity (except that the disk used for parity is not always the same). But if that were the case, there would not be such a huge performance hit between raid0 and raidz, at least for input. If you get the time some day, could you expand a bit on this please? Anyway, thank you, this will definitely change the way I am using raidz.
  • Roch Wednesday, May 31, 2006
    Because in raid5, a file system block goes to a single disk. So a re-read hits only that disk. The downside of that design is that to write the block to that one disk, you need to read the previous block, read the parity, update the parity, then write both the new block and parity. And since those 2 writes do not cannot usually be done atomically, you have the write hole mess and potential silent data corruption.
  • ux-admin Wednesday, May 31, 2006
    Great article, simply great! As soon as 10 6/06 is out, I will dearly need this information, since I have an enterprise storage server to configure. This is exactly what was needed; I even printed this article, just in case. Keep 'em coming!
  • Bill Todd Wednesday, May 31, 2006
    So in order to avoid the need for no more than 4 disk I/Os on each block write (two reads - one or both of which may be satisfied by cached data - plus two writes), you instead force a write to \*all\* the disks in the stripe (often more than 4 disk I/O operations, though all can be performed in parallel) \*plus\* degrade parallel read performance from N \* dev IOPS to 1 \* dev IOPS. Some people would suggest that that's a lousy trade-off, especially given the option to capture the modifications efficiently elsewhere (say, mirrored, with the new locations logged as look-aside addresses) and perform the stripe update lazily later. ZFS has some really neat innovations, but this does not appear to be one of them. - bill
  • Roch Thursday, June 1, 2006
    ZFS has other technologies that will affect the number of IOPS that effectively happens (the mitigating factors) both for Inputs and Outputs. For Streaming purposes RAID-Z performance will be on par with anything else. Now, the article highlighted the tradeoffs one is faced given a bunch of disks and the need for data protection: mirror or RAID-Z. That is the question that many of the small people will be facing out there.The question that Bill Todd raises is a different interesting issue: Given a RAID-5 controller, 1GB of NVRAM and a bunch of disk, should I throw away the controller or keep it. That is a much more complex question...
  • Roch quoting Jeff B Thursday, June 1, 2006
    On : zfs-discuss@opensolaris.org There's an important caveat I want to add to this. When you're doing sequential I/Os, or have a write-mostly workload, the issues that Roch explained so clearly won't come into play. The trade-off between space-efficient RAID-Z and IOP-efficient mirroring only exists when you're doing lots of small random reads. If your I/Os are large, sequential, or write-mostly, then ZFS's I/O scheduler will aggregate them in such a way that you'll get very efficient use of the disks regardless of the data replication model. It's only when you're doing small random reads that the difference between RAID-Z and mirroring becomes significant.... Jeff
  • Bill Todd Thursday, June 1, 2006
    While streaming/large-sequential writes won't be any worse than in more conventional RAID-5, a write-mostly workload using \*small\* (especially single-block) writes will be if the stripe width exceeds 4 disks: all the disks will be written, vs. just two disk writes (plus at most two disk reads) in a conventional RAID-5 implementation - at least if the updates must be applied synchronously. If the updates can be deferred until several accumulate and can be written out together (assuming that revectoring them to new disk locations - even if they are updates to existing file data rather than appended data - is part of ZFS's bag of tricks), then Jeff's explanation makes more sense. And ISTR some mention of a 'log' for small synchronous updates that might function in the manner I suggested (temporarily capturing the update until it could be applied lazily - of course, the log would need to be mirrored to offer equivalent data protection until that occurred). The impact on small, parallel reads ramains a major drawback and suggesting that this is necessary for data integrity seems to be a red herring if indeed ZFS can revector new writes at will, since it can just delay logging the new locations until both data and parity updates have completed. If there's some real problem doing this, I'm curious as to what it might be. - bill
  • Roch Saturday, June 3, 2006
    So yes, the synchronous writes go through the ZFS Intent log (zil) and Jeff mentioned this week that mirroring those tmp blocks N-way seems a good idea. ZFS does revector all writes to new locations (even block update) and it allows to stream just about any write intensive workload at top speed. It does seem possible as you suggest to implement raid-5 with ZFS; I suggest that would lead to a 4X degradation to all output intensive workloads; Since we won' t overwrite live data, to output X MB of data, ZFS would have to read X MB of freed blocks, X MB of parity, then write those blocks with new data. Maybe some of the read could already be cached but that's not quite clear that they would commonly be. Maybe what this saying is that RAID-5 works nicely for filesystems that allow themselves to overwrite live data. That maybe ok but does seem to require NVRAM to work. This is just not the design point of ZFS. It appears that for ZFS, RAID-5 would pessimise all write intensive workloads and RAID-Z pessimises non-cached random read type load.
  • Bill Todd Monday, June 5, 2006
    Don't be silly. You'd output X MB in full-stripe writes (possibly handling the last stripe specially), just as you do now, save that instead of smearing each block across all the drives you'd write them to individual drives (preferably in contiguous groups within a single file) such that they could be read back efficiently later. In typical cases (new file creation or appends to files rather than in-file updates to existing data) you wouldn't even experience any free-space checkerboarding, but where you did it could be efficiently rectified during your subsequent scrubbing passes (especially within large files where all the relevant metadata would be immediately at hand; in small files you'd need to remember some information about what needed reconsolidating, though given that in typical environments large files consume most of the space - though not necessarily most of the access activity - using parity RAID for small files is of questionable utility anyway). And there's no need for NVRAM with the full-stripe writes you'd still be using. (Hope this doesn't show up as a double post - there seemed to be some problem with the first attempt.) - bill
  • Bill Todd Monday, June 5, 2006
    On free-space checkerboarding and file defragmentation: You don't have to remember anything if you're really doing not-too-infrequent integrity scrubs, since as you scrub each file you're reading it all in anyway and have a golden opportunity to write it back out in large, efficient, contiguous chunks, whereas with small files that are fragmenting the free space you can just accumulate them until you've got a large enough batch to write out (again, efficiently) to a location better suited to their use. As long as you don't get obsessive about consolidating the last few small chunks of free space (i.e., are willing to leave occasions unusable free holes because they're negligible in total size), this should work splendidly. - bill
  • Bill Todd Monday, June 5, 2006
    Rats: given your snapshot/cloning mechanisms, you can't rearrange even current data as cavalierly as that, I guess. In fact, how \*do\* you avoid rather significant free-space fragmentation issues (and, though admittedly in atypical files, severe performance degradation in file access due to extreme small-update-induced fragmentation) in the presence of (possibly numerous and long-lived) snapshots (or do you just let performance degrade as large areas of free space become rare and make up what would have been large writes - even of single blocks, in the extreme case - out of small pieces of free space)? Possibly an incentive to have explored metadata-level snapshoting similar to, e.g., Interbase's versioning, since adding an entire new level of indirection at the block level just to promote file and free-space contiguity gets expensive as the system size increases beyond the point where all the indirection information can be kept memory-resident.. - bill
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.Captcha
Oracle

Integrated Cloud Applications & Platform Services