RAID-Z is the technology  used by ZFS  to implement a data-protection  scheme
which is less  costly  than  mirroring  in  terms  of  block

Here,  I'd  like  to go  over,    from a theoretical standpoint,   the
performance implication of using RAID-Z.   The goal of this technology
is to allow a storage subsystem to be able  to deliver the stored data
in  the face of one  or more disk   failures.  This is accomplished by
joining  multiple disks into  a  N-way RAID-Z  group. Multiple  RAID-Z
groups can be dynamically striped to form a larger storage pool.

To store file data onto  a RAID-Z group, ZFS  will spread a filesystem
(FS) block onto the N devices that make up the  group.  So for each FS
block,  (N - 1) devices  will  hold file  data  and 1 device will hold
parity  information.   This information  would eventually   be used to
reconstruct (or  resilver) data in the face  of any device failure. We
thus  have 1 / N  of the available disk  blocks that are used to store
the parity  information.   A 10-disk  RAID-Z group  has 9/10th of  the
blocks effectively available to applications.

A common alternative for data protection, is  the use of mirroring. In
this technology, a filesystem block is  stored onto 2 (or more) mirror
copies.  Here again,  the system will  survive single disk failure (or
more with N-way mirroring).  So 2-way mirror actually delivers similar
data-protection at   the expense of   providing applications access to
only one half of the disk blocks.

Now  let's look at this from  the performance angle in particular that
of  delivered filesystem  blocks  per second  (FSBPS).  A N-way RAID-Z
group  achieves it's protection  by spreading a  ZFS block  onto the N
underlying devices.  That means  that a single  ZFS block I/O must  be
converted to N device I/Os.  To be more precise,  in order to acces an
ZFS block, we need N device I/Os for Output and (N - 1) device I/Os for
input as the parity data need not generally be read-in.

Now after a request for a  ZFS block has been spread  this way, the IO
scheduling code will take control of all the device  IOs that needs to
be  issued.  At this  stage,  the ZFS  code  is capable of aggregating
adjacent  physical   I/Os  into   fewer ones.     Because of  the  ZFS
Copy-On-Write (COW) design, we   actually do expect this  reduction in
number of device level I/Os to work extremely well  for just about any
write intensive workloads.  We also expect  it to help streaming input
loads significantly.  The situation of random inputs is one that needs
special attention when considering RAID-Z.

Effectively,  as  a first approximation,  an  N-disk RAID-Z group will
behave as   a single   device in  terms  of  delivered    random input
IOPS. Thus  a 10-disk group of devices  each capable of 200-IOPS, will
globally act as a 200-IOPS capable RAID-Z group.  This is the price to
pay to achieve proper data  protection without  the 2X block  overhead
associated with mirroring.

With 2-way mirroring, each FS block output must  be sent to 2 devices.
Half of the available IOPS  are thus lost  to mirroring.  However, for
Inputs each side of a mirror can service read calls independently from
one another  since each  side   holds the full information.    Given a
proper software implementation that balances  the inputs between sides
of a mirror, the  FS blocks delivered by a  mirrored group is actually
no less than what a simple non-protected RAID-0 stripe would give.

So looking  at random access input  load, the number  of FS blocks per
second (FSBPS), Given N devices to be grouped  either in RAID-Z, 2-way
mirrored or simply striped  (a.k.a RAID-0, no  data protection !), the
equation would  be (where dev  represents   the capacity in  terms  of
blocks of IOPS of a single device):

		Blocks Available	FS Blocks / sec
		----------------	--------------
RAID-Z		(N - 1) \* dev		1 \* dev		
Mirror		(N / 2) \* dev		N \* dev		
Stripe		N \* dev			N \* dev		

Now lets take 100 disks of  100 GB, each each  capable of 200 IOPS and
look  at different  possible configurations;  In the   table below the
configuration labeled:
	"Z 5 x (19+1)"

refers to a dynamic striping of 5 RAID-Z groups, each group made of 20
disks (19 data disk + 1 parity). M refers to a 2-way mirror and S to a
simple dynamic stripe.

	 Config		Blocks Available	FS Blocks /sec
	 ------------	----------------	--------- 
	 Z 1  x (99+1) 	9900 GB		  	  200	  
	 Z 2  x (49+1)	9800 GB		  	  400	  
	 Z 5  x (19+1)	9500 GB			 1000	  
	 Z 10 x (9+1)	9000 GB			 2000	  
	 Z 20 x (4+1)	8000 GB			 4000	  
	 Z 33 x (2+1)	6600 GB			 6600	  

	 M  2 x (50) 	5000 GB			20000	  
	 S  1 x (100)   10000 GB		20000	  

So RAID-Z  gives you  at most 2X  the number  of blocks that mirroring
provides  but hits you  with  much fewer  delivered IOPS.  That  means
that, as the number of  devices in a  group N increases, the  expected
gain over mirroring (disk blocks)  is bounded (to  at most 2X) but the
expected cost  in IOPS is not  bounded (cost in  the range of [N/2, N]
fewer IOPS).  

Note  that for wide RAID-Z configurations,  ZFS takes into account the
sector  size of devices  (typically 512 Bytes)  and dynamically adjust
the effective number of columns in a stripe.  So even if you request a
99+1  configuration, the actual data  will probably be  stored on much
fewer data columns than that.   Hopefully this article will contribute
to steering deployments away from those types of configuration.

In conclusion, when preserving IOPS capacity is important, the size of
RAID-Z groups    should be restrained  to smaller   sizes and one must
accept some level of disk block overhead.

When performance matters most, mirroring should be highly favored.  If
mirroring  is considered too   costly but performance  is nevertheless
required, one could proceed like this:

	Given N devices each capable of X IOPS.

	Given a target of delivered  Y FS blocks per second
	for the storage pool.

	Build your storage using dynamically  striped RAID-Z groups of
	(Y / X) devices.

For instance: 

	Given 50 devices each capable of 200 IOPS.

	Given a target of delivered 1000 FS blocks per second
	for the storage pool.

	Build your storage using dynamically striped RAID-Z groups of
	(1000 / 200) = 5 devices.

In that system we then would have  20% block overhead lost to maintain
RAID-Z level parity.

RAID-Z is a great  technology not only  when disk blocks are your most
precious resources but also  when your available  IOPS far exceed your
expected needs.  But beware  that if you  get your hands on fewer very
large  disks, the IOPS capacity  can  easily become your most precious
resource. Under those conditions, mirroring should be strongly favored
or alternatively a  dynamic stripe of RAID-Z  groups each made up of a
small number of devices.


This is a very important article, and thank you for writing it. I have to admit that I did not understand everything. In particular, it always seemed to me that in a first approximation, raid5 (and raidz) was just (n-1) striped disks and the last one used for parity (except that the disk used for parity is not always the same). But if that were the case, there would not be such a huge performance hit between raid0 and raidz, at least for input. If you get the time some day, could you expand a bit on this please? Anyway, thank you, this will definitely change the way I am using raidz.

Posted by Marc on mai 31, 2006 at 06:54 AM MEST #

Because in raid5, a file system block goes to a single disk. So a re-read hits only that disk. The downside of that design is that to write the block to that one disk, you need to read the previous block, read the parity, update the parity, then write both the new block and parity. And since those 2 writes do not cannot usually be done atomically, you have the write hole mess and potential silent data corruption.

Posted by Roch on mai 31, 2006 at 09:48 AM MEST #

Great article, simply great! As soon as 10 6/06 is out, I will dearly need this information, since I have an enterprise storage server to configure. This is exactly what was needed; I even printed this article, just in case. Keep 'em coming!

Posted by ux-admin on mai 31, 2006 at 11:00 AM MEST #

So in order to avoid the need for no more than 4 disk I/Os on each block write (two reads - one or both of which may be satisfied by cached data - plus two writes), you instead force a write to \*all\* the disks in the stripe (often more than 4 disk I/O operations, though all can be performed in parallel) \*plus\* degrade parallel read performance from N \* dev IOPS to 1 \* dev IOPS. Some people would suggest that that's a lousy trade-off, especially given the option to capture the modifications efficiently elsewhere (say, mirrored, with the new locations logged as look-aside addresses) and perform the stripe update lazily later. ZFS has some really neat innovations, but this does not appear to be one of them. - bill

Posted by Bill Todd on mai 31, 2006 at 06:33 PM MEST #

ZFS has other technologies that will affect the number of IOPS that effectively happens (the mitigating factors) both for Inputs and Outputs. For Streaming purposes RAID-Z performance will be on par with anything else. Now, the article highlighted the tradeoffs one is faced given a bunch of disks and the need for data protection: mirror or RAID-Z. That is the question that many of the small people will be facing out there.The question that Bill Todd raises is a different interesting issue: Given a RAID-5 controller, 1GB of NVRAM and a bunch of disk, should I throw away the controller or keep it. That is a much more complex question...

Posted by Roch on juin 01, 2006 at 02:31 AM MEST #

On : zfs-discuss@opensolaris.org There's an important caveat I want to add to this. When you're doing sequential I/Os, or have a write-mostly workload, the issues that Roch explained so clearly won't come into play. The trade-off between space-efficient RAID-Z and IOP-efficient mirroring only exists when you're doing lots of small random reads. If your I/Os are large, sequential, or write-mostly, then ZFS's I/O scheduler will aggregate them in such a way that you'll get very efficient use of the disks regardless of the data replication model. It's only when you're doing small random reads that the difference between RAID-Z and mirroring becomes significant.... Jeff

Posted by Roch quoting Jeff B on juin 01, 2006 at 03:27 AM MEST #

While streaming/large-sequential writes won't be any worse than in more conventional RAID-5, a write-mostly workload using \*small\* (especially single-block) writes will be if the stripe width exceeds 4 disks: all the disks will be written, vs. just two disk writes (plus at most two disk reads) in a conventional RAID-5 implementation - at least if the updates must be applied synchronously. If the updates can be deferred until several accumulate and can be written out together (assuming that revectoring them to new disk locations - even if they are updates to existing file data rather than appended data - is part of ZFS's bag of tricks), then Jeff's explanation makes more sense. And ISTR some mention of a 'log' for small synchronous updates that might function in the manner I suggested (temporarily capturing the update until it could be applied lazily - of course, the log would need to be mirrored to offer equivalent data protection until that occurred). The impact on small, parallel reads ramains a major drawback and suggesting that this is necessary for data integrity seems to be a red herring if indeed ZFS can revector new writes at will, since it can just delay logging the new locations until both data and parity updates have completed. If there's some real problem doing this, I'm curious as to what it might be. - bill

Posted by Bill Todd on juin 01, 2006 at 05:04 PM MEST #

So yes, the synchronous writes go through the ZFS Intent log (zil) and Jeff mentioned this week that mirroring those tmp blocks N-way seems a good idea. ZFS does revector all writes to new locations (even block update) and it allows to stream just about any write intensive workload at top speed. It does seem possible as you suggest to implement raid-5 with ZFS; I suggest that would lead to a 4X degradation to all output intensive workloads; Since we won' t overwrite live data, to output X MB of data, ZFS would have to read X MB of freed blocks, X MB of parity, then write those blocks with new data. Maybe some of the read could already be cached but that's not quite clear that they would commonly be. Maybe what this saying is that RAID-5 works nicely for filesystems that allow themselves to overwrite live data. That maybe ok but does seem to require NVRAM to work. This is just not the design point of ZFS. It appears that for ZFS, RAID-5 would pessimise all write intensive workloads and RAID-Z pessimises non-cached random read type load.

Posted by Roch on juin 03, 2006 at 08:20 AM MEST #

Don't be silly. You'd output X MB in full-stripe writes (possibly handling the last stripe specially), just as you do now, save that instead of smearing each block across all the drives you'd write them to individual drives (preferably in contiguous groups within a single file) such that they could be read back efficiently later. In typical cases (new file creation or appends to files rather than in-file updates to existing data) you wouldn't even experience any free-space checkerboarding, but where you did it could be efficiently rectified during your subsequent scrubbing passes (especially within large files where all the relevant metadata would be immediately at hand; in small files you'd need to remember some information about what needed reconsolidating, though given that in typical environments large files consume most of the space - though not necessarily most of the access activity - using parity RAID for small files is of questionable utility anyway). And there's no need for NVRAM with the full-stripe writes you'd still be using. (Hope this doesn't show up as a double post - there seemed to be some problem with the first attempt.) - bill

Posted by Bill Todd on juin 05, 2006 at 06:43 AM MEST #

On free-space checkerboarding and file defragmentation: You don't have to remember anything if you're really doing not-too-infrequent integrity scrubs, since as you scrub each file you're reading it all in anyway and have a golden opportunity to write it back out in large, efficient, contiguous chunks, whereas with small files that are fragmenting the free space you can just accumulate them until you've got a large enough batch to write out (again, efficiently) to a location better suited to their use. As long as you don't get obsessive about consolidating the last few small chunks of free space (i.e., are willing to leave occasions unusable free holes because they're negligible in total size), this should work splendidly. - bill

Posted by Bill Todd on juin 05, 2006 at 06:26 PM MEST #

Rats: given your snapshot/cloning mechanisms, you can't rearrange even current data as cavalierly as that, I guess. In fact, how \*do\* you avoid rather significant free-space fragmentation issues (and, though admittedly in atypical files, severe performance degradation in file access due to extreme small-update-induced fragmentation) in the presence of (possibly numerous and long-lived) snapshots (or do you just let performance degrade as large areas of free space become rare and make up what would have been large writes - even of single blocks, in the extreme case - out of small pieces of free space)? Possibly an incentive to have explored metadata-level snapshoting similar to, e.g., Interbase's versioning, since adding an entire new level of indirection at the block level just to promote file and free-space contiguity gets expensive as the system size increases beyond the point where all the indirection information can be kept memory-resident.. - bill

Posted by Bill Todd on juin 05, 2006 at 10:46 PM MEST #

Post a Comment:
Comments are closed for this entry.



« d├ęcembre 2016

No bookmarks in folder