128K Suffice

 throughput numbers to raw device was corrected since my initial post

The question put forth is whether the ZFS 128K blocksize is sufficient 
to saturate a regular disk. There is great body of evidence that shows 
that the bigger the write sizes and matching large FS clustersize lead 
to more throughput. The counter point is that ZFS schedules it's I/O
like nothing else seen before and manages to sature a single disk
using enough concurrent 128K I/O.

I first measured the throughput of a write(2)  to raw device using for
instance this;

	dd if=/dev/zero of=/dev/rdsk/c1t1d0s0 bs=8192k count=1024

On   Solaris we would  see  some overhead of   reading  the block from
/dev/zero and then issuing the write call.  The tightest function that
fences the I/O is default_physio(). That  function will issue the I/O to
the device then wait for it  to complete.  If  we take the elapse time
spent in   this function and  count the  bytes that   are I/O-ed, this
should give  a   good  hint as   to   the throughput  the    device is
providing.  The above  dd command will  issue  a single I/O at  a time
(d-script to measure is attached).

Trying different blocksizes I see:

   Bytes   Elapse of phys IO     Size	    
   8 MB;   3576 ms of phys; avg sz : 16 KB; throughput 2 MB/s

   9 MB;   1861 ms of phys; avg sz : 32 KB; throughput 4 MB/s

   31 MB;  3450 ms of phys; avg sz : 64 KB; throughput 8 MB/s

   78 MB;  4932 ms of phys; avg sz : 128 KB; throughput 15 MB/s

   124 MB; 4903 ms of phys; avg sz : 256 KB; throughput 25 MB/s

   178 MB; 4868 ms of phys; avg sz : 512 KB; throughput 36 MB/s

   226 MB; 4824 ms of phys; avg sz : 1024 KB; throughput 46 MB/s

   226 MB; 4816 ms of phys; avg sz : 2048 KB; throughput 54 MB/s (was 46)

    32 MB;  686 ms of phys; avg sz : 4096 KB; throughput 58 MB/s (was 46)

   224 MB; 4741 ms of phys; avg sz : 8192 KB; throughput 59 MB/s (was 47)

   272 MB; 4336 ms of phys; avg sz : 16384 KB; throughput 58 MB/s (new  data)

   288 MB; 4327 ms of phys; avg sz : 32768 KB; throughput 59 MB/s (new data)

Data  was corrected  after  it was pointed out   that, physio will  be
throttled by maxphys. New data was obtained after settings

	/etc/system: set maxphys=8338608
	/kernel/drv/sd.conf sd_max_xfer_size=0x800000
	/kernel/drv/ssd.cond ssd_max_xfer_size=0x800000

	And setting un_max_xfer_size in "struct sd_lun".
	That address was figured out using dtrace and knowing that
	sdmin() calls ddi_get_soft_state (details avail upon request).
	And of course disabling the write cache (using format -e)

	With this in place I verified that each sdwrite() up to 8M 
	would lead to a single biodone interrupts using this:

	dtrace -n 'biodone:entry,sdwrite:entry{@a[probefunc, stack(20)]=count()}'

	Note that for 16M and 32M raw device writes, each default_physio
	will issue a series of 8M I/O. And so we don't
	expect any more throughput from that.

The script  used  to measure  the  rates  (phys.d)  was also
modified since I was  counting the bytes  before the I/O had
completed and that made a  big difference for the very large
I/O sizes.

If you take the  8M case, the  above rates correspond to the
time it takes to issue  and wait for a  single 8M I/O to the
sd driver. So this time certainly does include  1 seek and ~
0.13 seconds  of data transfer, then  the time to respond to
the  interrupt, finally the wakeup  of the thread waiting in
default_physio(). Given that the data  transfer rate using 4
MB is very close to  the one using 8  MB, I'd say that at 60
MB/sec all the fixed-cost  element are well amortized.  So I
would conclude from this that  the limiting factor is now at
the  device itself or  on the data  channel between the disk
and the host.
My disk is


Now lets see what  ZFS gets.  I measure using  single dd process.  ZFS
will chunk up data  in 128K blocks.  Now  the dd command interact with
memory. But the I/O are scheduled under the control of spa_sync().  So
in  the d-script (attached) I check  for the start  of an spa_sync and
time that based on elapse.  At the same  time I  gather the number  of
bytes and keep  a count of I/O (bdev_strategy)  that are being issued.
When the spa_sync completes we are  sure that all  those are on stable
storage. The script is a bit more  complex because there are 2 threads
that   issue  spa_sync, but  only     one  of them actually    becomes
activated. So the script will print out  some spurious lines of output
at times. I measure I/O with the script while this runs:

	dd if=/dev/zero of=/zfs2/roch/f1 bs=1024k count=8000

And I see:

   1431 MB; 23723 ms of spa_sync; avg sz : 127 KB; throughput 60 MB/s
   1387 MB; 23044 ms of spa_sync; avg sz : 127 KB; throughput 60 MB/s
   2680 MB; 44209 ms of spa_sync; avg sz : 127 KB; throughput 60 MB/s
   1359 MB; 24223 ms of spa_sync; avg sz : 127 KB; throughput 56 MB/s
   1143 MB; 19183 ms of spa_sync; avg sz : 126 KB; throughput 59 MB/s

OK, I  cheated. Here, ZFS is given  a full disk  to play with. In this
case ZFS enables the write cache. Note  that even with the write cache
enabled, when the spa_sync()  completes, it will  be after a  flush of
the cache has been executed. So the 60MB/sec do correspond to data set
to the platter. I just tried disabling the cache  (with format -e) but
I  am not sure if  that is taken into account  by ZFS; Results are the
same 60MB/sec. This will have to be confirmed.

With write cache enabled,  the physio test reaches 66  MB/s as soon as
we are issuing 16KB I/Os.   Here clearly though,  data  is not on  the
platter when the timed function completes.

Another variable  not  fully  controled  is the   physical  (cylinder)
locations of  the I/O. It could be  that some of the  differences come
from that.

What do I take away ?

	a single 2MB physical I/O will get 46 MB/sec out of my disk.

	35  concurrent  128K I/O  sustained  followed  by metadata I/O
	followed by flush  of  the write cache  allows  ZFS to get  60
	MB/sec out of the same disk.

This is what underwrites my belief that 128K blocksize is sufficiently
large. Now, nothing  here proves    that  256K would not give     more
throughput; so nothing is really settled. But I hope this helps put us 
on common ground.

#!/usr/sbin/dtrace -qs

/\* Measure throughput going through physio (dd to raw)\*/

	b = 0; /\* Bytecount \*/
	cnt = 0; /\* phys iocount \*/
	delta = 0; /\* time delta \*/
	tt = 0; /\* timestamp \*/

default_physio:entry {
	self->b = (args[5]->uio_iov->iov_len);

/tt!=0/ {
	cnt ++;
	b += self->b;

/delta != 0/ {
	printf("%d MB; %d ms of phys; avg sz : %d KB; throughput %d MB/s\\n",
		b / 1048576,
		delta / 1000000, 
		b / cnt / 1024,
		(b \* 1000000000) / (delta \* 1048676)); 

tick-5s {
		b = 0; delta = 0; cnt = 0; tt = 0

#!/usr/sbin/dtrace -qs

 \* Measure I/O throughput as generated by spa_sync 
 \* Between the spa_sync entry and return probe
 \* I count all I/O and bytes going through bdev_strategy.
 \* This is a lower bound on what the device can do since
 \* some aspects of spa_sync are non-concurrent I/Os.

	tt = 0; /\* timestamp \*/
	b = 0; /\* Bytecount \*/
	cnt = 0; /\* iocount \*/

spa_sync:entry/(self->t == 0) && (tt == 0)/{
	b = 0; /\* reset the I/O byte count \*/
	cnt = 0;
	tt = timestamp; 
	self->t = 1;

/(self->t == 1) && (tt != 0)/
	this->delta = (timestamp-tt);
	this->cnt = (cnt == 0) ? 1 : cnt; /\* avoid divide by 0 \*/
	printf("%d MB; %d ms of spa_sync; avg sz : %d KB; throughput %d MB/s\\n",
		b / 1048576,
		this->delta / 1000000, 
		b / this->cnt / 1024,
		(b \* 1000000000) / (this->delta \* 1048676)); 
	tt = 0;
	self->t = 0;

/\* We only count I/O issued during an spa_sync \*/
/tt != 0/
	cnt ++;
	b += (args[0]->b_bcount);

What kind of disk is it? I mean did you use FC or SCSI (or something else). I ask becouse using FC disks I see different results than yours and 8MB IO's using dd do produce much better throutput than ZFS with its 128KB IOs (the same disk). Maybe FC introduces more overhead in terms of protocol (SCSI inside FC) so it benefits more from larger IOs?

Posted by Robert Milkowski on mai 23, 2006 at 03:45 AM MEST #

Post a Comment:
Comments are closed for this entry.



« juillet 2016

No bookmarks in folder