Throughput numbers to a raw device was corrected since my initial post. The question put forth is whether the ZFS 128K blocksize is sufficient to saturate a regular disk. There is great body of evidence that shows that the bigger the write sizes and matching large FS clustersize lead to more throughput. The counter point is that ZFS schedules it’s I/O like nothing else seen before and manages to saturate a single disk using enough concurrent 128K I/O.

So I am proposing this for review by the community. I first measured the throughput of a write(2)to a raw device using for instance this:

dd if=/dev/zero of=/dev/rdsk/c1t1d0s0 bs=8192k count=1024

On Solaris we would see some overhead of reading the block from /dev/zero and then issuing the write call. The tightest function that fences the I/O is default_physio(). That function will issue the I/O to the device then wait for it to complete. If we take the elapse time spent in this function and count the bytes that are I/O-ed, this should give a good hint as to the throughput the device is providing. The above dd command will issue a single I/O at a time (d-script to measure is attached).

Trying different blocksizes I see:

   Bytes   Elapse of phys IO     Size       Sent
   8 MB;   3576 ms of phys; avg sz : 16 KB; throughput 2 MB/s
   9 MB;   1861 ms of phys; avg sz : 32 KB; throughput 4 MB/s
   31 MB;  3450 ms of phys; avg sz : 64 KB; throughput 8 MB/s
   78 MB;  4932 ms of phys; avg sz : 128 KB; throughput 15 MB/s
   124 MB; 4903 ms of phys; avg sz : 256 KB; throughput 25 MB/s
   178 MB; 4868 ms of phys; avg sz : 512 KB; throughput 36 MB/s
   226 MB; 4824 ms of phys; avg sz : 1024 KB; throughput 46 MB/s
   226 MB; 4816 ms of phys; avg sz : 2048 KB; throughput 54 MB/s (was 46)
    32 MB;  686 ms of phys; avg sz : 4096 KB; throughput 58 MB/s (was 46)
   224 MB; 4741 ms of phys; avg sz : 8192 KB; throughput 59 MB/s (was 47)
   272 MB; 4336 ms of phys; avg sz : 16384 KB; throughput 58 MB/s (new  data)
   288 MB; 4327 ms of phys; avg sz : 32768 KB; throughput 59 MB/s (new data)

Data was corrected after it was pointed out that, physio will be throttled by maxphys. New data was obtained after settings:

 /etc/system: set maxphys=8338608
 /kernel/drv/sd.conf sd_max_xfer_size=0x800000
 /kernel/drv/ssd.cond ssd_max_xfer_size=0x800000

And setting un_max_xfer_size in “struct sd_lun”. That address was figured out using dtrace and knowing that sdmin()calls ddi_get_soft_state (details avail upon request). And of course disabling the write cache (using format -e). With this in place I verified that each sdwrite() up to 8M would lead to a single biodone interrupts using this:

dtrace -n 'biodone:entry,sdwrite:entry{@a[probefunc, stack(20)]=count()}'

Note that for 16M and 32M raw device writes, each default_physio will issue a series of 8M I/O. And so we don’t expect any more throughput from that.

The script used to measure the rates (phys.d) was also modified since I was counting the bytes before the I/O had completed and that made a big difference for the very large I/O sizes. If you take the 8M case, the above rates correspond to the time it takes to issue and wait for a single 8M I/O to the sd driver. So this time certainly does include 1 seek and ~0.13 seconds of data transfer, then the time to respond to the interrupt, finally the wakeup of the thread waiting in default_physio(). Given that the data transfer rate using 4MB is very close to the one using 8 MB, I’d say that at 60MB/sec all the fixed-cost element are well amortized. So I would conclude from this that the limiting factor is now at the device itself or on the data channel between the disk and the host.

My disk is hitachi-dk32ej36nsun36g-pq08-33.92gb.

Now lets see what ZFS gets. I measure using single dd process. ZFS will chunk up data in 128K blocks. Now the dd command interact with memory. But the I/O are scheduled under the control of spa_sync(). So in the d-script (attached) I check for the start of an spa_sync and time that based on elapse. At the same time I gather the number of bytes and keep a count of I/O (bdev_strategy) that are being issued. When the spa_sync completes we are sure that all those are on stable storage. The script is a bit more complex because there are 2 threads that issue spa_sync, but only one of them actually becomes activated. So the script will print out some spurious lines of output at times. I measure I/O with the script while this runs:

dd if=/dev/zero of=/zfs2/roch/f1 bs=1024k count=8000

And I see:

   1431 MB; 23723 ms of spa_sync; avg sz : 127 KB; throughput 60 MB/s
   1387 MB; 23044 ms of spa_sync; avg sz : 127 KB; throughput 60 MB/s
   2680 MB; 44209 ms of spa_sync; avg sz : 127 KB; throughput 60 MB/s
   1359 MB; 24223 ms of spa_sync; avg sz : 127 KB; throughput 56 MB/s
   1143 MB; 19183 ms of spa_sync; avg sz : 126 KB; throughput 59 MB/s

OK, I cheated. Here, ZFS is given a full disk to play with. In this case ZFS enables the write cache. Note that even with the write cache enabled, when the spa_sync() completes, it will be after a flush of the cache has been executed. So the 60MB/sec do correspond to data set to the platter. I just tried disabling the cache (with format -e) but I am not sure if that is taken into account by ZFS; Results are the same 60MB/sec. This will have to be confirmed.

With write cache enabled, the physio test reaches 66 MB/s as soon as we are issuing 16KB I/Os. Here clearly though, data is not on the platter when the timed function completes.

 

Another variable not fully controled is the physical (cylinder) locations of the I/O. It could be that some of the differences come from that.

 

What do I take away ?

a single 2MB physical I/O will get 46 MB/sec out of my disk. 35 concurrent 128K I/O sustained followed by metadata I/O followed by flush of the write cache allows ZFS to get 60MB/sec out of the same disk.

This is what underwrites my belief that 128K blocksize is sufficiently large. Now, nothing here proves that 256K would not give more throughput; so nothing is really settled. But I hope this helps put us on common ground.

--------------phys.d-------------------
#!/usr/sbin/dtrace -qs
/\* Measure throughput going through physio (dd to raw)\*/
BEGIN {

b = 0; /\* Bytecount \*/

cnt = 0; /\* phys iocount \*/

delta = 0; /\* time delta \*/

tt = 0; /\* timestamp \*/ } default_physio:entry {

tt=timestamp;

self->b = (args[5]->uio_iov->iov_len); } default_physio:return /tt!=0/ {

cnt ++;

b += self->b;

delta+=(timestamp-tt); } tick-5s /delta != 0/ {

printf("%d MB; %d ms of phys; avg sz : %d KB; throughput %d MB/s\\n",

b / 1048576,

delta / 1000000,

b / cnt / 1024,

(b \* 1000000000) / (delta \* 1048676)); } tick-5s 

b = 0; delta = 0; cnt = 0; tt = 0 } 
--------------phys.d-------------------
--------------spa_sync.d-------------------
#!/usr/sbin/dtrace -qs
/\*
 \* Measure I/O throughput as generated by spa_sync 
 \* Between the spa_sync entry and return probe
 \* I count all I/O and bytes going through bdev_strategy.
 \* This is a lower bound on what the device can do since
 \* some aspects of spa_sync are non-concurrent I/Os.
 \*/
BEGIN {

tt = 0; /\* timestamp \*/

b = 0; /\* Bytecount \*/

cnt = 0; /\* iocount \*/ } spa_sync:entry/(self->t == 0) && (tt == 0)/{

b = 0; /\* reset the I/O byte count \*/

cnt = 0;

tt = timestamp;

self->t = 1; } spa_sync:return /(self->t == 1) && (tt != 0)/ {

this->delta = (timestamp-tt);

this->cnt = (cnt == 0) ? 1 : cnt; /\* avoid divide by 0 \*/

printf("%d MB; %d ms of spa_sync; avg sz : %d KB; throughput %d MB/s\\n",

b / 1048576,

this->delta / 1000000,

b / this->cnt / 1024,

(b \* 1000000000) / (this->delta \* 1048676));

tt = 0;

self->t = 0; } /\* We only count I/O issued during an spa_sync \*/ bdev_strategy:entry /tt != 0/ {

cnt ++;

b += (args[0]->b_bcount); } 
--------------spa_sync.d-------------------