X

News, tips, partners, and perspectives for the Oracle Solaris operating system

ZFS write throttle observations

Guest Author

The new ZFS write throttle feature, which integrated in Nevada build 87, specifically addresses write intensive workloads. Today, we take a closer look at the write throttle in action. Our test system is a Sun Fire X4500 running Nevada build 94 with a single ZFS pool of 42 striped disks.

blog@x4500> zpool list
NAME SIZE USED AVAIL CAP HEALTH ALTROOT
h 19.0T 620K 19.0T 0% ONLINE -

The zfs_write_throttle.d DTrace script is used to observe the write throttle. In a first test, we start generating write I/O load using a couple of “dd if=/dev/zero of=/h/<file> bs=1024k” commands. Here's an extract of the script output:

--- 2008 Jul 28 14:04:17
Sync rate (/s)
h 1
MB/s
h 1540
Delays/s
h 47
h Sync time (ms)
value ------------- Distribution ------------- count
80 | 0
100 |@@@@@@@@@@@ 3
120 |@@@@ 1
...snip...
260 |@@@@ 1
...snip...
580 |@@@@ 1
...snip...
780 |@@@@ 1
...snip...
1320 |@@@@@@@ 2
1340 | 0
1360 |@@@@ 1
...snip...
1520 |@@@@ 1
1540 | 0
h Written (MB)
value ------------- Distribution ------------- count
< 200 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 9
...snip...
3000 |@@@@ 1
...snip...
>= 4000 |@@@@ 1
h Write limit (MB)
value ------------- Distribution ------------- count
7750 | 0
>= 8000 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 11

The output has been shortened for clarity. With the default settings in place, one can observe that the average time for synchronizing data to disks takes well over a second [range 100 ms to 1540 ms] (please refer to Sync time distribution).

In a second test, we are reducing the target time for synchronizing data on disk from five seconds (default) to one second (using the zfs_txg_synctime variable). Here's again an extract of the script output:

--- 2008 Jul 28 14:08:27
Sync rate (/s)
h 1
MB/s
h 1681
Delays/s
h 56
h Sync time (ms)
value ------------- Distribution ------------- count
340 | 0
360 |@@@ 1
...snip...
460 |@@@ 1
480 | 0
500 |@@@ 1
...snip...
600 |@@@ 1
...snip...
660 |@@@ 1
...snip...
740 |@@@ 1
760 |@@@ 1
780 |@@@ 1
800 | 0
820 |@@@ 1
840 |@@@@@@ 2
860 |@@@@@@ 2
...snip...
1040 |@@@ 1
1060 | 0
h Written (MB)
value ------------- Distribution ------------- count
< 200 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 10
...snip...
2400 |@@@ 1
2600 | 0
2800 |@@@ 1
...snip...
>= 4000 |@@@@@@ 2
h Write limit (MB)
value ------------- Distribution ------------- count
2500 | 0
2750 |@@@@@@ 2
...snip...
4750 |@@@ 1
5000 |@@@@@@ 2
5250 | 0
5500 |@@@@@@@@@@@ 4
...snip...
6500 |@@@ 1
6750 | 0
7000 |@@@@@@@@@ 3
...snip...
>= 8000 |@@@ 1


Two things can be seen when comparing with the first test:

a) the average time for synchronizing data to disks has gone down [range 360 ms to 1060 ms].

b) the pool “write limit” mark did move around over time (please refer to Write limit distribution), thus dynamically throttling the incoming application write rate to the available I/O bandwidth.

More parameters are available for tuning (please see the source code), but as usual, use them with caution. To wrap-up, here's one last output extract where the parameter zfs_write_limit_override was set to 800 MB. In setting this parameter, we are enforcing the write limit to the value specified. This can be beneficial for applications that generate a continuous well paced write stream but are sensitive to write delays.

--- 2008 Jul 28 14:54:49
Sync rate (/s)
h 4
MB/s
h 677
Delays/s
h 1
h Sync time (ms)
value ------------- Distribution ------------- count
120 | 0
140 |@@@@@@ 6
160 |@@@@@@@@@@@@@@@ 15
180 |@@@@@@@@@@@@@@@ 15
200 |@@@@@ 5
220 | 0
h Written (MB)
value ------------- Distribution ------------- count
< 200 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 31
200 | 0
400 |@@@@@ 5
600 | 0
800 |@@@@@ 5
1000 | 0
h Write limit (MB)
value ------------- Distribution ------------- count
1250 | 0
1500 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 41
1750 | 0

Hopefully, you have enjoyed these little observations!


 


Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.