vdev cache and ZFS database performance

A little while back Neel did some nice evaluating of DB on ZFS performance. One issue that he correctly noted was: 6437054 vdev_cache wises up: increase DB performance by 16%.

The major issue for DB performance was that the vdev cache would inflate reads under 16KB (zfs_vdev_cache_max) to 64KB (1 << zfs_vdev_cache_bshift).

As you can guess, this inflating would really hurt typical databases as they do lots of record aligned random I/Os - and the random I/Os are typically under 16KB (Oracle and Postgress are usually configured with 8KB, JavaDB with 4KB, etc.). So why do we have this inflation in the first place? Turns out its really important for pre-fetching metadata. One workload that demonstrates this is the mult-stream sequnetial read workload for FileBench. We can also use the oltp workload of FileBench to test database performance.

What we changed in order to fix 6437054 was to make the vdev cache only inflate I/Os for \*metadata\* - not \*user\* data. You can now see that logic in vdev_cache_read(). This logically makes sense, as we can now rely on zfetch to correctly pre-fetch user data (which depends more on what the application is doing), and the vdev cache to pre-fetch metadata (which depends more on where it was located on disk).

Ok, yeah, theory is nice, but let's see some measurements...

OLTP workload

Below are the results from using this profile (named 'db.prof'). This was on a thumper, non-debug bits, ZFS configured in a 46 disk RAID-0, and the recordsize set to 8KB.

OpenSolaris results without the fix for 6437054 (onnv-gate:2007-07-11)

diskmonster# filebench db
parsing profile for config: large_db_oltp_8k
Running /var/tmp/ekstarz/fb/stats/diskmonster-zfs-db-Jul_18_2007-10h_14m_35s/large_db_oltp_8k/thisrun.f
 1698: 0.013: OLTP Version 1.16 2005/06/21 21:18:52 personality successfully loaded
 1698: 0.014: Creating/pre-allocating files
 1698: 0.014: Fileset logfile: 1 files, avg dir = 1024.0, avg depth = 0.0, mbytes=10
 1698: 0.014: Creating fileset logfile...
 1698: 0.118: Preallocated 1 of 1 of fileset logfile in 1 seconds
 1698: 0.118: Fileset datafiles: 10 files, avg dir = 1024.0, avg depth = 0.3, mbytes=102400
 1698: 0.118: Creating fileset datafiles...
 1698: 341.433: Preallocated 10 of 10 of fileset datafiles in 342 seconds
 1698: 341.434: Starting 200 shadow instances
...

 1698: 345.768: Running '/usr/lib/filebench/scripts/fs_flush zfs'
 1698: 345.774: Change dir to /var/tmp/ekstarz/fb/stats/diskmonster-zfs-db-Jul_18_2007-10h_14m_35s/large_db_oltp_8k
 1698: 345.776: Running...
 1698: 466.858: Run took 120 seconds...
 1698: 466.913: Per-Operation Breakdown
random-rate                 0ops/s   0.0mb/s      0.0ms/op        0us/op-cpu
shadow-post-dbwr         4237ops/s   0.0mb/s      0.5ms/op       37us/op-cpu
shadow-post-lg           4237ops/s   0.0mb/s      0.0ms/op        7us/op-cpu
shadowhog                4237ops/s   0.0mb/s      0.6ms/op      257us/op-cpu
shadowread               4255ops/s  33.2mb/s     45.7ms/op       67us/op-cpu
dbwr-aiowait               42ops/s   0.0mb/s      0.6ms/op       83us/op-cpu
dbwr-block                 42ops/s   0.0mb/s    215.7ms/op     2259us/op-cpu
dbwr-hog                   42ops/s   0.0mb/s      0.0ms/op       15us/op-cpu
dbwrite-a                4244ops/s  33.2mb/s      0.1ms/op        9us/op-cpu
lg-block                    1ops/s   0.0mb/s    745.3ms/op     6225us/op-cpu
lg-aiowait                  1ops/s   0.0mb/s      8.0ms/op       25us/op-cpu
lg-write                    1ops/s   0.3mb/s      0.3ms/op       26us/op-cpu

 1698: 466.913: 
IO Summary:      1034439 ops 8543.3 ops/s, (4255/4245 r/w)  66.7mb/s,    188us cpu/op,  45.7ms latency
 1698: 466.913: Stats dump to file 'stats.large_db_oltp_8k.out'
 1698: 466.913: in statsdump stats.large_db_oltp_8k.out
 1698: 466.922: Shutting down processes
\^C 1698: 601.568: Aborting...
 1698: 601.568: Deleting ISM...
Generating html for /var/tmp/ekstarz/fb/stats/diskmonster-zfs-db-Jul_18_2007-10h_14m_35s
file = /var/tmp/ekstarz/fb/stats/diskmonster-zfs-db-Jul_18_2007-10h_14m_35s/large_db_oltp_8k/stats.large_db_oltp_8k.out

diskmonster# 

OpenSolaris results with the fix for 6437054 (onnv-gate:2007-07-18)

diskmonster# filebench db
parsing profile for config: large_db_oltp_8k
Running /var/tmp/ekstarz/fb/stats/diskmonster-zfs-db-Jul_18_2007-10h_34m_46s/large_db_oltp_8k/thisrun.f
 1083: 0.037: OLTP Version 1.16 2005/06/21 21:18:52 personality successfully loaded
 1083: 0.037: Creating/pre-allocating files
 1083: 0.057: Fileset logfile: 1 files, avg dir = 1024.0, avg depth = 0.0, mbytes=10
 1083: 0.057: Creating fileset logfile...
 1083: 0.194: Preallocated 1 of 1 of fileset logfile in 1 seconds
 1083: 0.194: Fileset datafiles: 10 files, avg dir = 1024.0, avg depth = 0.3, mbytes=102400
 1083: 0.194: Creating fileset datafiles...
 1083: 335.203: Preallocated 10 of 10 of fileset datafiles in 336 seconds
 1083: 335.203: Starting 200 shadow instances
...
 1083: 339.484: Creating 221249536 bytes of ISM Shared Memory...
 1083: 339.649: Allocated 221249536 bytes of ISM Shared Memory... at fffffd7f8f600000
 1083: 339.650: Running '/usr/lib/filebench/scripts/fs_flush zfs'
 1083: 339.725: Change dir to /var/tmp/ekstarz/fb/stats/diskmonster-zfs-db-Jul_18_2007-10h_34m_46s/large_db_oltp_8k
 1083: 339.729: Running...
 1083: 460.683: Run took 120 seconds...
 1083: 460.724: Per-Operation Breakdown
random-rate                 0ops/s   0.0mb/s      0.0ms/op        0us/op-cpu
shadow-post-dbwr         5224ops/s   0.0mb/s      0.8ms/op       30us/op-cpu
shadow-post-lg           5224ops/s   0.0mb/s      0.1ms/op        6us/op-cpu
shadowhog                5223ops/s   0.0mb/s      0.7ms/op      255us/op-cpu
shadowread               5240ops/s  40.9mb/s     36.3ms/op       56us/op-cpu
dbwr-aiowait               52ops/s   0.0mb/s      0.9ms/op       86us/op-cpu
dbwr-block                 52ops/s   0.0mb/s    167.6ms/op     1605us/op-cpu
dbwr-hog                   52ops/s   0.0mb/s      0.0ms/op       15us/op-cpu
dbwrite-a                5226ops/s  40.8mb/s      0.2ms/op        9us/op-cpu
lg-block                    2ops/s   0.0mb/s    600.5ms/op     4580us/op-cpu
lg-aiowait                  2ops/s   0.0mb/s      4.0ms/op       23us/op-cpu
lg-write                    2ops/s   0.4mb/s      1.2ms/op       24us/op-cpu

 1083: 460.724: 
IO Summary:      1272557 ops 10520.9 ops/s, (5240/5228 r/w)  82.2mb/s,    156us cpu/op,  36.3ms latency
 1083: 460.724: Stats dump to file 'stats.large_db_oltp_8k.out'
 1083: 460.724: in statsdump stats.large_db_oltp_8k.out
 1083: 460.731: Shutting down processes
 1083: 1060.645: Deleting ISM...
Generating html for /var/tmp/ekstarz/fb/stats/diskmonster-zfs-db-Jul_18_2007-10h_34m_46s
file = /var/tmp/ekstarz/fb/stats/diskmonster-zfs-db-Jul_18_2007-10h_34m_46s/large_db_oltp_8k/stats.large_db_oltp_8k.out

diskmonster# 

10520.9 ops/s vs. 8543.3 ops/s, over 20%! That's a nice out of the box improvement!



Multi-Stream Sequential Read workload

A workaround previously mentioned to get better DB performance was to set 'zfs_vdev_cache_max' to 1B (which essentially disables the vdev cache as the random I/Os are not going to be lower than 1B). The problem with this approach is that it really hurst other workloads, such as the multi-stream sequential workload. Below are the results using the same thumper, non-debug bits, ZFS in a 46 disk RAID-0, checksums turned off, NCQ disabled via 'set sata:sata_func_enable = 0x5' in /etc/system', and using this profile (named 'sqread.prof').

OpenSolaris results with the fix for 6437054 (onnv-gate:2007-07-18), 'zfs_vdev_cache_max' left as its default value

diskmonster# filebench sqread
parsing profile for config: seqread1m
Running /var/tmp/ekstarz/fb/stats/diskmonster-zfs-sqread-Jul_18_2007-16h_50m_34s/seqread1m/thisrun.f
 2135: 0.005: Multi Stream Read Version 1.9 2005/06/21 21:18:52 personality successfully loaded
 2135: 0.005: Creating/pre-allocating files
 2135: 55.235: Pre-allocated file /bigIO/largefile4
 2135: 118.147: Pre-allocated file /bigIO/largefile3
 2135: 184.602: Pre-allocated file /bigIO/largefile2
 2135: 251.991: Pre-allocated file /bigIO/largefile1
 2135: 263.341: Starting 1 seqread instances
 2136: 264.348: Starting 1 seqread4 threads
 2136: 264.348: Starting 1 seqread3 threads
 2136: 264.348: Starting 1 seqread2 threads
 2136: 264.348: Starting 1 seqread1 threads
 2135: 267.358: Running '/usr/lib/filebench/scripts/fs_flush zfs'
 2135: 267.362: Change dir to /var/tmp/ekstarz/fb/stats/diskmonster-zfs-sqread-Jul_18_2007-16h_50m_34s/seqread1m
 2135: 267.362: Running...
 2135: 388.128: Run took 120 seconds...
 2135: 388.130: Per-Operation Breakdown
limit                       0ops/s   0.0mb/s      0.0ms/op        0us/op-cpu
seqread4                  469ops/s 468.7mb/s      2.1ms/op     1391us/op-cpu
limit                       0ops/s   0.0mb/s      0.0ms/op        0us/op-cpu
seqread3                  454ops/s 454.1mb/s      2.2ms/op     1412us/op-cpu
limit                       0ops/s   0.0mb/s      0.0ms/op        0us/op-cpu
seqread2                  444ops/s 443.8mb/s      2.2ms/op     1400us/op-cpu
limit                       0ops/s   0.0mb/s      0.0ms/op        0us/op-cpu
seqread1                  421ops/s 421.0mb/s      2.4ms/op     1414us/op-cpu

 2135: 388.130: 
IO Summary:      215878 ops 1787.6 ops/s, (1788/0 r/w) 1787.6mb/s,   1638us cpu/op,   2.2ms latency
 2135: 388.130: Stats dump to file 'stats.seqread1m.out'
 2135: 388.130: in statsdump stats.seqread1m.out
 2135: 388.136: Shutting down processes
Generating html for /var/tmp/ekstarz/fb/stats/diskmonster-zfs-sqread-Jul_18_2007-16h_50m_34s
file = /var/tmp/ekstarz/fb/stats/diskmonster-zfs-sqread-Jul_18_2007-16h_50m_34s/seqread1m/stats.seqread1m.out

diskmonster# 

OpenSolaris results with the fix for 6437054 (onnv-gate:2007-07-18), 'zfs_vdev_cache_max' set to 1 (disabled vdev cache)

diskmonster# ./do_sqread 
cannot open 'bigIO': no such pool
parsing profile for config: seqread1m
Running /var/tmp/ekstarz/fb/stats/diskmonster-zfs-sqread-Jul_18_2007-17h_27m_03s/seqread1m/thisrun.f
 4110: 0.005: Multi Stream Read Version 1.9 2005/06/21 21:18:52 personality successfully loaded
 4110: 0.005: Creating/pre-allocating files
 4110: 55.681: Pre-allocated file /bigIO/largefile4
 4110: 119.324: Pre-allocated file /bigIO/largefile3
 4110: 182.188: Pre-allocated file /bigIO/largefile2
 4110: 245.260: Pre-allocated file /bigIO/largefile1
 4110: 255.216: Starting 1 seqread instances
 4113: 256.222: Starting 1 seqread4 threads
 4113: 256.222: Starting 1 seqread3 threads
 4113: 256.222: Starting 1 seqread2 threads
 4113: 256.222: Starting 1 seqread1 threads
 4110: 259.232: Running '/usr/lib/filebench/scripts/fs_flush zfs'
 4110: 259.236: Change dir to /var/tmp/ekstarz/fb/stats/diskmonster-zfs-sqread-Jul_18_2007-17h_27m_03s/seqread1m
 4110: 259.236: Running...
 4110: 380.112: Run took 120 seconds...
 4110: 380.115: Per-Operation Breakdown
limit                       0ops/s   0.0mb/s      0.0ms/op        0us/op-cpu
seqread4                  369ops/s 369.5mb/s      2.7ms/op     1034us/op-cpu
limit                       0ops/s   0.0mb/s      0.0ms/op        0us/op-cpu
seqread3                  375ops/s 375.2mb/s      2.7ms/op     1047us/op-cpu
limit                       0ops/s   0.0mb/s      0.0ms/op        0us/op-cpu
seqread2                  369ops/s 369.2mb/s      2.7ms/op     1042us/op-cpu
limit                       0ops/s   0.0mb/s      0.0ms/op        0us/op-cpu
seqread1                  296ops/s 296.2mb/s      3.4ms/op     1066us/op-cpu

 4110: 380.115: 
IO Summary:      170443 ops 1410.1 ops/s, (1410/0 r/w) 1410.1mb/s,   1325us cpu/op,   2.8ms latency
 4110: 380.115: Stats dump to file 'stats.seqread1m.out'
 4110: 380.115: in statsdump stats.seqread1m.out
 4110: 380.121: Shutting down processes
Generating html for /var/tmp/ekstarz/fb/stats/diskmonster-zfs-sqread-Jul_18_2007-17h_27m_03s
file = /var/tmp/ekstarz/fb/stats/diskmonster-zfs-sqread-Jul_18_2007-17h_27m_03s/seqread1m/stats.seqread1m.out

diskmonster#

So by disabling the vdev cache, the throughput drops from 1787MB/s to 1410MB/s - a 25% regression. So disabling the vdev cache really hurts here. The nice thing is that with the fix for 6437054, we don't have to - and we get great DB performance too. My cake is tasty.

Future Work

Future work to increase DB on ZFS performance includes:
6457709 vdev_knob values should be determined dynamically
6471212 need reserved I/O scheduler slots to improve I/O latency of critical ops

ps: If you want the oltp workload of filebench to run correctly, you'll need this minor fix to 'flowop_library.c':

@@ -600,10 +600,11 @@ 
                 aiocb->aio_fildes = filedesc;
                 aiocb->aio_buf = threadflow->tf_mem + memoffset;
                 aiocb->aio_nbytes = \*flowop->fo_iosize;
                 aiocb->aio_offset = offset;
+                aiocb->aio_reqprio = 0;
 
 
                 filebench_log(LOG_DEBUG_IMPL,
                                 "aio fd=%d, bytes=%lld, offset=%lld",
                                 filedesc, \*flowop->fo_iosize, offset);


happy databasing!
Comments:

Great - can't wait to get build 70. That \*is\* where we get this fix isn't it?

Posted by James Mansion on July 19, 2007 at 05:33 AM PDT #

Yep, snv_70 is the build. No official plans yet, but most likely this will end up in s10u5 (which doesn't have an official schedule yet).

enjoy!

Posted by eric kustarz on July 19, 2007 at 05:33 AM PDT #

Will this also improve iSCSI-exported zvol workloads?

Posted by kosh on July 19, 2007 at 01:40 PM PDT #

I didn't do any specific zvol measurements. But it will partly be based on \*what\* workload you're talking about.

That being said, try it out and let us know what you find out!

Posted by eric kustarz on July 20, 2007 at 02:37 AM PDT #

For customers who want to try out ZFS on their development (oltp)DB's, what would you recommend, in addition to the current best practices, for general ZFS, (vdev or any other), tuning in lieu of S10U5. Thanks and keep up the good work!

Posted by T. McPhillips on July 31, 2007 at 11:13 PM PDT #

Check out this site:
http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#ZFS_and_Database_Recommendations

enjoy!

Posted by eric kustarz on August 01, 2007 at 07:19 AM PDT #

Post a Comment:
Comments are closed for this entry.
About

erickustarz

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today