Avoiding I/O inflation

As I mentioned in an earlier blog entry, I've been looking at MySQL over ZFS performance. That work went on the back burner for a while due to some re-prioritization and I shifted focus to 10GbE network performance on Solaris 10. I'll blog about that work after S10 Update 8 is GA (ETA is ~end of this year). Anyway, my fixes went into build 3 of Solaris 10 Update 8 so all of my work on S10 networking performance is now complete.

So now, I'm back to looking at MySQL over ZFS performance. In the meantime Neel has completed his work on MySQL over ZFS performance and he has a blog entry for those who need a quick tuning guide. In this blog entry though, I'm going to delve a little deeper into the first tuning tip Neel listed which could be summarized as:Avoid I/O inflation.

Now, I/O inflation on most systems is a good example of a "known unknown" - and this classification could well be true for most performance issues on a system. For example how can one know that there is a lurking performance problem - for starters, how does one even know that I/O inflation is occuring on your system? I've listed the steps below to help you determine this:

Start with looking at output from iostat which is a tool to look at the IO activity happening on disks, NFS, iSCSI etc. I usually use it like this: iostat -xcnz 1 which limits the output to show only the I/O deltas for the specified time period.

On my test system with 4 disks, running MySQL over ZFS exercised using a sysbench read-only test, a sample iostat delta looks like this :

    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
   33.0    0.0 4224.2    0.0  0.0  0.4    0.0   13.1   0  27 c0t2d0
   40.0    0.0 5120.3    0.0  0.0  0.7    0.0   18.7   0  29 c0t3d0
   48.0    0.0 6144.3    0.0  0.0  0.9    0.0   19.4   0  31 c0t4d0
   38.0    0.0 4864.2    0.0  0.0  0.7    0.0   18.2   0  31 c0t5d0
If we pick just the first line, and calculate the average size of the read from disk, we get (4224.2/33.0) = 128 KBytes. This matches the default block size of ZFS.

Next, we can look at the average size of a read (issued by the applications running on the system) to ZFS using this dtrace command:

# dtrace -c 'sleep 20' -n 'zfs_read:entry{@[((uio_t \*)arg1)->uio_resid]=count()}'
The tail end of the output from the dtrace might look like this:
...lots of preceding data removed for the sake of clarity...
            81920              100
            65536              145
            49152              238
            32768              391
            16384            36568
This tells us that in a 20 second interval, for an overwhelming majority of reads (36568 of them - to be precise), a userland application was reading 16 Kbytes of data at a time.

Since the only work happening on the system is from MySQL, I could speculate that these are database blocks that are being read. Or I could just as easily confirm this using yet another dtrace one liner:

# dtrace -c 'sleep 20' -n 'zfs_read:entry{@[execname]=count()}'
dtrace: description 'zfs_read:entry' matched 1 probe
dtrace: pid 1011 has exited
  mysqld                                                   35700
which tells me that over a 20 second period, 35700 calls to zfs_read were issued - all of them from the mysqld executable.

So, from all of this data, we can conclude that we have an application (which happens to be MySQL in this case) issuing reads that are 16 KBytes in length which get morphed into 128KByte reads by ZFS.

This effectively causes 8x the amount of work in terms of memory used, various system bus bandwidth used, disk bandwidth used, CPU used etc. Now, this may very well be useful work if the application uses the data that has been prefetched by ZFS. But in this particular case, MySQL happens to be accessing random blocks (since sysbench is running an OLTP type workload). So all of the data prefetching that happens due to the larger ZFS block size is a waste of time and effort. Reducing the ZFS block size to match the MySQL block size avoids all of this wasted effort.

See Neel's blog entry for the details on how this can be done.

Even if you are running a non-MySQL or any kind of a transactional database workload feel free to use the iostat and the dtrace commands listed above to see if you have I/O inflation happening on your system and if you do see this happening, check to see if avoiding the additional I/O overhead improves your overall system performance.

BTW, if you use UFS, you will need to change the 'zfs_read' in the dtrace commands listed above to 'ufs_read'.

And finally, here's some fine print in case you got your expecations up too high: Overall system performance tends to be the result of a complex interplay of many things - I/O is just one of them. Reducing (presumably unnecessary) I/O may help but it is not guaranteed to be a silver bullet.


Post a Comment:
  • HTML Syntax: NOT allowed

Charles Suresh


« October 2016