OLTP Improvements in Sun Storage 7000 2010.Q1
By user12610824 on Mar 11, 2010
For OLTP databases, we have seen as much as:
- 50% increase in average throughput
- 70% reduction in variability
This is based on transaction rates measured over a series of identically configured benchmark runs. Roch Bourbonnais provides a detailed discussion on his blog of the engineering work that went into this improvement, and I will highlight the aspects specific to Oracle and other OLTP database configurations.
In general, if you have configured your Unified Storage appliance to have shares or LUNs with recordsize/blocksize less than 128K, you are strongly encouraged to upgrade to the latest software release for enhanced overall performance.
For the details of how these gains were achieved, read on....
As Roch describes in his blog, this improvement relates to metaslab and block allocation in ZFS, and was worked as CR 6869229. As he describes, to store data in a ZFS pool, zfs first selects a vdev (a physical block device like a disk, or a logical grouping of physical block devices comprising a RAID group), then selects a metaslab (a region of space) within that vdev, and finally a block within that metaslab. I refer you to Roch's blog for more details on this and on the changes being introduced, and to Jeff Bonwick's older blogs on ZFS Block Allocation and Space Maps for further background.
As you may know, ZFS supports multiple record sizes, from 512 bytes to 128 kilobytes. In most cases, we recommend that you use the default record size of 128K for ZFS file systems, unless you have an application that manages large files using small random reads and writes. The most well known example is for database files, where it can be beneficial to match the ZFS record size to the database block size. This also applies to iSCSI LUNs, which have a default block size of 8K. In both cases, you may have a large amount of data that is randomly updated in small units of space.
The OLTP testing that contributed to CR 6869229 was for an Oracle database consisting of roughly 350GB of data and log files, stored on a Unified Storage appliance and accessed using NFSv4 with direct I/O. The workload was an OLTP environment simulating an order entry system for a wholesale supplier. The database block size was configured at 4KB, to minimize block contention, and the recordsize of the shares containing data files was configured with a matching 4KB record size. The database log files, which are accessed in a primarily sequential manner and with a relatively large I/O size, were configured with the default 128KB record size. In addition, the log file shares were configured with log bias set to latency, and the data file shares were configured with log bias set to throughput.
Initial testing consisted of repeated benchmarks runs with the number of active users scaled from 1 to 256. Three runs were completed at each user count before increasing to the next level. This testing revealed an anomaly, in that there was a high degree of variability among runs with the same user count, and that a group of runs with relatively low throughput could be followed by a sudden jump to relatively high throughput. To better understand the variability, testing was altered to focus on multiple, repeated runs with 64 active users, with all other factors held constant. This testing continued to exhibit a high degree of variability, and also revealed a cyclic pattern, with periods of high throughput followed by slow degradation over several runs, followed by a sudden return to the previous high. To identify the cause of the variation in throughput, we collected a broad range of statistics from Oracle, from Solaris, and from Analytics in the Unified Storage appliance. Some examples include Oracle buffer pool miss rates, top waiters and their contribution to run time, user and system CPU consumption, OS level reads and writes per second, kilobytes read and written per second, average service time, Appliance level NFSv4 reads and writes per second, disk reads and writes per second, and disk kilobytes read and written per second. These data were loaded into an OpenOffice spreadsheet, processed to generate additional derived statistics, and finally analyzed for correlation with the observed transaction rate in the database. This analysis highlighted I/O size in the appliance as the statistic having the strongest correlation (R\^2 = 0.83) to database transaction rates. What this showed is that database transaction rate seemed to increase with increased I/O size in the appliance, which also related to lower read and write service times as seen by the database server. Conversely, as average I/O size in the appliance dropped, database transaction rates would tend to drop as well. The question was, what was triggering changes in I/O size in the appliance, given a consistent I/O size in the application?
As Roch describes in his blog, metaslab and block allocation in ZFS were ultimately found to contribute heavily to the observed variability in OLTP throughput. When a given metaslab (a region of space within a vdev) became 70% full, ZFS would switch from a first fit to a best fit block allocation strategy within that metaslab, to help with the compactness of the on disk layout. Note that this refers to a single metaslab within a vdev, not the entire vdev or storage pool. With a random rewrite workload to a share with a small record size, like the 4KB OLTP database workload in our tests, the random updates tended to free up individual records within a given metaslab. When we switched to best fit allocation, new 4KB write requests would prefer to use these "best fit" locations rather than other, possibly larger areas of free space. This inhibited the ability of ZFS to do write aggregation, resulting in more IOPS required to move the same amount of data.
Two related space allocation issues were identified and ultimately improved. The first was to raise the threshold for transition to best fit allocation from 70% full to 96% full, and the second was to change the weighting factors applied to metaslab selection so that a higher level of free space would be maintained per metaslab. The latter avoids using metaslabs that might transition soon to best fit allocation, and more quickly switches away from a metaslab once it does make that transition. This will tend to spread a random write workload among more metaslabs, and each will have more free space and will permit a higher degree of write aggregation.
As mentioned already, the end result of these changes and other enhancements in the new software update were a 50% improvement in average OLTP throughput for this workload, and a 70% reduction in variability from run to run. Roch also reports a 200% improvement on MS Exchange performance, and others have reported substantial improvements in performance consistency on iSCSI luns.
The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.