Thursday Jul 01, 2010

Partition Alignment Guidelines for Unified Storage

If you create and access logical disks (aka LUNs) from your Sun Unified Storage appliance, whether over iSCSI or Fiber Channel, you should be aware that client side partition alignment can have a big impact on performance. This is a generic issue that applies to any virtual disk interface, not just Unified Storage, and relates to how client generated virtual disk I/O maps to actual I/O in the appliance. The good news is that it can be quite easy to properly align partitions.

Background

The reason we care about alignment is that most of the storage industry is based on the historical abstraction of 512 byte sectors, however the consumers of those sectors (filesystems and applications managing raw storage) and the sophisticated storage arrays that provide block storage generally organize their data internally in larger units, typically 4KB or multiples thereof. This includes LUNs in a Sun Unified Storage appliance, which use a default volume block size of 8KB. Without proper care, filesystem blocks can end up non-aligned with the natural block size of the storage, impacting performance.

With proper alignment, a single client block that is the same size or less as the volume block size of a LUN will be contained entirely within a single volume block in the LUN. Without proper alignment, that same client block may span multiple volume blocks in the LUN. That could result in 2 appliance reads for a single client read, and 2 appliance reads plus 2 appliance writes for a single client write. This will obviously have a big impact on performance if ignored.

The graphic below illustrates partition misalignment:

What we see in the graphic above is that the LUN is divided into fixed sized volume blocks, which are 8KB by default in the Unified Storage appliance. A given volume block in the LUN will always be read or written as a whole. When a LUN is imported by the client, it is presented as if it were a physical disk drive, with its own virtual sector size of 512 bytes. The client generally allocates sectors to one or more partitions or slices, which are then made available for file systems or raw application data. The block addresses of application or file system I/O are relative to the first sector in the partition, and the partition may be allowed to start on an arbitrary 512 byte sector.

In the example above, partition block P0 starts on an arbitrary sector, and spans LUN volume blocks L0 and L1. If we issue an 8KB read to P0, which matches the default 8KB volume block size of the LUN, we will have to read both L0 and L1 to get the data. If we issue an 8KB write to P0, we will have to read both L0 and L1 to get the data that is not being changed, then write back both L0 and L1 to store the combined new and old data.

The graphic below illustrates a properly aligned partition:

We now see that partition block P0 starts on a 512 byte sector that coincides with the start of LUN volume block L1. If we issue an 8KB read to P0, which matches the default 8KB volume block size of the LUN, we only have to read L1 to get the data. If we issue an 8KB write to P0, we simply replace LUN volume block L1 and do not need to do any reads at all. We have skipped a small amount of space in the LUN, but the result is a potentially large reduction in I/O.

There are three primary issues that lead to partition misalignment:

  • Most platforms consume some number of sectors at the beginning of a disk for a disk label, and actual data storage must skip these sectors to avoid overwriting the label.
  • Most partition management software, such as format, parted, and diskpart, was written to manage physical disks with 512 byte sectors. If there are constraints on how sectors are grouped into partitions, they typically relate to disk characteristics like heads, tracks, and cylinders, not virtual characteristics like user selected volume block sizes.
  • Most platforms allow you to create multiple partitions or slices within a disk, so even if the first partition is aligned, subsequent partitions may not be aligned.
Taken together, these factors mean that block zero of a given partition may map to an arbitrary 512 byte sector on a virtual disk, and for most platforms your partitions will not be aligned on LUN volume block size boundaries by default.

General Recommendations

If possible, use a disk label that allows sector addressing rather than cylinder addressing for partition/slice locations. This allows for simple math when calculating alignment.

If possible, create a single data partition/slice on the LUN, especially if you must use cylinder addressing. This avoids having to calculate alignment at multiple points within the LUN.

If you are offered an explicit alignment option by your disk partitioning software, use it. This currently only applies to Windows 2003 or later, where the diskpart.exe utility allows an "align=X" option on the create partition command, where X is the desired alignment in kilobytes. You should specify an alignment that either matches the volume block size in the LUN, or is a power of two and is larger than the volume block size.

Aligning by Sector

To manually calculate partition alignment by sector, make sure that the starting sector number of each partition is a multiple of the number of sectors in a LUN volume block. For example, with 512 byte sectors, there are 16 sectors in a default 8KB LUN volume block. In that case, the starting sector of each partition/slice should be a multiple of 16. The maximum volume block size for LUNs in the Sun Unified Storage appliance is currently 128KB, and there are 256 sectors in a 128KB volume block. For a 128KB volume block size, the starting sector of each partition/slice should be a multiple of 256.

If you are aligned for a power of two volume block size, you are also aligned for any smaller power of two volume block size. All supported volume block sizes in the Sun Unified Storage appliance are powers of two, so aligning for the maximum 128KB volume block size (ie starting partitions on multiples of 256 sectors) ensures alignment for all currently supported LUN volume block sizes.

Aligning by Cylinder

If you use a disk label that requires partitions/slices to begin on a cylinder boundary (for example, Solaris SMI labels), make sure that the starting cylinder number multiplied by the number of sectors per cylinder is a multiple of the number of sectors per LUN volume block.

The following Least Common Multiple (LCM) method can simplify the process:

  • Determine sectors per cylinder. In Solaris format, this is nhead \* nsect. In Linux fdisk, this is heads \* sectors/track.
  • Determine sectors per LUN volume block. There are two 512 byte sectors per kilobyte, so an 8KB volume block is 16 sectors, and a 128KB volume block is 256 sectors.
  • Find the LCM of the number of sectors per cylinder and per LUN volume block. For example, by using a tool like http://www.mathsisfun.com/least-common-multiple-tool.html
  • Divide the LCM by the number of sectors per cylinder
  • The result is the first non-zero cylinder that is aligned for your volume block size. Any cylinder that is a multiple of this number is also aligned.
For example, with 255 heads and 63 sectors per track, we have 16065 sectors per cylinder. With an 8KB LUN volume block size, we have 16 sectors per volume block. The LCM of 16065 and 16 is 257040. Dividing the LCM by 16065 (sectors per cylinder) gives us 16. Cylinder 16 is the first non-zero cylinder that is aligned for an 8KB LUN volume block, and any cylinder that is a multiple of 16 is also aligned.

Caveats

  • Do not use sector 0 of an MBR/msdos labled LUN, or sectors 0 through 33 of an EFI/gpt labeled LUN, to avoid overwriting the label.
  • Do not trust cylinder numbers reported by Linux fdisk or parted, because both may be rounded to the nearest cylinder. As described in the Linux specific section below, set units to sectors in both tools to verify alignment.
  • Do not trust KB offsets reported by Windows diskpart.exe, because they may be rounded to the nearest KB. As described in the Windows specific section below, you can use the wmic.exe utility to display actual byte offsets.
  • Do not trust cylinder numbers reported by Solaris fdisk on x86/amd64/x86_64 in interactive mode, because they may be rounded to the nearest cylinder. As decribed in the Solaris on x86/amd64/x86_64 section below, you can run "fdisk -W - {raw_device}" and use the reported Rsect (relative starting sector) to verify alignment. Note the Solaris fdisk will only create cylinder aligned partitions, so this issue relates primarily to reporting the location of partitions created by another mechanism.
  • If you use an SMI label with Solaris on x86/amd64/x86_64 keep in mind that the SMI label subdivides a partition within an MBR/msdos labeled LUN, so there are two levels of alignment to consider. See the Solaris on x86/amd64/x86_64 section for details.

Platform Specific Recommendations

Solaris on SPARC

If possible, use an EFI label (requires "format -e") which allows sector addressing. Configure data slices with a starting sector that is a multiple of the number of 512 byte sectors per LUN volume block. With a default volume block size of 8KB, the starting sector of each slice should be a multiple of 16. With any currently supported volume block size up to 128KB, a slice can begin on a sector that is a multiple of 256.

If you use an EFI label, ensure that sector 0 through 33 are not assigned to any slice, to avoid overwriting the label.

If you use an SMI label, you will be constrained to begin all slices on a cylinder boundary. To determine whether a cylinder is aligned on a LUN volume block boundary, multiply the cylinder number by the number of 512 byte sectors per cylinder. The result should be a multiple of the number of sectors per LUN volume block.

Refer to the Aligning by Cylinder section above for a Least Common Multiple method you can use to determine cylinder alignment.

Solaris on x86/amd64/x86_64

If possible, use an EFI label (requires "format -e") which allows sector addressing. However, be aware that unlike an SMI label, which subdivides a partition within an MBR/msdos labeled LUN when used on x86/amd64/x86_64, an EFI label replaces any existing MBR/msdos label, destroying any existing non-Solaris partitions.

If using an EFI label, use the same EFI guidelines as those described above in the Solaris on SPARC section.

If you use an SMI label with Solaris on x86/amd64/x86_64 keep in mind that the SMI label subdivides a Solaris2 partition within an MBR/msdos labeled LUN, so there are two levels of alignment to consider. The Solaris fdisk utility will report partitions relative to the beginning of the disk/LUN, and the Solaris format utility will report slices relative to the beginning of the Solaris2 partition.

One caveat with fdisk is that in interactive mode it will only create cylinder aligned partitions, but will also report partition starting points rounded to the nearest cylinder if they were created by another mechanism and are not actually cylinder aligned.

To confirm that a Solaris2 fdisk partition starts on a cylinder boundary, run "fdisk -W - {raw device}" and verify that the reported Rsect (relative starting sector) is a multiple of the number of sectors per cylinder.

The simplest alignment method for SMI on x86/amd64/x86_64 is to ensure that the Solaris2 partition created/reported by fdisk is on a non-zero cylinder boundary that is aligned for your LUN volume block size. You can then use the same guidelines as those described above in the Solaris on SPARC section to align slices within the Solaris2 partition using the format utility.

Refer to the Aligning by Cylinder section above for a Least Common Multiple method you can use to determine cylinder alignment.

Linux

Make sure that units is set to sectors when creating or displaying partitions in fdisk and/or parted. If using fdisk in interactive mode, the "u" command toggles units back and forth between sectors and cylinders. If using parted in interactive mode, the "units s" command sets units to sectors.

If you use either tool with units set to cylinders, the reported cylinder numbers may be rounded. Even if you do the math to determine a cylinder that should be aligned, you can not be sure that you are actually aligned unless you set units to sectors.

To ensure alignment, configure data partitions with a starting sector that is a multiple of the number of 512 byte sectors per LUN volume block. With a default volume block size of 8KB, the starting sector of each partition should be a multiple of 16. With any currently supported volume block size up to 128KB, a partition can begin on a sector that is a multiple of 256.

If you would like to choose a sector that is aligned for your LUN volume block and is also on a cylinder boundary, refer to the Aligning by Cylinder section above for a Least Common Multiple method you can use to determine cylinder alignment. After determining an aligned cylinder, multiply the cylinder number times sectors per cylinder, and use that as your starting sector number.

If you use a gpt label (equivalent to an EFI label in Solaris), ensure that sector 0 through 33 are not assigned to any partition, to avoid overwriting the label.

If you use an MBR (aka msdos) label, ensure that sector 0 is not assigned to any partition, to avoid overwriting the label.

Windows

For Windows 2003 SP1 and later, the diskpart.exe utility can be used to create aligned partitions by including the align=X option on the create partition command, where X is the desired alignment in kilobytes. To create an aligned partition, simply specify a power of two alignment that is greater than or equal to the LUN volume block size. For example, use align=128 to align for any LUN volume block size up to 128 KB. The default in Windows Vista and Windows 2008 is align=1024, which is correctly aligned for any power of two LUN volume block size up to 1MB, and does not need to be changed.

A caveat with the diskpart.exe utility is that it displays the offset in KB, but this is a rounded value. For example, a default Windows 2003 partition offset of 63 sectors is actually 31.5 KB, but will be displayed by diskpart.exe as 32 KB.

To determine the actual byte offset of partitions in Windows, you can use the wmic.exe utility, with a command like:

wmic partition get StartingOffset, Name, Index

This will show partition information for all of the basic disks/luns in the system, with StartingOffset specified in bytes. For proper alignment, StartingOffset should be a multiple of the number of bytes (not sectors) in the LUN volume block. For example, with a default 8KB LUN volume block size, StartingOffset should be a multiple of 8192.

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

Thursday Mar 11, 2010

OLTP Improvements in Sun Storage 7000 2010.Q1

The 2010.Q1 Software Update for the Sun Storage 7000 Unified Storage Systems product line is now available. Among the many enhancements and new features included in this update is an important performance improvement for configurations using shares or LUNs with small record sizes. The two most likely configurations to benefit from this change are OLTP databases, which typically configure the record size of a share to match the block size of the database, and iSCSI LUNs, which have a default block size of 8KB.

For OLTP databases, we have seen as much as:

  • 50% increase in average throughput
  • 70% reduction in variability

This is based on transaction rates measured over a series of identically configured benchmark runs. Roch Bourbonnais provides a detailed discussion on his blog of the engineering work that went into this improvement, and I will highlight the aspects specific to Oracle and other OLTP database configurations.

In general, if you have configured your Unified Storage appliance to have shares or LUNs with recordsize/blocksize less than 128K, you are strongly encouraged to upgrade to the latest software release for enhanced overall performance.

For the details of how these gains were achieved, read on....

As Roch describes in his blog, this improvement relates to metaslab and block allocation in ZFS, and was worked as CR 6869229. As he describes, to store data in a ZFS pool, zfs first selects a vdev (a physical block device like a disk, or a logical grouping of physical block devices comprising a RAID group), then selects a metaslab (a region of space) within that vdev, and finally a block within that metaslab. I refer you to Roch's blog for more details on this and on the changes being introduced, and to Jeff Bonwick's older blogs on ZFS Block Allocation and Space Maps for further background.

As you may know, ZFS supports multiple record sizes, from 512 bytes to 128 kilobytes. In most cases, we recommend that you use the default record size of 128K for ZFS file systems, unless you have an application that manages large files using small random reads and writes. The most well known example is for database files, where it can be beneficial to match the ZFS record size to the database block size. This also applies to iSCSI LUNs, which have a default block size of 8K. In both cases, you may have a large amount of data that is randomly updated in small units of space.

The OLTP testing that contributed to CR 6869229 was for an Oracle database consisting of roughly 350GB of data and log files, stored on a Unified Storage appliance and accessed using NFSv4 with direct I/O. The workload was an OLTP environment simulating an order entry system for a wholesale supplier. The database block size was configured at 4KB, to minimize block contention, and the recordsize of the shares containing data files was configured with a matching 4KB record size. The database log files, which are accessed in a primarily sequential manner and with a relatively large I/O size, were configured with the default 128KB record size. In addition, the log file shares were configured with log bias set to latency, and the data file shares were configured with log bias set to throughput.

Initial testing consisted of repeated benchmarks runs with the number of active users scaled from 1 to 256. Three runs were completed at each user count before increasing to the next level. This testing revealed an anomaly, in that there was a high degree of variability among runs with the same user count, and that a group of runs with relatively low throughput could be followed by a sudden jump to relatively high throughput. To better understand the variability, testing was altered to focus on multiple, repeated runs with 64 active users, with all other factors held constant. This testing continued to exhibit a high degree of variability, and also revealed a cyclic pattern, with periods of high throughput followed by slow degradation over several runs, followed by a sudden return to the previous high. To identify the cause of the variation in throughput, we collected a broad range of statistics from Oracle, from Solaris, and from Analytics in the Unified Storage appliance. Some examples include Oracle buffer pool miss rates, top waiters and their contribution to run time, user and system CPU consumption, OS level reads and writes per second, kilobytes read and written per second, average service time, Appliance level NFSv4 reads and writes per second, disk reads and writes per second, and disk kilobytes read and written per second. These data were loaded into an OpenOffice spreadsheet, processed to generate additional derived statistics, and finally analyzed for correlation with the observed transaction rate in the database. This analysis highlighted I/O size in the appliance as the statistic having the strongest correlation (R\^2 = 0.83) to database transaction rates. What this showed is that database transaction rate seemed to increase with increased I/O size in the appliance, which also related to lower read and write service times as seen by the database server. Conversely, as average I/O size in the appliance dropped, database transaction rates would tend to drop as well. The question was, what was triggering changes in I/O size in the appliance, given a consistent I/O size in the application?

As Roch describes in his blog, metaslab and block allocation in ZFS were ultimately found to contribute heavily to the observed variability in OLTP throughput. When a given metaslab (a region of space within a vdev) became 70% full, ZFS would switch from a first fit to a best fit block allocation strategy within that metaslab, to help with the compactness of the on disk layout. Note that this refers to a single metaslab within a vdev, not the entire vdev or storage pool. With a random rewrite workload to a share with a small record size, like the 4KB OLTP database workload in our tests, the random updates tended to free up individual records within a given metaslab. When we switched to best fit allocation, new 4KB write requests would prefer to use these "best fit" locations rather than other, possibly larger areas of free space. This inhibited the ability of ZFS to do write aggregation, resulting in more IOPS required to move the same amount of data.

Two related space allocation issues were identified and ultimately improved. The first was to raise the threshold for transition to best fit allocation from 70% full to 96% full, and the second was to change the weighting factors applied to metaslab selection so that a higher level of free space would be maintained per metaslab. The latter avoids using metaslabs that might transition soon to best fit allocation, and more quickly switches away from a metaslab once it does make that transition. This will tend to spread a random write workload among more metaslabs, and each will have more free space and will permit a higher degree of write aggregation.

As mentioned already, the end result of these changes and other enhancements in the new software update were a 50% improvement in average OLTP throughput for this workload, and a 70% reduction in variability from run to run. Roch also reports a 200% improvement on MS Exchange performance, and others have reported substantial improvements in performance consistency on iSCSI luns.

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

Monday Nov 30, 2009

Maximizing NFS client performance on 10Gb Ethernet

I generally agree with the opening statement of the ZFS Evil Tuning Guide, which says "Tuning is often evil and should rarely be done." That said, tuning is sometimes necessary, especially when you need to push the envelope. At the moment, achieving peak performance for NFS traffic between a single client and a single server, running over 10 Gigabit Ethernet (10GbE) is one of those cases. I will outline below the tunings I used to achieve a ~3X throughput improvement in NFS IOPS over 10GbE, on a Chip Multithreading (CMT) system running Solaris 10 Update 7 (S10u7).

The default values for the tunables outlined below are all either being reviewed, or have already changed since the release of S10u7. Some of these tunings are unnecessary if you are running S10u8, and they should all be unnecessary in the future. Consider these settings a workaround to achieve maximum performance, and plan to revisit them in the future. A good place to monitor for future developments is the Networks page on the Solaris Internals site. You can also review the NFS section of the Solaris Tunable Parameters Reference Manual.

If you want to fine tune these settings beyond what is outlined here, a reasonable technique is to start from your current default settings and double the value until no observable improvement is seen.

For the time being, consider the following settings if you plan to run NFS between a single client and a single server over 10GbE:

Step 1 - TCP window sizes

The TCP window size defines how much data a host is willing to send/receive without an acknowledgment from its communication partner. Window size is a central component of the TCP throughput formula, which can be simplified to the following if we assume no packet loss:

max throughput (per second) = window size / round trip time (in seconds)

For example, with 1ms RTT and the current default window size of 48k, we have:

49152 / 0.001 = ~50 MB/sec per communication partner

This is obviously too low for NFS over 10GbE, so the send and receive window sizes should be increased. A setting of 1MB provides a max bandwidth of ~1 GB/sec with a RTT of 1ms.

Solaris 10 Update 8 and earlier

	ndd -set /dev/tcp tcp_xmit_hiwat 1048576
	ndd -set /dev/tcp tcp_recv_hiwat 1048576
TCP window size has been the subject of a number of CRs, has changed several times over the years, and the default is likely to change again in the near future. Use a command like
	ndd -get /dev/tcp tcp_xmit_hiwat
on your system to check the current default value before tuning, to make sure that you do not inadvertently lower the values.

Note: if you want to increase TCP window sizes beyond 1MB, you should also increase tcp_max_buf and tcp_cwnd_max, which currently default to 1MB.

Step 2 - IP software rings

A general heuristic for network bandwidth is that we need approximately 1GHz of CPU bandwidth to handle 1Gb (gigabit) per second of network bandwidth. That means that we need to use multiple CPUs to match the bandwidth of a 10GbE interface. Software Rings are used in Solaris as a mechanism to spread the incoming load from a network interface across multiple CPU strands, so that we have enough aggregate CPU bandwidth to match the network interface bandwidth. The default value for the number of soft rings in Solaris 10 Update 7 and earlier is too low for 10GbE, and must be increased:

Solaris 10 Update 7 and earlier on Sun4v

In /etc/system
	set ip:ip_soft_rings_cnt=16

Solaris 10 Update 7 and earlier on Sun4u, x86-64, etc

In /etc/system
	set ip:ip_soft_rings_cnt=8

Solaris 10 Update 8 and later

Thanks to the implementation of CR 6621217 in S10 u8, the default value for the number of soft rings should be fine for network interface speeds up to and including 10GbE, so no tuning should be necessary.

The changes introduced by CR 6621217 highlight why tuning is often evil. It was found that it is difficult to find an optimal, system wide setting for the number of soft rings if the system contains multiple network interfaces of different types. This resulted in the addition of a new tunable, ip_soft_rings_10gig_cnt, which applies to 10GbE interfaces. The old tunable, ip_soft_rings_cnt, applies to 1GbE interfaces. Both tunables have good defaults at this point, so it is best not to tune either on S10u8 and later.

Step 3 - RPC client connections

Now that we have enough IP software rings to handle the network interface bandwidth, we need to have enough IP consumer threads to handle the IP bandwidth. In our case the IP consumer is NFS, and at the time of this writing, its default behavior is to open a single network connection from an NFS client to a given NFS server. This results in a single thread on the client that handles all of the data coming from that server. To maximize throughput between a single NFS client and server over 10GbE, we need to increase the number of network connections on the client:

Solaris 10 Update 8 and earlier

In /etc/system
	set rpcmod:clnt_max_conns = 8
Note: for this to be effective, you must have the fix for CR 2179399, which is available in snv_117, s10u8, or s10 patch 141914-02

A new default value for rpcmod:clnt_max_conns is being investigated as part of CR 6887770, so it should be unnecessary to tune this value in the future.

Step 4 - Allow for multiple pending I/O requests

The IOPS rate of a single thread issuing synchronous reads or writes over NFS will be bound by the round trip network latency between the client and server. To get the most out of the available bandwidth you should have a workload that generates multiple pending I/O requests. This can be from multiple processes each generating an individual I/O stream, a multi-threaded process generating multiple I/O streams, or a single or multi-threaded process using asynchronous I/O calls.

Conclusion

Once you have verified/tuned TCP window sizes, IP soft rings, and rpc client connections, and you have a workload that can capitalize on the available bandwidth, you should see excellent NFS throughput on your 10GbE network interface. There are a few more tunings that might add a few percentages of performance, but the tunings shown above should suffice for the majority of systems.

As I mentioned at the start, these tunables are all either under investigation or already adjusted in Solaris 10 Update 8. Our goal is always to provide excellent performance out of the box, and these tunings should be unnecessary in the near future.

Thursday Jun 18, 2009

MySQL Performance on Sun Storage 7000

If you saw Don MacAskill's keynote (The Smugmug Tale) at the recent MySQL Conference and Expo, you know that he had lots of positive things to say about his experience running MySQL on the Sun Storage 7410 at Smugmug. The 7410 marks the current high end of the Sun Storage 7000 Unified Storage line of network attached storage devices. For the past few months, I have been investigating MySQL database performance on the entry level and mid-range models, to see whether they might provide similar benefits to smaller MySQL sites. I have to admit that I was more than a little surprised at just how well things performed, even on the entry level 7110. For the whole story, read on...

The Sun Storage 7000 Unified Storage product line currently ranges from the entry level Sun Storage 7110, which provides up to 4TB of storage in a 2U form factor, to the high end Sun Storage 7410, which scales up to half a petabyte, and can be configured with Flash accelerated Hybrid Storage Pools and/or a High-availability Cluster option. The entire line, which also includes the mid-range Sun Storage 7210 and 7310, features a common, easy to use management interface and, of course, killer-app Dtrace Analytics. My investigation has focused primarily on the Sun Storage 7110, with some additional work on the Sun Storage 7210, which has more CPU, memory, and spindle capacity, and supports optional Write Flash Accelerator components.

Key Benefits for MySQL

There are a couple of aspects of the Sun Storage 7000 line that are really interesting for MySQL. As Don pointed out in his keynote, the underlying use of ZFS in the appliance means that you can get many of the benefits of ZFS even without migrating your OS to Solaris. The snapshot and clone features, for example, make automated backups and replication slave creation a snap, and the way that cloning is implemented means that a slave created by this method may require less physical storage than a slave created by traditional methods. That is because clones share unmodified disk blocks, so anything that remains unchanged after slave creation is never actually duplicated. You can also enable compression in the appliance, if you choose, which could mean even more space savings.

Analytics, Analytics, Analytics!

As a performance engineer, I can tell you that I have been frustrated many, many times by the lack of observability in storage devices. Even the best interfaces I have seen are like looking through frosted glass compared to the Dtrace Analytics interface. As Don MacAskill says "Drill down on everything. Correlate anything. God-like power." But don't take my word for it (or Don's), go to the Sun Storage 7000 Software page and download a simulator that you can run on your laptop or desktop, so you can see first hand.

Performance Highlights

Ok, nobody would care about Dtrace Analytics if all it did was tell you why your system was running so darn slow. The good news is that MySQL performance on the Sun Storage 7000 line is excellent, even on the entry level Sun Storage 7110. There are a couple of issues that I will point out later, but for MySQL on Linux over NFS, performance is great, right out of the box.

Huge read cache vs traditional HW Raid

The current Sun Storage 7000 lineup is based on standard Sun x64 systems and 64-bit OpenSolaris. That means lots of addressable memory, most of which is used as cache. On a standard Sun Storage 7110 with 8GB of memory, that can translate to over 6GB of usable cache. This can grow all the way to 128GB of memory on the 7410. Compare that to typical entry level hardware RAID devices that often come with as little as 256MB of cache, or even relatively high end hardware RAID devices that often come with 2GB of cache or less.

Blazingly fast reads from cache

With a database working set (the subset of data that clients are actively using) that fits in the cache of the Unified Storage appliance, I found that MySQL read-only performance on Linux and Solaris NFS was primarily bound by network bandwidth. For example, I observed over 6400 reads per second on the 7110, with a single MySQL/InnoDB/sysbench server accessing a 6GB working set (20GB total table space) via NFS over a 1Gb NIC. The sustained data throughput at this rate was a little over 80MB/sec, and with protocol overhead we had a sustained rate of a little over 90MB/sec on the wire, with peaks near 100MB/sec. That means we were running very near network line rate for the 1Gb NIC.

This resulted in 2x to 3x the Sysbench throughput for the MySQL/InnoDB/sysbench server using the Sun Storage 7110, compared to the same server when it was attached via fiber channel to a traditional HW RAID array with 2GB memory (providing enough cache for less than 1/3 of our working set). The read-only results were strong across the board for Linux and Solaris over both NFS and iSCSI.

On the 7210, we saw ~90% linear scaling for 1 to 6 MySQL/sysbench servers over NFS, when configured with a 10Gb NIC on the Unified Storage appliance and 1Gb nics in each DB server. Analytics showed that CPU and network in the 7210 were both at approximately 50% utilization during this test, and our working set used 36GB out of the available ~60GB of cache, so it is likely that we could have seen similar scaling to 8 or 10 DB servers.

The only tuning we applied at the Unified Storage appliance end was to set the record size of the NFS share to 16KB for the InnoDB tablespace storage. We left the record size at its default of 128KB for the NFS share that contained the InnoDB logs. There was no special network or NFS tuning applied to either the DB server or the appliance.

Strong IOPS for large working sets

If you have an active working set that exceeds the cache capacity in the Unified Storage appliance, your random read performance will eventually be bound by the IOPS rate of the underlying storage. For the Sun Storage 7110 and 7210, which do not have Read Flash Accelerator options like the 7310 and 7410, that means the IOPS rate of the disks. In a MySQL/InnoDB/sysbench test on the 7110, with an aggregate working set of 36GB (180GB aggregate table space), resulting in a 40% cache hit rate in the appliance, I observed roughly 4000 NFS reads per second and roughly 2700 resulting disk IOPS. That translates to 230 IOPS for each of the 12 data drives. In this test we were using only about 15% of the drive capacity, so we were "short stroking" the disks and getting better IOPS rates than we would if we had longer average seeks. For example, at 80% capacity we might see something on the order of 150 to 180 IOPS per drive.

For additional background on IOPS and other metrics for Sun Storage 7000 storage devices, check out Roch Bourbonnais' blogs on Sun Storage 7000 Performance invariants and Compared Performance of Sun 7000 Unified Storage Array Line and Amitabha Banerjee's blog on Analyzing the Sun Storage 7000.

Excellent MySQL read-write performance on Linux over NFS

I was very pleasantly surprised by MySQL read-write performance on Linux NFS to the entry level Sun Storage 7110. I went into the study expecting good read-write performance to the 7210, but I thought that the lack of Write Flash Accelerators in the 7110 would be an issue for MySQL. Not so! In my testing, I observed over 2300 reads plus 400 writes per second on the 7110, with a single MySQL/InnoDB/sysbench server accessing a 6GB working set (20GB total table space) via NFS over a 1Gb NIC.

This resulted in 1.5x the Sysbench throughput for the MySQL/InnoDB/sysbench server using the Sun Storage 7110, compared to the same server when it was attached via fiber channel to a traditional HW RAID array with 2GB memory (providing enough cache for less than 1/3 of our working set).

As with the read-only test, the large read cache in the 7110 compared to the HW RAID array probably played a big role here. The 85% cache hit rate in the appliance provided an advantage on read performance, which offset the potential advantage of the battery backed write cache in the HW RAID array. In addition, the multi-threaded MySQL/InnoDB/sysbench workload appears to have benefited from group commit on writes, since the 7110 started with lower throughput with 1 or 2 active threads, but began to outperform the HW RAID array at 4 threads and higher.

I was surprised enough by the MySQL read-write performance over Linux NFS that I felt compelled to confirm that fsync() calls in Linux were actually resulting in correct NFS data commits. Based on a combination of Dtrace Analytics and Wireshark analysis of NFS traffic, I can confirm that the data was correctly going to disk.

Good MySQL read-write performance over iSCSI w/out write flash accelerators

The testing I have done so far with MySQL over iSCSI has been exclusively on the entry level Sun Storage 7110. Unlike the 7210, 7310, and 7410, this model does not currently include a write flash accelerator option, which would probably help MySQL read-write performance over iSCSI. Even so, we observed good performance over iSCSI, especially at higher thread counts. For example, using the same database and network configurations described earlier, MySQL/InnoDB/Sysbench throughput over iSCSI, at 32 to 64 active threads, was comparable to what was seen on the same server when it was attached via fiber channel to a traditional HW RAID array with 2GB memory for both Solaris and Linux.

Current Issues

MySQL read-write over Solaris NFS

MySQL read-only performance on Solaris NFS is currently excellent, but read-write performance is impacted by the lack of VMODSORT support in the Solaris NFS client (CR 6213799). This affects fsync() performance for large files using a default, buffered NFS mount. The normal workaround for this is to eliminate client side buffering of file data by mounting the file system with the forcedirectio option, or enabling directio on a per-file basis. For example, the MySQL/InnoDB option "innodb_flush_method = O_DIRECT" enables dirctio on the InnoDB tablespace files. That is likely to work fine on an appliance that includes write flash accelerators like the Sun Storage 7210 and higher, but write flash accelerators are not currently available on the Sun Storage 7110. For the 7110 without write flash accelerators, we did not see a MySQL read-write performance gain by using directio instead of default, buffered NFS file access.

For Solaris, MySQL read-write performance on ZFS over iSCSI currently exceeds its performance over buffered or directio enabled NFS on the Sun Storage 7110, provided that fast-write-ack is enabled on the iSCSI luns.

iSCSI fast-write-ack persistence

A Solaris system running ZFS over iSCSI can realize performance gains by enabling fast-write-ack on the iSCSI luns in the storage appliance, because ZFS is known to correctly issue SCSI cache flush commands when fsync() is called (that is not currently known to be true for any Linux file system). The fast-write-ack option can be activated by enabling the "write cache enabled" option in the appliance for a given iSCSI lun. However, due to CR 6843533 the write cache enabled setting on an iSCSI lun will be silently disabled any time an iscsi login occurs for a target, although the Unified Storage BUI and CLI will still indicate that it is enabled. Examples of iscsi target login triggers include a reboot of the Unified Storage appliance, a reboot of the client, a client "iscsiadm modify target-param -p ..." command to modify negotiated parameters, or an "iscsiadm remove ..." followed by an "iscsiadm add ..." for the affected target. The workaround for CR 6843533 is to manually disable and then reenable write cache following an iscsi target login.

Recommendations

General

For an NFS share that will be used to store MySQL tablespace files, match the record size of the share to the block size of the storage engine. For example, this should be 16k for InnoDB tablespace files. This can be configured on a per-share basis by setting the "Database record size" (in the BUI) or "recordsize" (in the CLI) for the share. This must be done before creating the tablespace files in the share.

For an NFS share that will store transaction logs or other non-tablespace files, the record size of the share can be left at its default of 128k.

For more information on Sun Storage 7000 Unified Storage network attached storage devices, including lots more performance data, check out these other blogs:

You will also find Sun Storage 7000 information on the new Sun System and Application Performance, Monitoring, and Tuning wiki, which includes performance information for a broad range of technologies.

Linux

We saw our best MySQL read-only and read-write performance on Linux using NFS. Read-only performance over iSCSI approached that of NFS, but read-write over NFS substantially outperformed iSCSI. Based on the read-write result, we recommend using NFS for MySQL on Linux. All of our testing on Linux used NFSv3, which was the default in the version of Linux we tested.

Solaris

For the Sun Storage 7110 without write flash accelerators, we saw our best MySQL read-write performance on Solaris by running ZFS over iSCSI. This avoided the performance impact we would have seen with MySQL over NFS, due to the lack of VMODSORT support in the Solaris NFS client (CR 6213799). The data integrity guarantees of ZFS also allowed us to disable the InnoDB double write buffer, and ZFS cache management semantics allowed us to enable fast-write-ack on the iSCSI luns in the storage appliance. Until CR 6213799 has been addressed, we recommend using ZFS over iSCSI for MySQL on Solaris if you do not have write flash accelerators in your appliance. However, be sure to review the iSCSI fast-write-ack persistence discussion in the "Current Issues" section above if you use this feature.

For the Sun Storage 7210 and higher, write flash accelerators should benefit MySQL read-write performance over NFS using either the forcedirectio mount option or the "innodb_flush_method = O_DIRECT" MySQL/InnoDB option, and will also benefit MySQL read-write performance over iSCSI. We did not have an appliance with write flash accelerators available for this test, so at this time we can not recommend one configuration over the other.

Wednesday Apr 29, 2009

Performance Modeling Slides

Last week I gave a presentation titled "Predicting Performance with Queuing Models" at the MySQL Camp section of the MySQL User Conference. I have uploaded the slides for a longer version of the presentation than could have been covered in the 45 minute time slot, for anyone who would like a copy. I've also uploaded a version that includes speaker notes.

Presentation slides:

Tuesday Apr 21, 2009

Concurrent Commit Revisited

Today Sun announced the availability of MySQL 5.4, which contains a number of performance and scalability enhancements to the base MySQL 5.1 code. For the full run down on all of the enhancements, and the resulting scalability improvements, check out these blogs from Mikael Ronstrom and Allan Packer. You can also read Neelakanth Nadgir's blog, where he describes some performance fixes that were tried but rejected. As he says, sometimes you can learn a lot from the things that don't work out!

This release contains a range of improvements, both from the community (especially Google) and from work that has taken place in the MySQL performance team. As you can guess, we aren't stopping here, and there are a number of additional enhancements in the pipeline for future releases. Among those is a bug fix that is near and dear to my heart, and you can read some of the background in my blog post from last November. In essence, if you enable binlog logging, you currently lose concurrent commits in InnoDB, and this results in a loss of scalability. After languishing for far too long, an enhanced version of the fix I developed for this, and submitted to InnoDB in August 2008, is being tested for inclusion in a future release. The fix relates to a very small change in the use of the prepare_commit_mutex lock, which controls the order of commits to the binary log and the InnoDB logs, for the benefit of InnoDB hot backups (ibbackup). This small change results in a big improvement in scalability, as I demonstrated in my previous blog.

Based on a review by some of the top MySQL development engineers, including Mikael Ronstrom, Sergei Golubchik, and Mats Kindahl, the current plan is to introduce a new MySQL configuration option to enable variations on my fix. The new option, innodb_support_ibbackup, will accept values of 0, 1, or 2. These values will mean:

  • innodb_support_ibbackup=0 - no prepare_commit_mutex is ever used, which is unsafe for ibbackup but otherwise is ok and gives the best scalability.
  • innodb_support_ibbackup=1 - use my optimization of prepare_commit_mutex, which is safe for ibbackup, and dramatically reduces the performance impact.
  • innodb_support_ibbackup=2 - use the old behavior, which is safe for ibbackup, but breaks group commit and causes a huge scalability impact.
  • The default value is 1, which is the optimized use of the lock and will give good out of the box performance while still supporting ibbackup. If you know you don't use ibbackup, and don't have some other custom, in house requirement that unrelated commits in the binlog appear in the same order as commits in the InnoDB logs, you can improve scalability further by setting innodb_support_ibbackup to 0.

There are plenty of other performance and scalability enhancements that are already available in the new MySQL 5.4 release, so I urge you to check out the blogs by Mikael Ronstrom and Allan Packer for the full story. This is just one example of the additional improvements that are still in the pipeline!

Thursday Apr 09, 2009

Elements of Response Time

In a couple of weeks, I will be presenting a session on simple Queuing Models at the free MySQL Camp section of the MySQL User Conference. As luck would have it, I happened to come across a great example today of how a very simple back-of-the-envelope model can help us understand why a system performs the way it does.

My friend and co-worker, Tim Cook, has recently been doing some performance testing with the new Pool-of-Threads scheduler in MySQL 6.0. You can read about the results here and here on his blog.

As Tim showed in the first blog, with sysbench (the load driver) running on the same system as MySQL, throughput with the pool-of-threads scheduler was lower than with the default, thread-per-connection scheduler. However, with sysbench running on a remote system, accessing MySQL over the network (arguably a more realistic case), throughput with the two schedulers was quite similar. While reviewing response time data, it was noted that the ramping of response time (ie the response time curve) for pool-of-threads was very similar to that for thread-per-connection when communicating over the network. The question was, why?

As Sherlock Holmes would say, it's elementary! (pun intended)

In a closed queuing network, the lower bound on average response time, once you reach Nsat (the number of consumers at which some queuing is guaranteed to happen), is (N\*Dmax)-Z. Where N is the number of consumers, Dmax is the service demand at the bottleneck device, and Z is think time (the delay between requests from the same consumer).

If the response time curve is the same for both implementations, it suggests that Dmax is also the same and Nsat is comparable.

If you look at just CPU, network, and disk, you will have the following elements of response time:

  • The service demand at each device:
    • Dcpu = (%usr+%sys)/TPS
    • Ddisk = %busy/TPS
    • Dnet = (network utilization)/TPS
    • Note: TPS is Transactions Per Second reported by the application.
  • The total service demand:
    • D = Dcpu + Ddisk + Dnet
    The service demand at the bottleneck device:
    • Dmax = max(Dcpu, Ddisk, Dnet)
  • The number of consumers at which some queuing is guaranteed to occur:
    • Nsat = (D + Z)/Dmax
    • Note: Nsat marks the approximate knee in the throughput and response time curves. It is often written as N\*, but this can be confusing when written in plain text equations where "\*" indicates multiplication. Nsat may also be referred to as Nopt, meaning the optimal number of consumers in the system, because it marks the approximate point where throughput levels out and response time starts to climb.
  • The lower bound on average response time is then:
    • for N < Nsat: D
    • for N >= Nsat: (N\*Dmax)-Z

When sysbench was running locally with MySQL, Dmax was probably the CPU service demand, which differed between implementations. With sysbench remote, Dmax is probably now either Dnet or Ddisk. Some simple hand calculations will tell which, and that component will need to be addressed to reduce the average response time and increase the throughput.

I hope to see you at my MySQL Camp Session, at 2pm on Thursday, April 23rd, where we will discuss other uses of simple queuing models to answer questions about performance. MySQL Camp is free and you do not need to be registered for the main conference to attend, so drop by!

Friday Mar 06, 2009

MySQL Query Cache Sizing

I recently helped to analyze a performance issue with one of our MySQL customers, which highlighted the need to be cautious when sizing the query cache, and to validate whether it works well with your workload. The customer was evaluating a new 2-socket, 8 core, X64 system running Solaris, as a replication slave in an environment consisting of older systems running Linux. When they introduced the new slave, they found that it was unable to keep up with its replication load during their peak periods, and it took hours to catch up once the traffic dropped. The older systems running Linux had no trouble keeping up, so the obvious assumption was that there was something wrong with the new system, with Solaris, or with ZFS. In reality, the issue turned out to be an over sized MySQL query cache, and once that was addressed on the new system, it performed as expected.

The query cache is an interesting component of MySQL. It caches result sets of queries, which has the potential to improve performance by eliminating the entire parse/optimize/execute cycle for a repeated query. This comes at a cost though, and that cost tends to increase with the size of the cache. Due to the current implementation, the cost may also increase with the number of hardware threads and with the number of active connections, due to the cache's current coarse grained locking implementation. When the query cache helps, it can help a lot. When it hurts, it can hurt a lot. If you are currently using or considering using the query cache, you should definitely test your assumptions about how it interacts with your particular workload. You should also plan to revisit the topic from time to time, since changes to your application or workload may change the interactions with the query cache.

This particular case also highlights an issue related to migrations. If you are doing a "major" migration, for example between systems with different architectures, significantly different configurations, or, as in this case, with different operating systems, be sure to keep track of any other changes that are made along with the migration. It is natural to assume that any negative impacts are due to the big changes, but many times it will turn out to be a seemingly small change that causes grief. In this case, we knew that a major performance drop due to a Linux to Solaris migration was completely unexpected, which kept us from being led too far astray from finding the real root cause.

Now back to the query cache... In most cases, the query cache should be sized in the tens of megabytes and not larger. In the particular case we investigated, the customer had been running with a 20MB query cache on the older systems, and configured the new system with a 500MB query cache. This was presumably done because the new system had substantially more memory than the older systems, and it seemed to make sense to use some of that memory to boost the size of the query cache. However, due to a number of factors related to the implementation of the query cache, this actually hurt performance by a large margin. This was due to the level of overhead that was required to maintain the large cache, which was much higher than any benefit that it could provide.

To identify the problem, the customer generated a dtrace profile of mysqld with the DTraceToolkit "/opt/DTT/hotuser" script, or one similar to it. When piped through c++filt to demangle C++ names, the output looked like this:

Truncated to top 4

libc.so.1`clear_lockbyte 361 1.2%
libc.so.1`mutex_lock_impl 511 1.7%
libc.so.1`mutex_trylock_adaptive 1127 3.8%
mysqld`Query_cache::insert_into_free_memory_sorted_list(Query_cache_block\*, Query_cache_block\*\*) 21702 73.1%

This shows that mysqld was spending 73% of its time in the Query_cache::insert_into_free_memory_sorted_list() method. As soon as this was seen, we knew that the issue was related to updating the query cache, and a quick comparison of configurations confirmed that this system had a dramatically larger query cache than the systems that were performing well. It is likely that plockstat would have shown substantial contention on the query cache structure_guard_mutex as well, but that wasn't checked. Decreasing the query cache size, by setting the query_cache_size tunable, resolved the problem instantly.

The issue here was that the customer had a moderate level of write traffic, and the current query cache implementation invalidates all result sets for a given table whenever that table is updated. As the query cache grows in size, the number of entries that must be invalidated for a given table may grow as well. In addition, the coarse locking on the cache can lead to lock contention that can kill performance, particularly on multi-core hardware.

This leads to some general recommendations:

  1. If you plan to use the query cache, start small, perhaps 10MB to 20MB, and monitor the impact on your query/transaction throughput and response time. If performance drops, the query cache may not be a good match for your workload.
  2. Consider the read/write mix of your workload, as well as the rate of data updates. Query cache invalidations due to write traffic can be quite expensive, and even moderate levels can counteract the benefit of the cache.
  3. Verify whether the query cache is enabled, and if so, verify its currently configured size. You can do this by running "SHOW VARIABLES LIKE '%query_cache%';".  If query_cache_size (which defines the size of the cache, in bytes) is non-zero, and have_query_cache is TRUE, the cache is active. If query_cache_size is set to zero or have_query_cache is FALSE, the query cache is disabled.
    Note:
    • query_cache_size defaults to zero (ie disabled) on UNIX/Linux, but is non-zero in some of the example my.cnf files included with the MySQL distribution and on Windows.
    • You can change query_cache_size dynamically via "SET GLOBAL query_cache_size = XXX;" or permanently by adding a "query_cache_size = XXX" statement to your my.cnf file.
  4. Verify your query cache statistics via "SHOW GLOBAL STATUS;" for example
    • Hit rate = Qcache_hits / (Qcache_hits + Com_select)
    • Insert rate = Qcache_inserts / (Qcache_hits + Com_select)
    • Prune rate = Qcache_lowmem_prunes / Qcache_inserts
      • This is an approximation of the number of inserts that were too large for available free memory in the cache.
      • Updates to house keeping data may also result in a prune, so this is not an exact correlation.

You should try to allow all of the caches on your system to warm up before checking your cache statistics. This means allowing your application to run for at least 10 to 30 minutes, which is usually enough time for the system to reach a steady state.

When looking at hit rates, keep in mind that relatively low query cache hit rates may provide substantial benefit due to the potentially high cost of generating a result set for a query. For example, we usually try to maintain InnoDB buffer pool hit rates in the high 90% range, while hit rates below 50% in the query cache may be fine. On the other hand, if you can't get hit rates out of the single digits, you should really question the benefit of enabling the cache, since looking in the cache adds overhead to every select. In the end, query/transaction throughput and response time should be your ultimate indicator.

When looking at insert rates, keep in mind that relatively low insert rates may have a substantial negative impact on performance due to the high cost of updating the cache. I would be cautious of anything higher than single digit insert rates, and test the application with the query cache disabled (by setting query_cache_size to zero) to see if the cache is helping or hurting performance. However, keep in mind that a high prune rate might result in the need to repeatedly insert the same result set in the cache, so you need to consider that as well.

A non-zero prune rate may be an indication that you should increase the size of your query cache. However, keep in mind that the overhead of maintaining the cache is likely to increase with its size, so do this in small increments and monitor the result. If you need to dramatically increase the size of the cache to eliminate prunes, there is a good chance that your workload is not a good match for the query cache.

As you consider how your workload interacts with the query cache, keep in mind that you can use the SQL_NO_CACHE hint in a select statement if you find that a specific query has a tendency to negatively interact with the query cache.

As I said earlier, when the query cache helps, it can help a lot, but when it hurts, it can hurt a lot. It all depends on your particular workload, so it requires testing and is not something you should just blindly enable. In the end, the goal is to increase throughout and/or reduce response time, so those are the metrics you should ultimately consider. For more information about configuring and tuning the query cache, see the Query Cache section of the MySQL documentation. There is also an excellent write up in the book "High Performance MySQL" by Baron Schwartz et al, that touches on a number of issues I haven't mentioned.

I'd like to acknowledge my colleague Simon Vobes, who did a great job working with the customer to resolve this performance issue!

Friday Jan 09, 2009

Memory Leak Detection with libumem

I recently had the opportunity to do some memory leak detection with libumem, so I decided to share some thoughts and examples on its use.  The issue I was working on was related to a call for help from a colleague who was working primarily on Linux and OS X.  His application had a memory footprint that was growing over time, and he had used Valgrind and dtrace (on OS X) to try to find a leak, but had reached a dead end.  I offered to run the application on Solaris and use the libumem(3LIB) library and mdb(1) to search for leaks, and was able to quickly find a leak in the open source SIGAR library that he was using with his application.  For more details and current status on the specific leak, check out bug SIGAR-132.  For this discussion, I'll focus primarily on a simple example program to highlight libumem.

What is libumem?

The libumem(3LIB) library is a highly scalable memory allocation library that supports the standard malloc(3C) family of functions as well as its own umem_alloc(3MALLOC) functions and umem_cache_create(3MALLOC) object caching services.  It also provides debugging support including detection of memory leaks and many other common programming  errors.  The debugging capabilities are described in umem_debug(3MALLOC).  This discussion will focus primarily on using the debugging capabilities with standard malloc(3C).  For a performance comparison between libumem and several other memory allocators, have a look at Tim Cook's memory allocator bake-off from a few weeks back.

What is a memory leak?

Before I get started, let me clarify what I mean by a memory leak.  To me, a pure memory leak occurs when you allocate memory but then fail to retain a pointer to that memory.  For example, you might overwrite a pointer with a new value, or allow an automatic variable to be discarded without first freeing the memory that it references.  Without a pointer to the memory, you can't use it any more or free it, and it has leaked out of your control.  Some people also refer to situations where memory is held longer than necessary as a memory leak, but to me that is a memory hog, not a memory leak.  The debugging tools in libumem can help with both issues, but the techniques are different.  I will focus on what I consider a pure memory leak for today.

How do I enable libumem?

If you are compiling a new application and want libumem as your default memory allocator, just add -lumem to your compile or link command.  If you want to use any of the libumem specific functions, you should also #include <umem.h> in your program.  If you want to enable libumem on an existing application, you can use the LD_PRELOAD environment variable (or LD_PRELOAD_64 for 64 bit applications) to interpose the library on the application and cause it to use the malloc() family of functions from libumem instead of libc.

For example with sh/ksh/bash:

LD_PRELOAD=libumem.so your_command

with csh/tcsh:

(setenv LD_PRELOAD libumem.so; your_command)

To confirm that you are using libumem, you can use the pldd(1) command to list the dynamic libraries being used by your application.  For example:

$ pgrep -l my_app
 2239 my_app
$ pldd 2239
2239:    my_app
/lib/libumem.so.1
/usr/lib/libc/libc_hwcap2.so.1
$

How do I enable libumem debugging?

As described in umem_debug(3MALLOC), the activation of run-time debugging features is controlled by the UMEM_DEBUG and UMEM_LOGGING environment variables.  For memory leak detection, all we need to enable is the audit feature of UMEM_DEBUG.

For example, with sh/ksh/bash:

LD_PRELOAD=libumem.so UMEM_DEBUG=audit your_command

with csh/tcsh:

(setenv LD_PRELOAD libumem.so; setenv UMEM_DEBUG audit; your_command)

How do I access the debug data?

The libumem library provides a set of mdb(1) dcmds to inspect the debug data collected while the program runs.  To use the dcmds, you can either run your program under the control of mdb, attach to the program with mdb, or generate a core dump (for example with gcore(1)) and examine the dump with mdb.  The latter is the simplest, and looks like this:

$ pgrep -l my_app
1603 my_app
$ gcore 1603
gcore: core.1603 dumped
$ mdb core.1603
Loading modules: [ libumem.so.1 ld.so.1 ]
>

The commands above assume that your program runs long enough for you to generate the core dump, and that the memory leak has been triggered before the core dump is generated.  For a fast running program or to examine the image just before program exit, you can do the following:

$ LD_PRELOAD=libumem.so UMEM_DEBUG=audit mdb ./your_app
> ::sysbp _exit
> ::run
mdb: stop on entry to _exit
mdb: target stopped at:
0xfee3301a: addb %al,(%eax)
> ::load libumem.so.1
>

Once you are in mdb, you can get a listing of the libumem dcmds by running ::dmods -l libumem.so.1 and can get help on an individual dcmd with ::help dcmd.  For example:

> ::dmods -l libumem.so.1

libumem.so.1
dcmd allocdby - given a thread, print its allocated buffers
dcmd bufctl - print or filter a bufctl
dcmd bufctl_audit - print a bufctl_audit
dcmd findleaks - search for potential memory leaks
...
> ::help findleaks

NAME
findleaks - search for potential memory leaks

SYNOPSIS
[ addr ] ::findleaks [-dfv]

DESCRIPTION

Does a conservative garbage collection of the heap in order to find
potentially leaked buffers. Similar leaks are coalesced by stack
trace, with the oldest leak picked as representative. The leak
table is cached between invocations.
...

You can now use the various dcmds to look for memory leaks and other common problems with memory allocation, or to simply better understand how your application uses memory.

A complete example

The attached mem_leak.c program includes three simple memory leaks.  The first is within main(), where we overwrite a pointer after allocating memory.  The second is within a function, where we allow an automatic variable to be discarded before freeing memory that it references.  The last is a nested function call that includes a logic bug that causes it to return early, also allowing an automatic variable to be discarded before freeing memory that it references.

To get started, compile the program and start it up with libumem and its audit feature enabled:

$ /opt/SunStudioExpress/bin/cc -o mem_leak mem_leak.c
$ LD_PRELOAD=libumem.so UMEM_DEBUG=audit ./mem_leak
Memory allocated, hit enter to continue:
Memory freed, hit enter to exit:

With the program waiting at the second prompt, go to another window to generate a core dump and examine the results with mdb:

$ pgrep -l mem_leak
1714 mem_leak
$ gcore 1714
gcore: core.1714 dumped
$ mdb core.1714
Loading modules: [ libumem.so.1 ld.so.1 ]
> ::findleaks
CACHE LEAKED BUFCTL CALLER
08072c90 1 0807dd08 buf_create+0x12
08072c90 1 0807dca0 func_leak+0x12
08072c90 1 0807dbd0 main+0x12
------------------------------------------------------------------------
Total 3 buffers, 3456 bytes
>

The output from ::findleaks shows that we have leaked three memory buffers, as expected, and we can now obtain a stack trace for each by running ::bufctl_audit against each bufctl address:

> 0807dbd0::bufctl_audit
ADDR BUFADDR TIMESTAMP THREAD
CACHE LASTLOG CONTENTS
807dbd0 807bb00 f5c5bb73837 1
8072c90 0 0
libumem.so.1`umem_cache_alloc_debug+0x144
libumem.so.1`umem_cache_alloc+0x19a
libumem.so.1`umem_alloc+0xcd
libumem.so.1`malloc+0x2a
main+0x12
_start+0x7d

> 0807dca0::bufctl_audit
ADDR BUFADDR TIMESTAMP THREAD
CACHE LASTLOG CONTENTS
807dca0 807b180 f5c5bb74120 1
8072c90 0 0
libumem.so.1`umem_cache_alloc_debug+0x144
libumem.so.1`umem_cache_alloc+0x19a
libumem.so.1`umem_alloc+0xcd
libumem.so.1`malloc+0x2a
func_leak+0x12
main+0x2f
_start+0x7d

> 0807dd08::bufctl_audit
ADDR BUFADDR TIMESTAMP THREAD
CACHE LASTLOG CONTENTS
807dd08 807acc0 f5c5bb7446e 1
8072c90 0 0
libumem.so.1`umem_cache_alloc_debug+0x144
libumem.so.1`umem_cache_alloc+0x19a
libumem.so.1`umem_alloc+0xcd
libumem.so.1`malloc+0x2a
buf_create+0x12
nested_leak_l3+0xb
nested_leak_l2+8
nested_leak_l1+8
nested_leak+8
main+0x34
_start+0x7d

>

Note that if you have leaked any "oversized" allocations (currently anything over 16k) the output will include a list of these leaked buffers including a byte count and vmem_seg address.  You can obtain the stack traces for these buffer allocations by running ::vmem_seg -v against each vmem_seg address.

Looking at the stack traces, the entry just below libumem.so.1`malloc in each stack is the function that allocated the leaked buffer.  If it isn't clear which malloc() got leaked, it may help to use the ::dis dcmd to disassemble the code.  For example:

> main+0x12::dis
main: pushl %ebp
main+1: movl %esp,%ebp
main+3: subl $0x28,%esp
main+6: pushl $0x0
main+8: pushl $0x400
main+0xd: call -0x256 <PLT=libumem.so.1`malloc>
main+0x12: addl $0x8,%esp
main+0x15: movl %eax,-0x8(%ebp)
main+0x18: pushl $0x0
main+0x1a: pushl $0x400
main+0x1f: call -0x268 <PLT=libumem.so.1`malloc>
main+0x24: addl $0x8,%esp
main+0x27: movl %eax,-0x8(%ebp)
main+0x2a: call -0xff <func_leak>
main+0x2f: call -0x44 <nested_leak>
main+0x34: pushl $0x0
main+0x36: pushl $0x8050e70
>

The example above shows that there were two calls to malloc() near the beginning of main(), and we have leaked the memory allocated by the first one.  Note that the second malloc() is not reported as a leak even if the core is generated while the buffer is still active.  That is because we still have a reference to the buffer and it has not actually been leaked.  Whether the buffer is eventually freed doesn't really matter.  As long as we have a reference to the buffer at the time the core is generated or mdb examines the running program, it will not be reported as a leak.

Even with the information obtained from libumem and mdb, you will still have some detective work to do to determine exactly why you have leaked a particular buffer.  However, knowing which buffer has been leaked, and the point in the code where it was allocated, is more than half the battle.

Keep in mind that the allocation of the leaked memory may occur in a system library, not in the code for your program. This could mean you have found a leak in a system library, but more likely it means that you requested an object from the library and were supposed to call another function to discard that object when you were finished with it. For example, in the SIGAR leak that I mentioned at the start of this discussion, the leaks were related to buffers allocated by libnsl, but the real bug was a failure by sigar_rpc_ping() to call clnt_destroy(3NSL) to clean up a CLIENT handle it had created with clntudp_create(3NSL).

Tuesday Nov 04, 2008

Toward a More Scalable MySQL Replication Master

If you are a MySQL 5.x/6.0 InnoDB replication user, right now you take a significant performance hit on the replication master simply by turning on the binlog. The good news is that we've taken a big step toward eliminating that performance gap. I'll describe the problem, how I was able to track down the root cause, and point to a patch that fixes the problem. Since the changes are in the InnoDB code, right now we're waiting on Oracle/Innobase to review the fix and formally commit it. Once that happens, you should see it show up in binary releases. In the meantime, if you build your own binaries you can test the prototype patch yourself.

One of the things I have been working on quite a bit over the past several months is scalability of the nodes within a MySQL scale-out replication environment.  The reason being that there has been a rapid increase in CPU cores and hardware thread counts per server in recent years, across all system types.  Multi-core processors are now the norm, and regardless of the processor family you choose, each node in your environment will soon have dozens or hundreds of hardware threads available.  This trend is causing the notion of horizontal scaling (scale out) to be replaced by that of diagonal scaling (scale out and up), where a horizontally scaled architecture is made up of vertically scaled nodes.  This provides for increased performance while maintaining or even reducing the number of required nodes.  The key to this trend is that we need software that can scale both vertically and horizontally.  We also need to avoid negative scaling on multi-core systems, which refers to a drop in performance as thread counts increase, and is often caused by coarse grained locking in the code.

With this in mind, I set out to assess and improve the vertical scalability of nodes in a MySQL scale-out replication environment.  The first step was to analyze scalability of the master node, which accepts update requests from clients, applies them to the master copy of the database, then makes the updates available to slave nodes by way of the binlog file. For the purposes of this exercise, I will claim that enabling binlog logging is the primary differentiator between a replication master and a "regular" MySQL database server (yes, I know that binlog logging can also be used to support backups).  The first step in my analysis was to analyze the scalability impact of enabling binlog logging.

The basic methodology that I used to assess the impact of binlog logging was to drive a multi-threaded load against MySQL both with and without binlog logging enabled, gradually increasing the number of threads driving the load in each case.  Throughput was measured in transactions per second, and the chosen workload was a sysbench OLTP test.  This is a fairly simplistic benchmark, but has proven to be effective in this case.  The results of this initial test are shown below.

What the graph above shows is the difference in throughput between a baseline version of MySQL and the same version with binlog logging enabled. The baseline shown here is from a patched version of MySQL 6.0.5-alpha, which is roughly equivalent to 6.0.7-alpha. The results through 32 threads are similar to those I have obtained using 5.1.28-rc. The system used was a Sun SPARC Enterprise T5240, and psradm/psrset commands were used to allocate 2 cores (16 threads) to sysbench and 6 cores (48 threads) to mysqld. The remaining cores were taken offline. I want to mention that I did not see continued scaling beyond 48 sysbench threads, and have truncated the data as shown in order to focus on a specific scalability issue.

The striking result of this test was that, with the binlog enabled, both throughput and scalability were much worse than without binlog logging enabled.  Increasing the number of load generator threads did not result in any increase in throughput beyond 12 threads.  In order to find out why, I needed to determine what resource became a bottleneck as we approached and exceeded 12 load generator threads.  Since lock contention is a common reason for lack of scalability in a multi-threaded application, I turned next to the Solaris plockstat utility.  This dtrace consumer often makes it extremely easy, and sometimes quite obvious, to identify the root cause of a scalability issue in a multi-threaded program.  A sample plockstat command and the output from a run against MySQL 6.0.5-alpha with 32 sysbench threads is shown below.  The output has been truncated to save space.

plockstat -C -n 10 -e 5 -x bufsize=10k -x aggsize=2m -p 3491 | c++filt

Mutex block
Count     nsec Lock                         Caller
-------------------------------------------------------------------------------
 2038 54747571 mysqld`$XAJzoBKpX4GIETk.prepare_commit_mutex mysqld`int innobase_xa_prepare(handlerton\*,THD\*,bool)+0x4c
  282   447655 mysqld`mysql_bin_log+0x8     mysqld`bool MYSQL_BIN_LOG::write(Log_event\*)+0xa4
   83   471277 mysqld`mysql_bin_log+0x8     mysqld`bool MYSQL_BIN_LOG::write(Log_event\*)+0xa4

The plockstat options used above include:

-Clook for contention events
-n 10top 10 locks
-e 5sample for 5 seconds
-x bufsize=10k
-x aggsize=2m
increase size of buffers in dtrace due to the large number of events
-p 3491monitor process 3491, which was mysqld

Note that the output from plockstat can include C++ mangled names, so c++filt, which is part of SunStudio, was used to convert the mangled names back to original code.

The output above shows that the hottest lock is prepare_commit_mutex and the hottest caller is innobase_xa_prepare().  Searching through the code for this mutex and function call showed that the mutex was acquired in innobase_xa_prepare() and released in innobase_commit(). This code is located in storage/innobase/handler/ha_innodb.cc in 5.1 and 6.0 source distributions.  A Google search showed that the code was added during the 4.1 to 5.0 upgrade, as part of XA transaction support.  Incidentally, this specific code section is a very tiny part of a major code upgrade.

Figuring out what happens in MySQL from the time the prepare_commit_mutex lock is acquired until it is released was an interesting exercise, and I may talk about some of the techniques I used in future blogs.  As you can guess, our old friend dtrace came in pretty handy, but I also did a lot of good old fashioned code reading and code diffing between 4.1 and 5.0 code.  My main goal was to determine how things worked in 4.1, what had changed in 5.0, whether we could restore the 4.1 behavior, and whether that would improve scalability.

The first question to ask was, what is the purpose of the prepare_commit_mutex lock?  In essence, what is this lock protecting? Comments in the code indicate that the lock is there to ensure that commits to the binlog file occur in the same order as commits to the InnoDB logs, and that this is required for InnoDB hot backup. Subsequent discussions with the MySQL Replication team also indicate that this ordering is required for replication, so that commits on the slave will occur in the same order as commits on the master.

One of the first things I noticed about the prepare_commit_mutex lock is that it is acquired before the InnoDB prepare phase takes place. With innodb_flush_log_at_trx_commit set to 1 in my.cnf, this includes a synchronous flush of the InnoDB logs, which means that this will be a very long time to hold a lock.  The lock is then held while the binlog file is committed, which will also include a synchronous flush if sync_binlog is set to 1.  It is finally released after the InnoDB commit phase, which will include another synchronous flush of the InnoDB logs if innodb_flush_log_at_trx_commit is set to 1.  This means that we will have two or possibly three synchronous flushes to disk while holding this lock.  This results in an extremely long hold time on the lock, and is the main reason that the lock becomes so hot and puts a ceiling on scalability.

So what could we do about it?  The goal was to reduce the hold time on the lock by acquiring it later, releasing it earlier, or both.  To track this effort, I opened MySQL bug#38501 (hold times on prepare_commit_mutex limit read/write scalability).  This was later determined to be a duplicate of bug#13669 (Group commit is broken in 5.0).

The first improvement I tried was to acquire the lock at the end of the innobase_xa_prepare() method instead of at the beginning, so the lock was not held during the prepare phase.  This resulted in a 20%+ throughput improvement at 32 threads.  There has been a lot of discussion about whether this is a valid change, and at this point I believe that it will be found to be ok.  The replication team has indicated that replication has no ordering requirement between the prepare phase and the binlog commit, and I am waiting for feedback from Oracle/Innobase about the constraints for InnoDB hot backups.

The second improvement was to restore the 4.1 behavior for synchronous flushes of the InnoDB logs at commit time.  In the 5.0 code, the lock is held while the logs are both written and flushed.  In the 4.1 code, the logs were written but not flushed, the protective lock was released, then the logs were flushed to disk.  This meant the lock was not held during the flush to disk, allowing for increased scalability. This was the basis for "group commit" behavior in 4.1.  As it turns out, this behavior can be returned to MySQL with relatively few changes to the code, utilizing methods that are already present.  This change, combined with the first, resulted in more than double the throughput at high thread counts, as shown below.

The graph above is the same as the one shown earlier, but with a new set of data generated using a prototype patch for bug#38501 and bug#13669.  This shows that with the prototype patch, the impact of enabling binlog logging is dramatically reduced.  At high thread counts the throughput is more than double the Baseline+Binlog number, and is only about 9% below the throughput without binlog enabled.  This is exactly the kind of improvement I was hoping to achieve!

So what's next?  The prototype patch for this issue still needs to go through formal code review, but I am hopeful that it will require only minor changes.  Based on conversations with the MySQL Replication team, I believe that the basic changes retain the functionality that is required for replication.  Oracle/Innobase also needs to review the functionality related to InnoDB hot backups, and hopefully that will happen in the near future.

If you would like to try out the patch yourself, you can download the current prototype here and apply it to a 5.1 or 6.0 source distribution. Follow the regular process to install MySQL from a source distribution, for example these instructions for a 5.1 install, but apply the source code patch before the configure/make steps. For example, if you saved the patch to a file named "patchfile" you would apply it like this:

shell> gunzip < mysql-VERSION.tar.gz | tar -xvf -
shell> cd mysql-VERSION
shell> patch -p0 < patchfile

Keep in mind that this is a prototype patch, and should be considered pre-alpha code.

Incidentally, the work I have described here primarily helps the case where innodb_flush_log_at_trx_commit is set to 1 and we synchronously flush the InnoDB logs at the end of the prepare and commit phases of each transaction.  It does not fully address the case where sync_binlog is set to 1 and we also synchronously flush the binlog file for each transaction, although you will get some scalability improvement even then.  The sync_binlog case is still under investigation, and is certainly solvable.  We are also looking at a number of other areas for increased scalability of both the master and slave nodes.

Sunday Aug 24, 2008

Introduction

A bit about me to get things started... 

I am a Performance Engineer in Sun's Performance Applications Engineering (PAE) organization, where I conduct research on performance issues across the hardware/software stack. I plan to use this blog to share insights and experiences that I gain through my research, including tips and techniques for performance analysis, interesting findings and specific performance gains, and general musings on the state and future direction of computer systems performance. I will also share anything I come across that seems generally interesting, but the majority of entries will most likely be performance related.

I have worked in the computer field for 26 years, having spent the first 16 at the University of Rochester (UofR) and the last 10 at Sun. At the UofR, I started as a computer operator and worked/studied my way up to Technical Lead for their UNIX Group, which maintained campus infrastructure systems as well as many of the research computers. At Sun, prior to joining PAE, I spent several years on site at Xerox's primary North American data center, serving as a senior resource for performance and availability issues on their Sun products. Since joining the PAE organization, I have focused primarily on scalability on CMT/CMP systems, and currently have a major focus on MySQL Replication environments.

I am a life long learner and seemingly perpetual student, and completed my academic studies over many years. Most recently, I completed a graduate certificate in computer architecture at Stanford University, where I focused my energies on emerging CMT/CMP technologies and parallel processing. Prior work included a B.Sc. in computer operating systems at the State University of New York, Empire State College, and associates level studies in computer systems at the Rochester Institute of Technology.

I am very happily married and have a 13 year old daughter in middle school. My wife Mette is from Norway, and I have visited there with my family many times. When I'm not in front of a computer, I enjoy astronomy, bike riding, hiking, movies, and reading classic literature to my daughter, including everything from Gulliver's Travels to Frankenstein and Don Quixote.

My next post will most likely be related to MySQL replication, particularly focusing on recent work I have done to try to improve InnoDB replication master scalability when using XA transactions in conjunction with binlog replication.  Check back in a couple of weeks if that sounds interesting!

About

user12610824

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today