MySQL Performance on Sun Storage 7000

If you saw Don MacAskill's keynote (The Smugmug Tale) at the recent MySQL Conference and Expo, you know that he had lots of positive things to say about his experience running MySQL on the Sun Storage 7410 at Smugmug. The 7410 marks the current high end of the Sun Storage 7000 Unified Storage line of network attached storage devices. For the past few months, I have been investigating MySQL database performance on the entry level and mid-range models, to see whether they might provide similar benefits to smaller MySQL sites. I have to admit that I was more than a little surprised at just how well things performed, even on the entry level 7110. For the whole story, read on...

The Sun Storage 7000 Unified Storage product line currently ranges from the entry level Sun Storage 7110, which provides up to 4TB of storage in a 2U form factor, to the high end Sun Storage 7410, which scales up to half a petabyte, and can be configured with Flash accelerated Hybrid Storage Pools and/or a High-availability Cluster option. The entire line, which also includes the mid-range Sun Storage 7210 and 7310, features a common, easy to use management interface and, of course, killer-app Dtrace Analytics. My investigation has focused primarily on the Sun Storage 7110, with some additional work on the Sun Storage 7210, which has more CPU, memory, and spindle capacity, and supports optional Write Flash Accelerator components.

Key Benefits for MySQL

There are a couple of aspects of the Sun Storage 7000 line that are really interesting for MySQL. As Don pointed out in his keynote, the underlying use of ZFS in the appliance means that you can get many of the benefits of ZFS even without migrating your OS to Solaris. The snapshot and clone features, for example, make automated backups and replication slave creation a snap, and the way that cloning is implemented means that a slave created by this method may require less physical storage than a slave created by traditional methods. That is because clones share unmodified disk blocks, so anything that remains unchanged after slave creation is never actually duplicated. You can also enable compression in the appliance, if you choose, which could mean even more space savings.

Analytics, Analytics, Analytics!

As a performance engineer, I can tell you that I have been frustrated many, many times by the lack of observability in storage devices. Even the best interfaces I have seen are like looking through frosted glass compared to the Dtrace Analytics interface. As Don MacAskill says "Drill down on everything. Correlate anything. God-like power." But don't take my word for it (or Don's), go to the Sun Storage 7000 Software page and download a simulator that you can run on your laptop or desktop, so you can see first hand.

Performance Highlights

Ok, nobody would care about Dtrace Analytics if all it did was tell you why your system was running so darn slow. The good news is that MySQL performance on the Sun Storage 7000 line is excellent, even on the entry level Sun Storage 7110. There are a couple of issues that I will point out later, but for MySQL on Linux over NFS, performance is great, right out of the box.

Huge read cache vs traditional HW Raid

The current Sun Storage 7000 lineup is based on standard Sun x64 systems and 64-bit OpenSolaris. That means lots of addressable memory, most of which is used as cache. On a standard Sun Storage 7110 with 8GB of memory, that can translate to over 6GB of usable cache. This can grow all the way to 128GB of memory on the 7410. Compare that to typical entry level hardware RAID devices that often come with as little as 256MB of cache, or even relatively high end hardware RAID devices that often come with 2GB of cache or less.

Blazingly fast reads from cache

With a database working set (the subset of data that clients are actively using) that fits in the cache of the Unified Storage appliance, I found that MySQL read-only performance on Linux and Solaris NFS was primarily bound by network bandwidth. For example, I observed over 6400 reads per second on the 7110, with a single MySQL/InnoDB/sysbench server accessing a 6GB working set (20GB total table space) via NFS over a 1Gb NIC. The sustained data throughput at this rate was a little over 80MB/sec, and with protocol overhead we had a sustained rate of a little over 90MB/sec on the wire, with peaks near 100MB/sec. That means we were running very near network line rate for the 1Gb NIC.

This resulted in 2x to 3x the Sysbench throughput for the MySQL/InnoDB/sysbench server using the Sun Storage 7110, compared to the same server when it was attached via fiber channel to a traditional HW RAID array with 2GB memory (providing enough cache for less than 1/3 of our working set). The read-only results were strong across the board for Linux and Solaris over both NFS and iSCSI.

On the 7210, we saw ~90% linear scaling for 1 to 6 MySQL/sysbench servers over NFS, when configured with a 10Gb NIC on the Unified Storage appliance and 1Gb nics in each DB server. Analytics showed that CPU and network in the 7210 were both at approximately 50% utilization during this test, and our working set used 36GB out of the available ~60GB of cache, so it is likely that we could have seen similar scaling to 8 or 10 DB servers.

The only tuning we applied at the Unified Storage appliance end was to set the record size of the NFS share to 16KB for the InnoDB tablespace storage. We left the record size at its default of 128KB for the NFS share that contained the InnoDB logs. There was no special network or NFS tuning applied to either the DB server or the appliance.

Strong IOPS for large working sets

If you have an active working set that exceeds the cache capacity in the Unified Storage appliance, your random read performance will eventually be bound by the IOPS rate of the underlying storage. For the Sun Storage 7110 and 7210, which do not have Read Flash Accelerator options like the 7310 and 7410, that means the IOPS rate of the disks. In a MySQL/InnoDB/sysbench test on the 7110, with an aggregate working set of 36GB (180GB aggregate table space), resulting in a 40% cache hit rate in the appliance, I observed roughly 4000 NFS reads per second and roughly 2700 resulting disk IOPS. That translates to 230 IOPS for each of the 12 data drives. In this test we were using only about 15% of the drive capacity, so we were "short stroking" the disks and getting better IOPS rates than we would if we had longer average seeks. For example, at 80% capacity we might see something on the order of 150 to 180 IOPS per drive.

For additional background on IOPS and other metrics for Sun Storage 7000 storage devices, check out Roch Bourbonnais' blogs on Sun Storage 7000 Performance invariants and Compared Performance of Sun 7000 Unified Storage Array Line and Amitabha Banerjee's blog on Analyzing the Sun Storage 7000.

Excellent MySQL read-write performance on Linux over NFS

I was very pleasantly surprised by MySQL read-write performance on Linux NFS to the entry level Sun Storage 7110. I went into the study expecting good read-write performance to the 7210, but I thought that the lack of Write Flash Accelerators in the 7110 would be an issue for MySQL. Not so! In my testing, I observed over 2300 reads plus 400 writes per second on the 7110, with a single MySQL/InnoDB/sysbench server accessing a 6GB working set (20GB total table space) via NFS over a 1Gb NIC.

This resulted in 1.5x the Sysbench throughput for the MySQL/InnoDB/sysbench server using the Sun Storage 7110, compared to the same server when it was attached via fiber channel to a traditional HW RAID array with 2GB memory (providing enough cache for less than 1/3 of our working set).

As with the read-only test, the large read cache in the 7110 compared to the HW RAID array probably played a big role here. The 85% cache hit rate in the appliance provided an advantage on read performance, which offset the potential advantage of the battery backed write cache in the HW RAID array. In addition, the multi-threaded MySQL/InnoDB/sysbench workload appears to have benefited from group commit on writes, since the 7110 started with lower throughput with 1 or 2 active threads, but began to outperform the HW RAID array at 4 threads and higher.

I was surprised enough by the MySQL read-write performance over Linux NFS that I felt compelled to confirm that fsync() calls in Linux were actually resulting in correct NFS data commits. Based on a combination of Dtrace Analytics and Wireshark analysis of NFS traffic, I can confirm that the data was correctly going to disk.

Good MySQL read-write performance over iSCSI w/out write flash accelerators

The testing I have done so far with MySQL over iSCSI has been exclusively on the entry level Sun Storage 7110. Unlike the 7210, 7310, and 7410, this model does not currently include a write flash accelerator option, which would probably help MySQL read-write performance over iSCSI. Even so, we observed good performance over iSCSI, especially at higher thread counts. For example, using the same database and network configurations described earlier, MySQL/InnoDB/Sysbench throughput over iSCSI, at 32 to 64 active threads, was comparable to what was seen on the same server when it was attached via fiber channel to a traditional HW RAID array with 2GB memory for both Solaris and Linux.

Current Issues

MySQL read-write over Solaris NFS

MySQL read-only performance on Solaris NFS is currently excellent, but read-write performance is impacted by the lack of VMODSORT support in the Solaris NFS client (CR 6213799). This affects fsync() performance for large files using a default, buffered NFS mount. The normal workaround for this is to eliminate client side buffering of file data by mounting the file system with the forcedirectio option, or enabling directio on a per-file basis. For example, the MySQL/InnoDB option "innodb_flush_method = O_DIRECT" enables dirctio on the InnoDB tablespace files. That is likely to work fine on an appliance that includes write flash accelerators like the Sun Storage 7210 and higher, but write flash accelerators are not currently available on the Sun Storage 7110. For the 7110 without write flash accelerators, we did not see a MySQL read-write performance gain by using directio instead of default, buffered NFS file access.

For Solaris, MySQL read-write performance on ZFS over iSCSI currently exceeds its performance over buffered or directio enabled NFS on the Sun Storage 7110, provided that fast-write-ack is enabled on the iSCSI luns.

iSCSI fast-write-ack persistence

A Solaris system running ZFS over iSCSI can realize performance gains by enabling fast-write-ack on the iSCSI luns in the storage appliance, because ZFS is known to correctly issue SCSI cache flush commands when fsync() is called (that is not currently known to be true for any Linux file system). The fast-write-ack option can be activated by enabling the "write cache enabled" option in the appliance for a given iSCSI lun. However, due to CR 6843533 the write cache enabled setting on an iSCSI lun will be silently disabled any time an iscsi login occurs for a target, although the Unified Storage BUI and CLI will still indicate that it is enabled. Examples of iscsi target login triggers include a reboot of the Unified Storage appliance, a reboot of the client, a client "iscsiadm modify target-param -p ..." command to modify negotiated parameters, or an "iscsiadm remove ..." followed by an "iscsiadm add ..." for the affected target. The workaround for CR 6843533 is to manually disable and then reenable write cache following an iscsi target login.

Recommendations

General

For an NFS share that will be used to store MySQL tablespace files, match the record size of the share to the block size of the storage engine. For example, this should be 16k for InnoDB tablespace files. This can be configured on a per-share basis by setting the "Database record size" (in the BUI) or "recordsize" (in the CLI) for the share. This must be done before creating the tablespace files in the share.

For an NFS share that will store transaction logs or other non-tablespace files, the record size of the share can be left at its default of 128k.

For more information on Sun Storage 7000 Unified Storage network attached storage devices, including lots more performance data, check out these other blogs:

You will also find Sun Storage 7000 information on the new Sun System and Application Performance, Monitoring, and Tuning wiki, which includes performance information for a broad range of technologies.

Linux

We saw our best MySQL read-only and read-write performance on Linux using NFS. Read-only performance over iSCSI approached that of NFS, but read-write over NFS substantially outperformed iSCSI. Based on the read-write result, we recommend using NFS for MySQL on Linux. All of our testing on Linux used NFSv3, which was the default in the version of Linux we tested.

Solaris

For the Sun Storage 7110 without write flash accelerators, we saw our best MySQL read-write performance on Solaris by running ZFS over iSCSI. This avoided the performance impact we would have seen with MySQL over NFS, due to the lack of VMODSORT support in the Solaris NFS client (CR 6213799). The data integrity guarantees of ZFS also allowed us to disable the InnoDB double write buffer, and ZFS cache management semantics allowed us to enable fast-write-ack on the iSCSI luns in the storage appliance. Until CR 6213799 has been addressed, we recommend using ZFS over iSCSI for MySQL on Solaris if you do not have write flash accelerators in your appliance. However, be sure to review the iSCSI fast-write-ack persistence discussion in the "Current Issues" section above if you use this feature.

For the Sun Storage 7210 and higher, write flash accelerators should benefit MySQL read-write performance over NFS using either the forcedirectio mount option or the "innodb_flush_method = O_DIRECT" MySQL/InnoDB option, and will also benefit MySQL read-write performance over iSCSI. We did not have an appliance with write flash accelerators available for this test, so at this time we can not recommend one configuration over the other.

Comments:

What mount options did you use for NFS?

Posted by Justin Swanhart on June 19, 2009 at 01:50 AM PDT #

Hey David,

Nice write-up, good to see you cover some spots I haven't looked at myself.

You might want to update the post, though, about a serious MySQL problem: replication info files (relay-log.info, master.info) aren't sync'd to disk in shipping versions of MySQL. This means if you slave crashes, it could very well be up-to-date - but you wouldn't know it because your \*.info files are out-of-date. We had a slave crash and the \*.info files were more than \*2 months\* out of date.

The only sane solution is to patch your MySQL with something like Google's Transactional Replication patch ( http://code.google.com/p/google-mysql-tools/wiki/TransactionalReplication ) or Global Transaction IDs patch ( http://code.google.com/p/google-mysql-tools/wiki/GlobalTransactionIds ). We had Percona patch 5.4 with transactional replication for us. Looking forward to using Global Transaction IDs at some point, though.

Once that's done, and assuming you're only using InnoDB, you can take snapshots as often as you'd like, no commands required in MySQL. Pretty much a dream.

Hope that helps!

Don

Posted by Don MacAskill on June 19, 2009 at 02:38 AM PDT #

I should clarify that the problem I describe above isn't unique to NFS, but it is massively amplified by NFS. With local storage, the \*.info logs still aren't transactional, but they're usually close to within a few transactions, maybe even one.

With NFS, though, for some reason Linux and MySQL conspire to not even send any writes over the wire at all to those two files. We suspect those files are mmap'd and that's why - so if the NFS mount disappears, the server crashes, whatever - those files contents are totally lost.

Running a MySQL replica on NFS without a patch is a very very bad idea unless you can just throw the slave away easily. If it dies in a number of different ways, it's totally lost.

Posted by Don MacAskill on June 19, 2009 at 03:08 AM PDT #

@Don -- someone really should answer your question for you. It it in the best interest of the sales team to make that question go away and/or fix the code.

@David -- can I connect one MySQL server to several of these boxes? Assume I need more IOPs and cache then one box can supply.

Posted by Mark Callaghan on June 19, 2009 at 06:45 AM PDT #

Hi Don, Mark,

Yes, this was an issue (and still is in 5.1), however we've fixed this with a set of variables much like sync_binlog in 6.0 (which will be 5.4) - I wrote the initial patch (which was crap for performance :P), and the replication team cleaned it up and took it further:

http://bugs.mysql.com/bug.php?id=31665
http://bugs.mysql.com/bug.php?id=40337

Hopefully you should be seeing this in the Azalea release real soon.

Posted by Mark Leith on June 21, 2009 at 08:47 PM PDT #

@Justin,

I used default NFS mount options for both the Solaris and Linux tests. For Solaris this gave me NFSv4, for Linux it gave me NFSv3. I did not use direct I/O for either case, and had innodb_flush_log_at_trx_commit=1 for all of the runs. There was no special network tuning for any of the tests.

Posted by David Lutz on June 22, 2009 at 04:41 AM PDT #

@Don,

Thanks for pointing out the issue with relay log syncing. I will do some research to see if I can tell what is going on there. I suspect a bug or at least a mis-feature in Linux bdflush, since it appears that there is no guarantee that a dirty page will ever be flushed to disk. You may have encountered a corner case for the bdflush parameters that influence when pages are flushed.

As Mark Leith pointed out, explicit syncing of the relay log files is being added to MySQL, so this issue will go away when those patches are integrated.

Hopefully there is a simple workaround in the meantime.

Posted by David Lutz on June 22, 2009 at 05:00 AM PDT #

@Mark Callaghan,

The short answer is yes, you can connect one MySQL server to multiple Unified Storage appliances. If you are using NFS, you will end up with a minimum of one NFS mount per appliance, so you would have to work out how you would spread your files across multiple mount points. For iSCSI, I can imagine creating software stripes across multiple iSCSI targets, so you could spread the I/O for a single mount point across multiple appliances.

Keep in mind that an active-active Sun Storage 7410C clustered appliance, with a Hybrid Storage Pool and multiple 10GbE network interfaces, provides more IOPS and bandwidth than most single servers could swallow, but that doesn't mean there aren't cases where this might apply.

For examples of current single appliance metrics, check out:

http://blogs.sun.com/roch/entry/compared_performance_of_sun_7000

Posted by David Lutz on June 22, 2009 at 05:34 AM PDT #

@Don,

I've done some testing and didn't reproduce the relay log issue you mentioned. With a 2.6.18-128.el5 Linux kernel, NFSv3, and default bdflush tunings, a write() to a file will be flushed to disk after a short period of time, even without an explicit fsync(). I tested this with relay log files on a replication slave, and also with some simple C code doing writes to a file. In both cases, I could see the NFS transactions in analytics and in network traces, and could see the updates hit the disks.

The lack of explicit fsync() calls for the slave's relay logs is an issue, and it is being addressed, but it doesn't appear to be worse for NFS than for local disks.

I'm not sure how you encountered the problem you described. I will drop you an e-mail to see if you would like to follow up.

Posted by David Lutz on June 26, 2009 at 02:21 AM PDT #

Don,

We saw your point about the .info files at your conference presentation and here. In trying a 7410 we used:

relay-log-info-file =
master-info-file =

to place the files on local disks (though if your mysql servers are diskless obviously you couldn't do that). At this point we are no worse off than with our DAS setup, but in testing slave crash recovery we also noticed InnoDB prints out its known replication information:

InnoDB: In a MySQL replication slave the last master binlog file
InnoDB: position 0 371045067, file name master-host-bin.000728

This is the same as what it prints for InnoDB hot backup recovery, and in two tests was a few statements later than the information included in master.info. Manual review of the queries between the two points showed InnoDB was correct both times. In cases with only InnoDB tables, doesn't this effectively solve the issue?

Posted by Ware Adams on July 05, 2009 at 11:55 AM PDT #

The slave SQL thread does the following:
1) commit transaction to storage engine
2) update relay-log.info

A mysqld crash after step 1 and before step 2 means that the slave replays the transaction from step 1. This has been a known problem for a long time, but I am not sure if the MySQL manual highlights the issue. Bugs have been open for years. The Google patches have 2 fixes for this that Percona may have in an easier to use form -- rpl_transaction_enabled (only for innodb) and global group IDs (works for all storage engines, although it can't do anything for engines that don't do crash recovery correctly such as MyISAM).

Posted by Mark Callaghan on July 05, 2009 at 12:01 PM PDT #

Mark,

We hope to use the Google patches via Percona, but we are still using 4.1 for the next couple of months. Transactional replication is the biggest interest in this for us.

Still, given what you said isn't InnoDB printing the result of the latest commit? Can its output during crash recovery be relied upon for updating slave position?

Thanks.

Posted by Ware Adams on July 05, 2009 at 12:10 PM PDT #

I don't know if it can be relied upon. There was a lot of discussion about this (forgot where) from a bug found/filed by Baron of Maatkit/Percona fame.

Posted by Mark Callaghan on July 09, 2009 at 03:44 AM PDT #

I think there is an issue when mixed storage engines are used. If I understand the problem correctly, InnoDB is only aware of replication changes made to InnoDB storage engine backed tables.

The problem happens when:

insert into INNODB (); -- position updated in innodb redo log
insert into MYISAM ();
CRASH!

The second insert would be repeated at slave startup if the information in the redo log is used.

I think the v4 "crash safe replication" addresses this (and any other) problems.

Posted by Justin Swanhart on July 09, 2009 at 04:17 AM PDT #

Yes that is a problem with rpl_transaction_enabled (transactional slave replication from the Google patch). And it most likely shows up during changes to the mysql database (permissions, ...). Because you shouldn't be using MyISAM for anything else if you care about availability (or multicore performance).

global group IDs in the v4 patch fixes that and has features to replace rpl_transaction_enabled

Posted by Mark Callaghan on July 09, 2009 at 04:25 AM PDT #

Post a Comment:
  • HTML Syntax: NOT allowed
About

user12610824

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today