Tuesday Apr 21, 2009

MySQL 5.4 Sysbench Scalability on 64-way CMT Servers

As a followup to my MySQL 5.4 Scalability on 64-way CMT Servers blog, I'm posting MySQL 5.4 Sysbench results on the same platform. The tests were carried out using the same basic approach (i.e. turning off entire cores at a time) - see my previous blog for more details.

The Sysbench version used was 0.4.8, and the read-only runs were invoked with the following command:

sysbench --max-time=300 --max-requests=0 --test=oltp --oltp-dist-type=special --oltp-table-size=10000000 \\
   --oltp-read-only=on --num-threads=[NO_THREADS] run
The "oltp-read-only=on" parameter was omitted for the read-write tests. The my.cnf file listed in my previous blog was also used unchanged for these tests.

Here is the data presented graphically. Note that the number of vCPUs is the same as the number of active threads up to 64. Active threads beyond 64 are using only 64 vCPUs.

And here is some raw data:

Read-only

User ThreadsTxns/secUserSystemIdle
173.0669310
8436.2384160
16855.6483160
321674.6983170
482402.1282171
642727.45701911
802524.69641916
962491.10271856
1282131.09221464

Read-write

User ThreadsTxns/secUserSystemIdle
141.89493021
8269.62621523
16486.14581329
32867.88541234
481121.87491239
641453.00481438
801509.09491536
961612.11541829
1281595.75521830

A few observations:

  • Throughput scales 63% from 32-way to 64-way for read-only and 67% for read-write. Not quite as good as for the OLTP test reported in my earlier blog, but not at all bad.
  • Beyond 48 vCPUs, idle CPU is preventing optimal scaling for the read-only test.
  • There's quite a bit of CPU left on the table for the read-write tests.
There's still a lot more work to be done, but we're definitely seeing progress.

Allan

MySQL 5.4 Scalability on 64-way CMT Servers

Today Sun Microsystems announced MySQL 5.4, a release that focuses on performance and scalability. For a long time it's been possible to escape the confines of a single system with MySQL, thanks to scale-out technologies like replication and sharding. But it ought to be possible to scale-up efficiently as well - to fully utilize the CPU resource on a server with a single instance.

MySQL 5.4 takes a stride in that direction. It features a number of performance and scalability fixes, including the justifiably-famous Google SMP patch along with a range of other fixes. And there's plenty more to come in future releases. For specifics about the MySQL 5.4 fixes, check out Mikael Ronstrom's blog.

So how well does MySQL 5.4 scale? To help answer the question I'm going to take a look at some performance data from one of Sun's CMT systems based on the UltraSPARC T2 chip. This chip is an interesting test case for scalability because it amounts to a large SMP server in a single piece of silicon. The UltraSPARC T2 chip features 8 cores, each with 2 integer pipelines, one floating point unit, and 8 hardware threads. So it presents a total of 64 CPUs (hardware threads) to the Operating System. Since there are only 8 cores and 16 integer pipelines, you can't actually carry out 64 concurrent operations. But the CMT architecture is designed to hide memory latency - while some threads on a core are blocked waiting for memory, another thread on the same core can be executing - so you expect the chip to achieve total throughput that's better than 8x or even 16x the throughput of a single thread.

The data in question is based on an OLTP workload derived from an industry standard benchmark. The driver ran on the system under test as well as the database, and I used enough memory to adequately cache the active data and enough disks to ensure there were no I/O bottlenecks. I applied the scaling method we've used for many years in studies of this kind. Unused CPUs were turned off at each data point (this is easily achieved on Solaris with the psradm command). That ensured that "idle" CPUs did not actually participate in the benchmark (e.g. the OS assigns device drivers across the full set of CPUs, so they routinely handle interrupts if you leave them enabled). I turned off entire cores at a time, so an 8 "CPU" result, for example, is based on 8 hardware threads in a single core with the other seven cores turned off, and the 32-way result is based on 4 cores each with 8 hardware threads. This approach allowed me to simulate a chip with less than 8 cores. Sun does in fact ship 4- and 6-core UltraSPARC T2 systems; the 32- and 48-way results reported here should correspond with those configurations. For completeness, I've also included results with only one hardware thread turned on.

Since the driver kit for this workload is not open source, I'm conscious that community members will be unable to reproduce these results. With that in mind, I'll separately blog Sysbench results on the same platform. So why not just show the Sysbench results and be done with it? The SQL used in the Sysbench oltp test is fairly simple (e.g. it's all single table queries with no joins); this test rounds out the picture by demonstrating scalability with a more complex and more varied SQL mix.

It's worth noting that any hardware threads that are turned on will get access to the entire L2 cache of the T2. So, as the number of active hardware threads increases, you might expect that contention for the L2 would impact scalability. In practice we haven't found it to be a problem.

Here is the my.cnf:

[mysqld]
table_open_cache = 4096
innodb_log_buffer_size = 128M
innodb_log_file_size = 1024M
innodb_log_files_in_group = 3
innodb_buffer_pool_size = 6G
innodb_additional_mem_pool_size = 20M
innodb_thread_concurrency = 0

Enough background - on to the data. Here are the results presented graphically.

And here is some raw data:

MySQL 5.4.0

ThroughputvCPUsUsSyIdsmtxicswsycl
1.0016238003714566
4.768772302205936159
9.2416772302655716456
13.4524772303455656606
17.4232762223804776621
24.2948762315574966731
29.7964742426744576736

Abbreviations: "vCPUS" refers to hardware threads, "Us", "Sy", "Id" are CPU utilization for User, System, and Idle respectively, "smtx" refers to mutex spins, "icsw" to involuntary context switches, and "sycl" to systems calls - the data in each of these last three columns is normalized per transaction.

Some interesting takeaways:

  • The throughput increases 71% from 32 to 64 vCPUs. It's encouraging to see a significant increase in throughput beyond 32 vCPUs.
  • The scaleup from 1 to 64 vCPUs is almost 30x. As I noted earlier, the UltraSPARC T2 chip does not have 64 cores - it only has 8 cores and 16 integer pipelines, so this is a good outcome.

To put these results into perspective, there are still many high volume high concurrency environments that will still require replication and sharding to scale acceptably. And while MySQL 5.4 scales better than previous releases, we're not quite seeing scalability equivalent to that of proprietary databases. It goes without saying, of course, that we'll be working hard to further improve scalability and performance in the weeks and months to come.

But the clear message today is that MySQL 5.4 single-server scalability is now good enough for many commercial transaction-based application environments. If your organization hasn't yet taken the plunge with MySQL, now is definitely the time. And you certainly won't find yourself complaining about the price-performance.

Allan

Thursday Oct 11, 2007

CMT Comes Of Age

Sun engineers give the inside scoop on the new UltraSPARC T2 systems

[ Update Jan 2008: Sun SPARC Enterprise T5120 and T5220 servers were awarded Product of the Year 2007. ]

Sun launched the Chip-Level MultiThreading (CMT) era back in December 2005 with the release of the highly successful UltraSPARC T1 (Niagara) chip, featured in the Sun Fire T2000 and T1000 systems. With 8 cores, each with 4 hardware strands (or threads), these systems presented 32 CPUs and delivered an unprecedented amount of processing power in compact, eco-friendly packaging. The systems were referred to as CoolThreads servers because of their low power and cooling requirements.

Today Sun introduces the second generation of Niagara systems: the Sun SPARC Enterprise T5120 and T5220 servers and the Sun Blade T6320. With 8 hardware strands in each of 8 cores plus networking, PCI, and cryptographic capabilities, all packed into a single chip, these new 64-CPU systems raise the bar even higher.

The new systems can probably be best described by some of the engineers who have developed them, tested them, and pushed them to their limits. Their blogs will be cross-referenced here, so if you're interested to learn more, come back from time to time. New blogs should appear in the next 24 hours, and more over the next few weeks.

Here's what the engineers have to say.

  • UltraSPARC T2 Server Technology. Dwayne Lee gives us a quick overview of the new systems. Denis Sheahan blogs about UltraSPARC T2 floating point performance, offers a detailed T5120 and T5220 system overview, and shares insights into lessons learned from the UltraSPARC T1. Josh Simons offers us a glimpse under the covers. Stephan Hoerold gives us an illustration of the UltraSPARC T2 chip. Paul Sandhu gives us some insight into the MMU and shared contexts. Tim Bray blogs about the interesting challenges posed by a many-core future. Darryl Gove talks about T2 threads and cores. Tim Cook compares the UltraSPARC T2 to other recent SPARC processors. Phil Harman tests memory throughput on an UltraSPARC T2 system. Ariel Hendel, musing on CMT and evolution, evidences a philosophical bent.
  • Performance. The inimitable bmseer gives us a bunch of good news about benchmark performance on the new systems - no shortage of world records, apparently! Peter Yakutis offers detailed PCI-E I/O performance data. Ganesh Ramamurthy muses on the implications of UltraSPARC T2 servers from the perspective of a senior performance engineering director.
  • System Management. Find out about Lights Out Management (ILOM) from Tushar Katarki's blog.
  • Networking. Alan Chiu gives us some insights into 10 Gigabit Ethernet performance and tuning on the UltraSPARC T2 systems.
  • RAS. Richard Elling carries out a performability analysis of the T5120 and T5220 servers.
  • Clusters. Ashutosh Tripathi discusses Solaris Cluster support in LDoms I/O domains.
  • Virtualization. Learn about Logical Domains (LDoms) and the release of LDoms 1.0.1 from Honglin Su. Eric Sharakan has some more to say about LDoms and the UltraSPARC T2. Ashley Saulsbury presents a flash demo of 64 Logical Domains booting on an UltraSPARC T2 system. Find out why Sun xVM and Niagara 2 are the ultimate virtualization combo from Marc Hamilton.
  • Security Performance. Ning Sun discusses Cryptography Acceleration on T2 systems. Glenn Brunette offers us a Security Geek's point of view on the T5x20 systems. Lawrence Spracklen has several posts on UltraSPARC T2 cryptographic acceleration. Martin Mueller proposes a UltraSPARC T2 system deployment designed to deliver a high performance, high security environment.
  • Application Performance. Dileep Kumar talks about WebSphere Application Server performance with UltraSPARC T2 systems. Tim Bray shares some hands-on experiences testing a T5120.
  • Java Performance. Dave Dagastine offers us insights into the HotSpot JVM on the T2 and Java performance on the new T2 servers.
  • Web Applications. Murthy Chintalapati talks about web server performance. Constantin Gonzalez explores the implications of UltraSPARC T2 for Web 2.0 workloads. Shanti Subramanyam tells us that Cool Stack applications (including the AMP stack packages) are pre-loaded on all UltraSPARC T2-based servers.
  • Open Source Community. Barton George explorers the implications of UltraSPARC T2 servers for the Ubuntu and Open Source community.
  • Open Source Databases. Luojia Chen discusses MySQL tuning for Niagara servers.
  • Customer Use Cases. Stephan Hoerold gives us some insight into experiences of Early Access customers. Stephan also shares what happened when STRATO put a T5120 to the test. It seems like STRATO also did some experimentation with the system.
  • Sizing. I've posted an entry on Sizing UltraSPARC T2 Servers.
  • Solaris features. Scott Davenport blogs on Predictive Self-Healing on the T5120. Steve Sistare gives us a lot of insight into features in Solaris to optimize the UltraSPARC T2 platforms. Walter Bays salutes the folks who reliably deliver consistent interfaces on the new systems.
  • HPC & Compilers. Darryl Gove talks about compiler flags for T2 systems. Josh Simons talks about the relevance of the new servers to HPC applications. Ruud van der Pas measures T2 server performance with a suite of single-threaded technical-scientific applications. In another blog entry, Darryl Gove introduces us to performance counters on the T1 and T2.
  • Tools. Darryl Gove points to the location of free pre-installed developer tools on UltraSPARC T2 systems. Nicolai Kosche describes the hardware features added to UltraSPARC T2 to improve the DProfile Architecture in Sun Studio 12 Performance Analyzer. Ravindra Talashikar brings us Corestat for UltraSPARC T2, a tool that measures core utilization to help users better understand processor utilization on UltraSPARC T2 systems.

Finally

Go check out the new UltraSPARC T2 systems, and save energy and rack space in the process.

Enjoy!

About

I'm a Principal Engineer in the Performance Technologies group at Sun. My current role is team lead for the MySQL Performance & Scalability Project.

Search

Categories
Archives
« July 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
  
       
Today