Saturday Mar 26, 2011

Life at Oracle

It's now been 12 months since Oracle absorbed Sun. A lot has happened in that time, and the good news is that Sun's technology has a new lease on life. Reviewing my blog entries over the last 2 or 3 years, most of them were focussed on MySQL and T-series systems. Lots of good things are happening with both product lines.

As for me, I've been working on SPARC Supercluster. More on this topic later... ;-)

Tuesday Apr 21, 2009

MySQL 5.4 Scaling on Nehalem with Sysbench

As a final followup to my MySQL 5.4 Scalability on 64-way CMT Servers blog, I'm posting MySQL 5.4 Sysbench results on a Sun Fire X4270 platform using the Intel x86 Nehalem chip (2 sockets/8 cores/16 threads). All CPUs were turned on during the runs. The my.cnf was the same as described in the previous blog.

The Sysbench version used was 0.4.12, and the read-only runs were invoked with the following command:

sysbench --max-time=300 --max-requests=0 --test=oltp --oltp-dist-type=special --oltp-table-size=10000000 \\
   --oltp-read-only=on --num-threads=[NO_THREADS] run

The "oltp-read-only=on" parameter was omitted for the read-write tests. The my.cnf file listed in my previous blog was also used unchanged for these tests.

Here are the results graphically running on Linux.

The read-only results for MySQL 5.4 show a higher peak throughput, but a tail off beyond 32 threads compared to MySQL 5.1. The drop off in throughput is clearly an issue that needs to be tracked down and resolved. By contrast, the read-write results show better throughput across the board for MySQL 5.4 compared to MySQL 5.1 and significantly better scalability.

Here are the MySQL 5.4 results graphically on Solaris (Sysbench 0.4.8). Note that in this case unused CPUs were turned off. The number of vCPUs is the same as the number of active threads up to 16. Active threads beyond 16 are using only 16 vCPUs.

Solaris shows better throughput than Linux for the read-only tests and good scalability across the range of active threads.

The Sun Fire X4270 platform with its Nehalem CPUs offers a clear demonstration of the improved scalability of MySQL 5.4. The results suggest that users can expect to be able to effectively use available CPU resources when running MySQL 5.4 instances on this platform.

Allan

MySQL 5.4 Sysbench Scalability on 64-way CMT Servers

As a followup to my MySQL 5.4 Scalability on 64-way CMT Servers blog, I'm posting MySQL 5.4 Sysbench results on the same platform. The tests were carried out using the same basic approach (i.e. turning off entire cores at a time) - see my previous blog for more details.

The Sysbench version used was 0.4.8, and the read-only runs were invoked with the following command:

sysbench --max-time=300 --max-requests=0 --test=oltp --oltp-dist-type=special --oltp-table-size=10000000 \\
   --oltp-read-only=on --num-threads=[NO_THREADS] run
The "oltp-read-only=on" parameter was omitted for the read-write tests. The my.cnf file listed in my previous blog was also used unchanged for these tests.

Here is the data presented graphically. Note that the number of vCPUs is the same as the number of active threads up to 64. Active threads beyond 64 are using only 64 vCPUs.

And here is some raw data:

Read-only

User ThreadsTxns/secUserSystemIdle
173.0669310
8436.2384160
16855.6483160
321674.6983170
482402.1282171
642727.45701911
802524.69641916
962491.10271856
1282131.09221464

Read-write

User ThreadsTxns/secUserSystemIdle
141.89493021
8269.62621523
16486.14581329
32867.88541234
481121.87491239
641453.00481438
801509.09491536
961612.11541829
1281595.75521830

A few observations:

  • Throughput scales 63% from 32-way to 64-way for read-only and 67% for read-write. Not quite as good as for the OLTP test reported in my earlier blog, but not at all bad.
  • Beyond 48 vCPUs, idle CPU is preventing optimal scaling for the read-only test.
  • There's quite a bit of CPU left on the table for the read-write tests.
There's still a lot more work to be done, but we're definitely seeing progress.

Allan

MySQL 5.4 Scalability on 64-way CMT Servers

Today Sun Microsystems announced MySQL 5.4, a release that focuses on performance and scalability. For a long time it's been possible to escape the confines of a single system with MySQL, thanks to scale-out technologies like replication and sharding. But it ought to be possible to scale-up efficiently as well - to fully utilize the CPU resource on a server with a single instance.

MySQL 5.4 takes a stride in that direction. It features a number of performance and scalability fixes, including the justifiably-famous Google SMP patch along with a range of other fixes. And there's plenty more to come in future releases. For specifics about the MySQL 5.4 fixes, check out Mikael Ronstrom's blog.

So how well does MySQL 5.4 scale? To help answer the question I'm going to take a look at some performance data from one of Sun's CMT systems based on the UltraSPARC T2 chip. This chip is an interesting test case for scalability because it amounts to a large SMP server in a single piece of silicon. The UltraSPARC T2 chip features 8 cores, each with 2 integer pipelines, one floating point unit, and 8 hardware threads. So it presents a total of 64 CPUs (hardware threads) to the Operating System. Since there are only 8 cores and 16 integer pipelines, you can't actually carry out 64 concurrent operations. But the CMT architecture is designed to hide memory latency - while some threads on a core are blocked waiting for memory, another thread on the same core can be executing - so you expect the chip to achieve total throughput that's better than 8x or even 16x the throughput of a single thread.

The data in question is based on an OLTP workload derived from an industry standard benchmark. The driver ran on the system under test as well as the database, and I used enough memory to adequately cache the active data and enough disks to ensure there were no I/O bottlenecks. I applied the scaling method we've used for many years in studies of this kind. Unused CPUs were turned off at each data point (this is easily achieved on Solaris with the psradm command). That ensured that "idle" CPUs did not actually participate in the benchmark (e.g. the OS assigns device drivers across the full set of CPUs, so they routinely handle interrupts if you leave them enabled). I turned off entire cores at a time, so an 8 "CPU" result, for example, is based on 8 hardware threads in a single core with the other seven cores turned off, and the 32-way result is based on 4 cores each with 8 hardware threads. This approach allowed me to simulate a chip with less than 8 cores. Sun does in fact ship 4- and 6-core UltraSPARC T2 systems; the 32- and 48-way results reported here should correspond with those configurations. For completeness, I've also included results with only one hardware thread turned on.

Since the driver kit for this workload is not open source, I'm conscious that community members will be unable to reproduce these results. With that in mind, I'll separately blog Sysbench results on the same platform. So why not just show the Sysbench results and be done with it? The SQL used in the Sysbench oltp test is fairly simple (e.g. it's all single table queries with no joins); this test rounds out the picture by demonstrating scalability with a more complex and more varied SQL mix.

It's worth noting that any hardware threads that are turned on will get access to the entire L2 cache of the T2. So, as the number of active hardware threads increases, you might expect that contention for the L2 would impact scalability. In practice we haven't found it to be a problem.

Here is the my.cnf:

[mysqld]
table_open_cache = 4096
innodb_log_buffer_size = 128M
innodb_log_file_size = 1024M
innodb_log_files_in_group = 3
innodb_buffer_pool_size = 6G
innodb_additional_mem_pool_size = 20M
innodb_thread_concurrency = 0

Enough background - on to the data. Here are the results presented graphically.

And here is some raw data:

MySQL 5.4.0

ThroughputvCPUsUsSyIdsmtxicswsycl
1.0016238003714566
4.768772302205936159
9.2416772302655716456
13.4524772303455656606
17.4232762223804776621
24.2948762315574966731
29.7964742426744576736

Abbreviations: "vCPUS" refers to hardware threads, "Us", "Sy", "Id" are CPU utilization for User, System, and Idle respectively, "smtx" refers to mutex spins, "icsw" to involuntary context switches, and "sycl" to systems calls - the data in each of these last three columns is normalized per transaction.

Some interesting takeaways:

  • The throughput increases 71% from 32 to 64 vCPUs. It's encouraging to see a significant increase in throughput beyond 32 vCPUs.
  • The scaleup from 1 to 64 vCPUs is almost 30x. As I noted earlier, the UltraSPARC T2 chip does not have 64 cores - it only has 8 cores and 16 integer pipelines, so this is a good outcome.

To put these results into perspective, there are still many high volume high concurrency environments that will still require replication and sharding to scale acceptably. And while MySQL 5.4 scales better than previous releases, we're not quite seeing scalability equivalent to that of proprietary databases. It goes without saying, of course, that we'll be working hard to further improve scalability and performance in the weeks and months to come.

But the clear message today is that MySQL 5.4 single-server scalability is now good enough for many commercial transaction-based application environments. If your organization hasn't yet taken the plunge with MySQL, now is definitely the time. And you certainly won't find yourself complaining about the price-performance.

Allan

Wednesday Nov 05, 2008

MySQL Performance Optimizations

You might be wondering what's been happening with MySQL performance since Sun arrived on the scene. The good news is that we haven't been idle. There's been general recognition that MySQL could benefit from some performance and scalability enhancements, and Sun assembled a cross-organizational team immediately after the acquisition to get started on it. We've enjoyed excellent cooperation between the engineers from both organizations.

This kind of effort is not new for Sun - we've been working with proprietary database companies on performance for years, with source code for each of the major databases on site to help the process. In this case, the fact that the MySQL engineers are working for the same company certainly simplifies a lot of things.

If you'd like to get some insight into what's been happening, a video has just been posted online that features brief interviews with some of the engineers involved in various aspects of the exercise.

There are also screencasts available that offer a brief Introduction to MySQL Deployments, some basic pointers on Tuning and Monitoring MySQL, and a bit of a look at Future Directions for MySQL Performance. Other videos that might be of interest are also available at http://www.sun.com/software/.

The obvious question is when customers are going to see performance improvements in shipping releases. Actually, it's already happened. One example is a bug fix that made it into MySQL 5.1.28 and 6.0.7. The fix makes a noticeable difference to peak throughput and scalability with InnoDB. And that's only the beginning. You can expect to see a lot more in the not-too-distant future. Watch this space!

Monday Oct 13, 2008

Sun's 4-chip CMT system raises the bar

Find out about Sun's new 4-chip UltraSPARC T2 Plus system direct from the source: Sun's engineers.

Sun today announced the 4-chip variant of its UltraSPARC T2 Plus system, the Sun SPARC Enterprise T5440. This new system is the big brother of the 2-chip Sun SPARC Enterprise T5140 and T5240 systems released in April 2008. Each UltraSPARC T2 Plus chip offers 8 hardware strands in each of 8 cores. With up to four UltraSPARC T2 Plus chips delivering a total of 32 cores and 256 hardware threads and up to 512Gbytes of memory in a compact 4U package, the T5440 raises the bar for server performance, price-performance, energy efficiency, and compactness. And with Logical Domains (LDoms) and Solaris Containers, the potential for server consolidation is compelling.

Standard configurations of the Sun SPARC Enterprise T5440 include 2- and 4-chip systems at 1.2 GHz, and a 4-chip system at 1.4 GHz. All of these configurations come with 8 cores per chip.

The blogs posted today by various Sun engineers offer a broad perspective on the new system. The system design, the various hardware subsystems, the performance characteristics, the application experiences - it's all here! And if you'd like some background on how we arrived at this point, check out the earlier UltraSPARC T2 blogs (CMT Comes of Age) and the first release of the UltraSPARC T2 Plus (Sun's CMT goes multi-chip).

Let's see what the engineers have to say (and more will be posted throughout the day):

For more information on the new Sun SPARC Enterprise T5440 server, check out this web page.

Sunday Oct 12, 2008

Sizing a Sun Enterprise SPARC T5440 Server

Today Sun released the Sun Enterprise UltraSPARC T5440 server, a wild beast caged in a tiny 4U package. Putting it into perspective, this system has roughly the same performance potential as four Enterprise 10000 (Starfire) systems. Compared to the T5440, the floor space, energy consumption, and cooling required by the four older Starfire systems doesn't bear thinking about, either.

In more modern terms, the T5440 will handily outperform two Sun Fire E2900 systems with 12 dual-core UltraSPARC IV+ chips. Not bad for just four UltraSPARC T2 Plus chips. And when you add in up to 512Gbytes of memory and plenty of I/O connectivity, that's a lot of system.

Using up all that horsepower gets to be an interesting challenge. There are some applications that can consume the entire system, as demonstrated by some of the benchmarks published today. But for the most part, end users will be expecting to find other ways of deploying the considerable resources delivered by the system. Let's take a brief look at some of the issues to consider.

The first important factor is that the available resource is delivered in the form of many hardware threads. A four-chip T5440 delivers 32 cores with a whopping 256 hardware threads, and the Operating System in turn sees 256 CPUs. Each "CPU" has lower single-thread performance than many other current CPU chip implementations. But the capacity to get work done is enormous. For a simple analogy, consider water. One drop of water can't do a lot of damage. But that same drop together with a bunch of its friends carved out the Grand Canyon. (The UltraSPARC T1 chip was not codenamed "Niagara" for nothing!)

For applications that are multi-threaded, multi-process, or multi-user, you can spread the work across the available threads/CPUs. But don't expect an application to show off the strengths of this platform if it has heavy single-threaded dependencies. This is true for all systems based on the UltraSPARC T1, T2, and T2 Plus chip.

The good news is that people are starting to understand this message. When Sun first released the UltraSPARC T1 chip back in December 2005, it was a bit of a shock to many people. The Sun Fire T1000 and T2000 systems were the first wave of a new trend in CPU design, and some took a while to get their heads around the implications. Now, with Intel, AMD, and others jumping on the bandwagon, the stakes have become higher. And the rewards will flow quickest to those application developers who had the foresight to get their act together earliest.

The second factor is virtualization. Once again, people today understand the benefits of consolidating a larger number of smaller systems onto a fewer number of larger systems using virtualization technologies. The T5440 is an ideal platform for such consolidations. With its Logical Domain (LDom) capabilities and with Solaris Containers, there are many effective ways to carve the system up into smaller logical pieces.

And then there's the even simpler strategy of just throwing a bunch of different applications onto the system and letting Solaris handle the resource management. Solaris actually does an excellent job of managing such complex environments.

Summing it up, the T5440 is made to handle large workloads. As long as you don't expect great throughput if you're running a single thread, you should find it has a lot to offer in a small package.

Wednesday Apr 16, 2008

Dtrace with MySQL 6.0.5 - on a Mac

For the first time, MySQL includes Dtrace probes in the 6.0 release. On platforms that support Dtrace you can still find out a lot about what's happening, both in the Operating System kernel and in user processes, even without probes in the application. But carefully placed Dtrace probes inserted into the application code can give you a lot more information about what's going on, because they can be mapped to the application functionality. So far only a few probes have been included, but expect more to be added soon.

I decided to take the new probes for a spin. Oh, and rather than do it on a Solaris system, I figured I'd give it a shot on my Intel Core 2 Duo MacBook Pro, since MacOS X 10.5 (Leopard) supports Dtrace.

To begin with I pulled down and built MySQL 6.0.5 from Bit Keeper, thanks to some help from Brian Aker. Here's where, and here's how. I needed to make a couple of minor changes to the dtrace commands in the Makefiles to get it to compile - hopefully that will be fixed pretty soon in the source. Before long I had mysqld running. A quick "dtrace -l | grep mysql" confirmed that I did indeed have the Dtrace probes.

There are lots of interesting possibilities, but for now I'll just run a couple of simple scripts to illustrate the basics.

First, I ran this D script:

#!/usr/sbin/dtrace -s

:mysqld::
{
	printf("%d\\n", timestamp);
}

Running "select count(\*) from latest" from mysql in another window yielded the following:

dtrace: script './all.d' matched 16 probes
CPU     ID                    FUNCTION:NAME
  0  18665 _ZN7handler16ha_external_lockEP3THDi:external_lock 26152931955098

  0  18673 _Z13handle_selectP3THDP6st_lexP13select_resultm:select_start 26152931997414

  0  18665 _ZN7handler16ha_external_lockEP3THDi:external_lock 26153878060162

  0  18672 _Z13handle_selectP3THDP6st_lexP13select_resultm:select_finish 26153878082583

So the count(\*) took out a couple of locks, and we got to see the start and end of the select with a timestamp (in microseconds). Just for interest, I ran the count(\*) a second time, and this time none of the probes fired - the query was being satisfied from the query cache.

Next I decided to try "show indexes from latest;". The result was as follows:

mysql> show indexes from latest;
+--------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table  | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_Comment |
+--------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| latest |          0 | latest_x |            1 | hostid      | A         |      356309 |     NULL | NULL   | YES  | BTREE      |         |               | 
| latest |          0 | latest_x |            2 | exdate      | A         |      356309 |     NULL | NULL   | YES  | BTREE      |         |               | 
| latest |          0 | latest_x |            3 | extime      | A         |      356309 |     NULL | NULL   | YES  | BTREE      |         |               | 
+--------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
3 rows in set (0.15 sec)

Here's the result from Dtrace:

  0  18673 _Z13handle_selectP3THDP6st_lexP13select_resultm:select_start 27114155991557

  0  18670 _ZN7handler12ha_write_rowEPh:insert_row_start 27114308499688

  0  18669 _ZN7handler12ha_write_rowEPh:insert_row_finish 27114308553968

  0  18670 _ZN7handler12ha_write_rowEPh:insert_row_start 27114308565086

  0  18669 _ZN7handler12ha_write_rowEPh:insert_row_finish 27114308588605

  0  18670 _ZN7handler12ha_write_rowEPh:insert_row_start 27114308598685

  0  18669 _ZN7handler12ha_write_rowEPh:insert_row_finish 27114308622164

  0  18672 _Z13handle_selectP3THDP6st_lexP13select_resultm:select_finish 27114308714705

So three rows were inserted, presumably into a temporary table, corresponding to the three index columns. Dtrace shows that the query cache isn't used when you rerun this particular query.

Next I ran the following D script:

#!/usr/sbin/dtrace -s

:mysqld::\*_start
{
	self->ts = timestamp;
}

:mysqld::\*_finish
/self->ts/
{
	printf("%d\\n", timestamp - self->ts);
}

and reran the "show indexes" command. Here's the result:

CPU     ID                    FUNCTION:NAME
  1  18669 _ZN7handler12ha_write_rowEPh:insert_row_finish 61634

  1  18669 _ZN7handler12ha_write_rowEPh:insert_row_finish 22040

  1  18669 _ZN7handler12ha_write_rowEPh:insert_row_finish 21058

  1  18672 _Z13handle_selectP3THDP6st_lexP13select_resultm:select_finish 88933

This time we see just one line for each insert and select, including the time taken to complete the operation (in microseconds) rather than a timestamp.

Even in the more compact form, this would get a bit verbose for some operations, though, so I ran the following D script:

#!/usr/sbin/dtrace -s

:mysqld::\*_start
{
	self->ts = timestamp;
}

:mysqld::\*_finish
/self->ts/
{
	@completion_time[probename] = quantize(timestamp - self->ts);
}

After repeating the "show indexes", Dtrace returned the following after a Control-C:

dtrace: script './all3.d' matched 12 probes
\^C

  insert_row_finish                                 
           value  ------------- Distribution ------------- count    
            8192 |                                         0        
           16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@              2        
           32768 |@@@@@@@@@@@@@                            1        
           65536 |                                         0        

  select_finish                                     
           value  ------------- Distribution ------------- count    
           32768 |                                         0        
           65536 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1        
          131072 |                                         0        

This actually gives more info, but in a fairly compact form. Thanks to the quantize() function, we see a histogram with a count for each operation alongside the time taken; Two of the inserts completed in roughly 16384 microseconds, and one in 32768 microseconds.

Finally, I ran a "show tables;", which returned 48 rows, and the following from Dtrace:

dtrace: script './all3.d' matched 12 probes
\^C

  select_finish                                     
           value  ------------- Distribution ------------- count    
           32768 |                                         0        
           65536 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1        
          131072 |                                         0        

  insert_row_finish                                 
           value  ------------- Distribution ------------- count    
            4096 |                                         0        
            8192 |@@@@@@@@@@@@                             14       
           16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@             33       
           32768 |@                                        1        
           65536 |                                         0        

The report shows the 46 inserts. It would have been a lot more difficult to read with the first couple of D scripts, but the quantize() function offers a nice summary for no effort.

This is barely scratching the surface of what Dtrace can do, of course. But hopefully I've whetted your appetite. Why don't you check it out for yourself?

Tuesday Apr 08, 2008

Sun's CMT goes multi-chip

Sun engineers blog on the new multi-chip UltraSPARC T2 Plus systems

Today Sun is announcing new CMT-based systems, hard on the heels of the UltraSPARC T2 systems launched in October 2007 (the Sun SPARC Enterprise T5120 and T5220 systems). Whereas previous Sun CMT systems were based around a single-socket UltraSPARC T1 or T2 processor, the new systems incorporate two processors, doubling the number of cores and the number of hardware threads compared to UltraSPARC T2-based systems. Each UltraSPARC T2 Plus chip includes 8 hardware strands in each of 8 cores, so the Operating System sees a total of 128 CPUs. The new systems deliver an unprecedented amount of CPU capacity in a package this size, as evidenced by the very impressive benchmark results published today.

Systems come in both 1U and 2U packaging: the 1U Sun SPARC Enterprise T5140 ships with two UltraSPARC T2 Plus chips, each with 4, 6, or 8 cores at 1.2 GHz, and the 2U Sun SPARC Enterprise T5240 ships with two UltraSPARC T2 Plus chips, each with 6 or 8 cores at 1.2 GHz, or 8 cores at 1.4 GHz. For more information about the systems, a whitepaper is available which provides details on the processors and systems.

Once again, some of the engineers who have worked on these new systems have shared their experiences and insights in a series of wide-ranging blogs (for engineers' perspectives on the UltraSPARC T2 systems, check out the CMT Comes of Age blog). These blogs will be cross referenced here as they are posted. You should expect to see more appear in the next day or two, so plan on visiting again later to see what's new.

Here's what the engineers have to say:

  • UltraSPARC T2 Plus Server Technology. Tim Cook offers insights into what drove processor design toward CMT. Marc Hamilton serves up a brief overview of CMT for those less familiar with the technology. Dwayne Lee touches on the UltraSPARC T2 Plus chip. Josh Simons offers us a look under the hood of the new servers. Denis Sheahan provides an overview of the hardware components of the UltraSPARC T2 Plus sytems, then follows it up with details of the memory and coherency of the UltraSPARC T2 Plus processor. Lawrence Spracklen introduces the crypto acceleration on the chip. Richard Elling talks about RAS (Reliability, Availability, and Serviceability) in the systems, and Scott Davenport describes their predictive self-healing features.
  • Virtualization. Honglin Su announces the availability of the Logical Domains 1.0.2 release, which supports the UltraSPARC T2 Plus platforms. Eric Sharakan offers further observations on LDoms on T5140 and T5240. Ning Sun discusses a study designed to show how LDoms with CMT can improve scalability and system utilization, and points to a Blueprint on the issue.
  • Solaris Features. Steve Sistare outlines some of the changes made to Solaris to support scaling on large CMT systems.
  • System Performance. Peter Yakutis offers insights into PCI-Express performance. Brian Whitney shares some Stream benchmark results. Alan Chiu explains 10 Gbit Ethernet perfomance on the new systems. Charles Suresh gives some fascinating background into how line speed was achieved on the 10 GBit Ethernet NICs.
  • Application Performance. What happens when you run Batch Workloads on a Sun CMT server? Giri Mandalika's blog shares experiences running the Oracle E-Business Suite Payroll 11i workload. You might also find Satish Vanga's blog interesting - it focusses on SugarCRM running on MySQL (on a single-socket T5220). Josh Simons reveals the credentials of the new systems for HPC applications and backs it up by pointing to a new SPEComp2001 world record. Joerg Schwarz considers the applicability of the UltraSPARC T2 Plus servers for Health Care applications.
  • Web Tier. CVR explores the new World Record SPECweb2005 result on the T5220 system, and Walter Bays teases out the subtleties of the SPEC reporting process.
  • Java Performance. Dave Dagastine announces a World Record SPECjbb2005 result.
  • Benchmark Performance. The irrepressible bmseer details a number of world record results, including SPECjAppServer2004, SAP-SD 2 tier, and SPECjbb2005.
  • Open Source Community. Josh Berkus explores the implications for PostgreSQL using virtualization on the platform. Jignesh Shah discusses the possibilities with Glassfish V2 and PostgreSQL 8.3.1 on the T5140 and T5240 systems.
  • Sizing. Walter Bays introduces the CoolThreads Selection Tool (cooltst) v3.0 which is designed to gauge how well workloads will run on UltraSPARC T2 Plus systems.

Check out also the Sun CMT Wiki.

Monday Feb 25, 2008

Tuning MySQL on Linux

In this blog I'm sharing the results of a series of tests designed to explore the impact of various MySQL and, in particular, InnoDB tunables. Performance engineers from Sun have previously blogged on this subject - the main difference in this case is that these latest tests were based on Linux rather than Solaris.

It's worth noting that MySQL throughput doesn't scale linearly as you add large numbers of CPUs. This hasn't been a big issue to most users, since there are ways of deploying MySQL successfully on systems with only modest CPU counts. Technologies that are readily available and widely deployed include replication, which allows horizontal scale-out using query slaves, and memcached, which is very effective at reducing the load on a MySQL server. That said, scalability is likely to become more important as people increasingly deploy systems with quad-core processors, with the result that even two processor systems will need to scale eight ways to fully utilize the available CPU resources.

The obvious question is whether performance and scalability is going to attract the attention of a joint project involving the performance engineering groups at MySQL and Sun. You bet! Fruitful synergies should be possible as the two companies join forces. And in case you're wondering, Linux will be a major focus, not just Solaris - regard this blog as a small foretaste. Stay tuned in the months to come...

Test Details

On to the numbers. The tests were run on a Sun Fire X4150 server with two quad-core Intel Xeon processors (8 cores in total) and a Sun Fire X4450 server with four quad-core Intel Xeon processors (16 cores in total) running Red Hat Enterprise Linux 5.1. The workload was Sysbench with 10 million rows, representing a database about 2.5Gbytes in size, using the current 64-bit Community version of MySQL, 5.0.51a. My colleague Neel has blogged on the workload and how we used it. The graphs below do not list throughput values, since the goal was only to show relative performance improvements.

The first test varied innodb_thread_concurrency. In MySQL 5.0.7 and earlier, a value greater than 500 was required to allocate an unlimited number of threads. As of MySQL 5.0.18, a value of zero means unlimited threads. In the graph below, a value of zero clearly delivers better throughput beyond 4 threads for the read-only test.

The read-write test, however, benefits from a setting of 8 threads. These graphs show the throughput on the 8-core system, although both the 8- and the 16-core systems showed similar behavior for each of the read-only and the read-write tests.

The following graphs show the effect of increasing the InnoDB buffer cache with the innodb_buffer_cache_size parameter. The first graph shows read-only performance and the second shows read-write performance. As you would expect, throughput increases significantly as the cache increases in size, but eventually reaches a point where no benefit is derived from further increases.

Finally, we've seen that throughput is affected by the amount of memory we assign to the InnoDB buffer cache. But since the default Linux file system, ext3, also caches pages, why not let Linux do the caching rather than InnoDB. To test this, we tried comparing throughput with and without Linux file system caching. Setting the innodb_flush_method parameter to O_DIRECT will cause MySQL to bypass the file system cache. The results are shown in the graph below. Clearly the file system cache makes a difference, because throughput with the InnoDB buffer cache set to 1024 Mbytes supported by the file system cache is also as good as throughput with no file system caching and the InnoDB buffer cache set to 2048 Mbytes. But while the Linux file system cache can help protect you somewhat if you undersize your InnoDB buffer cache, for optimal performance, it's important to give the InnoDB buffer cache as much memory as it needs. Bypassing the Linux file system cache may not be a good idea unless you have properly sized the InnoDB buffer cache - disk read activity was very high when the buffer cache was too small and the file system cache was being bypassed. We also found that the CPU cost per transaction was higher when the InnoDB buffer cache was too small. That's not surprising, since the code path is longer when MySQL has to go outside the buffer cache to retrieve a block.

We tested a number of other parameters but found that none were as significant for this workload.

So to summarize, two key parameters to focus on are innodb_buffer_pool_size and innodb_thread_concurrency. Appropriate settings for these parameters are likely to help you ensure optimal throughput from your MySQL server.

Allan

Thursday Jan 17, 2008

MySQL in Safe Hands

Given the timing of my recent blog, Are Proprietary Databases Doomed?, I've been asked if I knew in advance about Sun's recent MySQL acquisition. Not at all! I was just as surprised and delighted as most others in the industry when I saw the news.

In the blog I outlined counter strategies that proprietary database companies might use to respond to the rise of Open Source Databases (OSDBs). One strategy was acqusition and I noted that MySQL, being privately held, was probably the most vulnerable.

The good news is that MySQL is no longer vulnerable. Sun has an unparalleled commitment to open source. No other organization has contributed anything like the quantity and quality of code, with Solaris, Java, OpenOffice, GlassFish, and other software now freely available under open source licenses. Sun also has an established track record with OSDBs such as PostgreSQL, and JavaDB, Sun's distribution of Derby. The MySQL acquisition does not represent a change of direction for Sun, rather the extension of an existing strategy.

The real surprise is that one of Oracle, IBM, or Microsoft didn't get there first. Any of them could have swallowed MySQL without burping. I'm betting there are people today wondering why on earth they let Sun steal a march on them.

Whatever the reason, the final outcome is great news for anyone who appreciates the value of free open source software. MySQL couldn't be in safer hands.

Allan

Thursday Oct 11, 2007

CMT Comes Of Age

Sun engineers give the inside scoop on the new UltraSPARC T2 systems

[ Update Jan 2008: Sun SPARC Enterprise T5120 and T5220 servers were awarded Product of the Year 2007. ]

Sun launched the Chip-Level MultiThreading (CMT) era back in December 2005 with the release of the highly successful UltraSPARC T1 (Niagara) chip, featured in the Sun Fire T2000 and T1000 systems. With 8 cores, each with 4 hardware strands (or threads), these systems presented 32 CPUs and delivered an unprecedented amount of processing power in compact, eco-friendly packaging. The systems were referred to as CoolThreads servers because of their low power and cooling requirements.

Today Sun introduces the second generation of Niagara systems: the Sun SPARC Enterprise T5120 and T5220 servers and the Sun Blade T6320. With 8 hardware strands in each of 8 cores plus networking, PCI, and cryptographic capabilities, all packed into a single chip, these new 64-CPU systems raise the bar even higher.

The new systems can probably be best described by some of the engineers who have developed them, tested them, and pushed them to their limits. Their blogs will be cross-referenced here, so if you're interested to learn more, come back from time to time. New blogs should appear in the next 24 hours, and more over the next few weeks.

Here's what the engineers have to say.

  • UltraSPARC T2 Server Technology. Dwayne Lee gives us a quick overview of the new systems. Denis Sheahan blogs about UltraSPARC T2 floating point performance, offers a detailed T5120 and T5220 system overview, and shares insights into lessons learned from the UltraSPARC T1. Josh Simons offers us a glimpse under the covers. Stephan Hoerold gives us an illustration of the UltraSPARC T2 chip. Paul Sandhu gives us some insight into the MMU and shared contexts. Tim Bray blogs about the interesting challenges posed by a many-core future. Darryl Gove talks about T2 threads and cores. Tim Cook compares the UltraSPARC T2 to other recent SPARC processors. Phil Harman tests memory throughput on an UltraSPARC T2 system. Ariel Hendel, musing on CMT and evolution, evidences a philosophical bent.
  • Performance. The inimitable bmseer gives us a bunch of good news about benchmark performance on the new systems - no shortage of world records, apparently! Peter Yakutis offers detailed PCI-E I/O performance data. Ganesh Ramamurthy muses on the implications of UltraSPARC T2 servers from the perspective of a senior performance engineering director.
  • System Management. Find out about Lights Out Management (ILOM) from Tushar Katarki's blog.
  • Networking. Alan Chiu gives us some insights into 10 Gigabit Ethernet performance and tuning on the UltraSPARC T2 systems.
  • RAS. Richard Elling carries out a performability analysis of the T5120 and T5220 servers.
  • Clusters. Ashutosh Tripathi discusses Solaris Cluster support in LDoms I/O domains.
  • Virtualization. Learn about Logical Domains (LDoms) and the release of LDoms 1.0.1 from Honglin Su. Eric Sharakan has some more to say about LDoms and the UltraSPARC T2. Ashley Saulsbury presents a flash demo of 64 Logical Domains booting on an UltraSPARC T2 system. Find out why Sun xVM and Niagara 2 are the ultimate virtualization combo from Marc Hamilton.
  • Security Performance. Ning Sun discusses Cryptography Acceleration on T2 systems. Glenn Brunette offers us a Security Geek's point of view on the T5x20 systems. Lawrence Spracklen has several posts on UltraSPARC T2 cryptographic acceleration. Martin Mueller proposes a UltraSPARC T2 system deployment designed to deliver a high performance, high security environment.
  • Application Performance. Dileep Kumar talks about WebSphere Application Server performance with UltraSPARC T2 systems. Tim Bray shares some hands-on experiences testing a T5120.
  • Java Performance. Dave Dagastine offers us insights into the HotSpot JVM on the T2 and Java performance on the new T2 servers.
  • Web Applications. Murthy Chintalapati talks about web server performance. Constantin Gonzalez explores the implications of UltraSPARC T2 for Web 2.0 workloads. Shanti Subramanyam tells us that Cool Stack applications (including the AMP stack packages) are pre-loaded on all UltraSPARC T2-based servers.
  • Open Source Community. Barton George explorers the implications of UltraSPARC T2 servers for the Ubuntu and Open Source community.
  • Open Source Databases. Luojia Chen discusses MySQL tuning for Niagara servers.
  • Customer Use Cases. Stephan Hoerold gives us some insight into experiences of Early Access customers. Stephan also shares what happened when STRATO put a T5120 to the test. It seems like STRATO also did some experimentation with the system.
  • Sizing. I've posted an entry on Sizing UltraSPARC T2 Servers.
  • Solaris features. Scott Davenport blogs on Predictive Self-Healing on the T5120. Steve Sistare gives us a lot of insight into features in Solaris to optimize the UltraSPARC T2 platforms. Walter Bays salutes the folks who reliably deliver consistent interfaces on the new systems.
  • HPC & Compilers. Darryl Gove talks about compiler flags for T2 systems. Josh Simons talks about the relevance of the new servers to HPC applications. Ruud van der Pas measures T2 server performance with a suite of single-threaded technical-scientific applications. In another blog entry, Darryl Gove introduces us to performance counters on the T1 and T2.
  • Tools. Darryl Gove points to the location of free pre-installed developer tools on UltraSPARC T2 systems. Nicolai Kosche describes the hardware features added to UltraSPARC T2 to improve the DProfile Architecture in Sun Studio 12 Performance Analyzer. Ravindra Talashikar brings us Corestat for UltraSPARC T2, a tool that measures core utilization to help users better understand processor utilization on UltraSPARC T2 systems.

Finally

Go check out the new UltraSPARC T2 systems, and save energy and rack space in the process.

Enjoy!

Wednesday Oct 10, 2007

Sizing UltraSPARC T2 Servers

The newly released Sun SPARC Enterprise T5120, T5220 servers and the Sun Blade T6320 present an interesting challenge to the end user: how do you deploy a system that looks like an entry level server (it's either one or two rack units in size and comes with a single processor chip), yet has more processing power than the fastest 64-CPU Enterprise 10000 (Starfire) shipped by Sun? Oh, and in case you're wondering, the Starfire comparison is good for delivered application throughput, too, not just theoretical speeds and feeds.

The first issue is to figure out how you're going to use up all the CPU. There are a number of possibilities, including:

  1. You deploy a single application that consumes the entire system. This single application might have multiple threads, such as the Java VM, or multiple processes, like Oracle. When a T2000 is dedicated to a single application, such as Oracle, for example, best practice is to treat it like a standard 12-16 CPU system and tune accordingly. So a good starting point is probably to tune a T5120 or T5220 as a 24-32 CPU system. You will want to monitor the proportion of idle CPU with vmstat or mpstat (or corestat] if you'd like more information about how busy the cores are). If there's a lot of idle CPU, then you might need to tune for more CPUs.

    A single application wasn't the most common way of consuming 32-thread UltraSPARC T1 servers like the Sun Fire T2000, though. And it's even less likely to be typical on the 64-thread T2 servers, which are a little more than twice as powerful as T1 servers.

    Why isn't it typical to consume a T1-based system with a single application? The most common reason is because a single application often doesn't require that much CPU. Sometimes, too, a single application instance doesn't scale well enough to consume all 32 CPUs. We've particularly seen this with open source applications with mostly 1- or 2-CPU deployments. Configuring multiple application instances can sometimes overcome this limitation.

    It's worth noting that application developers will increasingly find themselves needing to solve this issue in the future. With all chip suppliers moving to quad-core implementations, it will soon be necessary for applications to perform well with 4- to 8-CPUs just to consume the CPU resource of a 1- or 2-chip system. Early adopters of T1000 and T2000 systems are in good shape, because it's likely they've already made this transition.

  2. You consume the entire system by deploying multiple applications. These applications can, in turn, be multi-threaded, multi-process, or multi-user. Virtualization can be an attractive way of managing multiple applications, and there are two available technologies on T2-based servers: Solaris Zones and Logical Domains (LDoms). They are complimentary technologies, too, so you can use either, or both together. Domains will already be familiar to many - Sun users have been carving up their systems into multiple domains since the days of the Starfire. The LDom implementation is different, but the concepts are very similar. Check out this link for pointers to blogs on LDoms.

Caveats?

In my blog on Sizing T1 Servers back in 2005, I made a number of suggestions about sizing and consolidation that also apply to the new systems. I also noted two caveats related to performance. The first related to floating-point intensive workloads. This caveat no longer applies on T2 servers - the floating point units included in each of the 8 cores deliver excellent floating point performance. The second caveat related to single-thread performance and the importance of understanding whether an application would run well in a multi-threaded environment. Is there, for example, a significant dependence on long-running single-threaded batch jobs? This question must still be asked of T2 servers, although the single-threaded performance of the T2 is improved relative to the T1. The Cooltst tool was created to help identify single-threaded behavior with applications running on existing Solaris (SPARC and x64/x86) and Linux systems (x64/x86). A new version of Cooltst will soon be available that supports T2 systems as well. For optimal throughput with T2-based servers, single-threaded applications should either be broken up, deployed as multiple application instances, or mixed with other applications.

The bottom line is that T5x20 servers will soon be replacing much larger systems, and delivering significant reductions in energy, cooling, and storage requirements.

Monday Dec 19, 2005

Hands-on Consolidation on CoolThreads Servers

A couple of Sun Blueprints have recently been released that offer some helpful insights into application consolidation on CoolThreads Servers.

The first, Consolidating the Sun Store onto Sun Fire T2000 Servers, documents the process of migrating the online Sun Store from a Sun Enterprise 10000 with 38 400MHz CPUs onto a pair of Sun Fire T2000 servers (they used two for high availability). The resulting environment took advantage of Solaris 10 Containers. They saw an overall reduction of approximately 90 percent in both input power and heat output! Pretty cool (literally)! The space savings were even more significant.

The second blueprint, Web Consolidation on the Sun Fire T1000 using Solaris Containers, gives a detailed hands-on description of the process of building a web-tier consolidation platform on a Sun Fire T1000 with Containers.

If you're planning a CoolThreads consolidation, go check them out. I think you'll find both papers useful.

Tuesday Dec 06, 2005

Sizing CoolThreads Servers

The Sun Fire T1000/T2000 (aka "CoolThreads") server offers a lot of horsepower in a single chip: up to eight cores running at either 1000MHz or 1200MHz, each core with four hardware threads. But how should this SMP-in-a-chip be sized appropriately for real-world applications?

The published benchmarks show that the application throughput delivered by a single T2000 server is equivalent to the throughput delivered by multiple Xeon systems. And this isn't just marketing hype, either; the UltraSPARC T1 processor is a genuine breakthrough technology. But what are the practical considerations involved in replacing several Xeon servers with a single T1000 or T2000?

Preparing for CoolThreads

For starters, it's important to understand the design point of the UltraSPARC T1. If you need blazing single-thread performance, this isn't the system for you - the chip simply wasn't designed that way. And if you think that's bad, then I'm sorry to say your future is looking a little bleak. Every processor designer in the industry is moving to multiple cores, and one implication is that single thread performance will no longer be getting all the attention. Performance will be served up in smaller packages.

The UltraSPARC T1 is a chip oriented for throughput computing. With the multi-threading capablities of this chip Sun has done two things. The first is to push the envelope much further than anyone else anticipated. Not everyone will applaud this strategy, of course. (And just for fun, note the reactions carefully, and deduct points from competitors who bad-mouth Sun's strategy now, and later end up copying it!) More importantly, though, Sun has issued notice about the way applications need to be designed. In a world that increasingly delivers CPU power through multiple cores and threads, single-threaded applications don't make a whole lot of sense any more. The sooner you multi-thread your applications, the better off you'll be, regardless of your hardware vendor of choice.

That doesn't mean you'll be forced to rearchitect your applications before you can use the T1000/T2000, though. You can proceed provided your planned deployment has one or more of the following characteristics, any of which will allow it to take advantage of UltraSPARC T1's multiple cores and threads:

  • Multiple applications
  • Multiple user processes
  • Multi-threaded applications
  • Multi-process applications

In general, commercial software that runs well on SMP (Symmetric Multi-Processor) systems, will run well on T1000/T2000 (because one or more of the above already apply). Note that the Java JVM is already multi-threaded.

When to Walk Away

The other major consideration is floating point performance. The UltraSPARC T1 is not designed for floating-point intensive applications. This isn't as disastrous as it might sound. It turns out that a vast range of commercial applications, from ERP software like SAP through Java application servers, do very little floating point and run just fine on the T1000/T2000. If you're in any doubt about how to figure out the proportion of floating point instructions in your application, help is on the way. More on this in a future blog.

Sizing

If you made it past the single-threaded and floating point questions, you're ready for some serious sizing. The first step is to see how busy your current servers are. Suppose you plan to consolidate applications from six Xeon servers onto a Sun Fire T2000 server. If the CPUs on each system are typically 30% busy and peak at 50%, then you will be migrating a peak load equivalent to three fully-utilized servers.

By far the best way to test the relative performance of the T1000/T2000 and your current servers is to run your own application on both. If that isn't possible, a crude starting point might be to compare published performance on a real-world workload. Check out the published T1000/T2000 benchmarks for further information. If you can't directly compare your intended applications, try to find something as close as possible (e.g. the CPU, network, and storage I/O resource usage should look at least vaguely similar to your actual workload). Benchmarks that use real ISV application code (e.g. SAP and Oracle Applications) are going to be more relevant to a throughput platform like the T1000/T2000 than artificial benchmarks designed to measure the performance of a traditional CPU. One important warning: don't try to draw final conclusions if you're not comparing the same application on both platforms! Extrapolations don't work well when the technologies are radically different (and the UltraSPARC T1 is simply different to anything else out there).

The next step is to figure out how to deploy the applications. You have four, six, or eight cores at your disposal (depending on the T1000/T2000 platform you've chosen). Should you simply let Solaris worry about the scheduling? Or should you figure out your resource management priorities in advance and carve up the available resources before deploying the applications? You might want to refer to my blog about Consolidating Applications onto a CoolThreads Server for more information on this topic.

Once you're ready to deploy, make sure you do some serious load testing before going live. Don't make the mistake of rushing into production without first finding out how well your application scales on the T1000/T2000 platform. I don't know about you, but I hate nasty surprises! And if you do encounter scaling issues, don't forget that Solaris 10 Dtrace is your friend. And check out DProfile, too.

Once you get your head around this technology, you're going to enjoy it! And that's even without mentioning the power, cooling, and rack space savings...

Allan

PS. If you're looking for more CoolThreads info direct from Sun engineers, Richard McDougall has put together an excellent overview of other relevant blogs.

About

I'm a Principal Engineer in the Performance Technologies group at Sun. My current role is team lead for the MySQL Performance & Scalability Project.

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today