Scaling MySQL on a 256-way T5440 server using Solaris ZFS and Java 1.7
By mrbenchmark on Nov 10, 2008
A new era
In the past few years, I published many articles using Oracle as a database server. As a former Sybase system administrator and former Informix employee, it was obviously not a matter of personal choice. It was just because the large majority of Sun's customers running databases were also Oracle customers.
This summer, in our 26 Sun Solution Centers worldwide, I observed a shift. Yes, we were still seeing older solutions based on DB2, Oracle, Sybase or Informix being evaluated on new Sun hardware. But every customer project manager, every partner, every software engineer working on a new information system design asked us : Can we architect this solution with MySQL ?
In many cases, if you dared to reply YES to this question, the next interrogation would be about the scalability of the MySQL engine.
This is why I decided to write this article.
Please find below my initial goals :
Reach a high throughput of SQL queries on a 256-way Sun SPARC Enterprise T5440
Do it 21st century style i.e. with MySQL and ZFS , not 20th century style i.e with OraSybInf... and VxFS
Do it with minimal tuning i.e as close as possible as out-of-the-box
This article is describing how I achieved this goals. It has two main parts : a short description of the technologies used, then a showing of the results obtained.
Sun SPARC Enterprise T5440 server
The T5440 server is the first quad-socket server proposing 256 hardware threads in just four rack units. Each socket host a UltraSPARC T2 Plus processor which propose eight cores and 64 simultaneous threads into a single piece of silicon. While a lot of customers are interested in the capacity of this system to be divided into 128 two-way domains, this article explores the database capacity of a single 256-way Solaris 10 domain.
The Zettabyte file system
Announced in 2004 and introduced part of OpenSolaris build 27 in November 2005, ZFS is the one-and-only 128-bit file system. It includes many innovative features like a copy-on-write transactional model, snapshots and clones, dynamic striping and variable block sizes. Since July 2006, ZFS is also a key part of the Solaris operating system . A key difference between UFS and ZFS is the usage of the ARC [Adaptive Replacement Cache] instead of the traditional virtual memory page cache. To obtain the performance level shown in this article, we only had to tune the size of the ARC cache and turn off atime management on the file systems to optimize ZIL I/O latency. The default ZFS recordsize is commonly changed for database workload. For this study, we kept the default value of 128k.
The MySQL database server is the leading Open Source database for Web 2.0 environment. MySQL was introduced in May 1995 and has never stopped to be enriched with features. The 5.1 release is an important milestone as it introduces support for partitioning, event scheduling, XML functions and row based replication. While Sun is actively working on implementing a single instance highly scalable storage engine, this article is showing how one can reach a very high level of SQL query throughput using MySQL 5.1.29 64-bit on a 256-way server.
SLAMD and iGenOLTP
The SLAMD Distributed Load Generation Engine (SLAMD) is a Java-based application designed for stress testing and performance analysis of network-based applications. It was originally developed by Sun Microsystems, Inc., but it has been released as an open source application under the Sun Public License, which is an OSI-approved open source license. The main site for obtaining information about SLAMD is http://www.slamd.com/. It is also available as a java.net project.
iGenOLTP is a multi-processed and multi-threaded database benchmark. As a custom Java class for SLAMD, it is a lightweight workload composed of four select statements, one insert and one delete. It produces a 90% read/10% write workload simulating a global order system. For this exercise, we are using a maximum of 24 milllion customers and 240 million orders in the databases. The database is divided “sharded” in as many pieces as the number of MySQL instances on the system. [See this great article on database sharding]. For example, for 24 database instances, database 1 store customers 1 to 1 million, database 2 store customers 1 milion to 2 million and so on. The Java threads simulating the workload are aware of the database partitioning scheme and simulate the traffic accordingly.
This approach can be called “Application partitioning” as opposed to “Database partitioning”. Because it is based on a shared-nothing architecture, it it natively more scalable than a shared-everything approach (as in Oracle RAC).
Java Platform Standard Edition 7
Initially released in 1995, the programming language Java started a revolution in computer languages because of the concept of Java Virtual Machine causing instant portability across computer architectures. While the 1.7 JVM is still in beta release, it is the base of my iGenOltpMysql Java class performing the workload shown in this article. The key enhancement of the JVM 1.6 was the introduction of native Dtrace probes. The 1.7 JDK is an update packed with performance related enhancements including an improved Adaptive Compiler, optimized Rapid Memory Allocation , finely tuned garbage collector algorithms and finally a lighter thread synchronization capability causing better scalability. For this article we used the JDK7 build 38.
Software and Hardware summary
This study is using Solaris 10 Update 6 (released October 31st,2008), Java 1.7 build 38 (released Otober 23rd,2008), SLAMD 1.8.2, iGenOLTP v4.2 for MySQL and MySQL 5.1.29. The hardware tested is a T5440 with 4xUltraSPARC T2 Plus 1.2Ghz and 64 GB of RAM . A Sun Blade 8000 with 10 blades each with 2xAMD Opteron 8220 2.8Ghz and 8GB RAM is used as a client system. Finally a Sun ST6140 storage array [with 10x146GB 15k RPM drives] is configured in RAID-1 [2 HS], with two physical volumes and connected to the T54440 with two 4GB/s controllers.
Scaling vertically first
This is a matter of methodology. The first step is to determine the peak throughput of a single instance of MySQL with iGenOLTP using InnoDB then use approximately 75% of this throughput as the basis for the horizontal scalability test. ZFS and MySQL current best practices guided the choice of all the tunables used. [available upon request] The test is done in stabilized load with each simulation thread executing 10 transactions per second. Please find below the throughput and response time scalability curves :
Note that the peak throughput is 725 transactions per second which corresponds to 4350 SQL statements per second. We are caching the entire 1 Gbyte database. The only I/Os happening are due to the delete/insert statements, the MySQL log and the ZFS Intent Log. We will be using 75% of the peak workload simulation as the base workload per instance for the horizontal scalability exercise. Why 75% ? Our preliminary tests showed that it the was the best compromise to reach maximum multi-instance throughput.
The next step was to increase the number of instances while increasing proportionally the database size (number of customer ids). We will have the same 600 TPS workload requested on each instance but querying a different range within the global data set. The beauty of the setup is that we do not have to reinstall the MySQL binaries multiple times : we could just use soft links. The main thing to do was to configure 32 ZFS file systems on our ZFS pool and then to create & load the databases. This was easily automated with ksh scripts. Finally, we had to customize the Java workload to query all the database instances accurately...
Here are the results :
As you can see, we were able to reach a peak of more than 79,000 SQL queries per second on a single 4 RU server. The transaction throughput is still increasing after 28 instances but this is the sweet spot for this benchmark on the T5440 as guided by the transactions average response time. At 28 instances, we observed less than 30ms average response time. However, for 32 instances, response times jumped to an average of 95ms.
The main trick to achieve horizontal scalability: Optimize thread scheduling
Solaris is using the timeshare class as the default scheduling class. The scheduler needs to always make sure that the thread priorities are adequately balanced. For this test, we are running thousand of threads running this workload and can get critical CPU User Time back by avoiding unnecessary work by the scheduler. To achieve this, we are running the MySQL engines and Java processes in the Fixed Priority class. This is achieved easily using the Solaris priocntl command.
As I mentioned in introduction, an architecture shift is happening. Database sharding and application partitioning are the foundation of future information systems as pioneered by companies like Facebook [see this interesting blog entry]. This article prove that Sun Microsystems servers with CoolThread technology are an exceptional foundation for this change. And they will also considerably lower your Total Cost of Ownership as illustrated in this customer success story.
A very special thank you to the following experts who helped in the process or reviewed this article : Huon Sok, Allan Packer, Phil Morris, Mark Mulligan, Linda Kateley, Kevin Figiel and Patrick Cyril.
See you next time in the wonderful world of benchmarking....