Thursday May 21, 2009

OpenSolaris beats Linux on Memcached !

Following on the heels of our memcached performance tests on SunFire X2270 ( Sun's Nehalem-based server) running OpenSolaris, we ran the same tests on the same server but this time on RHEL5. As mentioned in the post presenting the first memcached results, a 10GBE Intel Oplin card was used in order to achieve the high throughput rates possible with these servers. It turned out that using this card on linux involved a bit of work resulting in driver and kernel re-builds.

  • With the default ixgbe driver from the RedHat distribution (version 1.3.30-k2 on kernel 2.6.18)), the interface simply hung during the benchmark test.
  • This led to downloading the driver from the Intel site (1.3.56.11-2-NAPI) and re-compiling it. This version does work and we got a maximum throughput of 232K operations/sec on the same linux kernel (2.6.18). However, this version of the kernel does not have support for multiple rings. 
  • The kernel version 2.6.29 includes support for multiple rings but still doesn't have the latest ixgbe driver which is 1.3.56-2-NAPI. So we downloaded, built and installed these versions of the kernel and driver. This worked well giving a maximum throughput of 280K with some tuning.

Results Comparison

The system running OpenSolaris and memcached 1.3.2 gave us a maximum throughput of 350K ops/sec as previously reported. The same system running RHEL5 (with kernel 2.6.29) and the same version of memcached resulted in 280K ops/sec. OpenSolaris outperforms Linux by 25% !

Linux Tuning

The following Linux tunables were changed to try and get the best performance:

net.ipv4.tcp_timestamps = 0
  net.core.wmem_default = 67108864
  net.core.wmem_max = 67108864
  net.core.optmem_max = 67108864
  net.ipv4.tcp_dsack = 0
  net.ipv4.tcp_sack = 0
  net.ipv4.tcp_window_scaling = 0
  net.core.netdev_max_backlog = 300000
  net.ipv4.tcp_max_syn_backlog = 200000
  

Here are the ixgbe specific settings that were used (2 transmit, 2 receive rings):

RSS=2,2 InterruptThrottleRate =1600,1600

OpenSolaris Tuning

The following settings in /etc/system were used to set the number of MSIX:

set ddi_msix_alloc_limit=4
set pcplusmp:apic_intr_policy=1

For the ixgbe interface, 4 transmit and 4 receive rings gave the best performance :

tx_queue_number=4, rx_queue_number=4

Finally, we bound the crossbow threads:

dladm set-linkprop -p cpus=12,13,14,15 ixgbe0

Monday Apr 27, 2009

Multi-instance memcached performance

As promised, here are more results running memcached on Sun's X2270 (Nehalem-based server). In my previous post, I mentioned that we got 350K ops/sec running a single instance of memcached at which point the throughput was hampered by the scalability issues of memcached. So we ran two instances of memcached on the same server, each using 15GB of memory and tested both 1.2.5 and 1.3.2 versions. Here are the results :

2 instances performance

The maximum throughput was 470K ops/sec using 4 threads in memcached 1.3.2. Performance of 1.2.5 was just very slightly lower. At this throughput, the network capacity of the single 10gbe card was reached as the benchmark does a lot of small packet transfers. See my earlier post for a description of the server configuration and the benchmark. At the maximum throughput, the cpu was still only 62% utilized (73% in the case of 1.2.5). Note that with a single instance we were using the same amount of cpu but reaching a lower throughput rate which once again points to memcached scalability issues.

These are really exciting results. Stay tuned - there is more exciting information coming.

Saturday Apr 18, 2009

Memcached Performance on Sun's Nehalem System

Memcached is the de-facto distributed caching server used to scale many web2.0 sites today. With the requirement to support a very large number of users as sites grow, memcached aids scalability by effectively cutting down on MySQL traffic and improving response times.

Memcached is a very light-weight server but is known not to scale beyond 4-6 threads. Some scalability improvements have gone into the 1.3 release (still in beta). With the new Intel Nehalem based systems improved hyper-threading providing twice as much performance as current systems, we were curious to see how memcached would perform on these systems. So we ran some tests, the results of which are shown below :




memcached 1.3.2 does scale slightly better than 1.2.5 after 4 threads. However, both versions reach their peak at 8 threads with 1.3.2 giving about 14% better throughput at 352,190 operations/sec.

The improvements made to per-thread stats certainly have helped as we no longer see stats_lock at the top of the profile. That honor now goes to cache_lock. With the increased performance of new systems making 350K ops/sec possible, breaking up of this (and other) lock(s) in memcached is necessary to improve scalability.

Test Details

A single instance of memcached was run on a SunFire X2270 (2 socket Nehalem) with 48GB of memory and an Oplin 10G card. Several external client systems were used to drive load against the server using an internally developed Memcached benchmark. More on the benchmark later.
The clients connected to the server using a single 10 Gigabit Ethernet link. At the maximum throughput of 350K, the network was about 52% utilized and the server was 62% utilized. So there is plenty of head-room on this system to handle a much higher load if memcached could scale better. Of course, it is possible to run multiple instances of memcached to get better performance and better utilize the system resources and we plan to do that next. It is important to note that utilizing these high performance systems effectively for memcached will require the use of 10 GBE interfaces.

Benchmark Details

The Memcached benchmark we ran is based on Apache Olio - a web2.0 workload. I recently showcased results from Olio on Nehalem systems as well. Since Olio is a complex multi-tier workload, we extracted the memcached part to more easily test it in a stand-alone environment. This gave rise to our Memcached benchmark.

The benchmark initially populates the server cache with objects of different sizes to simulate the types of data that real sites typically store in memcached :

  • small objects (4-100 bytes) to represent locks and query results
  • medium objects (1-2 KBytes) to represent thumbnails, database rows, resultsets
  • large objects (5-20 KBytes) to represent whole or partially generated pages

The benchmark then runs a mixture of operations (90% gets, 10% sets) and measures the throughput and response times when the system reaches steady-state. The workload is implemented using Faban, an open-source benchmark development framework. It not only speeds benchmark development, but the Faban harness is a great way to queue, monitor and archive runs for analysis.

Stay tuned for further results.
About

I'm a Senior Staff Engineer in the Performance & Applications Engineering Group (PAE). This blog focuses on tips to build, configure, tune and measure performance of popular open source web applications on Solaris.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today