By shanti on May 21, 2009
Following on the heels of our memcached performance tests
on SunFire X2270 ( Sun's Nehalem-based server) running OpenSolaris, we
ran the same tests on the same server but this time on RHEL5. As
mentioned in the post presenting the first memcached results,
a 10GBE Intel Oplin card was used in order to achieve the high
throughput rates possible with these servers. It turned out that using
this card on linux involved a bit of work resulting in driver and kernel
- With the default ixgbe driver from the RedHat distribution (version 1.3.30-k2 on kernel 2.6.18)), the interface simply hung during the benchmark test.
- This led to downloading the driver from the Intel site (126.96.36.199-2-NAPI) and re-compiling it. This version does work and we got a maximum throughput of 232K operations/sec on the same linux kernel (2.6.18). However, this version of the kernel does not have support for multiple rings.
- The kernel version 2.6.29 includes support for multiple rings but still doesn't have the latest ixgbe driver which is 1.3.56-2-NAPI. So we downloaded, built and installed these versions of the kernel and driver. This worked well giving a maximum throughput of 280K with some tuning.
The system running
OpenSolaris and memcached 1.3.2 gave us a maximum throughput of 350K
ops/sec as previously reported. The same system running RHEL5 (with
kernel 2.6.29) and the same version of memcached resulted in 280K
ops/sec. OpenSolaris outperforms Linux by 25% !
The following Linux tunables were changed to try and get the best performance:
net.ipv4.tcp_timestamps = 0 net.core.wmem_default = 67108864 net.core.wmem_max = 67108864 net.core.optmem_max = 67108864 net.ipv4.tcp_dsack = 0 net.ipv4.tcp_sack = 0 net.ipv4.tcp_window_scaling = 0 net.core.netdev_max_backlog = 300000 net.ipv4.tcp_max_syn_backlog = 200000
Here are the ixgbe specific settings that were used (2 transmit, 2 receive rings):
RSS=2,2 InterruptThrottleRate =1600,1600
The following settings in /etc/system were used to set the number of MSIX:set ddi_msix_alloc_limit=4 set pcplusmp:apic_intr_policy=1
For the ixgbe interface, 4 transmit and 4 receive rings gave the best performance :
Finally, we bound the crossbow threads:
dladm set-linkprop -p cpus=12,13,14,15 ixgbe0