Out of box Network Performance on Nehalem servers

With the upcoming OpenSolaris 2009.06 release, how does networking perform on Intel Nehalem servers vs. Linux? One of the network performance questions asked by customers is "Can a system achieve X Gbit/s?" or "Can a system send/receive y packets/sec?" after they have characterized the workload and find that networking may be the limiting factor. I'll take a look at out of box performance using a micro-benchmark tool uperf that shows the server's capabilities.

Benchmark Considerations

Intel Nehalem servers have sufficient cpu cycles and I/O bandwidth to drive one 10GbE port at line rate for large packet sizes. To see a meaningful difference between Solaris and Linux, I chose a four port configuration that can show difference in throughput. When PCIe gen. 2 10 Gigabit Ethernet is widely available, the four ports can come from two NIC and take only two PCIe slots. Since I use PCIe gen. 1 10 Gigabit Ethernet, the four ports come from four NIC to avoid PCIe bandwidth becoming the bottleneck.

Which Linux?

SUSE Linux Enterprise 11 is chosen as SUSE is widely used and it is based on a more recent Linux kernel than latest Red Hat distro. Compared to SLES 10 or RHEL 5.3, SLES 11 includes TX multiqueue support. This helps TX scales better with more connections. For simplicity, no tuning is done for etiher Linux or OpenSolaris. Tuning Linux will improve some results.

Test Results

Nehalem TX throughput

Nehalem RX throughput

Engineering Notes

S10 Update 7 throughput drops significantly with more connections because of high rate of interrupts. Tuning to interrupt more cpu:

/etc/system
set ddi_msix_alloc_limit=4
set pcplusmp:apic_intr_policy=1
set pcplusmp:apic_msix_max=4
set pcplusmp:apic_multi_msi_max=4

/kernel/drv/ixgbe.conf
rx_queue_number=8;

improves performance to be more in line with OpenSolaris. TCP RX throughput more than doubles to 20.6 Gbit/s, and TCP TX throughput is 26.6 Gbit/s at 1000 connections.

Uperf Profiles

uperf uses XML to describe the workload. The profile that describes TCP TX throughput is here. The  profile that describes TCP RX throughput is here. The profiles are written based on four clients; if different number of clients are used, modify the number of groups to match the number of clients.

Test Set-up

  • SUT: Sun Fire X4270, dual socket Intel Nehalem @ 2.66 GHz. Hyper-threading is on (i.e. change BIOS default). 24 GB of memory.
  • OS: OpenSolaris 2009.06; Solaris 5/09; SUSE Linux Enterprise 11 (2.6.27.19-5 kernel)
  • Four X1106A-Z Sun 10 GbE with Intel 82598EB 10 GbE Controller installed on SUT. ixgbe driver version 1.3.56.17-NAPI compiled on SUSE 11.
  • Clients: Four Sun Fire X4150, dual socket Intel Xeon @ 3.16 GHZ, 8 GB main memory. Three of the X4150 runs Solaris Nevada 109 with Sun Multithreaded 10 GbE Networking Card. One X4150 runs SUSE 10 SP2 with X1106A-Z Sun 10 GbE with Intel 82598EB 10 GbE Controller. Each 10 GbE port is connected back-back with one 10GbE port on SUT.
    • Tuning for Solaris clients:
      /etc/system
    • set ddi_msix_alloc_limit=8
      set pcplusmp:apic_multi_msi_max=8
      set pcplusmp:apic_msix_max=8
      set pcplusmp:apic_intr_policy=1
      set nxge:nxge_msi_enable=2
    • /kernel/drv/nxge.conf
    • soft-lso-enable = 1;
Comments:

I was wondering if you have conducted a similar test for smaller packet sizes (9KB). The reason I ask this question is two-fold:

(1) 64KB packets will see router fragmentation due to the ethernet framing limitations (max 9KB for jumbo frames), so perhaps 9KB sizes would model the end-to-end limitations better

(2) Smaller 9KB packet sizes will result in more packets, and this will result in more "Overheads"

Question: What impact do 9KB MTUs that have on Nehalem Raw (pre-network) TCP throughputs, and number of connections ?

I would be glad to elaborate, or engage in a discussion as needed. My email address is anand@maxtier.com

Thanks,

Posted by Anand on June 12, 2009 at 03:02 AM PDT #

"Out of the box" - except when tuning S10U7 :) While it appears this entry has answered one of the two questions posed in the introduction, namely "Can a system achieve X Gbit/s?" it doesn't address the second question mentioned, "Can a system send/receive y packets/sec?" - as the old adage goes "Those who just measure bulk throughput are condemned to run FTP servers" :)

Which of the TX configurations did TSO/LSO? Which of the RX configurations did LRO/GRO?

Why was one of the load generators running SuSE? I would have thought you would want all the load generators running the same bits at least for foolish consistency if nothing else?

Posted by rick jones on June 22, 2009 at 12:04 PM PDT #

Post a Comment:
Comments are closed for this entry.
About

user12608924

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today