Wednesday May 27, 2009

Out of box Network Performance on Nehalem servers

With the upcoming OpenSolaris 2009.06 release, how does networking perform on Intel Nehalem servers vs. Linux? One of the network performance questions asked by customers is "Can a system achieve X Gbit/s?" or "Can a system send/receive y packets/sec?" after they have characterized the workload and find that networking may be the limiting factor. I'll take a look at out of box performance using a micro-benchmark tool uperf that shows the server's capabilities.

Benchmark Considerations

Intel Nehalem servers have sufficient cpu cycles and I/O bandwidth to drive one 10GbE port at line rate for large packet sizes. To see a meaningful difference between Solaris and Linux, I chose a four port configuration that can show difference in throughput. When PCIe gen. 2 10 Gigabit Ethernet is widely available, the four ports can come from two NIC and take only two PCIe slots. Since I use PCIe gen. 1 10 Gigabit Ethernet, the four ports come from four NIC to avoid PCIe bandwidth becoming the bottleneck.

Which Linux?

SUSE Linux Enterprise 11 is chosen as SUSE is widely used and it is based on a more recent Linux kernel than latest Red Hat distro. Compared to SLES 10 or RHEL 5.3, SLES 11 includes TX multiqueue support. This helps TX scales better with more connections. For simplicity, no tuning is done for etiher Linux or OpenSolaris. Tuning Linux will improve some results.

Test Results

Nehalem TX throughput

Nehalem RX throughput

Engineering Notes

S10 Update 7 throughput drops significantly with more connections because of high rate of interrupts. Tuning to interrupt more cpu:

set ddi_msix_alloc_limit=4
set pcplusmp:apic_intr_policy=1
set pcplusmp:apic_msix_max=4
set pcplusmp:apic_multi_msi_max=4


improves performance to be more in line with OpenSolaris. TCP RX throughput more than doubles to 20.6 Gbit/s, and TCP TX throughput is 26.6 Gbit/s at 1000 connections.

Uperf Profiles

uperf uses XML to describe the workload. The profile that describes TCP TX throughput is here. The  profile that describes TCP RX throughput is here. The profiles are written based on four clients; if different number of clients are used, modify the number of groups to match the number of clients.

Test Set-up

  • SUT: Sun Fire X4270, dual socket Intel Nehalem @ 2.66 GHz. Hyper-threading is on (i.e. change BIOS default). 24 GB of memory.
  • OS: OpenSolaris 2009.06; Solaris 5/09; SUSE Linux Enterprise 11 ( kernel)
  • Four X1106A-Z Sun 10 GbE with Intel 82598EB 10 GbE Controller installed on SUT. ixgbe driver version compiled on SUSE 11.
  • Clients: Four Sun Fire X4150, dual socket Intel Xeon @ 3.16 GHZ, 8 GB main memory. Three of the X4150 runs Solaris Nevada 109 with Sun Multithreaded 10 GbE Networking Card. One X4150 runs SUSE 10 SP2 with X1106A-Z Sun 10 GbE with Intel 82598EB 10 GbE Controller. Each 10 GbE port is connected back-back with one 10GbE port on SUT.
    • Tuning for Solaris clients:
    • set ddi_msix_alloc_limit=8
      set pcplusmp:apic_multi_msi_max=8
      set pcplusmp:apic_msix_max=8
      set pcplusmp:apic_intr_policy=1
      set nxge:nxge_msi_enable=2
    • /kernel/drv/nxge.conf
    • soft-lso-enable = 1;

Tuesday Apr 14, 2009

Network Performance on Nehalem servers


Today Sun announced several Intel Nehalem based servers. I had early access to a (2 socket) Lynx 2 (Sun Fire X4270 server) as well as two Virgo (Sun Blade X6270 server module) to study network performance. For a quick introduction, Lynx 2 is a 2 socket 2U server. Virgo is a 2 socket server blade module that can be used on Sun Blade 6000 and 6048 chassis. I tested Virgo on Sun Blade 6000. Both Lynx 2 and Virgo support a maximum of 144GB of memory. Each socket is a quad core Nehalem. With Hyper-Threading turned on from BIOS, the operating system sees 16 virtual CPUs. I had Hyper-Threading turned on for all my tests.

Lynx 2 has 4 Gigabit Ethernet ports (code name 'Zoar') on-board, and 6 PCI Express slots that can plug in PCIe 1.1 or 2.0 cards. The Solaris driver for on-board Gigabit Ethernet is called igb. I also tested Sun 10 GbE with Intel 82598EB 10 Gigabit Ethernet Controller (code name 'Oplin'; part number X1106A-z for single-port version, X1107A-z for dual-port version) for 10 Gigabit Ethernet performance. Oplin is PCIe 1.1 compliant. Its Solaris driver is called ixgbe, which is available since Solaris 10 Update 6 and Open Solaris 2008.11. Latest Linux and Windows driver for Oplin can be downloaded from Intel.

Virgo has 2 Gigabit Ethernet ports on-board, and can either plug in up to 2 PCI ExpressModules (EM) to expand dedicated I/O, or add up to 2 NEM (Network Express Modules) to expand common I/O. I tested Sun Dual 10GbE ExpressModule with Intel 82598EB 10GE controller (code name 'Oplin EM'; part number x1108A-z) for dedicated 10 Gigabit Ethernet performance, and Sun Blade 6000 Virtualized Multi-Fabric 10GbE NEM (code name 'NEMHydra') for shared 10 Gigabit Ethernet performance.

Placement of 10 Gigabit Ethernet Cards or ExpressModules

Although the Nehalem servers support PCIe 2.0, Oplin is x8 PCIe 1.1 card, so the rated bandwidth is 16 Gbit/s per card (from 2 ports) in each direction. After protocol overhead, the measured bandwidth is ~ 12 Gbit/s in each direction per card. In comparison, two 10GbE ports on separate cards can achieve line rate. So for maximum throughput over multiple 10GbE ports, use 2N single-port Oplin rather than N dual-port Oplin. This also applies to Oplin EM: maximum two port throughput is achieved with two Oplin EM, one port per EM, and not with using dual ports from one Oplin EM.

If you use PCIe 2.0 10GbE card on Nehalem servers, the usable bandwidth is 4 Gbits/lane instead of 2, so it's possible to achieve line rate with dual-port cards.

Lynx 2 had PCIe slots on either active riser or passive riser. You can place 10GbE cards in any slot and won't see material performance difference.

Tuning ixgbe on Solaris

ixgbe driver supports both Oplin and Oplin ExpressModule. Hardware LSO is enabled by default. For Solaris Nevada 110 or later, only bcopy threshold and number of MSI-X per port needs to be tuned.

set ddi_msix_alloc_limit=4

Multiple transmit DMA channels are supported in Solaris Nevada but not in Solaris 10 yet, so there is more contention for transmit DMA channel on S10 with large number of connections. For Solaris 10 Update 7, more tunables are required to target interrupts to different cpu (apic_intr_policy) and allow more MSI-X per port.

set ddi_msix_alloc_limit=4
set pcplusmp:apic_multi_msi_max=4
set pcplusmp:apic_msix_max=4
set pcplusmp:apic_intr_policy=1
There was a bug in ixgbe for Solaris 10 Update 6 that prevents multiple receive DMA channels to work correctly, so the only recommended tuning on S10U6 is:

Oplin Performance on Lynx 2

I tested 2.66 GHz Lynx 2 with 24 GB main memory on Solaris Nevada 106. Oplin can transmit or receive 64KByte messages at line rate over one socket connection. Scaling performance is examined in two ways: with number of ports and number of connections. The maximum throughput using 4 Oplin cards, one port per card, is 36.3 Gbit/second TCP transmit, and 32.1 Gbit/second TCP receive on Solaris Nevada build 106, so the throughput scales very well to 4 ports.

The data below shows throughput scales well with number of connections for TCP transmit, and less so with TCP receive.

TCP TX tests using uperf with msg. size = 8192

#conn wndsz Mbps cpu(usr/sys)
8 256k 31981.64 41 (1/40)
32 256k 35972.53 59 (2/57)
100 256k 36380.51 67 (2/65)
1000 32k 34716.86 97 (3/94)
4000 32k 31180.33 94 (3/91)

TCP RX tests using uperf with msg. size = 8192

#conn wndsz Mbps cpu(usr/sys)
8 256k 23947.91 61 (2/58)
32 256k 32127.24 100 (3/96)
100 256k 22793.04 100 (1/99)
1000 32k 22844.36 100 (3/97)
4000 32k 22079.65 100 (3/97)

For UDP traffic, 4 ports can transmit an impressive 4.3 million 64 byte payload packets/second, or 1460 byte datagram at 25 Gbit/second. The reason why TCP throughput is higher than UDP is because Solaris supports Large Segment Offload for TCP, but not for UDP at this time.

To measure latency, I connected Lynx 2 to Virgo back-to-back and ran netpipe. Single thread round trip latency for TCP small packets are measured at 25 microseconds.

Oplin Express Module Performance on Virgo

I tested 2.8 GHz Virgo with 48GB of memory on Solaris Nevada 111. Oplin EM can transmit or receive 64KByte messages at line rate over one socket connection. Throughput remains at line rate with increased number of connections for 1 port. 2 port scaling with number of connections is shown below.

TCP TX tests using uperf with msg size = 8192

#conn wndsz Mbps cpu(usr/sys)
8 256k 16533.43 25 (0/25)
32 256k 18266.85 34 (1/34)
100 256k 18354.36 37 (1/36)
1000 32k 18478.93 51 (2/49)
4000 32k 18296.30 64 (2/63)

Section: TCP RX tests using uperf with msg size = 8192

#conn wndsz Mbps cpu(usr/sys)
8 256k 18034.17 52 (2/50)
32 256k 18369.19 69 (3/66)
100 256k 18324.59 81 (3/78)
1000 32k 18425.31 91 (4/87)
4000 32k 17670.74 100 (5/95)

For UDP traffic, 2 ports can transmit 3.9 million 64 byte packets/second, a little higher than 1 port at 3.5 million packets/second. UDP transmit throughput with 1460 byte datagram is 18.7 Gbit/second.


Single thread throughput can reach 9+ Gbit/second on Nehalem servers. Lynx can scale throughput to 4 10GbE ports, and Virgo can scale to 2 10GbE ports near line rate on Solaris. They provide excellent network performance for web and HPC applications.




« July 2016