Network Performance on Nehalem servers
By user12608924 on Apr 14, 2009
Today Sun announced several Intel Nehalem based servers. I had early access to a (2 socket) Lynx 2 (Sun Fire X4270 server) as well as two Virgo (Sun Blade X6270 server module) to study network performance. For a quick introduction, Lynx 2 is a 2 socket 2U server. Virgo is a 2 socket server blade module that can be used on Sun Blade 6000 and 6048 chassis. I tested Virgo on Sun Blade 6000. Both Lynx 2 and Virgo support a maximum of 144GB of memory. Each socket is a quad core Nehalem. With Hyper-Threading turned on from BIOS, the operating system sees 16 virtual CPUs. I had Hyper-Threading turned on for all my tests.
Lynx 2 has 4 Gigabit Ethernet ports (code name 'Zoar') on-board, and 6 PCI Express slots that can plug in PCIe 1.1 or 2.0 cards. The Solaris driver for on-board Gigabit Ethernet is called igb. I also tested Sun 10 GbE with Intel 82598EB 10 Gigabit Ethernet Controller (code name 'Oplin'; part number X1106A-z for single-port version, X1107A-z for dual-port version) for 10 Gigabit Ethernet performance. Oplin is PCIe 1.1 compliant. Its Solaris driver is called ixgbe, which is available since Solaris 10 Update 6 and Open Solaris 2008.11. Latest Linux and Windows driver for Oplin can be downloaded from Intel.
Virgo has 2 Gigabit Ethernet ports on-board, and can either plug in up to 2 PCI ExpressModules (EM) to expand dedicated I/O, or add up to 2 NEM (Network Express Modules) to expand common I/O. I tested Sun Dual 10GbE ExpressModule with Intel 82598EB 10GE controller (code name 'Oplin EM'; part number x1108A-z) for dedicated 10 Gigabit Ethernet performance, and Sun Blade 6000 Virtualized Multi-Fabric 10GbE NEM (code name 'NEMHydra') for shared 10 Gigabit Ethernet performance.
Placement of 10 Gigabit Ethernet Cards or ExpressModules
Although the Nehalem servers support PCIe 2.0, Oplin is x8 PCIe 1.1 card, so the rated bandwidth is 16 Gbit/s per card (from 2 ports) in each direction. After protocol overhead, the measured bandwidth is ~ 12 Gbit/s in each direction per card. In comparison, two 10GbE ports on separate cards can achieve line rate. So for maximum throughput over multiple 10GbE ports, use 2N single-port Oplin rather than N dual-port Oplin. This also applies to Oplin EM: maximum two port throughput is achieved with two Oplin EM, one port per EM, and not with using dual ports from one Oplin EM.
If you use PCIe 2.0 10GbE card on Nehalem servers, the usable bandwidth is 4 Gbits/lane instead of 2, so it's possible to achieve line rate with dual-port cards.
Lynx 2 had PCIe slots on either active riser or passive riser. You can place 10GbE cards in any slot and won't see material performance difference.
Tuning ixgbe on Solaris
ixgbe driver supports both Oplin and Oplin ExpressModule. Hardware LSO is enabled by default. For Solaris Nevada 110 or later, only bcopy threshold and number of MSI-X per port needs to be tuned.
/etc/system set ddi_msix_alloc_limit=4 /kernel/drv/ixgbe.conf tx_copy_threshold=1024;
Multiple transmit DMA channels are supported in Solaris Nevada but not in Solaris 10 yet, so there is more contention for transmit DMA channel on S10 with large number of connections. For Solaris 10 Update 7, more tunables are required to target interrupts to different cpu (apic_intr_policy) and allow more MSI-X per port.
/etc/system set ddi_msix_alloc_limit=4 set pcplusmp:apic_multi_msi_max=4 set pcplusmp:apic_msix_max=4 set pcplusmp:apic_intr_policy=1 /kernel/drv/ixgbe.conf tx_copy_threshold=1024; rx_queue_number=4;There was a bug in ixgbe for Solaris 10 Update 6 that prevents multiple receive DMA channels to work correctly, so the only recommended tuning on S10U6 is:
Oplin Performance on Lynx 2
I tested 2.66 GHz Lynx 2 with 24 GB main memory on Solaris Nevada 106. Oplin can transmit or receive 64KByte messages at line rate over one socket connection. Scaling performance is examined in two ways: with number of ports and number of connections. The maximum throughput using 4 Oplin cards, one port per card, is 36.3 Gbit/second TCP transmit, and 32.1 Gbit/second TCP receive on Solaris Nevada build 106, so the throughput scales very well to 4 ports.
The data below shows throughput scales well with number of connections for TCP transmit, and less so with TCP receive.
TCP TX tests using uperf with msg. size = 8192
TCP RX tests using uperf with msg. size = 8192
For UDP traffic, 4 ports can transmit an impressive 4.3 million 64 byte payload packets/second, or 1460 byte datagram at 25 Gbit/second. The reason why TCP throughput is higher than UDP is because Solaris supports Large Segment Offload for TCP, but not for UDP at this time.
To measure latency, I connected Lynx 2 to Virgo back-to-back and ran netpipe. Single thread round trip latency for TCP small packets are measured at 25 microseconds.
Oplin Express Module Performance on Virgo
I tested 2.8 GHz Virgo with 48GB of memory on Solaris Nevada 111. Oplin EM can transmit or receive 64KByte messages at line rate over one socket connection. Throughput remains at line rate with increased number of connections for 1 port. 2 port scaling with number of connections is shown below.
TCP TX tests using uperf with msg size = 8192
Section: TCP RX tests using uperf with msg size = 8192
For UDP traffic, 2 ports can transmit 3.9 million 64 byte packets/second, a little higher than 1 port at 3.5 million packets/second. UDP transmit throughput with 1460 byte datagram is 18.7 Gbit/second.
Single thread throughput can reach 9+ Gbit/second on Nehalem servers. Lynx can scale throughput to 4 10GbE ports, and Virgo can scale to 2 10GbE ports near line rate on Solaris. They provide excellent network performance for web and HPC applications.