Tuesday Apr 14, 2009

Network Performance on Nehalem servers


Today Sun announced several Intel Nehalem based servers. I had early access to a (2 socket) Lynx 2 (Sun Fire X4270 server) as well as two Virgo (Sun Blade X6270 server module) to study network performance. For a quick introduction, Lynx 2 is a 2 socket 2U server. Virgo is a 2 socket server blade module that can be used on Sun Blade 6000 and 6048 chassis. I tested Virgo on Sun Blade 6000. Both Lynx 2 and Virgo support a maximum of 144GB of memory. Each socket is a quad core Nehalem. With Hyper-Threading turned on from BIOS, the operating system sees 16 virtual CPUs. I had Hyper-Threading turned on for all my tests.

Lynx 2 has 4 Gigabit Ethernet ports (code name 'Zoar') on-board, and 6 PCI Express slots that can plug in PCIe 1.1 or 2.0 cards. The Solaris driver for on-board Gigabit Ethernet is called igb. I also tested Sun 10 GbE with Intel 82598EB 10 Gigabit Ethernet Controller (code name 'Oplin'; part number X1106A-z for single-port version, X1107A-z for dual-port version) for 10 Gigabit Ethernet performance. Oplin is PCIe 1.1 compliant. Its Solaris driver is called ixgbe, which is available since Solaris 10 Update 6 and Open Solaris 2008.11. Latest Linux and Windows driver for Oplin can be downloaded from Intel.

Virgo has 2 Gigabit Ethernet ports on-board, and can either plug in up to 2 PCI ExpressModules (EM) to expand dedicated I/O, or add up to 2 NEM (Network Express Modules) to expand common I/O. I tested Sun Dual 10GbE ExpressModule with Intel 82598EB 10GE controller (code name 'Oplin EM'; part number x1108A-z) for dedicated 10 Gigabit Ethernet performance, and Sun Blade 6000 Virtualized Multi-Fabric 10GbE NEM (code name 'NEMHydra') for shared 10 Gigabit Ethernet performance.

Placement of 10 Gigabit Ethernet Cards or ExpressModules

Although the Nehalem servers support PCIe 2.0, Oplin is x8 PCIe 1.1 card, so the rated bandwidth is 16 Gbit/s per card (from 2 ports) in each direction. After protocol overhead, the measured bandwidth is ~ 12 Gbit/s in each direction per card. In comparison, two 10GbE ports on separate cards can achieve line rate. So for maximum throughput over multiple 10GbE ports, use 2N single-port Oplin rather than N dual-port Oplin. This also applies to Oplin EM: maximum two port throughput is achieved with two Oplin EM, one port per EM, and not with using dual ports from one Oplin EM.

If you use PCIe 2.0 10GbE card on Nehalem servers, the usable bandwidth is 4 Gbits/lane instead of 2, so it's possible to achieve line rate with dual-port cards.

Lynx 2 had PCIe slots on either active riser or passive riser. You can place 10GbE cards in any slot and won't see material performance difference.

Tuning ixgbe on Solaris

ixgbe driver supports both Oplin and Oplin ExpressModule. Hardware LSO is enabled by default. For Solaris Nevada 110 or later, only bcopy threshold and number of MSI-X per port needs to be tuned.

set ddi_msix_alloc_limit=4

Multiple transmit DMA channels are supported in Solaris Nevada but not in Solaris 10 yet, so there is more contention for transmit DMA channel on S10 with large number of connections. For Solaris 10 Update 7, more tunables are required to target interrupts to different cpu (apic_intr_policy) and allow more MSI-X per port.

set ddi_msix_alloc_limit=4
set pcplusmp:apic_multi_msi_max=4
set pcplusmp:apic_msix_max=4
set pcplusmp:apic_intr_policy=1
There was a bug in ixgbe for Solaris 10 Update 6 that prevents multiple receive DMA channels to work correctly, so the only recommended tuning on S10U6 is:

Oplin Performance on Lynx 2

I tested 2.66 GHz Lynx 2 with 24 GB main memory on Solaris Nevada 106. Oplin can transmit or receive 64KByte messages at line rate over one socket connection. Scaling performance is examined in two ways: with number of ports and number of connections. The maximum throughput using 4 Oplin cards, one port per card, is 36.3 Gbit/second TCP transmit, and 32.1 Gbit/second TCP receive on Solaris Nevada build 106, so the throughput scales very well to 4 ports.

The data below shows throughput scales well with number of connections for TCP transmit, and less so with TCP receive.

TCP TX tests using uperf with msg. size = 8192

#conn wndsz Mbps cpu(usr/sys)
8 256k 31981.64 41 (1/40)
32 256k 35972.53 59 (2/57)
100 256k 36380.51 67 (2/65)
1000 32k 34716.86 97 (3/94)
4000 32k 31180.33 94 (3/91)

TCP RX tests using uperf with msg. size = 8192

#conn wndsz Mbps cpu(usr/sys)
8 256k 23947.91 61 (2/58)
32 256k 32127.24 100 (3/96)
100 256k 22793.04 100 (1/99)
1000 32k 22844.36 100 (3/97)
4000 32k 22079.65 100 (3/97)

For UDP traffic, 4 ports can transmit an impressive 4.3 million 64 byte payload packets/second, or 1460 byte datagram at 25 Gbit/second. The reason why TCP throughput is higher than UDP is because Solaris supports Large Segment Offload for TCP, but not for UDP at this time.

To measure latency, I connected Lynx 2 to Virgo back-to-back and ran netpipe. Single thread round trip latency for TCP small packets are measured at 25 microseconds.

Oplin Express Module Performance on Virgo

I tested 2.8 GHz Virgo with 48GB of memory on Solaris Nevada 111. Oplin EM can transmit or receive 64KByte messages at line rate over one socket connection. Throughput remains at line rate with increased number of connections for 1 port. 2 port scaling with number of connections is shown below.

TCP TX tests using uperf with msg size = 8192

#conn wndsz Mbps cpu(usr/sys)
8 256k 16533.43 25 (0/25)
32 256k 18266.85 34 (1/34)
100 256k 18354.36 37 (1/36)
1000 32k 18478.93 51 (2/49)
4000 32k 18296.30 64 (2/63)

Section: TCP RX tests using uperf with msg size = 8192

#conn wndsz Mbps cpu(usr/sys)
8 256k 18034.17 52 (2/50)
32 256k 18369.19 69 (3/66)
100 256k 18324.59 81 (3/78)
1000 32k 18425.31 91 (4/87)
4000 32k 17670.74 100 (5/95)

For UDP traffic, 2 ports can transmit 3.9 million 64 byte packets/second, a little higher than 1 port at 3.5 million packets/second. UDP transmit throughput with 1460 byte datagram is 18.7 Gbit/second.


Single thread throughput can reach 9+ Gbit/second on Nehalem servers. Lynx can scale throughput to 4 10GbE ports, and Virgo can scale to 2 10GbE ports near line rate on Solaris. They provide excellent network performance for web and HPC applications.

Monday Nov 10, 2008

Network Performance on Open Storage Appliance

Sun Storage 7000 Unified Storage Systems is launched this Monday. This is  the industry's first open storage appliance and will deliver breakthrough performance, speed and simplicity. I will share an inside view of how network performance is optimized for it. Most of the optimization are generic and can be applied to other storage systems running Solaris.

1 GBytes/sec and Beyond

Early on there is the question of whether networking performance work should focus on 10GbE or GigE. If the system can sustain 1 GBytes/sec throughput or more, then it makes sense to focus on 10GbE performance first, and multiple GigE performance will be easy. 1 GBytes/sec is a nice round number to remember. Can it be done? Using uperf as micro-benchmark tool to model NFS workload, throughput on a 2 socket quad core AMD Barcelona system can reach 900 MBytes/s, and on a 4 socket quad core Barcelona can reach 10GbE line rate, so the answer is 'yes'. Results from uperf are later validated with NFS cached read throughput that exceeds 1 GBytes/sec. Later work focuses on reducing CPU utilization and latency while not missing the 1 GBytes/sec mark.

A lot of hard work goes into reaching 1 GBytes/sec on one NAS node. The key metrics from a network performance perspective are throughput, CPU efficiency, and latency. Packet rate is not a key metric because IOPS/sec for storage are much lower than packets/sec for network. I'll explain how the tuning decisions are made.

CPU Efficiency

Networking is CPU intensive, and leaving enough CPU cycles to do other work is important for NAS. CPU efficiency measures the amount of work done divided by the amount of CPU used to do the work. If the same amount of work is done using less CPU, then it's more CPU efficient and more desirable. CPU efficiency for networking can be measured by bytes/cycle, i.e. amount of bytes transferred divided by CPU cycles. Roch has written a DTrace script that measures the inverse of bytes/cycle - cycles/byte, and it is used to help make tuning decisions.

Large Segment Offload

Large Segment Offload works by sending large buffers (buffer size can be up to 64 KBytes in Solaris) down the protocol stack and letting the lower layer fragment it into packets, instead of fragmenting at the IP layer. Hardware LSO relies on network interface card to split the buffer into separate packets, while software LSO relies on the network driver to fragment into packets. The 10GbE for Sun Storage 7000 Unified Storage Systems (the 10GbE driver is called nxge) supports software LSO, and it helps CPU efficiency and throughput for transmit. LSO is enabled by default for nxge, and may be incorporated later for GigE (e1000g).

Jumbo Frames 

If your network infrastructure allows jumbo frames, it is helpful to improve CPU efficiency. Jumbo frames are enabled by default, as auto-negotiation for MTU between client and server should just work.

Interrupts vs. Soft Rings 

At 10GbE speed a NIC can generate more interrupts than single CPU can handle. Solaris provides 2 mechanisms to offload the interrupt CPU so that it's not a performance bottleneck: interrupt multiple CPU using MSI/X (Message Signaled Interrupt and its successor, MSI-X), and soft rings. Interrupting multiple cpu relies on the multiple DMA channels on a NIC to generate interrupts, and the interrupts from each DMA channel can be directed to a different CPU, so packet receive processing is done in interrupt context on multiple CPU. Soft rings don't rely on NIC hardware to have multiple DMA channels, and fan out packets to kernel threads from interrupt service routine for  receive processing. Three combinations are compared for NFS workload: MSI-X only, soft rings only, and MSI-X + soft rings. The results are as expected: MSI-X only provides the best CPU efficiency(~8% better) at a similar throughput.

Number of Transmit DMA Channels

One anomaly that stood out in our performance study is CPU efficiency for nxge drops significantly as the number of clients increase. mpstat shows there is significant increase in context switch, and CPU performance counter shows Hyper-transport utilization is above recommended 60%. Although multiple transmit DMA channels (aka TX rings as they are ring buffers) are needed for maximum packet rate, reducing the number of  TX rings reduces context switch and thread migration. Experiment shows 1 TX ring limits throughput, and 2 TX ring reaches maximum throughput while improving CPU efficiency by up to 50%.

Configuration conn     Gbits/s         
all TX rings 12     4.9 11.1
40     5.8 15.7
2 TX rings       
12     6.3 9.0
40     8.2 10.8


Latency matters for storage clients because it can become a limiting factor for client throughput. One important component of network latency is interrupt blanking: instead of interrupting the CPU immediately, a NIC can delay generating a interrupt so that if more packets arrive during that interval, they can be processed together and be more CPU efficient. Reducing interrupt interval to minimum value for nxge (~25us) improves throughput by up to 30% for 8 or less connections; CPU cost (cycles/byte) can increase by up to 10%.

Wednesday Apr 09, 2008

10 Gigabit Ethernet on UltraSPARC T2 Plus

10 Gigabit Ethernet on UltraSPARC T2 Plus Systems


Sun has launched the industry's first dual-socket chip multi-threading technology (CMT) platforms -- a new family of virtualized servers based on the UltraSPARC T2 Plus (Victoria Falls) processor. The Sun SPARC Enterprise T5140 and T5240 servers - code-named "Maramba" - are Sun's third-generation CMT systems, extending the product family beyond the previously-announced Sun SPARC Enterprise T5120 and T5220 servers, with which they share numerous features.

The UltraSPARC T2 Plus based systems have Neptune chip integrated on-board for flexible GbE/10GbE deployment. 4 Gigabit Ethernet ports are ready to use out of the box. If one optional XAUI card is added, there will be one 10Gigabit Ethernet port and 3 of the 4 Gigabit Ethernet ports will be available. When two XAUI cards are added, there will be two 10Gigabit Ethernet ports and 2 Gigabit Ethernet ports available.

The two XAUI slots connect to the same x8 PCI Express switch, so aggregate throughput from both XAUI cards is limited by PCI Express bandwidth. According to PCI Express® 1.1 specification, bandwidth for payload on x8 lanes is 16 Gbits/s per direction. In reality, single x8 PCI Express switch bandwidth is about 12 Gbits/s transmit, 13 Gbits/s receive, and 18.2 Gbits/s bi-directional, measured on T5240 under Solaris 10 Update 5. Since T5140/T5240 systems have two PCI Express switches, to get maximum aggregate throughput with two 10GbE ports, the recommended configuration is as follows: one XAUI card, plus one Atlas (Sun Dual 10GbE PCI Express Card) on a different PCI Express switch, which would be one of PCI Express slot 1, 2, or 3(T5240 only).

Software Large Segment Offload

Large segment offload (LSO) is an offloading technique to increase transmit throughput by reducing CPU overhead. The technique is also called TCP segmentation offload (TSO) when it applies to TCP. It works by sending large buffers (buffer size can be up to 64 KBytes in Solaris) down the protocol stack and letting the lower layer fragment it into packets, instead of fragmenting at the IP layer. Hardware LSO relies on network interface card to split the buffer into separate packets, while software LSO relies on the network driver to fragment into packets.

The benefits of LSO are:

  • better single thread transmit throughput (+30% on T5240 for 64KB TCP message.)
  • better transmit throughput (+20% on multi-port T5240)
  • lower CPU utilization. (-10% on T5220 NIU)
Software LSO is a new feature in nxge driver. It's available in Solaris Nevada since build 82. It is available as patch 138048 for Solaris 10 Update 5, and is part of Solaris 10 Update 6.


The 10GbE requires minimal tuning for throughput in Solaris 10 Update 7 or earlier. In /etc/system

set ip:ip_soft_rings_cnt=16
uses 16 soft rings per 10GbE port. If you don't want to reboot the system, you can use ndd to change the number of soft rings
ndd -set /dev/ip ip_soft_rings_cnt 16

and plumb nxge for the tuning to take effect. The default for Solaris 10 Update 8 or later is 16 soft rings for 10GbE, so the tuning is no longer needed. You can read more about soft rings in my blog about 10GbE on UltraSPARC T2.

Software LSO is disabled by default. To maximize throughput and/or reduce CPU utilization, it should be enabled. To enable it, edit /platform/sun4v/kernel/drv/nxge.conf and uncomment the line

soft-lso-enable = 1; 


The system used is a 1.4GHz, 8 core T5240, 128 GB memory, 4 Sun Dual 10GbE PCI Express Card, Solaris 10 Update 4 with patches, software LSO enabled. Performance is measured by uPerf 0.2.6.

T5240 10GbE throughput (single thread) TX (MTU 1500)
RX (MTU 1500)
TX (jumbo frame)
RX (jumbo frame)
1.40 1.50 2.40 2.01

Maximum network throughput is achieved with multiple threads (i.e. multiple socket connections):

T5240 10GbE throughput (multiple threads)
# ports metric
1 Gbit/s
9.30 9.46
cpu util. (%)
2 Gbit/s 18.47 17.44
cpu util. (%) 46
4 Gbit/s 21.78 20.44
cpu util. (%)

There is no material difference between the performance of on-board XAUI card and Sun Multi-threaded 10GbE PCI Express Card.

4 GB DIMM gives higher receive performance and memory bandwidth

Different TCP receive throughput was observed on two different T5240, and it was due to different memory bandwidth using different DIMM size. 2 GB DIMM currently shipping with T5240 have 4 banks, while 4 GB DIMM have 8 banks, and more banks give more memory bandwidth. Comparing STREAM benchmark results between 16 x 2GB and 32 x 4 GB memory configuration on the same T5240, 4GB DIMM configuration gives 18-25% higher memory bandwidth. The increased memory bandwidth from 4GB DIMM gives the TCP receive throughput above. Using 2 GB DIMM, with the same or higher memory interleave factor (16 x 2 GB, 32 x 2 GB vs. 16 x 4 GB), will lead to a 30% throughput drop to 14.2 Gb/s for 4 10GbE ports. So using 4GB DIMM is recommended for maximum TCP receive throughput.

The size of DIMM and memory bandwidth have little impact on TCP transmit throughput: less than 2%.


Index of blogs about UltraSPARC T2 Plus

T5140 architecture whitepaper

10 Gigabit Ethernet on UltraSPARC T2

Sun Multi-threaded 10GbE Tuning Guide

Neptune and NIU Hardware Specification




« July 2016