Wednesday Apr 09, 2008

10 Gigabit Ethernet on UltraSPARC T2 Plus

10 Gigabit Ethernet on UltraSPARC T2 Plus Systems


Sun has launched the industry's first dual-socket chip multi-threading technology (CMT) platforms -- a new family of virtualized servers based on the UltraSPARC T2 Plus (Victoria Falls) processor. The Sun SPARC Enterprise T5140 and T5240 servers - code-named "Maramba" - are Sun's third-generation CMT systems, extending the product family beyond the previously-announced Sun SPARC Enterprise T5120 and T5220 servers, with which they share numerous features.

The UltraSPARC T2 Plus based systems have Neptune chip integrated on-board for flexible GbE/10GbE deployment. 4 Gigabit Ethernet ports are ready to use out of the box. If one optional XAUI card is added, there will be one 10Gigabit Ethernet port and 3 of the 4 Gigabit Ethernet ports will be available. When two XAUI cards are added, there will be two 10Gigabit Ethernet ports and 2 Gigabit Ethernet ports available.

The two XAUI slots connect to the same x8 PCI Express switch, so aggregate throughput from both XAUI cards is limited by PCI Express bandwidth. According to PCI Express® 1.1 specification, bandwidth for payload on x8 lanes is 16 Gbits/s per direction. In reality, single x8 PCI Express switch bandwidth is about 12 Gbits/s transmit, 13 Gbits/s receive, and 18.2 Gbits/s bi-directional, measured on T5240 under Solaris 10 Update 5. Since T5140/T5240 systems have two PCI Express switches, to get maximum aggregate throughput with two 10GbE ports, the recommended configuration is as follows: one XAUI card, plus one Atlas (Sun Dual 10GbE PCI Express Card) on a different PCI Express switch, which would be one of PCI Express slot 1, 2, or 3(T5240 only).

Software Large Segment Offload

Large segment offload (LSO) is an offloading technique to increase transmit throughput by reducing CPU overhead. The technique is also called TCP segmentation offload (TSO) when it applies to TCP. It works by sending large buffers (buffer size can be up to 64 KBytes in Solaris) down the protocol stack and letting the lower layer fragment it into packets, instead of fragmenting at the IP layer. Hardware LSO relies on network interface card to split the buffer into separate packets, while software LSO relies on the network driver to fragment into packets.

The benefits of LSO are:

  • better single thread transmit throughput (+30% on T5240 for 64KB TCP message.)
  • better transmit throughput (+20% on multi-port T5240)
  • lower CPU utilization. (-10% on T5220 NIU)
Software LSO is a new feature in nxge driver. It's available in Solaris Nevada since build 82. It is available as patch 138048 for Solaris 10 Update 5, and is part of Solaris 10 Update 6.


The 10GbE requires minimal tuning for throughput in Solaris 10 Update 7 or earlier. In /etc/system

set ip:ip_soft_rings_cnt=16
uses 16 soft rings per 10GbE port. If you don't want to reboot the system, you can use ndd to change the number of soft rings
ndd -set /dev/ip ip_soft_rings_cnt 16

and plumb nxge for the tuning to take effect. The default for Solaris 10 Update 8 or later is 16 soft rings for 10GbE, so the tuning is no longer needed. You can read more about soft rings in my blog about 10GbE on UltraSPARC T2.

Software LSO is disabled by default. To maximize throughput and/or reduce CPU utilization, it should be enabled. To enable it, edit /platform/sun4v/kernel/drv/nxge.conf and uncomment the line

soft-lso-enable = 1; 


The system used is a 1.4GHz, 8 core T5240, 128 GB memory, 4 Sun Dual 10GbE PCI Express Card, Solaris 10 Update 4 with patches, software LSO enabled. Performance is measured by uPerf 0.2.6.

T5240 10GbE throughput (single thread) TX (MTU 1500)
RX (MTU 1500)
TX (jumbo frame)
RX (jumbo frame)
1.40 1.50 2.40 2.01

Maximum network throughput is achieved with multiple threads (i.e. multiple socket connections):

T5240 10GbE throughput (multiple threads)
# ports metric
1 Gbit/s
9.30 9.46
cpu util. (%)
2 Gbit/s 18.47 17.44
cpu util. (%) 46
4 Gbit/s 21.78 20.44
cpu util. (%)

There is no material difference between the performance of on-board XAUI card and Sun Multi-threaded 10GbE PCI Express Card.

4 GB DIMM gives higher receive performance and memory bandwidth

Different TCP receive throughput was observed on two different T5240, and it was due to different memory bandwidth using different DIMM size. 2 GB DIMM currently shipping with T5240 have 4 banks, while 4 GB DIMM have 8 banks, and more banks give more memory bandwidth. Comparing STREAM benchmark results between 16 x 2GB and 32 x 4 GB memory configuration on the same T5240, 4GB DIMM configuration gives 18-25% higher memory bandwidth. The increased memory bandwidth from 4GB DIMM gives the TCP receive throughput above. Using 2 GB DIMM, with the same or higher memory interleave factor (16 x 2 GB, 32 x 2 GB vs. 16 x 4 GB), will lead to a 30% throughput drop to 14.2 Gb/s for 4 10GbE ports. So using 4GB DIMM is recommended for maximum TCP receive throughput.

The size of DIMM and memory bandwidth have little impact on TCP transmit throughput: less than 2%.


Index of blogs about UltraSPARC T2 Plus

T5140 architecture whitepaper

10 Gigabit Ethernet on UltraSPARC T2

Sun Multi-threaded 10GbE Tuning Guide

Neptune and NIU Hardware Specification

Friday Oct 19, 2007

10 Gigabit Ethernet on UltraSPARC T2

Tuning on-chip 10GbE (NIU) on T5120/T5220

The UltraSPARC T2 has integrated dual 10GbE Network Interface Unit (NIU) on-chip. It requires minimal tuning for throughput. In /etc/system
set ip:ip_soft_rings_cnt=16
uses 16 soft rings per 10GbE port. If you don't want to reboot the system, you can use ndd to change the number of soft rings
ndd -set /dev/ip ip_soft_rings_cnt 16
and plumb nxge for the tuning to take effect. Solaris 10 Update 8 or later uses 16 soft rings for 10GbE by default, so the tuning is no longer needed.

Soft rings are kernel threads that offload processing of received packets from interrupt CPU, thus preventing interrupt CPU from becoming the bottleneck. The trade-off is increased latency from switching from interrupt thread to soft ring thread. If your workload is latency sensitive, you may want to see if turning off soft rings help meet your latency needs while also delivering required packet rate or throughput.

Soft rings can be used with any GLDv3 network drivers, such as nxge, e1000g or bge. A little known trick about soft rings is that it can be configured on a per port basis, so you can, for example, configure NIU with 16 soft rings and on-board 1GbE with 2 soft rings. To continue the example:

ndd -set /dev/ip ip_soft_rings_cnt 16
ifconfig nxge0 plumb
ndd -set /dev/ip ip_soft_rings_cnt 2
ifconfig e1000g1 plumb

Tuning Sun Multi-threaded 10GbE PCI-E NIC on T5120/T5220

If your T5120/T5220 has plugged in Sun Multi-threaded 10GbE PCI Express NIC, then two tunables are recommended for throughput:
set ip:ip_soft_rings_cnt=16
set ddi_msix_alloc_limit=8
ddi_msix_alloc_limit is a system-wide limit of how many MSI(Message Signaled Interrupt) or MSI-X a PCI device can allocate. The default is to allow maximum 2 MSI per device. Since each 10GbE port of Sun Multi-threaded 10GbE PCI-E has 8 receive DMA channels and each channel can generate one interrupt, it can generate up to 8 interrupts. To avoid interrupt CPU becoming the performance bottleneck, it is recommended to set ddi_msix_alloc_limit to 8 so that network receive interrupts can target 8 different CPU. This tunable will be unnecessary with a patch to Solaris 10 update 4.

Which CPU are taking interrupts?

If your application threads are pinned too often by interrupts and it becomes a problem, you can create processor set to dedicate these CPU for interrupt processing. To find out which CPU are taking interrupts, you can use intrstat. In the intrstat output below, you can see CPU 27-34 is interrupted by NIU (niumx is the name of nexus driver for NIU):
      device |     cpu24 %tim     cpu25 %tim     cpu26 %tim     cpu27 %tim
     niumx#0 |         0  0.0         0  0.0         0  0.0      2086 80.9

      device |     cpu28 %tim     cpu29 %tim     cpu30 %tim     cpu31 %tim
     niumx#0 |      1885 82.4      2057 80.1      2189 77.2      2019 79.6

      device |     cpu32 %tim     cpu33 %tim     cpu34 %tim     cpu35 %tim
     niumx#0 |      1993 81.8      2073 79.7      1948 81.7         0  0.0
You can see interrupts from Sun Multi-threaded 10GbE PCI-E using mdb command as well:
echo ::interrupts | mdb -k
but interrupts from NIU are not included at this time.


UltraSPARC T2 comes with on-chip dual 10GbE Network Interface Unit(NIU), but you can also plug in 10GbE NIC on PCI-E slots. Here is a summary of performance features:
# 10GbE ports
# transmit DMA channels/port
# receive DMA channels/port
integrated on-chip?
bus interface?
8 lane PCI Express
bus bandwidth limit?
16 Gbits/s each direction\*
transmit packet classification
receive packet classification
\* Note: PCI Express Specification 1.1 specifies 2 Gbit/s payload per lane full-duplex, so 8 lanes can reach 8 x 2 = 16 Gbit/s. Actual bus bandwidth on T5120/S10U4 is measured at 12 Gbits/s from host to device.

The same driver, nxge, is used for both NIU and Sun Multi-threaded 10GbE PCI-E NIC. If you have both installed, you can tell which instance is NIU by looking into /etc/path_to_inst: the instance with /niu prefix is NIU.

$grep nxge /etc/path_to_inst
"/pci@0/pci@0/pci@8/pci@0/pci@8/network@0" 2 "nxge"
"/pci@0/pci@0/pci@8/pci@0/pci@8/network@0,1" 3 "nxge"
"/niu@80/network@0" 0 "nxge"
"/niu@80/network@1" 1 "nxge"
In the example above, nxge0 and nxge1 are NIU, and nxge2 and nxge3 are PCI-E NIC.

You may wonder: is there performance difference between NIU and PCI-E NIC? The short answer is: NIU wins in all micro-benchmarks except one - single 10GbE port transmitting UDP small packets.

Sun Multi-threaded 10GbE PCI-E NIC can transmit an impressive 2.1 million 64 byte UDP packets per second, 50% more than NIU(1.4 million pps) out of one port, due to the fact that it has 50% more transmit DMA channels than NIU (12 vs. 8 per port). So if your workload is mostly sending small UDP packets, Sun Multi-threaded 10GbE PCI-E NIC may deliver higher packet rate than NIU on UltraSPARC T2 systems.

In all other scenarios, NIU gives higher throughput or lower CPU utilization at similar throughput. On 2 10GbE port throughput test, 2 NIU can achieve an impressive 14.6 Gbits/s on TCP transmit, or 18.2 Gbits/s on TCP receive using 8Kbytes messages and 145 connections on 1.4GHz T5120. For TCP transmit, the CPU efficiency (measured by Gbps/GHz) of NIU is 23% higher than PCI-E NIC at maximum throughput for 2 ports, and 46% higher at maximum throughput for 1 port. These clearly demonstrates the performance advantage of integrating 10GbE on-chip.


T5120, T5220, T6320 System and blades Launch blogs

Sun Multi-threaded 10GbE Tuning Guide

Neptune and NIU Hardware Specification




« April 2014