By user12608924 on Apr 09, 2008
10 Gigabit Ethernet on UltraSPARC T2 Plus Systems
Sun has launched the industry's first dual-socket chip multi-threading technology (CMT) platforms -- a new family of virtualized servers based on the UltraSPARC T2 Plus (Victoria Falls) processor. The Sun SPARC Enterprise T5140 and T5240 servers - code-named "Maramba" - are Sun's third-generation CMT systems, extending the product family beyond the previously-announced Sun SPARC Enterprise T5120 and T5220 servers, with which they share numerous features.
The UltraSPARC T2 Plus based systems have Neptune chip integrated on-board for flexible GbE/10GbE deployment. 4 Gigabit Ethernet ports are ready to use out of the box. If one optional XAUI card is added, there will be one 10Gigabit Ethernet port and 3 of the 4 Gigabit Ethernet ports will be available. When two XAUI cards are added, there will be two 10Gigabit Ethernet ports and 2 Gigabit Ethernet ports available.
The two XAUI slots connect to the same x8 PCI Express switch, so
aggregate throughput from both XAUI cards is limited by PCI Express bandwidth. According to PCI Express® 1.1 specification, bandwidth for payload on x8 lanes is 16 Gbits/s per direction. In reality, single x8 PCI Express switch bandwidth is about 12 Gbits/s transmit, 13
Gbits/s receive, and 18.2 Gbits/s bi-directional, measured on T5240 under Solaris 10 Update 5. Since T5140/T5240 systems have two PCI Express switches, to get maximum aggregate throughput with two 10GbE ports, the recommended configuration is as follows: one XAUI card, plus one Atlas (Sun Dual 10GbE PCI Express Card) on a different PCI Express switch, which would be one of PCI Express slot 1, 2, or 3(T5240 only).
Software Large Segment Offload
Large segment offload (LSO) is an offloading technique to increase transmit throughput by reducing CPU overhead. The technique is also called TCP segmentation offload (TSO) when it applies to TCP. It works by sending large buffers (buffer size can be up to 64 KBytes in Solaris) down the protocol stack and letting the lower layer fragment it into packets, instead of fragmenting at the IP layer. Hardware LSO relies on network interface card to split the buffer into separate packets, while software LSO relies on the network driver to fragment into packets.
The benefits of LSO are:
- better single thread transmit throughput (+30% on T5240 for 64KB TCP message.)
- better transmit throughput (+20% on multi-port T5240)
- lower CPU utilization. (-10% on T5220 NIU)
The 10GbE requires minimal tuning for throughput in Solaris 10 Update 7 or earlier. In /etc/system
set ip:ip_soft_rings_cnt=16uses 16 soft rings per 10GbE port. If you don't want to reboot the system, you can use ndd to change the number of soft rings
ndd -set /dev/ip ip_soft_rings_cnt 16
and plumb nxge for the tuning to take effect. The default for Solaris 10 Update 8 or later is 16 soft rings for 10GbE, so the tuning is no longer needed. You can read more about soft rings in my blog about 10GbE on UltraSPARC T2.
Software LSO is disabled by default. To maximize throughput and/or reduce CPU utilization, it should be enabled. To enable it, edit /platform/sun4v/kernel/drv/nxge.conf and uncomment the line
soft-lso-enable = 1;
The system used is a 1.4GHz, 8 core T5240, 128 GB memory, 4 Sun Dual 10GbE PCI Express Card, Solaris 10 Update 4 with patches, software LSO enabled. Performance is measured by uPerf 0.2.6.
|T5240 10GbE throughput (single thread)||TX (MTU 1500)
||RX (MTU 1500)
||TX (jumbo frame)
||RX (jumbo frame)
Maximum network throughput is achieved with multiple threads (i.e. multiple socket connections):
| T5240 10GbE
throughput (multiple threads)
|cpu util. (%)
|cpu util. (%)||46
|cpu util. (%)
There is no material difference between the performance of on-board XAUI card and Sun Multi-threaded 10GbE PCI Express Card.
4 GB DIMM gives higher receive performance and memory bandwidth
Different TCP receive throughput was observed on two different T5240, and it was due to different memory bandwidth using different DIMM size. 2 GB DIMM currently shipping with T5240 have 4 banks, while 4 GB DIMM have 8
banks, and more banks give more memory bandwidth. Comparing STREAM benchmark results between 16 x 2GB and 32 x 4 GB memory configuration on the same T5240, 4GB DIMM configuration gives 18-25% higher memory bandwidth. The increased memory bandwidth from 4GB DIMM gives the TCP receive throughput above. Using 2 GB DIMM, with the same or higher memory interleave factor (16 x 2 GB, 32 x 2 GB vs. 16 x 4 GB), will lead to a 30% throughput drop to 14.2 Gb/s for 4 10GbE ports. So using 4GB DIMM is recommended for maximum TCP receive throughput.
The size of DIMM and memory bandwidth have little impact on TCP transmit throughput: less than 2%.