10 Gigabit Ethernet on UltraSPARC T2 Plus

10 Gigabit Ethernet on UltraSPARC T2 Plus Systems

Overview 

Sun has launched the industry's first dual-socket chip multi-threading technology (CMT) platforms -- a new family of virtualized servers based on the UltraSPARC T2 Plus (Victoria Falls) processor. The Sun SPARC Enterprise T5140 and T5240 servers - code-named "Maramba" - are Sun's third-generation CMT systems, extending the product family beyond the previously-announced Sun SPARC Enterprise T5120 and T5220 servers, with which they share numerous features.

The UltraSPARC T2 Plus based systems have Neptune chip integrated on-board for flexible GbE/10GbE deployment. 4 Gigabit Ethernet ports are ready to use out of the box. If one optional XAUI card is added, there will be one 10Gigabit Ethernet port and 3 of the 4 Gigabit Ethernet ports will be available. When two XAUI cards are added, there will be two 10Gigabit Ethernet ports and 2 Gigabit Ethernet ports available.

The two XAUI slots connect to the same x8 PCI Express switch, so aggregate throughput from both XAUI cards is limited by PCI Express bandwidth. According to PCI Express® 1.1 specification, bandwidth for payload on x8 lanes is 16 Gbits/s per direction. In reality, single x8 PCI Express switch bandwidth is about 12 Gbits/s transmit, 13 Gbits/s receive, and 18.2 Gbits/s bi-directional, measured on T5240 under Solaris 10 Update 5. Since T5140/T5240 systems have two PCI Express switches, to get maximum aggregate throughput with two 10GbE ports, the recommended configuration is as follows: one XAUI card, plus one Atlas (Sun Dual 10GbE PCI Express Card) on a different PCI Express switch, which would be one of PCI Express slot 1, 2, or 3(T5240 only).

Software Large Segment Offload

Large segment offload (LSO) is an offloading technique to increase transmit throughput by reducing CPU overhead. The technique is also called TCP segmentation offload (TSO) when it applies to TCP. It works by sending large buffers (buffer size can be up to 64 KBytes in Solaris) down the protocol stack and letting the lower layer fragment it into packets, instead of fragmenting at the IP layer. Hardware LSO relies on network interface card to split the buffer into separate packets, while software LSO relies on the network driver to fragment into packets.

The benefits of LSO are:

  • better single thread transmit throughput (+30% on T5240 for 64KB TCP message.)
  • better transmit throughput (+20% on multi-port T5240)
  • lower CPU utilization. (-10% on T5220 NIU)
Software LSO is a new feature in nxge driver. It's available in Solaris Nevada since build 82. It is available as patch 138048 for Solaris 10 Update 5, and is part of Solaris 10 Update 6.

Tuning

The 10GbE requires minimal tuning for throughput in Solaris 10 Update 7 or earlier. In /etc/system

set ip:ip_soft_rings_cnt=16
uses 16 soft rings per 10GbE port. If you don't want to reboot the system, you can use ndd to change the number of soft rings
ndd -set /dev/ip ip_soft_rings_cnt 16

and plumb nxge for the tuning to take effect. The default for Solaris 10 Update 8 or later is 16 soft rings for 10GbE, so the tuning is no longer needed. You can read more about soft rings in my blog about 10GbE on UltraSPARC T2.

Software LSO is disabled by default. To maximize throughput and/or reduce CPU utilization, it should be enabled. To enable it, edit /platform/sun4v/kernel/drv/nxge.conf and uncomment the line

soft-lso-enable = 1; 

Performance

The system used is a 1.4GHz, 8 core T5240, 128 GB memory, 4 Sun Dual 10GbE PCI Express Card, Solaris 10 Update 4 with patches, software LSO enabled. Performance is measured by uPerf 0.2.6.

T5240 10GbE throughput (single thread) TX (MTU 1500)
RX (MTU 1500)
TX (jumbo frame)
RX (jumbo frame)
Gbit/s
1.40 1.50 2.40 2.01

Maximum network throughput is achieved with multiple threads (i.e. multiple socket connections):

T5240 10GbE throughput (multiple threads)
# ports metric
TX
RX
1 Gbit/s
9.30 9.46
cpu util. (%)
24
25
2 Gbit/s 18.47 17.44
cpu util. (%) 46
49
4 Gbit/s 21.78 20.44
cpu util. (%)
77
83

There is no material difference between the performance of on-board XAUI card and Sun Multi-threaded 10GbE PCI Express Card.

4 GB DIMM gives higher receive performance and memory bandwidth

Different TCP receive throughput was observed on two different T5240, and it was due to different memory bandwidth using different DIMM size. 2 GB DIMM currently shipping with T5240 have 4 banks, while 4 GB DIMM have 8 banks, and more banks give more memory bandwidth. Comparing STREAM benchmark results between 16 x 2GB and 32 x 4 GB memory configuration on the same T5240, 4GB DIMM configuration gives 18-25% higher memory bandwidth. The increased memory bandwidth from 4GB DIMM gives the TCP receive throughput above. Using 2 GB DIMM, with the same or higher memory interleave factor (16 x 2 GB, 32 x 2 GB vs. 16 x 4 GB), will lead to a 30% throughput drop to 14.2 Gb/s for 4 10GbE ports. So using 4GB DIMM is recommended for maximum TCP receive throughput.

The size of DIMM and memory bandwidth have little impact on TCP transmit throughput: less than 2%.

Links

Index of blogs about UltraSPARC T2 Plus

T5140 architecture whitepaper

10 Gigabit Ethernet on UltraSPARC T2

Sun Multi-threaded 10GbE Tuning Guide

Neptune and NIU Hardware Specification

Comments:

What happened to the on chip 10Gig-E ports that the original T2 had?

It seems that the T2+ had them removed in favor of on board (vs on chip) neptune card.

Do you have similar benchmarks for the T5220 using the two on-chip 10Gig-E's?

Posted by John on April 09, 2008 at 01:20 AM PDT #

[Trackback] Today Sun and Fujitsu announced the the first new multiproc Victoria Falls systems: Sun Microsystems And Fujitsu Expand SPARC Enterprise Server Line With New UltraSPARC T2 Plus Processor-Based Systems. The benchmark results for this new system are rea...

Posted by c0t0d0s0.org on April 09, 2008 at 02:36 AM PDT #

John: they were removed (along with half of memory controllers) to make room for coherency SMP links, it seems.

Posted by Tomasz on April 09, 2008 at 09:05 PM PDT #

Posted by Tomasz on April 09, 2008 at 09:06 PM PDT #

John,
See my blog http://blogs.sun.com/puresee/entry/10_gigabit_ethernet_on_ultrasparc about T5220 using the two on-chip 10Gig-E's. The results on that blog was w/o LSO. Using LSO, T5220 TX is about 16 Gbit/s using write(), and 18 Gbit/s using sendfilev(). RX was unchanged.

Posted by Pure See on April 10, 2008 at 03:22 AM PDT #

[Trackback] In Solaris 10 update 6 wird der nxge Treiber (fuer 10GBit Ethernet) "Software Large Segment Offload" unterstuetzen.  Damit wird der Netzwerk-Durchsatz erhoeht, und gleichzeitig die hierfuer noetige CPU-Last verringert.  Welche Leist...

Posted by Die Kernspalter on September 24, 2008 at 02:06 AM PDT #

Post a Comment:
  • HTML Syntax: NOT allowed
About

user12608924

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today