Monday Oct 13, 2008

T5440 PCI-E I/O Performance

T5440 PCI-Express Performance
Sun's latest CMT-based server is the four-way Sun SPARC Enterprise T5440. As with the previous two-way Sun SPARC Enterprise T5140 and T5240 servers, the T5440 is built around the UltraSPARC T2 Plus processor (an SMP version of the UltraSPARC T2). Whereas the T5140 and T5240 used glueless coherency links to connect two T2 Plus processors in 1U and 2U form factors, the T5440 uses 4 coherency hubs to connect up to four processors in a 4U form factor. (The coherency hub was presented at a 2008 IEEE Symposium on High-Performance Interconnects. (You can find more detail here: slides, paper.) With four T2 Plus processors, the T5440 provides 256 hardware threads.  As with previous CMT servers the T5440 utilizes PCI-Express (PCI-E) for Input/Output (IO). With the UltraSPARC  T2 and T2 Plus processors, a PCI-E root complex is brought directly on-chip, reducing latency between IO devices and memory.

T5440 PCI-Express Topology
The T5440 uses four PCI-E switches to connect to onboard devices and eight PCI-E slots for external device connections. All the slots are x8 PCI-E electrically, though two are physically x16. Another two of the eight slots are unavailable if the co-located XAUI is used. 

The T5440 can be configured with fewer than 4 CPU modules. In that case, x8 PCI-E crosslinks between the PCI-E switches are enabled providing full access to all IO components. (Note: crosslinks are not shown in the diagram above.) This is discussed further here.

Bandwidth

Let's take a look at the DMA bandwidth performance for the T5440. The T5440 has four root complexes (one on-chip per T2 Plus processor). These measurements were made using multiple internally developed PCI-E exerciser cards. (How we made these measurements.) We used multiple load generating modules and two Sun External IO Expansion Units. As shown in the section on latency below, any expansion unit will add latency for IO, however for IO devices that support sufficient outstanding IO requests, full PCI-E bandwidth is achievable. (For a logical view of the IO Expansion Unit PCI-E configuration, look here.)


1 PCI-E

2 PCI-E

3 PCI-E

4 PCI-E

100% DMA Read

1520 MB/s

3050 MB/s

4580 MB/s

6100 MB/s

100% DMA Write

1720 MB/s

3440 MB/s

5170 MB/s

6890 MB/s

Bi-Directional (DMA read + DMA write)

2940 MB/s

5900 MB/s

8850MB/s

11800MB/s

Bandwidth for the T5440 is 97% for uni-directional and 93% for bi-directional of the theoretical maximum for the given PCI-E configuration.  Peak DMA Bandwidth on T5440 scales very well.


Latency
The table below shows latency for DMA Read operations. (The time is from the upstream request by the device to first data bytes of the downstream 64B completion.)

T5140

T5240

T5440

T5440

T5440 from

IO Expander

CPU

T2 Plus @ 1.2GHz

T2 Plus @ 1.4GHz

T2 Plus @ 1.2GHz

T2 Plus @ 1.4GHz

T2 Plus @ 1.4GHz

One DW (4 Bytes) Satisfied from L2 Cache

653 ns

641 ns (est.)

698 ns

662 ns

1900 ns

First DW of 64 Byte MemRd Satisfied from Local Memory

820 ns

 808 ns (est.)

1047 ns

954 ns

2200 ns

First DW of 64 Byte MemRd Satisfied from Remote Memory

916 ns

 904 ns (est.)

1242 ns

1143 ns

2400 ns

For a given IO slot, memory is either local or remote. (Local versus Remote Latency.) Both the T5240 and the T5440 are SMP architectures and have additional delays due to coherency protocol overhead. The added latency due to coherence on the 5240 is lower than on the T5440 since the T5240 is glueless while the T5440 uses a coherency hub. The last column shows latency for a device in the IO Expansion Unit. The extra levels of PCI-E switch add approximately 1.25 us of latency to the IO path, but with sufficient outstanding IO requests to memory, devices can still achieve full PCI-E x8 bandwidth.

Wednesday Apr 09, 2008

T5240 PCI-E I/O Performance

T5240 PCI-Express Performance
Sun is launching two servers using the UltraSPARC T2 Plus processor (an SMP version of the UltraSPARC T2 processor). The T2 Plus processor extends the capabilities of the UltraSPARC T2 with a glueless coherency link, doubling the number of threads available to the system. This processor is used in the 2-way Sun SPARC Enterprise T5140 (a 1U server) and T5240 (a 2U server). As with the T2 servers these T2 Plus based servers utilize PCI-Express (PCI-E) for Input/Output (IO). Earlier we discussed PCI-E performance of the T5220 server. Now we'll take a closer look at the two-way T5240 server's PCI-E performance.

T5240 PCI-Express Topology
The T5240 uses two PCI-E switches to connect to onboard devices and slots for external device connections. In the case of the T5140 there are three x8 PCI-E slots. Two of the three slots can be either x8 PCI-E or XAUI. The T5240 doubles these capabilities, providing six x8 PCI-E, two of which are dual-flavor x8 PCI-E or XAUI combo slots. (Logical view of combo slots.)  

Bandwidth
Let's take a look at the DMA bandwidth performance for these systems. For comparison, we show data measured on the earlier T2000 and T5220 servers using the T1 and N2 processors respectively. (Slot topology for the T2000, Slot topology for the T5220.) Note that the T2000 architecture has two PCI-E root complexes (called Leaf A and Leaf B) while the T5220 has a single PCI-E root complex. The T5240 has two root complexes. These measurements were made using one or more internally developed PCI-E exerciser cards (How we made these measurements.)

Peak DMA Single x8 Root Complex Bandwidth measured by PCI-E exerciser card:


T2000 (Single Leaf, dual PCI-E exerciser)

T5220 (Single Level 8533, single PCI-E exerciser)

T5240 2WG (Single 8548, dual PCI-E exerciser)

CPU

T1

T2

T2 Plus

DMA Read

1650 MB/s

1430 MB/s

1510 MB/s

DMA Write

1440 MB/s

1480 MB/s

1720 MB/s

Bi-Directional

2200 MB/s

2470 MB/s

2400MB/s

Peak DMA Dual x8 Root Complex Bandwidth measured by PCI-E exerciser card:


T2000 Fire A+B leaf

T5220 (2 X8 slots)

T5240 2Way Glueless

CPU

T1

T2

T2 Plus

DMA Read

2000 MB/s

1430 MB/s

3020 MB/s

DMA Write

1600 MB/s

1480 MB/s

3450 MB/s

Bi-Directional

2200 MB/s

2470 MB/s

4800MB/s


Bandwidth for the T5240 is as expected for PCI-E link widths. Peak DMA Bandwidth on T5240 is considerably better than T2000. This is despite the fact that the T2000 also has two x8 PCI-E root complexes. T2000 bandwidth is ultimately limited by the peak effective JBus capacity of 2.5 GB/s. T5240 with two x8 root complexes provides essentially twice the PCI-E BW capacity compared with the T5220.

Latency
The table below shows latency for DMA Read operations. (The time is from the upstream request by the device to first data bytes of the downstream 64B completion.) Again, for comparison we include the latency for the T2000 and T5220. The table shows, for each PCI-E slot, the width and the number of intervening bridge ASIC or PCI-E switch levels between the slot and the processor. Each additional level adds latency to the path for IO.


T2000

T5220 (1 8533)

T5220 (2 8533)

T5240 (local)

T5240 (remote)

CPU

T1

T2

T2

T2 Plus

T2 Plus

CPU MHz

1000

1167

1167

1079

1079

One DW (4 Bytes) Satisfied from L2 Cache

1329 ns

629 ns

894 ns

653 ns

n/a

First DW of 64 Byte MemRd Satisfied from Memory

1474 ns

798 ns

1066 ns

820 ns

916 ns


We should expect to see improved latency for both the T5140 and T5240 as well as the T5120 and T5220 as compared with the T2000 for a couple of reasons. First, the T1 processor communicated with the outside world via its JBus port. The T2000 utilized a dedicated JBus segment and a dedicated JBus to PCI-E bridge ASIC called Fire for all IO. (Documentation for T1, JBus, and Fire.) In contrast, the T2 and T2 Plus processors bring the PCI-E root complex functionality from Fire directly on-chip with much lower latency to host memory. The JBus and FIRE ASIC are not present on the T2 and T2 Plus based servers. Second, the PCI-E switches used for the T2 and T2 Plus based servers are cut-through as opposed to the earlier store-and-forward versions used on the T2000. In fact, during development of the T5120/T5220 servers, we determined that just changing the T5220 switches to cut-through reduce latency by 35-40%. In comparing the T5220 with the T5240 local memory latency, we note that the T5220 latency with a single switch level is slightly better than the T5240. This is due to the fact that the T5240, being an SMP architecture has an additional delay due to coherency protocol overhead. Remote latency on the 5240 imposes further delay. However, in its favor, the T5240 has a flat switch hierarchy, whereas the T5220 has two switch levels for five out of its six PCI-E slots.



Monday Nov 05, 2007

T5220 PCI-E IO Performance

Sun recently launched three servers based on the UltraSPARC T2 processor. The T2 processor is the next generation of CMT following the UltraSPARC T1 and is used in the Sun SPARC Enterprise T5120 (a 1U server) and T5220 (a 2U server). It's also used in the Sun Blade T6320 (a blade server). Both T1 and T2 based servers utilize PCI-Express (PCI-E) for Input/Output (IO). We'll take a closer look at the PCI-E performance, with focus on the T5220 server. 

T5220 PCI-Express Topology
The T5220 makes use of three PCI-E switches to connect to onboard devices and slots for external device connections. In the case of the T5120 there is one x8 PCI-E slot, and two slots that can be either x4 PCI-E or XAUI. The T5220 doubles these capabilities, providing two x8 PCI-E, two x4 PCI-E, and two dual-flavor x4 PCI-E or XAUI combo slots. (Logical view of combo slots.)  



Bandwidth

Let's take a look at the DMA bandwidth performance for these systems. For comparison, we show data measured on the earlier T2000 server using the T1 (aka Niagara) processor. (Slot topology for the T2000.) Note that the T2000 architecture has two PCI-E root complexes (called Leaf A and Leaf B). We make a direct comparison with T5220 and a single leaf of the T2000, but also show the T2000 dual-leaf performance.

Peak DMA Bandwidth measured by exerciser:   (How we made these measurements.)
Server 
Slot 
Width of  
limiting      
PCI-E Link 
DMA 
Read 
(only)
DMA 
Write 
(only)

Simultaneous DMA Read and Write

T2000 
1.0 GHz
J2100, J2201, or J2202 
(individually)
x8

1.2 GB/s

1.2 GB/s

2.2 GB/s

J2201 + J2202 
Single Leaf (combined)
x8

1.6 GB/s

1.4 GB/s

2.2 GB/s

J2100 + J2201 + J2201 
Dual Leaf (combined)         
x8

1.6 GB/s

1.4 GB/s

2.2 GB/s

T5220
1.4 GHz

PCI-E2 or PCI-E5

(individually)

x8

1.5 GB/s

1.6 GB/s

2.5 GB/s

PCI-E0, PCI-E1, PCI-E3, or PCI-E4

(individually)

x4

0.76 GB/s

0.80 GB/s

1.3 GB/s

PCI-E2 and PCI-E5

(combined)

x8

1.5 GB/s

1.7 GB/s

2.5 GB/s

Two or more x4 slots

(simultaneously)

x8

1.5 GB/s

1.6 GB/s

2.6 GB/s

All 6 slots

(simultaneously)

x8

1.5 GB/s

1.7 GB/s

2.7 GB/s


Bandwidth for the T5220 is as expected for PCI-E link widths. Peak DMA Bandwidth on T5220 is somewhat better than T2000. This is despite the fact that the T2000 actually has two x8 PCI-E root complexes. T2000 bandwidth is ultimately limited by the peak effective JBus capacity of 2.5 GB/s.

Latency

The table below shows latency for DMA Read operations. (The time is from the upstream request by the device to first data bytes of the downstream 64B completion.) Again, for comparison we include the T2000 latency. The table shows, for each PCI-E slot, the width and the number of intervening bridge ASIC or PCI-E switch levels between the slot and the processor. Each additional level adds latency to the path for IO.
Server 
Slot Designation
PCI-E 
Link 
Width
Number of 
Intervening 
Levels

DMA Read Latency (request to start of data return)

T2000 
1.0 GHz 
J2100, J2201, 
or J2202
x8
1 JBus/PCI-E 
Bridge 
     plus
1 PCI-E Switch

1470 ns

T5220
1.4 GHz

PCI-E2

x8

1 PCI-E Switch

805 ns

PCI-E0, PCI-E1,
PCI-E3,  or PCI-E4

x4

2 PCI-E Switches

1227 ns

PCI-E5

x8

2 PCI-E Switches

990 ns


We should expect to see improved latency for the T5120 and T5220 as compared with the T2000 for a couple of reasons. First, the T1 processor communicated with the outside world via its JBus port. The T2000 utilized a dedicated JBus segment and a dedicated JBus to PCI-E bridge ASIC called Fire for all IO. (Documentation for T1, JBus, and Fire.) In contrast, the T2 processor brings the PCI-E root complex functionality from Fire directly on-chip with much lower latency to host memory. The JBus and FIRE ASIC are not present on the T5220 and T5120 servers. Second, the PCI-E switches used for the T5120 and T5220 are cut-through as opposed to the earlier store-and-forward versions used on the T2000. In fact, during development of the T5120/T5220 servers, we determined that just changing the T5220 switches to cut-through reduce latency by 35-40%.



About

pyakutis

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today