T5240 PCI-E I/O Performance

T5240 PCI-Express Performance
Sun is launching two servers using the UltraSPARC T2 Plus processor (an SMP version of the UltraSPARC T2 processor). The T2 Plus processor extends the capabilities of the UltraSPARC T2 with a glueless coherency link, doubling the number of threads available to the system. This processor is used in the 2-way Sun SPARC Enterprise T5140 (a 1U server) and T5240 (a 2U server). As with the T2 servers these T2 Plus based servers utilize PCI-Express (PCI-E) for Input/Output (IO). Earlier we discussed PCI-E performance of the T5220 server. Now we'll take a closer look at the two-way T5240 server's PCI-E performance.

T5240 PCI-Express Topology
The T5240 uses two PCI-E switches to connect to onboard devices and slots for external device connections. In the case of the T5140 there are three x8 PCI-E slots. Two of the three slots can be either x8 PCI-E or XAUI. The T5240 doubles these capabilities, providing six x8 PCI-E, two of which are dual-flavor x8 PCI-E or XAUI combo slots. (Logical view of combo slots.)  

Bandwidth
Let's take a look at the DMA bandwidth performance for these systems. For comparison, we show data measured on the earlier T2000 and T5220 servers using the T1 and N2 processors respectively. (Slot topology for the T2000, Slot topology for the T5220.) Note that the T2000 architecture has two PCI-E root complexes (called Leaf A and Leaf B) while the T5220 has a single PCI-E root complex. The T5240 has two root complexes. These measurements were made using one or more internally developed PCI-E exerciser cards (How we made these measurements.)

Peak DMA Single x8 Root Complex Bandwidth measured by PCI-E exerciser card:


T2000 (Single Leaf, dual PCI-E exerciser)

T5220 (Single Level 8533, single PCI-E exerciser)

T5240 2WG (Single 8548, dual PCI-E exerciser)

CPU

T1

T2

T2 Plus

DMA Read

1650 MB/s

1430 MB/s

1510 MB/s

DMA Write

1440 MB/s

1480 MB/s

1720 MB/s

Bi-Directional

2200 MB/s

2470 MB/s

2400MB/s

Peak DMA Dual x8 Root Complex Bandwidth measured by PCI-E exerciser card:


T2000 Fire A+B leaf

T5220 (2 X8 slots)

T5240 2Way Glueless

CPU

T1

T2

T2 Plus

DMA Read

2000 MB/s

1430 MB/s

3020 MB/s

DMA Write

1600 MB/s

1480 MB/s

3450 MB/s

Bi-Directional

2200 MB/s

2470 MB/s

4800MB/s


Bandwidth for the T5240 is as expected for PCI-E link widths. Peak DMA Bandwidth on T5240 is considerably better than T2000. This is despite the fact that the T2000 also has two x8 PCI-E root complexes. T2000 bandwidth is ultimately limited by the peak effective JBus capacity of 2.5 GB/s. T5240 with two x8 root complexes provides essentially twice the PCI-E BW capacity compared with the T5220.

Latency
The table below shows latency for DMA Read operations. (The time is from the upstream request by the device to first data bytes of the downstream 64B completion.) Again, for comparison we include the latency for the T2000 and T5220. The table shows, for each PCI-E slot, the width and the number of intervening bridge ASIC or PCI-E switch levels between the slot and the processor. Each additional level adds latency to the path for IO.


T2000

T5220 (1 8533)

T5220 (2 8533)

T5240 (local)

T5240 (remote)

CPU

T1

T2

T2

T2 Plus

T2 Plus

CPU MHz

1000

1167

1167

1079

1079

One DW (4 Bytes) Satisfied from L2 Cache

1329 ns

629 ns

894 ns

653 ns

n/a

First DW of 64 Byte MemRd Satisfied from Memory

1474 ns

798 ns

1066 ns

820 ns

916 ns


We should expect to see improved latency for both the T5140 and T5240 as well as the T5120 and T5220 as compared with the T2000 for a couple of reasons. First, the T1 processor communicated with the outside world via its JBus port. The T2000 utilized a dedicated JBus segment and a dedicated JBus to PCI-E bridge ASIC called Fire for all IO. (Documentation for T1, JBus, and Fire.) In contrast, the T2 and T2 Plus processors bring the PCI-E root complex functionality from Fire directly on-chip with much lower latency to host memory. The JBus and FIRE ASIC are not present on the T2 and T2 Plus based servers. Second, the PCI-E switches used for the T2 and T2 Plus based servers are cut-through as opposed to the earlier store-and-forward versions used on the T2000. In fact, during development of the T5120/T5220 servers, we determined that just changing the T5220 switches to cut-through reduce latency by 35-40%. In comparing the T5220 with the T5240 local memory latency, we note that the T5220 latency with a single switch level is slightly better than the T5240. This is due to the fact that the T5240, being an SMP architecture has an additional delay due to coherency protocol overhead. Remote latency on the 5240 imposes further delay. However, in its favor, the T5240 has a flat switch hierarchy, whereas the T5220 has two switch levels for five out of its six PCI-E slots.



Comments:

Post a Comment:
Comments are closed for this entry.
About

pyakutis

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today