T5220 PCI-E IO Performance

Sun recently launched three servers based on the UltraSPARC T2 processor. The T2 processor is the next generation of CMT following the UltraSPARC T1 and is used in the Sun SPARC Enterprise T5120 (a 1U server) and T5220 (a 2U server). It's also used in the Sun Blade T6320 (a blade server). Both T1 and T2 based servers utilize PCI-Express (PCI-E) for Input/Output (IO). We'll take a closer look at the PCI-E performance, with focus on the T5220 server. 

T5220 PCI-Express Topology
The T5220 makes use of three PCI-E switches to connect to onboard devices and slots for external device connections. In the case of the T5120 there is one x8 PCI-E slot, and two slots that can be either x4 PCI-E or XAUI. The T5220 doubles these capabilities, providing two x8 PCI-E, two x4 PCI-E, and two dual-flavor x4 PCI-E or XAUI combo slots. (Logical view of combo slots.)  



Bandwidth

Let's take a look at the DMA bandwidth performance for these systems. For comparison, we show data measured on the earlier T2000 server using the T1 (aka Niagara) processor. (Slot topology for the T2000.) Note that the T2000 architecture has two PCI-E root complexes (called Leaf A and Leaf B). We make a direct comparison with T5220 and a single leaf of the T2000, but also show the T2000 dual-leaf performance.

Peak DMA Bandwidth measured by exerciser:   (How we made these measurements.)
Server 
Slot 
Width of  
limiting      
PCI-E Link 
DMA 
Read 
(only)
DMA 
Write 
(only)

Simultaneous DMA Read and Write

T2000 
1.0 GHz
J2100, J2201, or J2202 
(individually)
x8

1.2 GB/s

1.2 GB/s

2.2 GB/s

J2201 + J2202 
Single Leaf (combined)
x8

1.6 GB/s

1.4 GB/s

2.2 GB/s

J2100 + J2201 + J2201 
Dual Leaf (combined)         
x8

1.6 GB/s

1.4 GB/s

2.2 GB/s

T5220
1.4 GHz

PCI-E2 or PCI-E5

(individually)

x8

1.5 GB/s

1.6 GB/s

2.5 GB/s

PCI-E0, PCI-E1, PCI-E3, or PCI-E4

(individually)

x4

0.76 GB/s

0.80 GB/s

1.3 GB/s

PCI-E2 and PCI-E5

(combined)

x8

1.5 GB/s

1.7 GB/s

2.5 GB/s

Two or more x4 slots

(simultaneously)

x8

1.5 GB/s

1.6 GB/s

2.6 GB/s

All 6 slots

(simultaneously)

x8

1.5 GB/s

1.7 GB/s

2.7 GB/s


Bandwidth for the T5220 is as expected for PCI-E link widths. Peak DMA Bandwidth on T5220 is somewhat better than T2000. This is despite the fact that the T2000 actually has two x8 PCI-E root complexes. T2000 bandwidth is ultimately limited by the peak effective JBus capacity of 2.5 GB/s.

Latency

The table below shows latency for DMA Read operations. (The time is from the upstream request by the device to first data bytes of the downstream 64B completion.) Again, for comparison we include the T2000 latency. The table shows, for each PCI-E slot, the width and the number of intervening bridge ASIC or PCI-E switch levels between the slot and the processor. Each additional level adds latency to the path for IO.
Server 
Slot Designation
PCI-E 
Link 
Width
Number of 
Intervening 
Levels

DMA Read Latency (request to start of data return)

T2000 
1.0 GHz 
J2100, J2201, 
or J2202
x8
1 JBus/PCI-E 
Bridge 
     plus
1 PCI-E Switch

1470 ns

T5220
1.4 GHz

PCI-E2

x8

1 PCI-E Switch

805 ns

PCI-E0, PCI-E1,
PCI-E3,  or PCI-E4

x4

2 PCI-E Switches

1227 ns

PCI-E5

x8

2 PCI-E Switches

990 ns


We should expect to see improved latency for the T5120 and T5220 as compared with the T2000 for a couple of reasons. First, the T1 processor communicated with the outside world via its JBus port. The T2000 utilized a dedicated JBus segment and a dedicated JBus to PCI-E bridge ASIC called Fire for all IO. (Documentation for T1, JBus, and Fire.) In contrast, the T2 processor brings the PCI-E root complex functionality from Fire directly on-chip with much lower latency to host memory. The JBus and FIRE ASIC are not present on the T5220 and T5120 servers. Second, the PCI-E switches used for the T5120 and T5220 are cut-through as opposed to the earlier store-and-forward versions used on the T2000. In fact, during development of the T5120/T5220 servers, we determined that just changing the T5220 switches to cut-through reduce latency by 35-40%.



Comments:

What about performance without DMA?

And should there be multi-threaded versions of old Solaris tools such as cp, dd etc?

On the T2000 we had a throughput to disk of about 60-70MB/s when using a single dd.

Posted by Mika on November 06, 2007 at 12:44 AM EST #

T2000 disks were 10K RPM and ~60-70 MB/s is what you would expect to get for single stream sequential transfers (dd is inherently single-threaded). I may have a chance to get some PIO (peripheral IO, non-DMA) latency numbers. If so, I'll post them here.

Posted by peter on November 06, 2007 at 02:38 AM EST #

The 60MB/s were measured on an EMC.

We were then running three dd jobs in parallel towards the same LUN. We got about about 3x60MB/s. It would be interesting to know if the 60MB/s is the limitation of one strand.

I still believe new multithreaded versions of ancient unix commands would be a good thing (and should be easy to implement).

Posted by Mika on November 06, 2007 at 09:01 AM EST #

Post a Comment:
Comments are closed for this entry.
About

pyakutis

Search

Archives
« February 2015
SunMonTueWedThuFriSat
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
       
       
Today