Monday Oct 13, 2008

T5440 PCI-E I/O Performance

T5440 PCI-Express Performance
Sun's latest CMT-based server is the four-way Sun SPARC Enterprise T5440. As with the previous two-way Sun SPARC Enterprise T5140 and T5240 servers, the T5440 is built around the UltraSPARC T2 Plus processor (an SMP version of the UltraSPARC T2). Whereas the T5140 and T5240 used glueless coherency links to connect two T2 Plus processors in 1U and 2U form factors, the T5440 uses 4 coherency hubs to connect up to four processors in a 4U form factor. (The coherency hub was presented at a 2008 IEEE Symposium on High-Performance Interconnects. (You can find more detail here: slides, paper.) With four T2 Plus processors, the T5440 provides 256 hardware threads.  As with previous CMT servers the T5440 utilizes PCI-Express (PCI-E) for Input/Output (IO). With the UltraSPARC  T2 and T2 Plus processors, a PCI-E root complex is brought directly on-chip, reducing latency between IO devices and memory.

T5440 PCI-Express Topology
The T5440 uses four PCI-E switches to connect to onboard devices and eight PCI-E slots for external device connections. All the slots are x8 PCI-E electrically, though two are physically x16. Another two of the eight slots are unavailable if the co-located XAUI is used. 

The T5440 can be configured with fewer than 4 CPU modules. In that case, x8 PCI-E crosslinks between the PCI-E switches are enabled providing full access to all IO components. (Note: crosslinks are not shown in the diagram above.) This is discussed further here.

Bandwidth

Let's take a look at the DMA bandwidth performance for the T5440. The T5440 has four root complexes (one on-chip per T2 Plus processor). These measurements were made using multiple internally developed PCI-E exerciser cards. (How we made these measurements.) We used multiple load generating modules and two Sun External IO Expansion Units. As shown in the section on latency below, any expansion unit will add latency for IO, however for IO devices that support sufficient outstanding IO requests, full PCI-E bandwidth is achievable. (For a logical view of the IO Expansion Unit PCI-E configuration, look here.)


1 PCI-E

2 PCI-E

3 PCI-E

4 PCI-E

100% DMA Read

1520 MB/s

3050 MB/s

4580 MB/s

6100 MB/s

100% DMA Write

1720 MB/s

3440 MB/s

5170 MB/s

6890 MB/s

Bi-Directional (DMA read + DMA write)

2940 MB/s

5900 MB/s

8850MB/s

11800MB/s

Bandwidth for the T5440 is 97% for uni-directional and 93% for bi-directional of the theoretical maximum for the given PCI-E configuration.  Peak DMA Bandwidth on T5440 scales very well.


Latency
The table below shows latency for DMA Read operations. (The time is from the upstream request by the device to first data bytes of the downstream 64B completion.)

T5140

T5240

T5440

T5440

T5440 from

IO Expander

CPU

T2 Plus @ 1.2GHz

T2 Plus @ 1.4GHz

T2 Plus @ 1.2GHz

T2 Plus @ 1.4GHz

T2 Plus @ 1.4GHz

One DW (4 Bytes) Satisfied from L2 Cache

653 ns

641 ns (est.)

698 ns

662 ns

1900 ns

First DW of 64 Byte MemRd Satisfied from Local Memory

820 ns

 808 ns (est.)

1047 ns

954 ns

2200 ns

First DW of 64 Byte MemRd Satisfied from Remote Memory

916 ns

 904 ns (est.)

1242 ns

1143 ns

2400 ns

For a given IO slot, memory is either local or remote. (Local versus Remote Latency.) Both the T5240 and the T5440 are SMP architectures and have additional delays due to coherency protocol overhead. The added latency due to coherence on the 5240 is lower than on the T5440 since the T5240 is glueless while the T5440 uses a coherency hub. The last column shows latency for a device in the IO Expansion Unit. The extra levels of PCI-E switch add approximately 1.25 us of latency to the IO path, but with sufficient outstanding IO requests to memory, devices can still achieve full PCI-E x8 bandwidth.
About

pyakutis

Search

Categories
Archives
« October 2008
SunMonTueWedThuFriSat
   
1
2
3
4
5
6
7
8
9
10
11
12
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
 
       
Today