T5440 PCI-E I/O Performance

T5440 PCI-Express Performance
Sun's latest CMT-based server is the four-way Sun SPARC Enterprise T5440. As with the previous two-way Sun SPARC Enterprise T5140 and T5240 servers, the T5440 is built around the UltraSPARC T2 Plus processor (an SMP version of the UltraSPARC T2). Whereas the T5140 and T5240 used glueless coherency links to connect two T2 Plus processors in 1U and 2U form factors, the T5440 uses 4 coherency hubs to connect up to four processors in a 4U form factor. (The coherency hub was presented at a 2008 IEEE Symposium on High-Performance Interconnects. (You can find more detail here: slides, paper.) With four T2 Plus processors, the T5440 provides 256 hardware threads.  As with previous CMT servers the T5440 utilizes PCI-Express (PCI-E) for Input/Output (IO). With the UltraSPARC  T2 and T2 Plus processors, a PCI-E root complex is brought directly on-chip, reducing latency between IO devices and memory.

T5440 PCI-Express Topology
The T5440 uses four PCI-E switches to connect to onboard devices and eight PCI-E slots for external device connections. All the slots are x8 PCI-E electrically, though two are physically x16. Another two of the eight slots are unavailable if the co-located XAUI is used. 

The T5440 can be configured with fewer than 4 CPU modules. In that case, x8 PCI-E crosslinks between the PCI-E switches are enabled providing full access to all IO components. (Note: crosslinks are not shown in the diagram above.) This is discussed further here.

Bandwidth

Let's take a look at the DMA bandwidth performance for the T5440. The T5440 has four root complexes (one on-chip per T2 Plus processor). These measurements were made using multiple internally developed PCI-E exerciser cards. (How we made these measurements.) We used multiple load generating modules and two Sun External IO Expansion Units. As shown in the section on latency below, any expansion unit will add latency for IO, however for IO devices that support sufficient outstanding IO requests, full PCI-E bandwidth is achievable. (For a logical view of the IO Expansion Unit PCI-E configuration, look here.)


1 PCI-E

2 PCI-E

3 PCI-E

4 PCI-E

100% DMA Read

1520 MB/s

3050 MB/s

4580 MB/s

6100 MB/s

100% DMA Write

1720 MB/s

3440 MB/s

5170 MB/s

6890 MB/s

Bi-Directional (DMA read + DMA write)

2940 MB/s

5900 MB/s

8850MB/s

11800MB/s

Bandwidth for the T5440 is 97% for uni-directional and 93% for bi-directional of the theoretical maximum for the given PCI-E configuration.  Peak DMA Bandwidth on T5440 scales very well.


Latency
The table below shows latency for DMA Read operations. (The time is from the upstream request by the device to first data bytes of the downstream 64B completion.)

T5140

T5240

T5440

T5440

T5440 from

IO Expander

CPU

T2 Plus @ 1.2GHz

T2 Plus @ 1.4GHz

T2 Plus @ 1.2GHz

T2 Plus @ 1.4GHz

T2 Plus @ 1.4GHz

One DW (4 Bytes) Satisfied from L2 Cache

653 ns

641 ns (est.)

698 ns

662 ns

1900 ns

First DW of 64 Byte MemRd Satisfied from Local Memory

820 ns

 808 ns (est.)

1047 ns

954 ns

2200 ns

First DW of 64 Byte MemRd Satisfied from Remote Memory

916 ns

 904 ns (est.)

1242 ns

1143 ns

2400 ns

For a given IO slot, memory is either local or remote. (Local versus Remote Latency.) Both the T5240 and the T5440 are SMP architectures and have additional delays due to coherency protocol overhead. The added latency due to coherence on the 5240 is lower than on the T5440 since the T5240 is glueless while the T5440 uses a coherency hub. The last column shows latency for a device in the IO Expansion Unit. The extra levels of PCI-E switch add approximately 1.25 us of latency to the IO path, but with sufficient outstanding IO requests to memory, devices can still achieve full PCI-E x8 bandwidth.
Comments:

How much impact does the 1.25 us have with the external expansion units? On the face of it, this seems to be less than rounding error if I the expansion unit has an HBA connected to a storage array. Even fast disk IO request times are measured in ms so I am having a hard time seeing a lot of relevance to the extra latency for this type of use. A bigger concern would probably be the bandwidth of the uplink if a person is tempted by the ability to plug a bunch of x8 cards into the expansion unit.

Also, I noticed that the http://mysales.central.sun.com/public/systems/enterprise/external_io.html link is invalid outside of Sun.

Posted by Mike Gerdts on October 13, 2008 at 05:13 AM EDT #

The round-trip latency can affect bandwidth depending on how many IO requests your device can keep outstanding and/or how large those requests are. For devices that make large requests, there may not be any impact, but possibly a network device, where request sizes tend to be smaller, would see impact.

Here's an example:
Suppose a device is making 512B memory read requests from a slot in the IO expansion unit.
Just to make things easy, let's assume the round-trip latency seen by the device is 2 us. If it makes just one request at a time, and waits for the data return to issue another request, then it will achieve bandwidth of (512B / 2 us) = 256 MB/s. Now suppose that device is located on an internal T5440 slot. Let's say it sees 1 us for round-trip latency. Then for the same one-IO-request at a time behavior, it will get bandwidth of 512 MB/s, a 2x improvement just from changing slot location.

If the device makes larger requests, even if one at a time, it's bandwidth will go up. In this example, if the device made 4KB requests it would easily achieve the maximum for the x8 PCI-E and the performance would be the same from either slot location.

Also if it keeps more than one request at a time in flight, its bandwidth goes up. So if the device still makes only 512B memory read requests at a time, but can keep 8 in flight, then it will also achieve the maximum bandwidth of the PCI-E x8 links and performance will be the same in any slot location.

So the latency matters to devices that make relatively small requests, or can't keep too many requests in flight. A network card might perform better for transmits (transmits use DMA read) if it's located in an internal slot rather than the IO Expansion Unit.

Hope that helps to explain.

Thanks for catching the url, I'll fix that!

Posted by Peter Yakutis on October 14, 2008 at 04:46 AM EDT #

The content at the link to how you made the measurements would be stronger if it included information about the size(s) of the DMA requests and the number outstanding from any one device. Even stronger would be to include some information about "typical" (yeah, how long is a piece of string :) sizes one would expect to see from I/O cards.

Posted by rick jones on October 14, 2008 at 03:06 PM EDT #

"Typical transfer sizes" from IO cards is probably hard to generalize. PCI-E allows cards to make read requests up to 4KB in size (but no larger). The size of write requests and read completions is limited in PCI-E by the maximum packet size. 128B must be supported, but 256B and 512B are probably common. Above 512B, the performance benefit is not great, and the expense in hardware grows.

I'll add in detail to the the text on that link, but for these measurements, reads were 512B and MPS was 128B (which means writes were 128B). Our PCI-E exerciser cards used two DMA engines each, and each engine can support up to four read requests outstanding (and there are several hundred ns of turn-around time for the card to issue the next request after receiving all the data). I could max out any single root complex for reads or writes with two cards in on-board slots. For bi-directional, I could achieve max with three cards, one on-board and two in IO Expansion slots.

I would venture to say that most networks don't use jumbo frames. So with MTU of 1500B, that would be the largest request size for network cards (and many, many network packets are small). For disks, the HBA probably has large buffers. There you may see larger request sizes, with 4KB being the maximum.

Posted by Peter Yakutis on October 15, 2008 at 02:47 AM EDT #

I'm curious, for the 4x gig-e interfaces, does the neptune chip act as a PCIe bridge to a e1000g controller?

Populated XAUI slots will obviously show up as nxge0 and nxge1, but I'm wondering what the quad gig-e ports are controlled by.

Posted by Dale Ghent on October 17, 2008 at 05:36 AM EDT #

The gig-e interfaces also show up as nxge. I'm not expert on the internals of neptune,
but it's the same chip as was used on T5240.

Posted by Peter Yakutis on October 20, 2008 at 06:01 AM EDT #

Post a Comment:
Comments are closed for this entry.
About

pyakutis

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today