Friday Apr 11, 2008

Memory and coherency on the UltraSPARC T2 Plus Processor


Coherency Architecture


As the name suggests the UltraSPARC T2 Plus is a derivative of the UltraSPARC T2. The vast majority of the layout in silicon of UltraSPARC T2 Plus is the same as the original UltraSPARC T2. Both use Texas Instruments 65nm technology and have similar packaging and pin layout. There are still 8 cores with 8 Floating point and crypto units and 16 integer pipelines on each processor and a x8 PCI-E link integrated. The main differences are the replacement of the T2's 10Gig interface with silicon to perform coherency on the T2 Plus.


The coherency unit (CU) sits between the L2 cache and the memory controller Units (MCU) that drive the FBDIMM channels. The CU snoops all requests for memory and communicates with the CU on the second UltraSPARC T2 Plus processor to resolve all accesses. If data is on the remote processor its CU will forward the data


The coherency does not use a traditional bus such as on a v890 but instead uses multiple high speed SERDES links running between chips to achieve optimal scaling. We already had high speed SERDES links coming from T2 to connect to FBDIMM and after simulating numerous scenarios we decided to use half of these links for the coherency. To achieve this we reduced the MCUs from four on T2 to two on the T2 Plus. As each of the two MCUs were driving two FBDIMM channels that gave us 4 channels for our coherency links creating 4 coherency planes.


When the UltraSPARC T2 Plus is running at 1.2GHz the coherency links run at 4.0Gbps, for a 1.4GHz processor the links run at 4.8Gbps. There are 14 transmit and 14 receive links per channel. When running at 4.8Gbp the aggregate bandwidth is 8.4GB/s per channel per direction. The total across all 4 channels is therefore 33.6GB/s in each direction. Bottom line this is a big wide low latency pipe between the UltraSPARC T2 Plus processors dedicated only to memory transfers.


The L2 cache on the UltraSPARC T2 Plus is organized the same as the T2, 4MB, 16 way associative in 8 banks with 64 byte cache lines. All 8 cores still see all the L2 cache, just the number of MCUs have been reduced. Four banks from the L2 cache drive one MCUs.


The diagram below illustrates the architecture

















Memory Latency


So we have introduced the concept of local and remote memory where latency to access local memory is less than that of remote memory. The CU adds 76ns for the remote access and about 15ns for a local access

This gives a latency number from lmbench of :

Local = 150 nsec
Remote = 225 nsec

Bottom line the architecture is NUMA but not highly so. For instance Solaris NUMA optimizations give 8% - 10% improvement in performance.

Memory Bandwidth


There has been some discussion on the Web about the reduction of MCUs “dramatically” reducing the bandwidth of the UltraSPARC T2 Plus. This is pretty bogus.

Each UltraSPARC T2 Plus has a max theoretical bandwidth of:

Read 21 GBytes/s
Write 10.67GBytes/s
Total Read + Write 32 GBytes/s

Thats a theoretical max 64GBytes/s across the two processors on a T5140 or T5240. We have tested over 30GBytes a second read bandwidth using the STREAM benchmark on a T5240. Thats 75% of the max theoretical bandwidth.

What's far more important for maximum bandwidth are the number of FBDIMMs available in a channel. As the number of internal-banks/drams increase scheduling associativity increase and so does bandwidth. For instance 4 DIMMs in a channel gives significantly better bandwidth than just 2 . On the T5240 a riser (called the Memory Mezzanine Kit) is available that enables a second row of 16 DIMMs. This riser snaps on top of the layer of DIMMs on the motherboard and extends the FBDIMM channels so each can have up to 4 DIMMs. This configuration gives the maximum bandwidth.

Bottom line the processor has a large pipe to memory and we can utilize most of this bandwidth

Memory Interleaving

The T5120 / T5240 present a single memory address range to Solaris or any other OS. All memory on both processors is accessible through the CU. As mentioned previously the L2 cache line is still 64 bytes and we interleave these lines across the 8 banks on the cache to achieve maximum memory bandwidth. When actually accessing memory the MCU interleaves the 64 byte request across its two channels, 32 bytes of the data comes from channel 0 and 32 bytes from channel 1

The next interleave factor is 512 bytes which is 8 banks x 64 bytes. In this mode the first 512 bytes will be placed in processor zero's memory, next 512 bytes come from the memory attached to processor one etc. All memory accesses will be distributed evenly across both the UltraSPARC T2 Plus processors. The effect on the memory latency is to make it the average of 150ns and 225ns at about 190ns. The upside of this is avoiding hotspots, the downside is all applications pay the same latency penalty and there is a lot of interconnect traffic.

To take advantage of NUMA optimizations in the Solaris OS we added another interleave factor of 1GB. In this mode memory is allocated in 1GB chunks per processor. On T5140 and T5240 this is the default interleaving mode. The mapping information is passed to Solaris which can then optimize memory placement. For instance the stack and heap allocations can be placed on the same UltraSPARC T2 Plus as the process thus taking advantage of lower latency to local memory.

All these interleaf modes are handled automatically by the hardware. When a memory request reaches the CU after missing the L2 a decision is made which node to get it serviced from based on the interleaving factor which is a very simple compare based on address.

The hardware has the ability to have a mixed environment as well, some portion 512 byte interleaved and the rest 1GB interleaved but this is not used in current systems.


Memory Configuration


To determine the configuration of the memory on your system login to the system controller as user admin and enter showcomponent. This will dump a line for each DIMM in the system such as


/SYS/MB/CMP0/MR0/BR0/CH0/D2
/SYS/MB/CMP0/MR0/BR0/CH0/D3
/SYS/MB/CMP0/MR0/BR0/CH1/D2

The key to this cryptic output is as follows

MB is Motherboard

CMP0 is the first UltraSPARC T2 Plus processor
CMP1 is the second UltraSPARC T2 Plus processor

MR is the Memory Riser (Mezzaine)
MR0 extends the channels attached to the first processor
MR1 extends the channels attached to the first processor

On the ground floor there are 2 DIMMs per channel, D0 and D1
On the riser each channel is extended 2 more DIMMs D2 and D3

Each processor has 2 Memory controllers called confusingly Branches BR0 and BR1
Each memory Controller has 2 channels CH0 and CH1

So..... (still with me)


4 DIMMs on a CPU 0 - 1 DIMM on each channel of each controller


MB/CMP0/BR0/CH0/D0

MB/CMP0/BR0/CH1/D0

MB/CMP0/BR1/CH0/D0

MB/CMP0/BR1/CH1/D0



8 DIMMS on a CPU - 2 DIIMMs on each channel of each controller


MB/CMP0/BR0/CH0/D0

MB/CMP0/BR0/CH1/D0

MB/CMP0/BR1/CH0/D0

MB/CMP0/BR1/CH1/D0

MB/CMP0/BR0/CH0/D1

MB/CMP0/BR0/CH1/D1

MB/CMP0/BR1/CH0/D1

MB/CMP0/BR1/CH1/D1




All 16 on a CPU 4 DIIMMs on each channel of each controller


MB/CMP0/BR0/CH0/D0 - 8 on the ground floor

MB/CMP0/BR0/CH1/D0

MB/CMP0/BR1/CH0/D0

MB/CMP0/BR1/CH1/D0

MB/CMP0/BR0/CH0/D1

MB/CMP0/BR0/CH1/D1

MB/CMP0/BR1/CH0/D1

MB/CMP0/BR1/CH1/D1

MB/CMP0/MR0/BR0/CH0/D2 - head upstairs to the riser

MB/CMP0/MR0/BR0/CH0/D3

MB/CMP0/MR0/BR0/CH1/D2

MB/CMP0/MR0/BR0/CH1/D3

MB/CMP0/MR0/BR1/CH0/D2

MB/CMP0/MR0/BR1/CH0/D3

MB/CMP0/MR0/BR1/CH1/D2

MB/CMP0/MR0/BR1/CH1/D3




Thursday Apr 10, 2008

Overview of T2 Plus systems

Sun on Wednesday launched the next generation of CMT servers based on the UltraSPARC T2 Plus processor. The T2 Plus is the first CMT processor with in-built coherency links that enables 2-way servers. The new servers are called the T5140 and T5240. The T5140 is a 1U server and the T5240 is a 2U server. A picture of the T5240 is on the right. Each server comes with two T2 Plus processors linked together through the motherboard


This is really exciting for us as it doubles the capacity of these servers while maintaining the same 1U and 2U form factors. This represents the absolute highest computes density of any server in the market.


As the name suggests the UltraSPARC T2 Plus is a derivative of the UltraSPARC T2. The vast majority of the processors are the same, both use Texas Instruments 65nm technology. There are still 8 cores with 8 Floating point and crypto units and 16 integer pipelines on each processor and a x8 PCI-E link integrated. The main differences are the replacement of the T2's 10Gig interface with silicon to perform coherency on the T2 Plus. The coherency does not use a traditional bus but instead uses multiple high speed SERDES links running at 4.8Gbps between chips to achieve optimal scaling


First the stats on these servers

T5140 is a 1U server with 16 FBDIMM slots, there are 1GB, 2GB and 4GB DIMMs available and we offer the following options


2x 4 Core, 1.2Ghz, 8GB memory

2x 6 Core, 1.2Ghz, 16GB memory

2x 8 Core, 1.2Ghz, 32GB memory

2x 8 Core, 1.2Ghz, 64GB memory


A T5140 can have up to 4 hot plug disks, either 73GB or 146GB SAS



At the rear are up to 3 PCI-Express slots. Two of these can be 10Gig ports via Sun's XAUI cards shown on the right. Note that unlike the T2 based servers the the 10Gig on the T1540 is not integrated on the processor but is provided by a Neptune NIC on the motherboard.


As with all our CMT servers there are 4 inbuilt GigE ports which are provided by the same onboard Neptune NIC. This is different from the T5120 and T5220 which used two Intel “Ophir” NICs for the 1Gig connectivity.


The server has 2 Hot plug power supplies at 720 Watts each.


Below is a photo of the T5140. Note the two T2 Plus processors under the larger copper heat sinks. The interconnect between the processors is wired through the motherboard. The other copper heat sink in the bottom right hand corner is the Neptune NIC



T5240 is a 2U server with up to 32 FBDIMM slots


We offer the following options with 8 or 16 DIMMs


2x 6 Core, 1.2Ghz, 8GB memory

2x 8 Core, 1.2Ghz, 32GB memory

2x 8 Core, 1.2Ghz, 64GB memory

2x 8 Core, 1.4GHz, 64GB memory


In addition you can buy a riser (called the Memory Mezzanine Kit) that will accommodate another 16 DIMMs there is a photo of the riser on the right.

This riser snaps on top of the layer of DIMMs on the motherboard and extends the FBDIMM channels so each can have up to 4 DIMMs


By default the T5240 can have up to 8 hot plug disks, again either 73GB or 146GB SAS. In addition when ordering the system a different disk backplane can be selected that can accommodate up to 16 disks.


The system has 6 PCI-Express slots and just like the T5140 two of these slots can be 10Gig ports via Sun's XAUI cards using the onboard Neptune NIC. Also like the T5140 the Neptune also provides the 4 built-in GigEs


By default the T5240 has two hot plug power supplies 1100 watts each. For the 1.4GHz and 16 disk configs however you need to use 220V AC input.





Below is a photo of the T5240 without a riser card, notice the motherboard is exactly the same as the T5140. The server has larger fans, however, to accommodate the extra memory and potentially more disks and PCIE cards. Note the larger power supplies are stacked one on top of each other as opposed to side by side on the T5140




As mentioned the T5140 and T5240 have bigger power supplies than the corresponding T5120 and T5220 as they need to support two processors. Power supplies from T5120 and T5220 should not be used for these servers.


The T5140 and T5240 use the exact same FBDIMM memory, however, as the T5120 / T5220 and support the exact same PCI-E and XAUI cards.


As noted earlier networking on the T5140 and T5240 is interesting as the two 10Gig XAUI and 4xGigE share the same Neptune NIC. If you use a XAUI card one of the GigE ports are disabled. Looking at the diagram below which highlights the two XAUI and GigE ports.


If you insert a XAUI card into XAUI 0 then the GigE port net1 will be disabled


If you insert a XAUI card into XAUI 1 then the GigE port net0 will be disabled


The Solaris Driver for all these ports is nxge, so device nxge0 can be either a 10Gig or a 1Gig port





One neat feature of the T5240 and T5140 is that each of the UltraSPARC T2 Plus processors has its own integrated x8 PCI-E link. Having two processors therefore automatically doubles the PCI-Express bandwidth. The T5x40 servers have twice the I/O bandwidth of their T5x20 equivalents. The PCI-E slots are wired so that the I/O is spread evenly across the x8 ports on both UltraSPARC T2 Plus processors.


For more information check out our whitepaper at http://www.sun.com/servers/coolthreads/t5140/wp.pdf








About

denissheahan

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today
Bookmarks