Memory and coherency on the UltraSPARC T2 Plus Processor
By denissheahan on Apr 11, 2008
As the name suggests the UltraSPARC T2 Plus is a derivative of the UltraSPARC T2. The vast majority of the layout in silicon of UltraSPARC T2 Plus is the same as the original UltraSPARC T2. Both use Texas Instruments 65nm technology and have similar packaging and pin layout. There are still 8 cores with 8 Floating point and crypto units and 16 integer pipelines on each processor and a x8 PCI-E link integrated. The main differences are the replacement of the T2's 10Gig interface with silicon to perform coherency on the T2 Plus.
The coherency unit (CU) sits between the L2 cache and the memory controller Units (MCU) that drive the FBDIMM channels. The CU snoops all requests for memory and communicates with the CU on the second UltraSPARC T2 Plus processor to resolve all accesses. If data is on the remote processor its CU will forward the data
The coherency does not use a traditional bus such as on a v890 but instead uses multiple high speed SERDES links running between chips to achieve optimal scaling. We already had high speed SERDES links coming from T2 to connect to FBDIMM and after simulating numerous scenarios we decided to use half of these links for the coherency. To achieve this we reduced the MCUs from four on T2 to two on the T2 Plus. As each of the two MCUs were driving two FBDIMM channels that gave us 4 channels for our coherency links creating 4 coherency planes.
When the UltraSPARC T2 Plus is running at 1.2GHz the coherency links run at 4.0Gbps, for a 1.4GHz processor the links run at 4.8Gbps. There are 14 transmit and 14 receive links per channel. When running at 4.8Gbp the aggregate bandwidth is 8.4GB/s per channel per direction. The total across all 4 channels is therefore 33.6GB/s in each direction. Bottom line this is a big wide low latency pipe between the UltraSPARC T2 Plus processors dedicated only to memory transfers.
The L2 cache on the UltraSPARC T2 Plus is organized the same as the T2, 4MB, 16 way associative in 8 banks with 64 byte cache lines. All 8 cores still see all the L2 cache, just the number of MCUs have been reduced. Four banks from the L2 cache drive one MCUs.
The diagram below illustrates the architecture
So we have introduced the concept of local and remote memory where latency to access local memory is less than that of remote memory. The CU adds 76ns for the remote access and about 15ns for a local access
This gives a latency number from lmbench of :
Local = 150 nsec
Remote = 225 nsec
Bottom line the architecture is NUMA but not highly so. For instance Solaris NUMA optimizations give 8% - 10% improvement in performance.
There has been some discussion on the Web about the reduction of MCUs “dramatically” reducing the bandwidth of the UltraSPARC T2 Plus. This is pretty bogus.
Each UltraSPARC T2 Plus has a max theoretical bandwidth of:
Read 21 GBytes/s
Total Read + Write 32 GBytes/s
Thats a theoretical max 64GBytes/s across the two processors on a T5140 or T5240. We have tested over 30GBytes a second read bandwidth using the STREAM benchmark on a T5240. Thats 75% of the max theoretical bandwidth.
What's far more important for maximum bandwidth are the number of FBDIMMs available in a channel. As the number of internal-banks/drams increase scheduling associativity increase and so does bandwidth. For instance 4 DIMMs in a channel gives significantly better bandwidth than just 2 . On the T5240 a riser (called the Memory Mezzanine Kit) is available that enables a second row of 16 DIMMs. This riser snaps on top of the layer of DIMMs on the motherboard and extends the FBDIMM channels so each can have up to 4 DIMMs. This configuration gives the maximum bandwidth.
Bottom line the processor has a large pipe to memory and we can utilize most of this bandwidth
The T5120 / T5240 present a single memory address range to Solaris or any other OS. All memory on both processors is accessible through the CU. As mentioned previously the L2 cache line is still 64 bytes and we interleave these lines across the 8 banks on the cache to achieve maximum memory bandwidth. When actually accessing memory the MCU interleaves the 64 byte request across its two channels, 32 bytes of the data comes from channel 0 and 32 bytes from channel 1
The next interleave factor is 512 bytes which is 8 banks x 64 bytes. In this mode the first 512 bytes will be placed in processor zero's memory, next 512 bytes come from the memory attached to processor one etc. All memory accesses will be distributed evenly across both the UltraSPARC T2 Plus processors. The effect on the memory latency is to make it the average of 150ns and 225ns at about 190ns. The upside of this is avoiding hotspots, the downside is all applications pay the same latency penalty and there is a lot of interconnect traffic.
To take advantage of NUMA optimizations in the Solaris OS we added another interleave factor of 1GB. In this mode memory is allocated in 1GB chunks per processor. On T5140 and T5240 this is the default interleaving mode. The mapping information is passed to Solaris which can then optimize memory placement. For instance the stack and heap allocations can be placed on the same UltraSPARC T2 Plus as the process thus taking advantage of lower latency to local memory.
All these interleaf modes are handled automatically by the hardware. When a memory request reaches the CU after missing the L2 a decision is made which node to get it serviced from based on the interleaving factor which is a very simple compare based on address.
The hardware has the ability to have a mixed environment as well, some portion 512 byte interleaved and the rest 1GB interleaved but this is not used in current systems.
To determine the configuration of the memory on your system login to the system controller as user admin and enter showcomponent. This will dump a line for each DIMM in the system such as
The key to this cryptic output is as follows
CMP0 is the first UltraSPARC T2 Plus processor
CMP1 is the second UltraSPARC T2 Plus processor
MR is the Memory Riser (Mezzaine)
MR0 extends the channels attached to the first processor
MR1 extends the channels attached to the first processor
On the ground floor there are 2 DIMMs per channel, D0 and D1
On the riser each channel is extended 2 more DIMMs D2 and D3
Each processor has 2 Memory controllers called confusingly Branches BR0 and BR1
Each memory Controller has 2 channels CH0 and CH1
So..... (still with me)
4 DIMMs on a CPU 0 - 1 DIMM on each channel of each controller
8 DIMMS on a CPU - 2 DIIMMs on each channel of each controller
All 16 on a CPU 4 DIIMMs on each channel of each controller
MB/CMP0/BR0/CH0/D0 - 8 on the ground floor
MB/CMP0/MR0/BR0/CH0/D2 - head upstairs to the riser
MB/CMP0/MR0/BR1/CH0/D2 MB/CMP0/MR0/BR1/CH0/D3 MB/CMP0/MR0/BR1/CH1/D2 MB/CMP0/MR0/BR1/CH1/D3
MB/CMP0/MR0/BR1/CH0/D3 MB/CMP0/MR0/BR1/CH1/D2 MB/CMP0/MR0/BR1/CH1/D3