Monday Oct 13, 2008

Zambezi Architecture

Today we are launching our newest CMT server, the T5440. This server is a monster of performance and scalability. It can have up to four T2 Plus processors, 1.2GHz or 1.4GHz in a 4 Rack Unit (RU) chasis.

The glue for this system is a new ASIC from Sun code named Zambezi. The Zambezi is a coherency hub that enables the UltraSPARC T2 Plus processor to scale from 1 socket to 4 socket systems. The official name of the ASIC is the UltraSPARC T2 Plus XBR but this is a bit of a mouthful so we will stick with Zambezi

The main functions of the Zambezi are:
  • It broadcasts snoop requests to all sockets

  • It serializes requests to the same address

  • Consolidates all snoop responses


The Zambezi uses the same coherence protocol as the T5140/T5240 which I have described in a previous blog http://blogs.sun.com/deniss/. Communication is over point to point serial links that are implemented on top of an FBDIMM low level protocol. The full implementation was described in a paper at the 2008 IEEE Symposium on High-Performance Interconnects http://www.hoti.org/hoti16/program/2008slides/Session_Presentation/Feehrer_CoherencyHub_2008-08-27-09-51.pdf

CMT systems require large amount of memory bandwidth in order to scale. This was one of the main rationales for moving to FBDIMM with the UltraSPARC T2 as it provides 3x the memory bandwidth over the DDR2 interface used in UltraSPARC T1. The biggest challenge for Zambezi was to enable this bandwidth across 4 sockets and to avoid being the bottleneck to scaling.

Coming out of each UltraSPARC T2 Plus processor are 4 independent coherence planes. The T2 plus has 8 banks of L2 cache and each plane is responsible for the traffic from two of these banks. The plane is identified by two bits (12 and 13) of the Physical address. There are 4 Zambezi hubs in the system, each handling a single coherence plane. Each Zambezi is connected to each of the four T2 Plus processors over four separate point-to-point serial coherence links Because planes are independent there are no connections between the Zambezi chips.

A diagram of the architecture is shown on the right












The primary goal of Zambezi was to have the minimum latency for data crossing the ASIC. The final Zambezi latency achieved was a mere 33.1ns which exceeded our expectations. Approximately a third of the 33ns is serialization/deserialization overhead, and another third is link-layer overhead (framing composition/decomposition, CRC check/generate, etc.)

Note the Zambezi is involved even for local access to memory in order to resolve conflicts. The local memory latency, therefore, for the T5440 which is 229ns is slightly higher than the T5140 and T5240. The remote memory latency for the T5440 is 311ns. This makes the T5440 a NUMA machine but not highly so.

The bandwidth achieved by the Zambezi is extremely impressive. The theoretical bandwidth from our simulations is 84GB/s read, 42GB/s write and 126GB/s combined.

Translating these speeds and feeds to the real world the T5440 has achieved many world record benchmark results. These results all demonstrate the highest levels of throughput and scalability. Zambezi is THE key component in this scalability.




The ASIC is in 65nm technology from Texas Instruments, has about 3.6 million gates, a die size of about 79 square mm and runs at 800MHz. A photo on the right is the Zambezi floorplan.













T5440 Architecture

Today we are launching our newest CMT server, the T5440. This server is a monster of performance and scalability. It can have up to four T2 Plus processors, 1.2GHz or 1.4GHz in a 4 Rack Unit (RU) chasis. The design is modular enabling 1, 2, 3 and 4 processors configurations. The system scales to 32 cores, 256 threads and 512 GB of memory.

The diagram on the right shows the architecture. Each UltraSPARC T2 Plus processor has 4 coherency planes.

On a T5440 there are four high speed hub chips, called Zambezi, running at 800MHz which connect each of the four coherency planes. Each UltraSPARC T2 Plus processor communicates with all others in the system via the Zambezis. Each processor has its own local memory and can access all the remote memory of the other processors via the Zambezis. Thus the memory in the system scales as we add processors. Each processor also has its own integrated PCI-Express x8 link so the I/O bandwidth also scales as we add processors.

The scalability advantage of this architecture was highlighted at the 2008 IEEE Symposium on High-Performance Interconnects. http://www.hoti.org/archive/2008papers/2008_S2_1.pdf there is also a set of slides at http://www.hoti.org/hoti16/program/2008slides/Session_Presentation/Feehrer_CoherencyHub_2008-08-27-09-51.pdf

The design of the T5440 system itself is very different to our previous systems which have all been traditional 1U and 2U servers.

The T5440 is 4 RU and the processors and memory are on daughter cards. These daughter cards plug into a motherboard. The motherboard contains the Zambezis and I/O subsystem. The daughter cards are configured in groups of two. One cpu card and one memory card per group. The memory cards are optional however as the cpu cards has enough DIMM slots for a minimum memory config

The photo on the right shows a cpu daughter card which contains one UltraSPARC T2 processor and slots for 4 FBDIMMs. The UltraSPARC T2 Plus has 2 FBDIMM channels, 2 branches per channel and the DIMM slots on the cpu daughter card are directly connected to these 4 branches.

The memory board as shown in the photo on the left contains 12 slots for FBDIMM memory. The memory daughter card extends the 4 FBDIMM branches from the associated UltraSPARC T2 processor. Each branch can be extended by up to 3 more DIMMs.

DIMMs come in 2GB, 4GB and 8GB sizes and run at 667MHz just like the T5140 and T5240 servers.

The minimum requirement is 4 DIMMs per processor. Other options are 8 or 16 DIMMs per processor.

Currently all processors must have the same amount of memory but each processor can achieve this with different sized DIMMs



From these photos you can see that the connectors on the cpu and memory daughter cards are different. The cards can only go in preassigned slots. This can be seen from the photo of the motherboard, the cpus plug into the longer slots and the memory into the shorter ones.







Slots must be filled in the following order. CPU/MEM pair 0 and pair 1 followed by pair 2 and pair 3. Standing in front of the server the slots are numbered.

front left 3 - 1 - 2 - 0 front right.

The motherboard photograph also shows the 8 PCI-E slots, although all are x8 electrically two of the slots (the longer ones) have x16 physical conectors.

Note PCI-E cards on the T5440 plug in vertically, this is a change from the T5120, T5220, T5140 and T5240 servers which used risers and installed PCI-E cards horizontally.

Note also two of the PCI-E slots have a second smaller slot next to them. These are the XAUI slots. Sun supplied XAUI cards plug into these slots and provide 10Gig networking.








The diagram on the right shows a fully configured T5440 with all its daughter cards and how these are connected to the Zambezis and I/O system. The beauty of this physical design is upgradability. You can start with a 1 cpu configuration and add extra compute and memory as required

There needs to be 'blanks' in place of empty slots that ensure appropriate airflow.











A top view of a fully loaded T5440 is shown in the photo at the right. From bottom to top are the 4 large fans for cooling, the row of cpu and memory daughter cards, and the 8 PIC-E I/O slots. The separate circuit board is the Service Processor for the system which also plugs in as a daughter card.








In the photo on the left we see the T5440 from the rear. Here you see the 4 power supplies each of which is 1,120 watts. To power a fully loaded T5440 with 4 cpus, 4 disks and 512TB of memory requires 2 supplies. Four supplies are required for full redundancy in this configuration.

Also note metal fillers that cover the 8 PCI-E slots, the built in 4x1Gig copper network connections, 100Mb and serial links to the Service processor, USB etc. Also in this photo you can see one of the 4 fans pulled out of the system.












Finally looking at the front of the system we see the 4 built-in disks. These are SAS and can be either 73GB or 146GB. There is also a DVD if required.

Friday Apr 11, 2008

Memory and coherency on the UltraSPARC T2 Plus Processor


Coherency Architecture


As the name suggests the UltraSPARC T2 Plus is a derivative of the UltraSPARC T2. The vast majority of the layout in silicon of UltraSPARC T2 Plus is the same as the original UltraSPARC T2. Both use Texas Instruments 65nm technology and have similar packaging and pin layout. There are still 8 cores with 8 Floating point and crypto units and 16 integer pipelines on each processor and a x8 PCI-E link integrated. The main differences are the replacement of the T2's 10Gig interface with silicon to perform coherency on the T2 Plus.


The coherency unit (CU) sits between the L2 cache and the memory controller Units (MCU) that drive the FBDIMM channels. The CU snoops all requests for memory and communicates with the CU on the second UltraSPARC T2 Plus processor to resolve all accesses. If data is on the remote processor its CU will forward the data


The coherency does not use a traditional bus such as on a v890 but instead uses multiple high speed SERDES links running between chips to achieve optimal scaling. We already had high speed SERDES links coming from T2 to connect to FBDIMM and after simulating numerous scenarios we decided to use half of these links for the coherency. To achieve this we reduced the MCUs from four on T2 to two on the T2 Plus. As each of the two MCUs were driving two FBDIMM channels that gave us 4 channels for our coherency links creating 4 coherency planes.


When the UltraSPARC T2 Plus is running at 1.2GHz the coherency links run at 4.0Gbps, for a 1.4GHz processor the links run at 4.8Gbps. There are 14 transmit and 14 receive links per channel. When running at 4.8Gbp the aggregate bandwidth is 8.4GB/s per channel per direction. The total across all 4 channels is therefore 33.6GB/s in each direction. Bottom line this is a big wide low latency pipe between the UltraSPARC T2 Plus processors dedicated only to memory transfers.


The L2 cache on the UltraSPARC T2 Plus is organized the same as the T2, 4MB, 16 way associative in 8 banks with 64 byte cache lines. All 8 cores still see all the L2 cache, just the number of MCUs have been reduced. Four banks from the L2 cache drive one MCUs.


The diagram below illustrates the architecture

















Memory Latency


So we have introduced the concept of local and remote memory where latency to access local memory is less than that of remote memory. The CU adds 76ns for the remote access and about 15ns for a local access

This gives a latency number from lmbench of :

Local = 150 nsec
Remote = 225 nsec

Bottom line the architecture is NUMA but not highly so. For instance Solaris NUMA optimizations give 8% - 10% improvement in performance.

Memory Bandwidth


There has been some discussion on the Web about the reduction of MCUs “dramatically” reducing the bandwidth of the UltraSPARC T2 Plus. This is pretty bogus.

Each UltraSPARC T2 Plus has a max theoretical bandwidth of:

Read 21 GBytes/s
Write 10.67GBytes/s
Total Read + Write 32 GBytes/s

Thats a theoretical max 64GBytes/s across the two processors on a T5140 or T5240. We have tested over 30GBytes a second read bandwidth using the STREAM benchmark on a T5240. Thats 75% of the max theoretical bandwidth.

What's far more important for maximum bandwidth are the number of FBDIMMs available in a channel. As the number of internal-banks/drams increase scheduling associativity increase and so does bandwidth. For instance 4 DIMMs in a channel gives significantly better bandwidth than just 2 . On the T5240 a riser (called the Memory Mezzanine Kit) is available that enables a second row of 16 DIMMs. This riser snaps on top of the layer of DIMMs on the motherboard and extends the FBDIMM channels so each can have up to 4 DIMMs. This configuration gives the maximum bandwidth.

Bottom line the processor has a large pipe to memory and we can utilize most of this bandwidth

Memory Interleaving

The T5120 / T5240 present a single memory address range to Solaris or any other OS. All memory on both processors is accessible through the CU. As mentioned previously the L2 cache line is still 64 bytes and we interleave these lines across the 8 banks on the cache to achieve maximum memory bandwidth. When actually accessing memory the MCU interleaves the 64 byte request across its two channels, 32 bytes of the data comes from channel 0 and 32 bytes from channel 1

The next interleave factor is 512 bytes which is 8 banks x 64 bytes. In this mode the first 512 bytes will be placed in processor zero's memory, next 512 bytes come from the memory attached to processor one etc. All memory accesses will be distributed evenly across both the UltraSPARC T2 Plus processors. The effect on the memory latency is to make it the average of 150ns and 225ns at about 190ns. The upside of this is avoiding hotspots, the downside is all applications pay the same latency penalty and there is a lot of interconnect traffic.

To take advantage of NUMA optimizations in the Solaris OS we added another interleave factor of 1GB. In this mode memory is allocated in 1GB chunks per processor. On T5140 and T5240 this is the default interleaving mode. The mapping information is passed to Solaris which can then optimize memory placement. For instance the stack and heap allocations can be placed on the same UltraSPARC T2 Plus as the process thus taking advantage of lower latency to local memory.

All these interleaf modes are handled automatically by the hardware. When a memory request reaches the CU after missing the L2 a decision is made which node to get it serviced from based on the interleaving factor which is a very simple compare based on address.

The hardware has the ability to have a mixed environment as well, some portion 512 byte interleaved and the rest 1GB interleaved but this is not used in current systems.


Memory Configuration


To determine the configuration of the memory on your system login to the system controller as user admin and enter showcomponent. This will dump a line for each DIMM in the system such as


/SYS/MB/CMP0/MR0/BR0/CH0/D2
/SYS/MB/CMP0/MR0/BR0/CH0/D3
/SYS/MB/CMP0/MR0/BR0/CH1/D2

The key to this cryptic output is as follows

MB is Motherboard

CMP0 is the first UltraSPARC T2 Plus processor
CMP1 is the second UltraSPARC T2 Plus processor

MR is the Memory Riser (Mezzaine)
MR0 extends the channels attached to the first processor
MR1 extends the channels attached to the first processor

On the ground floor there are 2 DIMMs per channel, D0 and D1
On the riser each channel is extended 2 more DIMMs D2 and D3

Each processor has 2 Memory controllers called confusingly Branches BR0 and BR1
Each memory Controller has 2 channels CH0 and CH1

So..... (still with me)


4 DIMMs on a CPU 0 - 1 DIMM on each channel of each controller


MB/CMP0/BR0/CH0/D0

MB/CMP0/BR0/CH1/D0

MB/CMP0/BR1/CH0/D0

MB/CMP0/BR1/CH1/D0



8 DIMMS on a CPU - 2 DIIMMs on each channel of each controller


MB/CMP0/BR0/CH0/D0

MB/CMP0/BR0/CH1/D0

MB/CMP0/BR1/CH0/D0

MB/CMP0/BR1/CH1/D0

MB/CMP0/BR0/CH0/D1

MB/CMP0/BR0/CH1/D1

MB/CMP0/BR1/CH0/D1

MB/CMP0/BR1/CH1/D1




All 16 on a CPU 4 DIIMMs on each channel of each controller


MB/CMP0/BR0/CH0/D0 - 8 on the ground floor

MB/CMP0/BR0/CH1/D0

MB/CMP0/BR1/CH0/D0

MB/CMP0/BR1/CH1/D0

MB/CMP0/BR0/CH0/D1

MB/CMP0/BR0/CH1/D1

MB/CMP0/BR1/CH0/D1

MB/CMP0/BR1/CH1/D1

MB/CMP0/MR0/BR0/CH0/D2 - head upstairs to the riser

MB/CMP0/MR0/BR0/CH0/D3

MB/CMP0/MR0/BR0/CH1/D2

MB/CMP0/MR0/BR0/CH1/D3

MB/CMP0/MR0/BR1/CH0/D2

MB/CMP0/MR0/BR1/CH0/D3

MB/CMP0/MR0/BR1/CH1/D2

MB/CMP0/MR0/BR1/CH1/D3




Thursday Apr 10, 2008

Overview of T2 Plus systems

Sun on Wednesday launched the next generation of CMT servers based on the UltraSPARC T2 Plus processor. The T2 Plus is the first CMT processor with in-built coherency links that enables 2-way servers. The new servers are called the T5140 and T5240. The T5140 is a 1U server and the T5240 is a 2U server. A picture of the T5240 is on the right. Each server comes with two T2 Plus processors linked together through the motherboard


This is really exciting for us as it doubles the capacity of these servers while maintaining the same 1U and 2U form factors. This represents the absolute highest computes density of any server in the market.


As the name suggests the UltraSPARC T2 Plus is a derivative of the UltraSPARC T2. The vast majority of the processors are the same, both use Texas Instruments 65nm technology. There are still 8 cores with 8 Floating point and crypto units and 16 integer pipelines on each processor and a x8 PCI-E link integrated. The main differences are the replacement of the T2's 10Gig interface with silicon to perform coherency on the T2 Plus. The coherency does not use a traditional bus but instead uses multiple high speed SERDES links running at 4.8Gbps between chips to achieve optimal scaling


First the stats on these servers

T5140 is a 1U server with 16 FBDIMM slots, there are 1GB, 2GB and 4GB DIMMs available and we offer the following options


2x 4 Core, 1.2Ghz, 8GB memory

2x 6 Core, 1.2Ghz, 16GB memory

2x 8 Core, 1.2Ghz, 32GB memory

2x 8 Core, 1.2Ghz, 64GB memory


A T5140 can have up to 4 hot plug disks, either 73GB or 146GB SAS



At the rear are up to 3 PCI-Express slots. Two of these can be 10Gig ports via Sun's XAUI cards shown on the right. Note that unlike the T2 based servers the the 10Gig on the T1540 is not integrated on the processor but is provided by a Neptune NIC on the motherboard.


As with all our CMT servers there are 4 inbuilt GigE ports which are provided by the same onboard Neptune NIC. This is different from the T5120 and T5220 which used two Intel “Ophir” NICs for the 1Gig connectivity.


The server has 2 Hot plug power supplies at 720 Watts each.


Below is a photo of the T5140. Note the two T2 Plus processors under the larger copper heat sinks. The interconnect between the processors is wired through the motherboard. The other copper heat sink in the bottom right hand corner is the Neptune NIC



T5240 is a 2U server with up to 32 FBDIMM slots


We offer the following options with 8 or 16 DIMMs


2x 6 Core, 1.2Ghz, 8GB memory

2x 8 Core, 1.2Ghz, 32GB memory

2x 8 Core, 1.2Ghz, 64GB memory

2x 8 Core, 1.4GHz, 64GB memory


In addition you can buy a riser (called the Memory Mezzanine Kit) that will accommodate another 16 DIMMs there is a photo of the riser on the right.

This riser snaps on top of the layer of DIMMs on the motherboard and extends the FBDIMM channels so each can have up to 4 DIMMs


By default the T5240 can have up to 8 hot plug disks, again either 73GB or 146GB SAS. In addition when ordering the system a different disk backplane can be selected that can accommodate up to 16 disks.


The system has 6 PCI-Express slots and just like the T5140 two of these slots can be 10Gig ports via Sun's XAUI cards using the onboard Neptune NIC. Also like the T5140 the Neptune also provides the 4 built-in GigEs


By default the T5240 has two hot plug power supplies 1100 watts each. For the 1.4GHz and 16 disk configs however you need to use 220V AC input.





Below is a photo of the T5240 without a riser card, notice the motherboard is exactly the same as the T5140. The server has larger fans, however, to accommodate the extra memory and potentially more disks and PCIE cards. Note the larger power supplies are stacked one on top of each other as opposed to side by side on the T5140




As mentioned the T5140 and T5240 have bigger power supplies than the corresponding T5120 and T5220 as they need to support two processors. Power supplies from T5120 and T5220 should not be used for these servers.


The T5140 and T5240 use the exact same FBDIMM memory, however, as the T5120 / T5220 and support the exact same PCI-E and XAUI cards.


As noted earlier networking on the T5140 and T5240 is interesting as the two 10Gig XAUI and 4xGigE share the same Neptune NIC. If you use a XAUI card one of the GigE ports are disabled. Looking at the diagram below which highlights the two XAUI and GigE ports.


If you insert a XAUI card into XAUI 0 then the GigE port net1 will be disabled


If you insert a XAUI card into XAUI 1 then the GigE port net0 will be disabled


The Solaris Driver for all these ports is nxge, so device nxge0 can be either a 10Gig or a 1Gig port





One neat feature of the T5240 and T5140 is that each of the UltraSPARC T2 Plus processors has its own integrated x8 PCI-E link. Having two processors therefore automatically doubles the PCI-Express bandwidth. The T5x40 servers have twice the I/O bandwidth of their T5x20 equivalents. The PCI-E slots are wired so that the I/O is spread evenly across the x8 ports on both UltraSPARC T2 Plus processors.


For more information check out our whitepaper at http://www.sun.com/servers/coolthreads/t5140/wp.pdf








Tuesday Oct 09, 2007

Floating Point performance on the UltraSPARC T2 processor

UtraSPARC T1 Floating Point Performance

The first generation of CMT processor, the UltraSPARC T1, had a single floating point unit shared between 8 cores. This tradeoff was made to keep the chip a reasonable size in 90nm silicon technology. We also knew that most commercial applications have little or no floating point content so it was a calculated risk.

The T1 floating point unit has some other limitations: it is at the other side of the crossbar from all the cores, so there is a 40 cycle penalty to access the unit; only one thread can use it at a time; and some of the more obscure FP and VIS instructions were not implemented in silicon but emulated in software Even though there are few FP instructions in commercial applications lack of floating point performance is a difficult concept for folks to grasp. They don't usually give it much thought. To help ease the transition to T1 we created a tool cooltst available from /www.opensparc.net/cooltools that gave a simple indication of the percent of floating point in an application using the hardware performance counters available on most processors.

The limited floating point capability on T1 did have a number of downsides

  • It gave FUD to our major competitors

  • It was confusing to customers who had never had to previously test FP content in their applications

  • It excluded T1 systems from many of the major Financial applications such as Monte Carlo and Risk Management

  • It also excluded the T1 from the High Performance Computing (HPC) space

UtraSPARC T2 Floating Point Performance

The next generation of CMT, the UltraSPARC T2, that we are releasing in systems today is built in 65nm technology. This process gives nearly 40% more circuits in the same die size. One of the priorities for the design was to fix the limited floating point capability . This was achieved as follows

  • There is now a single Floating Point Unit (FPU) per core, for a total of 8 on chip. Each FPU is shared by all 8 threads in the core

  • Each FPU actually has 3 separate pipelines, Add/Multiply which handles most of the floating point operations, a graphics pipeline that implements VIS 2.0 and a Divide/Square root pipeline (see diagram below)

  • The first 2 pipelines are 12 stage (see diagram on the right) and fully pipelined to accommodate many threads at the same time. The Divide/Sqrt has a variable latency depending on the operation

  • Access to the FPU is in 6 cycles (compared to 40 on the T1)

  • All FP and VIS 2.0 instructions are implemented in silicon, no more software emulation

  • The FPU also enhances integer performance by performing integer multiply, divide and popc

  • The FPU also enhances Cryptographic performance by providing arithmetic functions to the per core Crypto units.

The theoretical max raw performance of the 8 floating point units is 11 Giga flops per second (GFlops/s). A huge advantage over other implementations however is that 64 threads can share the units and thus we can achieve an extremely high percentage of theoretical peak. Our experiments have achieved nearly 90% of the 11 Gflop/s.

From a Floating point perspective the other advantage T2 has is the huge memory bandwidth from the on-chip memory controllers that are connected directly to FBDIMM memory. This memory subsystem can deliver a theoretical max of 60GB/s of combined memory bandwidth.

For example, during early testing utilizing the floating-point rate component of the benchmark suite SPECfp_rate2000, some of the codes achieved over 20 Gbytes per second of memory bandwidth.


This combination of FP and memory bandwidth can be seen in our SPEC CPU2006 floating-point rate result (1). The Sun SPARC Enterprise T5220 server, driven by the UltraSPARC T2 processor running at 1.4GHz, achieved a a world record result of 62.3 SPECfp_rate2006 running with 63 threads – note this number will be published at http://www.spec.org/cpu2006/results/res2007q4/#SPECfp_rate after the systems are launched.


Another proof point is a set of Floating point applications called the Performance Evaluation Application Suite (PEAS) developed by a Sun engineer, Ruud van der Pas, an described in his blog http://blogs.sun.com/ruud/

PEAS currently consists of 20 technical-scientific applications written in Fortran and C. These are all single threaded user programs derived from the real applications. Ruud has run all twenty of these codes in parallel on a T2 system and then two sets of the codes – 40 in parallel. There was still plenty of spare CPU on the box


At the UltraSPARC T2 launch http://www.sun.com/featured-articles/2007-0807/feature/index.jsp Dave Patterson and Sam Williams from Berkely presented their findings using Sparse matrix on the T2. Their tests consist of a variety of FP intensive HPC codes that have large memory bandwidth requirements. Sam will also present his findings at SC07 http://sc07.supercomputing.org/schedule/event_detail.php?evid=11090

We have also had a number of Early Access customers that are testing floating point applications. One customer can now fully load his T5120 where previously it would fail to complete in the desired time on T2000.

In conclusion with the UltraSPARC T2 we have solved the issue of limited floating point capability that was an issue on the original T1. Our guidance to avoid running applications on the T1 with more than 2% FP instructions does not apply to the T2. No changes are required by developers to take advantage of the 8 floating point units as they are fully integrated into the pipeline of each core.



(1) SPEC® and the benchmark name SPECfp® are registered trademarks of the Standard Performance Evaluation Corporation. Presented result have been submitted to SPEC. For the latest SPECcpu2006 benchmark results, visit http://www.spec.org/cpu2006. Competitive claims are based on results at www.spec.org/cpu2006 as of 8 October 2007



T5120 and T5220 system overview

Sun today launches the new line of servers based on the UltraSPARC T2 processor. The T2 processor is the next generation of CMT following on from the very successful UltraSPARC T1. The new servers are called the T5120 and T5220. The T5120 is a 1U server and the T5220 is a 2U server.


The T5120/T5220 systems differ greatly to the T1000 and T2000 systems of today having a completely redesigned Motherboard.

Answers to common FAQs are

  • An UltraSPARC T1 processor cannot be put in a T5120/T5220 Motherboard

  • An UltraSPARC T2 cannot be put in a T1000/T2000 system

  • A T1000/T2000 cannot be upgraded to a T5120/T5220


The T5120 and T5220 use the exact same motherboard. In fact the only differences between the two systems are:

  • Height: 1U vs 2U.

  • Power supplies: 650 Watts for the T5120, 750 Watts for the T5220. Note the power supplies between the two systems are physically different and cannot be interchanged. Also both systems have two hot pluggable power supplies.

  • Number of PCI-E slots: 3 for the T5120 versus 6 for the T5120

  • Max number of disks: 4 for the T5120 versus 8 for the T5120

The first thing you will notice is that the T5120 and T5220 are longer than the UltraSPARC T1 based T1000 and T2000 systems. The T5120/T5220 are 28.1 inches long versus 24 inches on the previous generation.

To open the system press down on the button in the center of the lid and push towards the end. Note that just like the T2000 there is an intrusion switch in the top right hand corner of the system. If the lid is opened power is cut to the system so make sure the OS has been gracefully shutdown before opening the lid

On the T2000 systems today the service controller (SC) is on a small pcb in a special slot. For the T5120/T5220 we have integrated the SC on the Motherboard


CPU

Looking at the system, the processor is under the large copper heatsink in the middle of the Motherboard between the 16 memory DIMM slots. The UltraSPARC T2 processor frequency is either 1.2GHz or 1.4GHz and comes in 4/6/8-core options.


dimmMemory

The memory in T5120/T5220 systems is FBDIMM and different from the DDR2 used in T1000/T2000. There is a photo of an FBDIMM memory stick on the right.

Note you cannot use current memory from T1000/T2000 in the new systems

The T5120/T5220 have 16 DIMM slots and currently 3 DIMM size options, 1GB, 2GB and 4GB. So the maximum memory today is 64GB in these systems

The DIMMs can be half populated. Insert the first DIMM in the slot closest to the CPU and every second one after that


Fans

The fans are accessed through the smaller lid in the cover. The fan unit now consists of 2 fans connected together. The T5120/T5220 only require 1 row of fans for cooling. The second row will be empty. Note the shaped plastic on top of the processor and memory in the T5220 which is used to force air from the fans over the CPU and DIMMS. The fans are hot plugable an we have implemented variable fan speed control to reduce noise. Under nominal conditions the fans will run at half speed.


Disks

The T5120/T5220 systems use the same 73GB and 146GB SAS drives available on T2000 based systems today but will use a different bracket. Thus to use current T2000 disks requires swapping the bracket.

T5120/T5220 systems use the LSI 1068E RAID controller in order to support 8 physical disks. Apart from the extra disks this is functionally equivalent to the controller used in the T2000 today. RAID 0+1 is available via Solaris Raidctl. Note mirroring a root partition still needs to be done before the OS is installed.


One of the big differences between T5120/T5220 and T1 based systems is the I/O configuration. As mentioned previously the T5120 (1U) has 3 PCI-E slots and the T5220 (2U) has 6. There are no PCI-X slots on the new systems. Since the launch of the T1 systems we have seen increasing availability of low profile PCI-Express cards for all major HBA applications, FCAL, Gige and 10Gig networking, IB etc.

I/O

The UltraSPARC T2 has two 10Gig network ports built into the processor. These ports provide superior 10Gig performance and require fewer CPU cycles. The 2 interfaces on the chip are industry standard XAUI. There is a card available from Sun to convert XAUI to a fiber 10Gig connection. An example of the XAUI card and its optics are shown on the right. The electrical connector on a XAUI card (the gold connector in the photo) is towards the back of the card which is a different position to standard PCI-E cards.

Unlike the T2000 the PCI-E and XAUI slots on T5120/T5220 are on their side and plug into 3 riser cards that then plug into the Motherboard. The 2U riser can be seen in the photo on the right. The two bigger connectors on the left of the riser are for PCI-E and the small one on the right is for XAUI.

The 1U has a single layer of 3 slots and the 2U has 2 layers. All the PCI-E connectors are either x8 or x16 in size, but are actually wired x4 or x8.

Both T5120 and T5220 systems have 4 Gigabit Ethernet ports integrated into the motherboard. Like the T2000 these ports use 2 Intel Ophir chips and the e1000g driver in the Solaris OS.


Rear

Looking at the back of either the T5120 or T5220 you will see that there are two redundant power supplies.

There is a serial port and a 100mb network port for the System Controller as well as the 4 Gigabit ports. At the right hand corner is a RS232 serial port designated as TTYA by the operating system. There are two, 2.0 capable USB ports at the rear of both systems


Front

In the front there are slots for 4 disks on a T5120 and 8 on a T5220. On both systems there is also a DVD in the top right hand corner. There are two, 2.0 capable USB ports at the front of both systems as well.


Front / Rear LEDs

On rear of the system we have 3 LEDs which are from left to right

  • Locator LED

  • Service Required LED

  • Power OK LED

On the Front of the system we have side the same 3 LEDS on the left hand. On the right hand side we have 3 more

  • TOP which indicates a fan needs attention

  • PS LED indicates a power supply has an issue

  • Over temp indicator


Lesons learned from T1



We created UltrSPARC T1 and launched it into the world in November 2005. Adoption was slow at the start as CMT was so different. We spent many hours explaining patiently how it worked to customers, partners etc. We also did Proof of Concepts (POCs) with many of Sun's major customers around the world. As we progressed and the product ramped we gathered a body of knowledge on Applications, how they worked on CMT and how to tune them. We also wrote whitepapers available at http://www.sun.com/blueprints and we created tools that we posted at http://www.opensparc.net/cool-tools.html

Now here we are two years later about to launch UltraSPARC T2. We have learned many lessons along the way.

1. We have much more experience in helping customers to migrate to Solaris

Migrations involve a number of steps:


  • Recertifying customers' application stacks on Solaris 10
  • Identifying legacy applications that cannot be moved to Solaris 10

To help the deployment of Legacy applications we have developed the Solaris 8 Migration Assistant, internal Sun codename Etude http://blogs.sun.com/dp/entry/project_etude_revealed Etude allows a user to run a Solaris 8 application inside of a BrandZ zone in Solaris 10



2. Customers often evaluate CMT performance with a single threaded application.

Many times a customer has a standard benchmark that they have used for years to evaluate all hardware. They have collected a body of results that they use to rank servers. These tests are often single threaded and viewed as a “power” test of the server. The customer feels such as test has the effect of “leveling” the performance playing field. This test is often the first door that needs to be passed to enable further evaluation.

I have received many mails from folks that start “the performance on the T2000 was very poor, it was 50% of a v440”. My first question is always is the CMT server 96% idle at the time of the test. If so this usually points to a single threaded test.

Single thread performance is not the design point for CMT. The pipeline is simple and designed above all else to be shared thus masking the memory stall of threads. This design leads to its extreme low power and its extremely high throughput. In UltraSPARC T2 we have continued this focus. There are now twice the integer pipelines (16in total) which doubles the throughput of the chip. We have added a couple of more pipeline stages and increased the size of some of the caches which increases single thread performance about 25%.

With CMT we need to ask the question does this test really reflect the true nature of the customers workload. If the customer truly needs single thread performance then CMT is not for them. In many situations, however, the reality is they really require throughput. As many customers have found in a throughput environment CMT is a far superior architecture.



3. The 1/4 frequency argument is a common misconception of CMT performance

The ¼ Frequency argument goes as follows:


  • A CMT pipeline runs at say 1.2GHz and has 4 threads sharing it
  • Therefore each thread only gets 1/4 the cycles and runs 300MHz
  • This makes it less performant than an old US II chip

This line of argument doesn't hold because most commercial code chases pointers and is constantly loading data structures. On average a commercial application stalls every 100 instructions for a variety of reasons such as TLB miss, I cache miss, Level 2 cache miss etc. When a thread stalls it is usually delayed for many cycles, an Icache miss for instance is 23 cycles. So even though a thread is running at 1.2GHz it usually spends 70% of its time stalled. This is why major processor manufacturers create ever deeper out-of-order pipelines in an effort to avoid this stall.

All this stalling is perfect for CMT. The hardware automatically switches out a thread when it stalls and shares its cycles amongst the other 3 threads on the pipeline masking the stall. With this technique we can utilize the pipeline 75% - 80 of the time provided there are enough threads to absorb the stall



4. Most Commercial applications have little or no floating point instructions.

As part of the UltraSPARC T1 rollout we introduced a tool called http://cooltools.sunsource.net/cooltst/index.html that gives an indication of the percentage of floating point in a current deployment environment. In reviewing the output from many hundreds of customer cooltst runs we rarely see a large floating point indication.

"Most" commercial application developers would rather not deal with floating point as it is more difficult to program. There are of course exceptions in the commercial space such as SAS and portions of the SAP stack. One big exception is Wall Street with such applications as Monte Carlo.

In UltraSPARC T2 we added a fully pipelined floating point unit per core shared by 8 threads. These FP units can deliver over 11 Giga Flops of Floating Point performance per second. So the Floating Point issue has been completely eliminated in T2.



5. One of the biggest gains in Java apps is moving to 1.5 or 1.6 JVM

Many Java applications today are running on JVM 1.4.2. The last build of this version was created in Dec 2003 and it has now officially entered the Sun EOL transition period described at
http://java.sun.com/j2se/1.4.2/.

One of the best ways to improve throughput performance of a Java Application on CMT servers is to upgrade the JVM to at least the latest 1.5 JVM or preferably 1.6. These versions of JVM have a host of new features and performance optimizations many targeted specifically at CMT. I have seen 15% - 30% increase in performance of Java applications wen migrated from 1.42 to 1.6

The issue complicated slightly by older versions of ISV software that are only supported on 1.4.2. Again we encourage customers to migrate to the newer version of the ISV stack.



6. There are a set of Applications where CMT really shines

We have worked with over 200 customers in the last 2 years and the following list covers a large portion of the applications where CMT showed excellent performance.

Webservers.

  • Sunone
  • Apache

J2SE Appservers.

  • BEA Weblogic
  • IBM Websphere
  • Glassfish (formerly SunOne Appserver)

Database Servers.

  • Oracle including RAC
  • DB2
  • Sybase
  • mySQL

Mail Servers.


  • Sendmail
  • Domino
  • Brightmail including Spamguard

Java Throughput Applications

Traditional Appservers.


  • Siebel
  • Peoplesoft

Net Backup

JES Stack


  • Directory, Portal, Access Manager





Tuesday Dec 06, 2005

UltraSPARC T1 - low power and the SWAP metric

Low power was a key design point of the UltraSPARC T1 "Niagara" processor from the
ground up. Unlike other processors we took a simpler approach to the pipeline.
Avoiding deep and complex Out of Order pipelines reduced power consumption to less
than 5 watts per core. We have also made a smaller more efficient L2 cache which
is great for throughput and consumes a lot less power. The total power consumption
for the chip is typically less than 70 watts, about half of comparible x86 processors.

Soon after first silicon we started to measure the power drawn from
the processor and DDR memory. Using our Fireball bringup system we measured current
drawn by modifying the motherboard to attach test points to the memory and cpu voltage
regulators. The result was a very ugly setup involving an oscilliscope.

We ran an Oracle, Java and SAP workload on the box while taking the measurements.
None of these really drove the memory bandwidth, however, which has a theoretical max
of 25GB/s. WSe switched to to a memory exerciser that could push about 14GB/s to the
memory subsystem.

With the arrival of p0.0 Ontarios power measurement began in earnest. We used a number
of different memory exercisers. We took measurements with 8GB, 16GB and 32GB of
memory, with and without the PCI-X and PCI-E slots occupied. We also got a breakdown of
the power costs of the major components of the systems


  • UltraSPARC T1 processor
  • The DDR2 memory
  • The 4 internal drives
  • The fans
  • The power supply efficiency
  • The PCI-E and PCi-X slots

All these components make up the total system power.

These tests proved that even under the heaviest load the T2000 did not require the
initial 550 Watt power supplies we had specified. The most we measured was 340 Watts
fully loaded. A system running at 340 Watts is not in the efficient range of a
550 Watt supply. We decided to change the supply for a 450 Watt version which
is far more efficient the 340 Watt range. This represented a significant power saving.
This transition is currently progressing.

These initial T2000 power measurements were done with a Voltech AC Power Analyzer and
a Tektronix 745 Oscillascope after modifying the power cord. This mechanism
was not practical for most benchmark environments and did not scale as the number
of systems increased.

The solution we found was a simple Power Analyzer and Watt
meter called "Watts Up" which is available from Frys for about $70.
The meter is simple to use, it provides a socket for the T2000 power cable and
then the meter is plugged into the wall socket. You need two for the dual power
supplies of the T2000.

We bought a bunch of these meters and put two on each of the benchmark configurations
we were testing. We also used these meters on comparable Xeon systems. When running
a workload on a T2000 we often run it in parallel on a Xeon system and take
performance and power measurements for both.

All our measurements on T2000 showed huge reductions in power consumption and
heat generated relative to other servers. At the same time the performance on
the T2000 was better than these systems.

We were also presenting our technology to many customers at the time and were hearing
again and again how power an cooling were becoming a critical factor in server deployment.
Datacenters are at the limits of power be in and how hot air that
can be extracted. Racks can only be deployed half full or require a large amount
of space around them.

Customers were extremely interested in our low power technology. But we needed a
metric to combine performance with power consumption and the space taken up by
racks of systems in the datacenter

To achieve this Rick Hetherington, Distinguished Engineer and Chief Architect for the
UltraSPARC T1 and Mat Keep and his team developed the very elegant "Space, Watts and
Performance (SWaP)" metric calculated using the following formula



    Performance
    -----------
    Space X Power Consumption


  • Performance is any benchmark such as an industry standard or customer specific workload
  • Space is the height of the server in rack units (RU)
  • Power is the Watts consumed by the system either by specs or even better by measuring the draw
    with a meter during the actual workload.

An example:

If two systems deliver the same throughput performance but one is 2RU and
draws only 300 Watts and the other 3RU and draws 800 Watts.

The SWap rating of the first would be 0.83 and the second 0.16.

The first system has a 5X advantage over the second when deployed in the datacenter.

For a full description see:

www.sun.com/servers/coolthreads/swap

We really believe SWaP is good for customers as it shows the true cost of deploying
a system in a datacenter. We hope it becomes an industry standard approach to
power and space management in the datacenter.


[ Technorati: NiagaraCMT, ]

UltraSPARC T1 large page projects in Solaris

A Translation Lookaside Buffer (TLB) is a hardware cache that is used
to translate a processes's virtual address to a physical memory address.
UltraSPARC T1 has a 64 entry Intruction TLB and a 64 entry Data TLB per
core. The unit of translation is page size, UltraSPARC T1 supports
4 page sizes, 8k (the default), 64k, 4m and 256MB. When memory is
accesssed and the mapping is not in the TLB this is termed a TLB miss.
Excessive TLB misses are bad for performance

The 64 entry TLBs are relatively small compared to current SPARC processors
They have the advantage, however, that you can mix and match all page
sizes in the TLB ie the TLB does not have to be programmed to a particular page size.
One entry can be 8k for instance and the next one can be 256MB.

We know that TLB performance was going to be critical for UltraSPARC T1. Early on
we started a number of Solaris projects to provide optimal TLB performance
for the processor.

The first project was MPSS for Vnodes (VMPSS) also known as large
pages for text and libraries. Before this project binary text and library
segments were always placed on 8k pages. For large binaries such as Oracle
or SAP, or applications with a large number of libraries, this results in a
high number of ITLB misses per second.

VMPSS provides in kernel infrastructure and mechanisms so that large pages
can be used with file mappings that are text and initdata segments of binaries
and libs. Text and libraries are placed on the largest page size possible.
For smaller binaries and libraries this is usually 64k pages but for bigger
binaries such as Oracle it is 4MB mappings.

Use pmap -xs to see the pagesize that has been allocated. The first entries in
the output is the binary itself, libraries are usually twards the end of the listing.

14148: /usr/sap/SSO/SYS/exe/run/saposcol
Address Kbytes RSS Anon Locked Pgsz Mode Mapped File
0000000100000000 320 320 - - 64K r-x-- saposcol

The performance gains on UltraSparcT1 were significant, up to 10% on some Oracle
workloads.

The second TLB related project was Large Pages for Kernel Memory which provides
large pages for the kernel heap. The kernel is a particularly bad TLB miss
offender. Code generally spends less time in the kernel and so
on entry the TLB is usually cold. Prior to this project the kernel heap
has been mapped on 8k pages. We saw moderate performance gains with this project.

The third project added was Large Page OOB (out-of-the-box) Performance
The Multiple Page Size Support (MPSS) project in Solaris 9 added support
for pagesizes other than 8k. MPSS Environment variables needed
to be set and a library mpss.so.1 preloaded prior to running an application
The aim of the MPOOB project was to bring the benefits of large pages to a
broader range of applications out-of-the-box, without requiring
the need for the MPSS variables.

MPOOB affects the allocation of heap, stack and anon pages.

Again check if large pages we obtained using pmap -xs

0000000104858000 32 32 32 - 8K rwx-- [ heap ]
0000000104860000 3712 3712 3712 - 64K rwx-- [ heap ]
0000000104C00000 8192 8192 8192 - 4M rwx-- [ heap ]

This is a huge win for our customers, freeing them from the need to set environment
variables to tune the TLBs.

If for some reason pages are not allocated correctly by default the MPSS variables
can still be used to override. See manual sentry for mpss.so.1 for details

The fourth large page project was support for 256MB aka Giant pages on
UltraSPARC T1 systems. This project was actually added as part of the
UltraSPARC IV+ project, however the TLB programming is different on Niagara.

For an allocation to be a candidate for 256mb pages it most have the following
characteristics

- At least 256mb in size
- Be aligned on a 256MB address

Giant pages can only be allocated on a 256mb address boundary. If the
allocation is greater than 256mb Solaris will attempt to use 256mb pages
at the next boundary. Solaris will attempt to allocate 8k, 64k and 4mb pages until
the boundary is reached.

One of the biggest performance gains from giant pages is in the Oracle SGA
which is allocated as System V shared memory. If the SGA is large it should
be allocated on giant pages. Again use pmap -xs to confirm

0000000380000000 25427968 25427968 - 25427968 256M rwxsR [ ism shmid=0x3 ]
0000000990000000 16384 16384 - 16384 4M rwxsR [ ism shmid=0x3 ]
0000000991000000 56 56 - 56 8K rwxsR [ ism shmid=0x3 ]

In the previous example the first 25GB of SGA is allocated on 256mb pages, there is a tail
at the end that is first allocated on 4mb pages. The residue is 56k which is allocated
on 8k pages.

The final project added to Solaris was Large Page Availability. The aim of this
project was to increase the number of large pages in the system and improving the
efficiency of creating large pages. This project is largely hidden from the end user.
It is key however to ensuring applications can allocate large pages.

To determine how well the TLBs are doing use the trapstat command. The trapstat -T
option breaks down the data as follows

- Per hardware Strand
- User and kernel
- Pagesize wwithing each mode

On an 8 core 32 thread UltraSPARC T1 system the output is very long. The example
below gives the last strands output plus a total

cpu m size| itlb-miss %tim itsb-miss %tim | dtlb-miss %tim dtsb-miss %tim |%tim
----------+-------------------------------+-------------------------------+----
31 u 8k| 989 0.1 5 0.0 | 28050 1.2 3 0.0 | 1.3
31 u 64k| 2510 0.2 0 0.0 | 139354 5.4 4 0.0 | 5.6
31 u 4m| 2768 0.2 0 0.0 | 94936 4.5 0 0.0 | 4.7
31 u 256m| 0 0.0 0 0.0 | 79590 3.6 0 0.0 | 3.6
- - - - - + - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - + - -
31 k 8k| 1921 0.1 0 0.0 | 35701 1.3 6 0.0 | 1.4
31 k 64k| 0 0.0 0 0.0 | 330 0.0 0 0.0 | 0.0
31 k 4m| 0 0.0 0 0.0 | 71 0.0 0 0.0 | 0.0
31 k 256m| 0 0.0 0 0.0 | 3388 0.2 4 0.0 | 0.2
==========+===============================+===============================+====
ttl | 278212 0.6 68 0.0 | 12334583 16.4 368 0.0 |16.9

Note the difference with traditional Sparc processors - 512k pages have been dropped
and new 256mb entries added.

In this example we see hardly any iTLB misses, this is because of large pages for text
and libraries. There are also 256MB page misses in the kernel indicating large pages
for kernel heap is also in operation.


[ Technorati: NiagaraCMT, ]

UltraSPARC T1 in a Fireball - the ugly duckling

Just after I arrived in the Niagara Arch group we taped out UltraSPARC T1 and
10 weeks later we had first silicon. The guys in the bringup team spent
many late nights and got Solaris booted in a couple of weeks

Now what. We knew what the performance should be in theory but we needed
to prove this as quickly as possible. Silicon verification was ongoing
in Sunnyvale using all the available systems but we managed to beg one
from the team to do some performance evaluation.

The system the we received was without exception one of the ugliest I have
ever seen. It was called a Fireball and was a UltraSPARC T1 board jammed sideways
into a 4U deskside server. It had a an industrial 6 inch fan on it that
was irritatingly loud. There were cables and wires everywhere to aid debug
In the front were slots for 8 old full height SCSI disks. It looked like
no mechanical engineer had been involved in its creation. A picture of this
would later appear in Jonathans blog

Little did we know that we would come to love these systems and that
they would still be involved in performance testing a year later.

The system had limited I/O capabilities so we decided to initially test
a throughput cpu/memory Java benchmark. Initial chips were only
rated for 800MHz but you cannot keep a good performance engineer down.
We worked out a way to hack the reset code to drive the chips to 1.2GHz
by increasing the core voltage. As these were initial silicon samples
we didn't know what to expect. We tested a number of chips until we
found 3 that could run at 1.2GHz.

After that it was mostly software. When working with very early systems
firmware and OS bits are hand built, panics and powercycles are common.
Because Niagara is such an exciting new technology, however, people were
prepared to take a lot of pain to run early workloads. It took about
a week to get the right software stack in place but it was worth the wait.

The initial number at 800MHz was nearly 100k Ops/sec. When we
cranked it up to 1.2GHz we got 129k with minimal tuning. We were astounded
at how the UltraSPARC T1 threads absorbed work.

The Software to access Hardware performance counters was not yet in Solaris
so we scrambled to add this functionality. What they revealed was that the
utilization of the Niagara pipe was nearly 70%. In 12 years of Sparc Performance work
I had never seen a number that high. Not only was the silicon working
beautifully but the pre-silicon simulations had been right.

We rushed to run other CPU/memory benchmarks including an internal XML
test and got similar results.

A few weeks later I gave a presentation where I first showed the standard
CMT/Niagara slide with the 4 threads on a core and the 8 cores absorbing
all the stall. I'm sure most folks in the room were moaning to themselves
"here we go again" . Then I put up a slide with simply stated "An now
it is real" with the Java throughput and XML results. The age of CMT had
truly arrived.


[ Technorati: NiagaraCMT, ]

About

denissheahan

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today
Bookmarks