Monday Oct 13, 2008

T5440 Architecture

Today we are launching our newest CMT server, the T5440. This server is a monster of performance and scalability. It can have up to four T2 Plus processors, 1.2GHz or 1.4GHz in a 4 Rack Unit (RU) chasis. The design is modular enabling 1, 2, 3 and 4 processors configurations. The system scales to 32 cores, 256 threads and 512 GB of memory.

The diagram on the right shows the architecture. Each UltraSPARC T2 Plus processor has 4 coherency planes.

On a T5440 there are four high speed hub chips, called Zambezi, running at 800MHz which connect each of the four coherency planes. Each UltraSPARC T2 Plus processor communicates with all others in the system via the Zambezis. Each processor has its own local memory and can access all the remote memory of the other processors via the Zambezis. Thus the memory in the system scales as we add processors. Each processor also has its own integrated PCI-Express x8 link so the I/O bandwidth also scales as we add processors.

The scalability advantage of this architecture was highlighted at the 2008 IEEE Symposium on High-Performance Interconnects. http://www.hoti.org/archive/2008papers/2008_S2_1.pdf there is also a set of slides at http://www.hoti.org/hoti16/program/2008slides/Session_Presentation/Feehrer_CoherencyHub_2008-08-27-09-51.pdf

The design of the T5440 system itself is very different to our previous systems which have all been traditional 1U and 2U servers.

The T5440 is 4 RU and the processors and memory are on daughter cards. These daughter cards plug into a motherboard. The motherboard contains the Zambezis and I/O subsystem. The daughter cards are configured in groups of two. One cpu card and one memory card per group. The memory cards are optional however as the cpu cards has enough DIMM slots for a minimum memory config

The photo on the right shows a cpu daughter card which contains one UltraSPARC T2 processor and slots for 4 FBDIMMs. The UltraSPARC T2 Plus has 2 FBDIMM channels, 2 branches per channel and the DIMM slots on the cpu daughter card are directly connected to these 4 branches.

The memory board as shown in the photo on the left contains 12 slots for FBDIMM memory. The memory daughter card extends the 4 FBDIMM branches from the associated UltraSPARC T2 processor. Each branch can be extended by up to 3 more DIMMs.

DIMMs come in 2GB, 4GB and 8GB sizes and run at 667MHz just like the T5140 and T5240 servers.

The minimum requirement is 4 DIMMs per processor. Other options are 8 or 16 DIMMs per processor.

Currently all processors must have the same amount of memory but each processor can achieve this with different sized DIMMs



From these photos you can see that the connectors on the cpu and memory daughter cards are different. The cards can only go in preassigned slots. This can be seen from the photo of the motherboard, the cpus plug into the longer slots and the memory into the shorter ones.







Slots must be filled in the following order. CPU/MEM pair 0 and pair 1 followed by pair 2 and pair 3. Standing in front of the server the slots are numbered.

front left 3 - 1 - 2 - 0 front right.

The motherboard photograph also shows the 8 PCI-E slots, although all are x8 electrically two of the slots (the longer ones) have x16 physical conectors.

Note PCI-E cards on the T5440 plug in vertically, this is a change from the T5120, T5220, T5140 and T5240 servers which used risers and installed PCI-E cards horizontally.

Note also two of the PCI-E slots have a second smaller slot next to them. These are the XAUI slots. Sun supplied XAUI cards plug into these slots and provide 10Gig networking.








The diagram on the right shows a fully configured T5440 with all its daughter cards and how these are connected to the Zambezis and I/O system. The beauty of this physical design is upgradability. You can start with a 1 cpu configuration and add extra compute and memory as required

There needs to be 'blanks' in place of empty slots that ensure appropriate airflow.











A top view of a fully loaded T5440 is shown in the photo at the right. From bottom to top are the 4 large fans for cooling, the row of cpu and memory daughter cards, and the 8 PIC-E I/O slots. The separate circuit board is the Service Processor for the system which also plugs in as a daughter card.








In the photo on the left we see the T5440 from the rear. Here you see the 4 power supplies each of which is 1,120 watts. To power a fully loaded T5440 with 4 cpus, 4 disks and 512TB of memory requires 2 supplies. Four supplies are required for full redundancy in this configuration.

Also note metal fillers that cover the 8 PCI-E slots, the built in 4x1Gig copper network connections, 100Mb and serial links to the Service processor, USB etc. Also in this photo you can see one of the 4 fans pulled out of the system.












Finally looking at the front of the system we see the 4 built-in disks. These are SAS and can be either 73GB or 146GB. There is also a DVD if required.

Friday Apr 11, 2008

Memory and coherency on the UltraSPARC T2 Plus Processor


Coherency Architecture


As the name suggests the UltraSPARC T2 Plus is a derivative of the UltraSPARC T2. The vast majority of the layout in silicon of UltraSPARC T2 Plus is the same as the original UltraSPARC T2. Both use Texas Instruments 65nm technology and have similar packaging and pin layout. There are still 8 cores with 8 Floating point and crypto units and 16 integer pipelines on each processor and a x8 PCI-E link integrated. The main differences are the replacement of the T2's 10Gig interface with silicon to perform coherency on the T2 Plus.


The coherency unit (CU) sits between the L2 cache and the memory controller Units (MCU) that drive the FBDIMM channels. The CU snoops all requests for memory and communicates with the CU on the second UltraSPARC T2 Plus processor to resolve all accesses. If data is on the remote processor its CU will forward the data


The coherency does not use a traditional bus such as on a v890 but instead uses multiple high speed SERDES links running between chips to achieve optimal scaling. We already had high speed SERDES links coming from T2 to connect to FBDIMM and after simulating numerous scenarios we decided to use half of these links for the coherency. To achieve this we reduced the MCUs from four on T2 to two on the T2 Plus. As each of the two MCUs were driving two FBDIMM channels that gave us 4 channels for our coherency links creating 4 coherency planes.


When the UltraSPARC T2 Plus is running at 1.2GHz the coherency links run at 4.0Gbps, for a 1.4GHz processor the links run at 4.8Gbps. There are 14 transmit and 14 receive links per channel. When running at 4.8Gbp the aggregate bandwidth is 8.4GB/s per channel per direction. The total across all 4 channels is therefore 33.6GB/s in each direction. Bottom line this is a big wide low latency pipe between the UltraSPARC T2 Plus processors dedicated only to memory transfers.


The L2 cache on the UltraSPARC T2 Plus is organized the same as the T2, 4MB, 16 way associative in 8 banks with 64 byte cache lines. All 8 cores still see all the L2 cache, just the number of MCUs have been reduced. Four banks from the L2 cache drive one MCUs.


The diagram below illustrates the architecture

















Memory Latency


So we have introduced the concept of local and remote memory where latency to access local memory is less than that of remote memory. The CU adds 76ns for the remote access and about 15ns for a local access

This gives a latency number from lmbench of :

Local = 150 nsec
Remote = 225 nsec

Bottom line the architecture is NUMA but not highly so. For instance Solaris NUMA optimizations give 8% - 10% improvement in performance.

Memory Bandwidth


There has been some discussion on the Web about the reduction of MCUs “dramatically” reducing the bandwidth of the UltraSPARC T2 Plus. This is pretty bogus.

Each UltraSPARC T2 Plus has a max theoretical bandwidth of:

Read 21 GBytes/s
Write 10.67GBytes/s
Total Read + Write 32 GBytes/s

Thats a theoretical max 64GBytes/s across the two processors on a T5140 or T5240. We have tested over 30GBytes a second read bandwidth using the STREAM benchmark on a T5240. Thats 75% of the max theoretical bandwidth.

What's far more important for maximum bandwidth are the number of FBDIMMs available in a channel. As the number of internal-banks/drams increase scheduling associativity increase and so does bandwidth. For instance 4 DIMMs in a channel gives significantly better bandwidth than just 2 . On the T5240 a riser (called the Memory Mezzanine Kit) is available that enables a second row of 16 DIMMs. This riser snaps on top of the layer of DIMMs on the motherboard and extends the FBDIMM channels so each can have up to 4 DIMMs. This configuration gives the maximum bandwidth.

Bottom line the processor has a large pipe to memory and we can utilize most of this bandwidth

Memory Interleaving

The T5120 / T5240 present a single memory address range to Solaris or any other OS. All memory on both processors is accessible through the CU. As mentioned previously the L2 cache line is still 64 bytes and we interleave these lines across the 8 banks on the cache to achieve maximum memory bandwidth. When actually accessing memory the MCU interleaves the 64 byte request across its two channels, 32 bytes of the data comes from channel 0 and 32 bytes from channel 1

The next interleave factor is 512 bytes which is 8 banks x 64 bytes. In this mode the first 512 bytes will be placed in processor zero's memory, next 512 bytes come from the memory attached to processor one etc. All memory accesses will be distributed evenly across both the UltraSPARC T2 Plus processors. The effect on the memory latency is to make it the average of 150ns and 225ns at about 190ns. The upside of this is avoiding hotspots, the downside is all applications pay the same latency penalty and there is a lot of interconnect traffic.

To take advantage of NUMA optimizations in the Solaris OS we added another interleave factor of 1GB. In this mode memory is allocated in 1GB chunks per processor. On T5140 and T5240 this is the default interleaving mode. The mapping information is passed to Solaris which can then optimize memory placement. For instance the stack and heap allocations can be placed on the same UltraSPARC T2 Plus as the process thus taking advantage of lower latency to local memory.

All these interleaf modes are handled automatically by the hardware. When a memory request reaches the CU after missing the L2 a decision is made which node to get it serviced from based on the interleaving factor which is a very simple compare based on address.

The hardware has the ability to have a mixed environment as well, some portion 512 byte interleaved and the rest 1GB interleaved but this is not used in current systems.


Memory Configuration


To determine the configuration of the memory on your system login to the system controller as user admin and enter showcomponent. This will dump a line for each DIMM in the system such as


/SYS/MB/CMP0/MR0/BR0/CH0/D2
/SYS/MB/CMP0/MR0/BR0/CH0/D3
/SYS/MB/CMP0/MR0/BR0/CH1/D2

The key to this cryptic output is as follows

MB is Motherboard

CMP0 is the first UltraSPARC T2 Plus processor
CMP1 is the second UltraSPARC T2 Plus processor

MR is the Memory Riser (Mezzaine)
MR0 extends the channels attached to the first processor
MR1 extends the channels attached to the first processor

On the ground floor there are 2 DIMMs per channel, D0 and D1
On the riser each channel is extended 2 more DIMMs D2 and D3

Each processor has 2 Memory controllers called confusingly Branches BR0 and BR1
Each memory Controller has 2 channels CH0 and CH1

So..... (still with me)


4 DIMMs on a CPU 0 - 1 DIMM on each channel of each controller


MB/CMP0/BR0/CH0/D0

MB/CMP0/BR0/CH1/D0

MB/CMP0/BR1/CH0/D0

MB/CMP0/BR1/CH1/D0



8 DIMMS on a CPU - 2 DIIMMs on each channel of each controller


MB/CMP0/BR0/CH0/D0

MB/CMP0/BR0/CH1/D0

MB/CMP0/BR1/CH0/D0

MB/CMP0/BR1/CH1/D0

MB/CMP0/BR0/CH0/D1

MB/CMP0/BR0/CH1/D1

MB/CMP0/BR1/CH0/D1

MB/CMP0/BR1/CH1/D1




All 16 on a CPU 4 DIIMMs on each channel of each controller


MB/CMP0/BR0/CH0/D0 - 8 on the ground floor

MB/CMP0/BR0/CH1/D0

MB/CMP0/BR1/CH0/D0

MB/CMP0/BR1/CH1/D0

MB/CMP0/BR0/CH0/D1

MB/CMP0/BR0/CH1/D1

MB/CMP0/BR1/CH0/D1

MB/CMP0/BR1/CH1/D1

MB/CMP0/MR0/BR0/CH0/D2 - head upstairs to the riser

MB/CMP0/MR0/BR0/CH0/D3

MB/CMP0/MR0/BR0/CH1/D2

MB/CMP0/MR0/BR0/CH1/D3

MB/CMP0/MR0/BR1/CH0/D2

MB/CMP0/MR0/BR1/CH0/D3

MB/CMP0/MR0/BR1/CH1/D2

MB/CMP0/MR0/BR1/CH1/D3




Thursday Apr 10, 2008

Overview of T2 Plus systems

Sun on Wednesday launched the next generation of CMT servers based on the UltraSPARC T2 Plus processor. The T2 Plus is the first CMT processor with in-built coherency links that enables 2-way servers. The new servers are called the T5140 and T5240. The T5140 is a 1U server and the T5240 is a 2U server. A picture of the T5240 is on the right. Each server comes with two T2 Plus processors linked together through the motherboard


This is really exciting for us as it doubles the capacity of these servers while maintaining the same 1U and 2U form factors. This represents the absolute highest computes density of any server in the market.


As the name suggests the UltraSPARC T2 Plus is a derivative of the UltraSPARC T2. The vast majority of the processors are the same, both use Texas Instruments 65nm technology. There are still 8 cores with 8 Floating point and crypto units and 16 integer pipelines on each processor and a x8 PCI-E link integrated. The main differences are the replacement of the T2's 10Gig interface with silicon to perform coherency on the T2 Plus. The coherency does not use a traditional bus but instead uses multiple high speed SERDES links running at 4.8Gbps between chips to achieve optimal scaling


First the stats on these servers

T5140 is a 1U server with 16 FBDIMM slots, there are 1GB, 2GB and 4GB DIMMs available and we offer the following options


2x 4 Core, 1.2Ghz, 8GB memory

2x 6 Core, 1.2Ghz, 16GB memory

2x 8 Core, 1.2Ghz, 32GB memory

2x 8 Core, 1.2Ghz, 64GB memory


A T5140 can have up to 4 hot plug disks, either 73GB or 146GB SAS



At the rear are up to 3 PCI-Express slots. Two of these can be 10Gig ports via Sun's XAUI cards shown on the right. Note that unlike the T2 based servers the the 10Gig on the T1540 is not integrated on the processor but is provided by a Neptune NIC on the motherboard.


As with all our CMT servers there are 4 inbuilt GigE ports which are provided by the same onboard Neptune NIC. This is different from the T5120 and T5220 which used two Intel “Ophir” NICs for the 1Gig connectivity.


The server has 2 Hot plug power supplies at 720 Watts each.


Below is a photo of the T5140. Note the two T2 Plus processors under the larger copper heat sinks. The interconnect between the processors is wired through the motherboard. The other copper heat sink in the bottom right hand corner is the Neptune NIC



T5240 is a 2U server with up to 32 FBDIMM slots


We offer the following options with 8 or 16 DIMMs


2x 6 Core, 1.2Ghz, 8GB memory

2x 8 Core, 1.2Ghz, 32GB memory

2x 8 Core, 1.2Ghz, 64GB memory

2x 8 Core, 1.4GHz, 64GB memory


In addition you can buy a riser (called the Memory Mezzanine Kit) that will accommodate another 16 DIMMs there is a photo of the riser on the right.

This riser snaps on top of the layer of DIMMs on the motherboard and extends the FBDIMM channels so each can have up to 4 DIMMs


By default the T5240 can have up to 8 hot plug disks, again either 73GB or 146GB SAS. In addition when ordering the system a different disk backplane can be selected that can accommodate up to 16 disks.


The system has 6 PCI-Express slots and just like the T5140 two of these slots can be 10Gig ports via Sun's XAUI cards using the onboard Neptune NIC. Also like the T5140 the Neptune also provides the 4 built-in GigEs


By default the T5240 has two hot plug power supplies 1100 watts each. For the 1.4GHz and 16 disk configs however you need to use 220V AC input.





Below is a photo of the T5240 without a riser card, notice the motherboard is exactly the same as the T5140. The server has larger fans, however, to accommodate the extra memory and potentially more disks and PCIE cards. Note the larger power supplies are stacked one on top of each other as opposed to side by side on the T5140




As mentioned the T5140 and T5240 have bigger power supplies than the corresponding T5120 and T5220 as they need to support two processors. Power supplies from T5120 and T5220 should not be used for these servers.


The T5140 and T5240 use the exact same FBDIMM memory, however, as the T5120 / T5220 and support the exact same PCI-E and XAUI cards.


As noted earlier networking on the T5140 and T5240 is interesting as the two 10Gig XAUI and 4xGigE share the same Neptune NIC. If you use a XAUI card one of the GigE ports are disabled. Looking at the diagram below which highlights the two XAUI and GigE ports.


If you insert a XAUI card into XAUI 0 then the GigE port net1 will be disabled


If you insert a XAUI card into XAUI 1 then the GigE port net0 will be disabled


The Solaris Driver for all these ports is nxge, so device nxge0 can be either a 10Gig or a 1Gig port





One neat feature of the T5240 and T5140 is that each of the UltraSPARC T2 Plus processors has its own integrated x8 PCI-E link. Having two processors therefore automatically doubles the PCI-Express bandwidth. The T5x40 servers have twice the I/O bandwidth of their T5x20 equivalents. The PCI-E slots are wired so that the I/O is spread evenly across the x8 ports on both UltraSPARC T2 Plus processors.


For more information check out our whitepaper at http://www.sun.com/servers/coolthreads/t5140/wp.pdf








Tuesday Oct 09, 2007

Floating Point performance on the UltraSPARC T2 processor

UtraSPARC T1 Floating Point Performance

The first generation of CMT processor, the UltraSPARC T1, had a single floating point unit shared between 8 cores. This tradeoff was made to keep the chip a reasonable size in 90nm silicon technology. We also knew that most commercial applications have little or no floating point content so it was a calculated risk.

The T1 floating point unit has some other limitations: it is at the other side of the crossbar from all the cores, so there is a 40 cycle penalty to access the unit; only one thread can use it at a time; and some of the more obscure FP and VIS instructions were not implemented in silicon but emulated in software Even though there are few FP instructions in commercial applications lack of floating point performance is a difficult concept for folks to grasp. They don't usually give it much thought. To help ease the transition to T1 we created a tool cooltst available from /www.opensparc.net/cooltools that gave a simple indication of the percent of floating point in an application using the hardware performance counters available on most processors.

The limited floating point capability on T1 did have a number of downsides

  • It gave FUD to our major competitors

  • It was confusing to customers who had never had to previously test FP content in their applications

  • It excluded T1 systems from many of the major Financial applications such as Monte Carlo and Risk Management

  • It also excluded the T1 from the High Performance Computing (HPC) space

UtraSPARC T2 Floating Point Performance

The next generation of CMT, the UltraSPARC T2, that we are releasing in systems today is built in 65nm technology. This process gives nearly 40% more circuits in the same die size. One of the priorities for the design was to fix the limited floating point capability . This was achieved as follows

  • There is now a single Floating Point Unit (FPU) per core, for a total of 8 on chip. Each FPU is shared by all 8 threads in the core

  • Each FPU actually has 3 separate pipelines, Add/Multiply which handles most of the floating point operations, a graphics pipeline that implements VIS 2.0 and a Divide/Square root pipeline (see diagram below)

  • The first 2 pipelines are 12 stage (see diagram on the right) and fully pipelined to accommodate many threads at the same time. The Divide/Sqrt has a variable latency depending on the operation

  • Access to the FPU is in 6 cycles (compared to 40 on the T1)

  • All FP and VIS 2.0 instructions are implemented in silicon, no more software emulation

  • The FPU also enhances integer performance by performing integer multiply, divide and popc

  • The FPU also enhances Cryptographic performance by providing arithmetic functions to the per core Crypto units.

The theoretical max raw performance of the 8 floating point units is 11 Giga flops per second (GFlops/s). A huge advantage over other implementations however is that 64 threads can share the units and thus we can achieve an extremely high percentage of theoretical peak. Our experiments have achieved nearly 90% of the 11 Gflop/s.

From a Floating point perspective the other advantage T2 has is the huge memory bandwidth from the on-chip memory controllers that are connected directly to FBDIMM memory. This memory subsystem can deliver a theoretical max of 60GB/s of combined memory bandwidth.

For example, during early testing utilizing the floating-point rate component of the benchmark suite SPECfp_rate2000, some of the codes achieved over 20 Gbytes per second of memory bandwidth.


This combination of FP and memory bandwidth can be seen in our SPEC CPU2006 floating-point rate result (1). The Sun SPARC Enterprise T5220 server, driven by the UltraSPARC T2 processor running at 1.4GHz, achieved a a world record result of 62.3 SPECfp_rate2006 running with 63 threads – note this number will be published at http://www.spec.org/cpu2006/results/res2007q4/#SPECfp_rate after the systems are launched.


Another proof point is a set of Floating point applications called the Performance Evaluation Application Suite (PEAS) developed by a Sun engineer, Ruud van der Pas, an described in his blog http://blogs.sun.com/ruud/

PEAS currently consists of 20 technical-scientific applications written in Fortran and C. These are all single threaded user programs derived from the real applications. Ruud has run all twenty of these codes in parallel on a T2 system and then two sets of the codes – 40 in parallel. There was still plenty of spare CPU on the box


At the UltraSPARC T2 launch http://www.sun.com/featured-articles/2007-0807/feature/index.jsp Dave Patterson and Sam Williams from Berkely presented their findings using Sparse matrix on the T2. Their tests consist of a variety of FP intensive HPC codes that have large memory bandwidth requirements. Sam will also present his findings at SC07 http://sc07.supercomputing.org/schedule/event_detail.php?evid=11090

We have also had a number of Early Access customers that are testing floating point applications. One customer can now fully load his T5120 where previously it would fail to complete in the desired time on T2000.

In conclusion with the UltraSPARC T2 we have solved the issue of limited floating point capability that was an issue on the original T1. Our guidance to avoid running applications on the T1 with more than 2% FP instructions does not apply to the T2. No changes are required by developers to take advantage of the 8 floating point units as they are fully integrated into the pipeline of each core.



(1) SPEC® and the benchmark name SPECfp® are registered trademarks of the Standard Performance Evaluation Corporation. Presented result have been submitted to SPEC. For the latest SPECcpu2006 benchmark results, visit http://www.spec.org/cpu2006. Competitive claims are based on results at www.spec.org/cpu2006 as of 8 October 2007



T5120 and T5220 system overview

Sun today launches the new line of servers based on the UltraSPARC T2 processor. The T2 processor is the next generation of CMT following on from the very successful UltraSPARC T1. The new servers are called the T5120 and T5220. The T5120 is a 1U server and the T5220 is a 2U server.


The T5120/T5220 systems differ greatly to the T1000 and T2000 systems of today having a completely redesigned Motherboard.

Answers to common FAQs are

  • An UltraSPARC T1 processor cannot be put in a T5120/T5220 Motherboard

  • An UltraSPARC T2 cannot be put in a T1000/T2000 system

  • A T1000/T2000 cannot be upgraded to a T5120/T5220


The T5120 and T5220 use the exact same motherboard. In fact the only differences between the two systems are:

  • Height: 1U vs 2U.

  • Power supplies: 650 Watts for the T5120, 750 Watts for the T5220. Note the power supplies between the two systems are physically different and cannot be interchanged. Also both systems have two hot pluggable power supplies.

  • Number of PCI-E slots: 3 for the T5120 versus 6 for the T5120

  • Max number of disks: 4 for the T5120 versus 8 for the T5120

The first thing you will notice is that the T5120 and T5220 are longer than the UltraSPARC T1 based T1000 and T2000 systems. The T5120/T5220 are 28.1 inches long versus 24 inches on the previous generation.

To open the system press down on the button in the center of the lid and push towards the end. Note that just like the T2000 there is an intrusion switch in the top right hand corner of the system. If the lid is opened power is cut to the system so make sure the OS has been gracefully shutdown before opening the lid

On the T2000 systems today the service controller (SC) is on a small pcb in a special slot. For the T5120/T5220 we have integrated the SC on the Motherboard


CPU

Looking at the system, the processor is under the large copper heatsink in the middle of the Motherboard between the 16 memory DIMM slots. The UltraSPARC T2 processor frequency is either 1.2GHz or 1.4GHz and comes in 4/6/8-core options.


dimmMemory

The memory in T5120/T5220 systems is FBDIMM and different from the DDR2 used in T1000/T2000. There is a photo of an FBDIMM memory stick on the right.

Note you cannot use current memory from T1000/T2000 in the new systems

The T5120/T5220 have 16 DIMM slots and currently 3 DIMM size options, 1GB, 2GB and 4GB. So the maximum memory today is 64GB in these systems

The DIMMs can be half populated. Insert the first DIMM in the slot closest to the CPU and every second one after that


Fans

The fans are accessed through the smaller lid in the cover. The fan unit now consists of 2 fans connected together. The T5120/T5220 only require 1 row of fans for cooling. The second row will be empty. Note the shaped plastic on top of the processor and memory in the T5220 which is used to force air from the fans over the CPU and DIMMS. The fans are hot plugable an we have implemented variable fan speed control to reduce noise. Under nominal conditions the fans will run at half speed.


Disks

The T5120/T5220 systems use the same 73GB and 146GB SAS drives available on T2000 based systems today but will use a different bracket. Thus to use current T2000 disks requires swapping the bracket.

T5120/T5220 systems use the LSI 1068E RAID controller in order to support 8 physical disks. Apart from the extra disks this is functionally equivalent to the controller used in the T2000 today. RAID 0+1 is available via Solaris Raidctl. Note mirroring a root partition still needs to be done before the OS is installed.


One of the big differences between T5120/T5220 and T1 based systems is the I/O configuration. As mentioned previously the T5120 (1U) has 3 PCI-E slots and the T5220 (2U) has 6. There are no PCI-X slots on the new systems. Since the launch of the T1 systems we have seen increasing availability of low profile PCI-Express cards for all major HBA applications, FCAL, Gige and 10Gig networking, IB etc.

I/O

The UltraSPARC T2 has two 10Gig network ports built into the processor. These ports provide superior 10Gig performance and require fewer CPU cycles. The 2 interfaces on the chip are industry standard XAUI. There is a card available from Sun to convert XAUI to a fiber 10Gig connection. An example of the XAUI card and its optics are shown on the right. The electrical connector on a XAUI card (the gold connector in the photo) is towards the back of the card which is a different position to standard PCI-E cards.

Unlike the T2000 the PCI-E and XAUI slots on T5120/T5220 are on their side and plug into 3 riser cards that then plug into the Motherboard. The 2U riser can be seen in the photo on the right. The two bigger connectors on the left of the riser are for PCI-E and the small one on the right is for XAUI.

The 1U has a single layer of 3 slots and the 2U has 2 layers. All the PCI-E connectors are either x8 or x16 in size, but are actually wired x4 or x8.

Both T5120 and T5220 systems have 4 Gigabit Ethernet ports integrated into the motherboard. Like the T2000 these ports use 2 Intel Ophir chips and the e1000g driver in the Solaris OS.


Rear

Looking at the back of either the T5120 or T5220 you will see that there are two redundant power supplies.

There is a serial port and a 100mb network port for the System Controller as well as the 4 Gigabit ports. At the right hand corner is a RS232 serial port designated as TTYA by the operating system. There are two, 2.0 capable USB ports at the rear of both systems


Front

In the front there are slots for 4 disks on a T5120 and 8 on a T5220. On both systems there is also a DVD in the top right hand corner. There are two, 2.0 capable USB ports at the front of both systems as well.


Front / Rear LEDs

On rear of the system we have 3 LEDs which are from left to right

  • Locator LED

  • Service Required LED

  • Power OK LED

On the Front of the system we have side the same 3 LEDS on the left hand. On the right hand side we have 3 more

  • TOP which indicates a fan needs attention

  • PS LED indicates a power supply has an issue

  • Over temp indicator


Lesons learned from T1



We created UltrSPARC T1 and launched it into the world in November 2005. Adoption was slow at the start as CMT was so different. We spent many hours explaining patiently how it worked to customers, partners etc. We also did Proof of Concepts (POCs) with many of Sun's major customers around the world. As we progressed and the product ramped we gathered a body of knowledge on Applications, how they worked on CMT and how to tune them. We also wrote whitepapers available at http://www.sun.com/blueprints and we created tools that we posted at http://www.opensparc.net/cool-tools.html

Now here we are two years later about to launch UltraSPARC T2. We have learned many lessons along the way.

1. We have much more experience in helping customers to migrate to Solaris

Migrations involve a number of steps:


  • Recertifying customers' application stacks on Solaris 10
  • Identifying legacy applications that cannot be moved to Solaris 10

To help the deployment of Legacy applications we have developed the Solaris 8 Migration Assistant, internal Sun codename Etude http://blogs.sun.com/dp/entry/project_etude_revealed Etude allows a user to run a Solaris 8 application inside of a BrandZ zone in Solaris 10



2. Customers often evaluate CMT performance with a single threaded application.

Many times a customer has a standard benchmark that they have used for years to evaluate all hardware. They have collected a body of results that they use to rank servers. These tests are often single threaded and viewed as a “power” test of the server. The customer feels such as test has the effect of “leveling” the performance playing field. This test is often the first door that needs to be passed to enable further evaluation.

I have received many mails from folks that start “the performance on the T2000 was very poor, it was 50% of a v440”. My first question is always is the CMT server 96% idle at the time of the test. If so this usually points to a single threaded test.

Single thread performance is not the design point for CMT. The pipeline is simple and designed above all else to be shared thus masking the memory stall of threads. This design leads to its extreme low power and its extremely high throughput. In UltraSPARC T2 we have continued this focus. There are now twice the integer pipelines (16in total) which doubles the throughput of the chip. We have added a couple of more pipeline stages and increased the size of some of the caches which increases single thread performance about 25%.

With CMT we need to ask the question does this test really reflect the true nature of the customers workload. If the customer truly needs single thread performance then CMT is not for them. In many situations, however, the reality is they really require throughput. As many customers have found in a throughput environment CMT is a far superior architecture.



3. The 1/4 frequency argument is a common misconception of CMT performance

The ¼ Frequency argument goes as follows:


  • A CMT pipeline runs at say 1.2GHz and has 4 threads sharing it
  • Therefore each thread only gets 1/4 the cycles and runs 300MHz
  • This makes it less performant than an old US II chip

This line of argument doesn't hold because most commercial code chases pointers and is constantly loading data structures. On average a commercial application stalls every 100 instructions for a variety of reasons such as TLB miss, I cache miss, Level 2 cache miss etc. When a thread stalls it is usually delayed for many cycles, an Icache miss for instance is 23 cycles. So even though a thread is running at 1.2GHz it usually spends 70% of its time stalled. This is why major processor manufacturers create ever deeper out-of-order pipelines in an effort to avoid this stall.

All this stalling is perfect for CMT. The hardware automatically switches out a thread when it stalls and shares its cycles amongst the other 3 threads on the pipeline masking the stall. With this technique we can utilize the pipeline 75% - 80 of the time provided there are enough threads to absorb the stall



4. Most Commercial applications have little or no floating point instructions.

As part of the UltraSPARC T1 rollout we introduced a tool called http://cooltools.sunsource.net/cooltst/index.html that gives an indication of the percentage of floating point in a current deployment environment. In reviewing the output from many hundreds of customer cooltst runs we rarely see a large floating point indication.

"Most" commercial application developers would rather not deal with floating point as it is more difficult to program. There are of course exceptions in the commercial space such as SAS and portions of the SAP stack. One big exception is Wall Street with such applications as Monte Carlo.

In UltraSPARC T2 we added a fully pipelined floating point unit per core shared by 8 threads. These FP units can deliver over 11 Giga Flops of Floating Point performance per second. So the Floating Point issue has been completely eliminated in T2.



5. One of the biggest gains in Java apps is moving to 1.5 or 1.6 JVM

Many Java applications today are running on JVM 1.4.2. The last build of this version was created in Dec 2003 and it has now officially entered the Sun EOL transition period described at
http://java.sun.com/j2se/1.4.2/.

One of the best ways to improve throughput performance of a Java Application on CMT servers is to upgrade the JVM to at least the latest 1.5 JVM or preferably 1.6. These versions of JVM have a host of new features and performance optimizations many targeted specifically at CMT. I have seen 15% - 30% increase in performance of Java applications wen migrated from 1.42 to 1.6

The issue complicated slightly by older versions of ISV software that are only supported on 1.4.2. Again we encourage customers to migrate to the newer version of the ISV stack.



6. There are a set of Applications where CMT really shines

We have worked with over 200 customers in the last 2 years and the following list covers a large portion of the applications where CMT showed excellent performance.

Webservers.

  • Sunone
  • Apache

J2SE Appservers.

  • BEA Weblogic
  • IBM Websphere
  • Glassfish (formerly SunOne Appserver)

Database Servers.

  • Oracle including RAC
  • DB2
  • Sybase
  • mySQL

Mail Servers.


  • Sendmail
  • Domino
  • Brightmail including Spamguard

Java Throughput Applications

Traditional Appservers.


  • Siebel
  • Peoplesoft

Net Backup

JES Stack


  • Directory, Portal, Access Manager





About

denissheahan

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today
Bookmarks