Tuesday Oct 09, 2007

Floating Point performance on the UltraSPARC T2 processor

UtraSPARC T1 Floating Point Performance

The first generation of CMT processor, the UltraSPARC T1, had a single floating point unit shared between 8 cores. This tradeoff was made to keep the chip a reasonable size in 90nm silicon technology. We also knew that most commercial applications have little or no floating point content so it was a calculated risk.

The T1 floating point unit has some other limitations: it is at the other side of the crossbar from all the cores, so there is a 40 cycle penalty to access the unit; only one thread can use it at a time; and some of the more obscure FP and VIS instructions were not implemented in silicon but emulated in software Even though there are few FP instructions in commercial applications lack of floating point performance is a difficult concept for folks to grasp. They don't usually give it much thought. To help ease the transition to T1 we created a tool cooltst available from /www.opensparc.net/cooltools that gave a simple indication of the percent of floating point in an application using the hardware performance counters available on most processors.

The limited floating point capability on T1 did have a number of downsides

  • It gave FUD to our major competitors

  • It was confusing to customers who had never had to previously test FP content in their applications

  • It excluded T1 systems from many of the major Financial applications such as Monte Carlo and Risk Management

  • It also excluded the T1 from the High Performance Computing (HPC) space

UtraSPARC T2 Floating Point Performance

The next generation of CMT, the UltraSPARC T2, that we are releasing in systems today is built in 65nm technology. This process gives nearly 40% more circuits in the same die size. One of the priorities for the design was to fix the limited floating point capability . This was achieved as follows

  • There is now a single Floating Point Unit (FPU) per core, for a total of 8 on chip. Each FPU is shared by all 8 threads in the core

  • Each FPU actually has 3 separate pipelines, Add/Multiply which handles most of the floating point operations, a graphics pipeline that implements VIS 2.0 and a Divide/Square root pipeline (see diagram below)

  • The first 2 pipelines are 12 stage (see diagram on the right) and fully pipelined to accommodate many threads at the same time. The Divide/Sqrt has a variable latency depending on the operation

  • Access to the FPU is in 6 cycles (compared to 40 on the T1)

  • All FP and VIS 2.0 instructions are implemented in silicon, no more software emulation

  • The FPU also enhances integer performance by performing integer multiply, divide and popc

  • The FPU also enhances Cryptographic performance by providing arithmetic functions to the per core Crypto units.

The theoretical max raw performance of the 8 floating point units is 11 Giga flops per second (GFlops/s). A huge advantage over other implementations however is that 64 threads can share the units and thus we can achieve an extremely high percentage of theoretical peak. Our experiments have achieved nearly 90% of the 11 Gflop/s.

From a Floating point perspective the other advantage T2 has is the huge memory bandwidth from the on-chip memory controllers that are connected directly to FBDIMM memory. This memory subsystem can deliver a theoretical max of 60GB/s of combined memory bandwidth.

For example, during early testing utilizing the floating-point rate component of the benchmark suite SPECfp_rate2000, some of the codes achieved over 20 Gbytes per second of memory bandwidth.


This combination of FP and memory bandwidth can be seen in our SPEC CPU2006 floating-point rate result (1). The Sun SPARC Enterprise T5220 server, driven by the UltraSPARC T2 processor running at 1.4GHz, achieved a a world record result of 62.3 SPECfp_rate2006 running with 63 threads – note this number will be published at http://www.spec.org/cpu2006/results/res2007q4/#SPECfp_rate after the systems are launched.


Another proof point is a set of Floating point applications called the Performance Evaluation Application Suite (PEAS) developed by a Sun engineer, Ruud van der Pas, an described in his blog http://blogs.sun.com/ruud/

PEAS currently consists of 20 technical-scientific applications written in Fortran and C. These are all single threaded user programs derived from the real applications. Ruud has run all twenty of these codes in parallel on a T2 system and then two sets of the codes – 40 in parallel. There was still plenty of spare CPU on the box


At the UltraSPARC T2 launch http://www.sun.com/featured-articles/2007-0807/feature/index.jsp Dave Patterson and Sam Williams from Berkely presented their findings using Sparse matrix on the T2. Their tests consist of a variety of FP intensive HPC codes that have large memory bandwidth requirements. Sam will also present his findings at SC07 http://sc07.supercomputing.org/schedule/event_detail.php?evid=11090

We have also had a number of Early Access customers that are testing floating point applications. One customer can now fully load his T5120 where previously it would fail to complete in the desired time on T2000.

In conclusion with the UltraSPARC T2 we have solved the issue of limited floating point capability that was an issue on the original T1. Our guidance to avoid running applications on the T1 with more than 2% FP instructions does not apply to the T2. No changes are required by developers to take advantage of the 8 floating point units as they are fully integrated into the pipeline of each core.



(1) SPEC® and the benchmark name SPECfp® are registered trademarks of the Standard Performance Evaluation Corporation. Presented result have been submitted to SPEC. For the latest SPECcpu2006 benchmark results, visit http://www.spec.org/cpu2006. Competitive claims are based on results at www.spec.org/cpu2006 as of 8 October 2007



T5120 and T5220 system overview

Sun today launches the new line of servers based on the UltraSPARC T2 processor. The T2 processor is the next generation of CMT following on from the very successful UltraSPARC T1. The new servers are called the T5120 and T5220. The T5120 is a 1U server and the T5220 is a 2U server.


The T5120/T5220 systems differ greatly to the T1000 and T2000 systems of today having a completely redesigned Motherboard.

Answers to common FAQs are

  • An UltraSPARC T1 processor cannot be put in a T5120/T5220 Motherboard

  • An UltraSPARC T2 cannot be put in a T1000/T2000 system

  • A T1000/T2000 cannot be upgraded to a T5120/T5220


The T5120 and T5220 use the exact same motherboard. In fact the only differences between the two systems are:

  • Height: 1U vs 2U.

  • Power supplies: 650 Watts for the T5120, 750 Watts for the T5220. Note the power supplies between the two systems are physically different and cannot be interchanged. Also both systems have two hot pluggable power supplies.

  • Number of PCI-E slots: 3 for the T5120 versus 6 for the T5120

  • Max number of disks: 4 for the T5120 versus 8 for the T5120

The first thing you will notice is that the T5120 and T5220 are longer than the UltraSPARC T1 based T1000 and T2000 systems. The T5120/T5220 are 28.1 inches long versus 24 inches on the previous generation.

To open the system press down on the button in the center of the lid and push towards the end. Note that just like the T2000 there is an intrusion switch in the top right hand corner of the system. If the lid is opened power is cut to the system so make sure the OS has been gracefully shutdown before opening the lid

On the T2000 systems today the service controller (SC) is on a small pcb in a special slot. For the T5120/T5220 we have integrated the SC on the Motherboard


CPU

Looking at the system, the processor is under the large copper heatsink in the middle of the Motherboard between the 16 memory DIMM slots. The UltraSPARC T2 processor frequency is either 1.2GHz or 1.4GHz and comes in 4/6/8-core options.


dimmMemory

The memory in T5120/T5220 systems is FBDIMM and different from the DDR2 used in T1000/T2000. There is a photo of an FBDIMM memory stick on the right.

Note you cannot use current memory from T1000/T2000 in the new systems

The T5120/T5220 have 16 DIMM slots and currently 3 DIMM size options, 1GB, 2GB and 4GB. So the maximum memory today is 64GB in these systems

The DIMMs can be half populated. Insert the first DIMM in the slot closest to the CPU and every second one after that


Fans

The fans are accessed through the smaller lid in the cover. The fan unit now consists of 2 fans connected together. The T5120/T5220 only require 1 row of fans for cooling. The second row will be empty. Note the shaped plastic on top of the processor and memory in the T5220 which is used to force air from the fans over the CPU and DIMMS. The fans are hot plugable an we have implemented variable fan speed control to reduce noise. Under nominal conditions the fans will run at half speed.


Disks

The T5120/T5220 systems use the same 73GB and 146GB SAS drives available on T2000 based systems today but will use a different bracket. Thus to use current T2000 disks requires swapping the bracket.

T5120/T5220 systems use the LSI 1068E RAID controller in order to support 8 physical disks. Apart from the extra disks this is functionally equivalent to the controller used in the T2000 today. RAID 0+1 is available via Solaris Raidctl. Note mirroring a root partition still needs to be done before the OS is installed.


One of the big differences between T5120/T5220 and T1 based systems is the I/O configuration. As mentioned previously the T5120 (1U) has 3 PCI-E slots and the T5220 (2U) has 6. There are no PCI-X slots on the new systems. Since the launch of the T1 systems we have seen increasing availability of low profile PCI-Express cards for all major HBA applications, FCAL, Gige and 10Gig networking, IB etc.

I/O

The UltraSPARC T2 has two 10Gig network ports built into the processor. These ports provide superior 10Gig performance and require fewer CPU cycles. The 2 interfaces on the chip are industry standard XAUI. There is a card available from Sun to convert XAUI to a fiber 10Gig connection. An example of the XAUI card and its optics are shown on the right. The electrical connector on a XAUI card (the gold connector in the photo) is towards the back of the card which is a different position to standard PCI-E cards.

Unlike the T2000 the PCI-E and XAUI slots on T5120/T5220 are on their side and plug into 3 riser cards that then plug into the Motherboard. The 2U riser can be seen in the photo on the right. The two bigger connectors on the left of the riser are for PCI-E and the small one on the right is for XAUI.

The 1U has a single layer of 3 slots and the 2U has 2 layers. All the PCI-E connectors are either x8 or x16 in size, but are actually wired x4 or x8.

Both T5120 and T5220 systems have 4 Gigabit Ethernet ports integrated into the motherboard. Like the T2000 these ports use 2 Intel Ophir chips and the e1000g driver in the Solaris OS.


Rear

Looking at the back of either the T5120 or T5220 you will see that there are two redundant power supplies.

There is a serial port and a 100mb network port for the System Controller as well as the 4 Gigabit ports. At the right hand corner is a RS232 serial port designated as TTYA by the operating system. There are two, 2.0 capable USB ports at the rear of both systems


Front

In the front there are slots for 4 disks on a T5120 and 8 on a T5220. On both systems there is also a DVD in the top right hand corner. There are two, 2.0 capable USB ports at the front of both systems as well.


Front / Rear LEDs

On rear of the system we have 3 LEDs which are from left to right

  • Locator LED

  • Service Required LED

  • Power OK LED

On the Front of the system we have side the same 3 LEDS on the left hand. On the right hand side we have 3 more

  • TOP which indicates a fan needs attention

  • PS LED indicates a power supply has an issue

  • Over temp indicator


Lesons learned from T1



We created UltrSPARC T1 and launched it into the world in November 2005. Adoption was slow at the start as CMT was so different. We spent many hours explaining patiently how it worked to customers, partners etc. We also did Proof of Concepts (POCs) with many of Sun's major customers around the world. As we progressed and the product ramped we gathered a body of knowledge on Applications, how they worked on CMT and how to tune them. We also wrote whitepapers available at http://www.sun.com/blueprints and we created tools that we posted at http://www.opensparc.net/cool-tools.html

Now here we are two years later about to launch UltraSPARC T2. We have learned many lessons along the way.

1. We have much more experience in helping customers to migrate to Solaris

Migrations involve a number of steps:


  • Recertifying customers' application stacks on Solaris 10
  • Identifying legacy applications that cannot be moved to Solaris 10

To help the deployment of Legacy applications we have developed the Solaris 8 Migration Assistant, internal Sun codename Etude http://blogs.sun.com/dp/entry/project_etude_revealed Etude allows a user to run a Solaris 8 application inside of a BrandZ zone in Solaris 10



2. Customers often evaluate CMT performance with a single threaded application.

Many times a customer has a standard benchmark that they have used for years to evaluate all hardware. They have collected a body of results that they use to rank servers. These tests are often single threaded and viewed as a “power” test of the server. The customer feels such as test has the effect of “leveling” the performance playing field. This test is often the first door that needs to be passed to enable further evaluation.

I have received many mails from folks that start “the performance on the T2000 was very poor, it was 50% of a v440”. My first question is always is the CMT server 96% idle at the time of the test. If so this usually points to a single threaded test.

Single thread performance is not the design point for CMT. The pipeline is simple and designed above all else to be shared thus masking the memory stall of threads. This design leads to its extreme low power and its extremely high throughput. In UltraSPARC T2 we have continued this focus. There are now twice the integer pipelines (16in total) which doubles the throughput of the chip. We have added a couple of more pipeline stages and increased the size of some of the caches which increases single thread performance about 25%.

With CMT we need to ask the question does this test really reflect the true nature of the customers workload. If the customer truly needs single thread performance then CMT is not for them. In many situations, however, the reality is they really require throughput. As many customers have found in a throughput environment CMT is a far superior architecture.



3. The 1/4 frequency argument is a common misconception of CMT performance

The ¼ Frequency argument goes as follows:


  • A CMT pipeline runs at say 1.2GHz and has 4 threads sharing it
  • Therefore each thread only gets 1/4 the cycles and runs 300MHz
  • This makes it less performant than an old US II chip

This line of argument doesn't hold because most commercial code chases pointers and is constantly loading data structures. On average a commercial application stalls every 100 instructions for a variety of reasons such as TLB miss, I cache miss, Level 2 cache miss etc. When a thread stalls it is usually delayed for many cycles, an Icache miss for instance is 23 cycles. So even though a thread is running at 1.2GHz it usually spends 70% of its time stalled. This is why major processor manufacturers create ever deeper out-of-order pipelines in an effort to avoid this stall.

All this stalling is perfect for CMT. The hardware automatically switches out a thread when it stalls and shares its cycles amongst the other 3 threads on the pipeline masking the stall. With this technique we can utilize the pipeline 75% - 80 of the time provided there are enough threads to absorb the stall



4. Most Commercial applications have little or no floating point instructions.

As part of the UltraSPARC T1 rollout we introduced a tool called http://cooltools.sunsource.net/cooltst/index.html that gives an indication of the percentage of floating point in a current deployment environment. In reviewing the output from many hundreds of customer cooltst runs we rarely see a large floating point indication.

"Most" commercial application developers would rather not deal with floating point as it is more difficult to program. There are of course exceptions in the commercial space such as SAS and portions of the SAP stack. One big exception is Wall Street with such applications as Monte Carlo.

In UltraSPARC T2 we added a fully pipelined floating point unit per core shared by 8 threads. These FP units can deliver over 11 Giga Flops of Floating Point performance per second. So the Floating Point issue has been completely eliminated in T2.



5. One of the biggest gains in Java apps is moving to 1.5 or 1.6 JVM

Many Java applications today are running on JVM 1.4.2. The last build of this version was created in Dec 2003 and it has now officially entered the Sun EOL transition period described at
http://java.sun.com/j2se/1.4.2/.

One of the best ways to improve throughput performance of a Java Application on CMT servers is to upgrade the JVM to at least the latest 1.5 JVM or preferably 1.6. These versions of JVM have a host of new features and performance optimizations many targeted specifically at CMT. I have seen 15% - 30% increase in performance of Java applications wen migrated from 1.42 to 1.6

The issue complicated slightly by older versions of ISV software that are only supported on 1.4.2. Again we encourage customers to migrate to the newer version of the ISV stack.



6. There are a set of Applications where CMT really shines

We have worked with over 200 customers in the last 2 years and the following list covers a large portion of the applications where CMT showed excellent performance.

Webservers.

  • Sunone
  • Apache

J2SE Appservers.

  • BEA Weblogic
  • IBM Websphere
  • Glassfish (formerly SunOne Appserver)

Database Servers.

  • Oracle including RAC
  • DB2
  • Sybase
  • mySQL

Mail Servers.


  • Sendmail
  • Domino
  • Brightmail including Spamguard

Java Throughput Applications

Traditional Appservers.


  • Siebel
  • Peoplesoft

Net Backup

JES Stack


  • Directory, Portal, Access Manager





About

denissheahan

Search

Categories
Archives
« July 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
  
       
Today
Bookmarks