Floating Point performance on the UltraSPARC T2 processor

UtraSPARC T1 Floating Point Performance

The first generation of CMT processor, the UltraSPARC T1, had a single floating point unit shared between 8 cores. This tradeoff was made to keep the chip a reasonable size in 90nm silicon technology. We also knew that most commercial applications have little or no floating point content so it was a calculated risk.

The T1 floating point unit has some other limitations: it is at the other side of the crossbar from all the cores, so there is a 40 cycle penalty to access the unit; only one thread can use it at a time; and some of the more obscure FP and VIS instructions were not implemented in silicon but emulated in software Even though there are few FP instructions in commercial applications lack of floating point performance is a difficult concept for folks to grasp. They don't usually give it much thought. To help ease the transition to T1 we created a tool cooltst available from /www.opensparc.net/cooltools that gave a simple indication of the percent of floating point in an application using the hardware performance counters available on most processors.

The limited floating point capability on T1 did have a number of downsides

  • It gave FUD to our major competitors

  • It was confusing to customers who had never had to previously test FP content in their applications

  • It excluded T1 systems from many of the major Financial applications such as Monte Carlo and Risk Management

  • It also excluded the T1 from the High Performance Computing (HPC) space

UtraSPARC T2 Floating Point Performance

The next generation of CMT, the UltraSPARC T2, that we are releasing in systems today is built in 65nm technology. This process gives nearly 40% more circuits in the same die size. One of the priorities for the design was to fix the limited floating point capability . This was achieved as follows

  • There is now a single Floating Point Unit (FPU) per core, for a total of 8 on chip. Each FPU is shared by all 8 threads in the core

  • Each FPU actually has 3 separate pipelines, Add/Multiply which handles most of the floating point operations, a graphics pipeline that implements VIS 2.0 and a Divide/Square root pipeline (see diagram below)

  • The first 2 pipelines are 12 stage (see diagram on the right) and fully pipelined to accommodate many threads at the same time. The Divide/Sqrt has a variable latency depending on the operation

  • Access to the FPU is in 6 cycles (compared to 40 on the T1)

  • All FP and VIS 2.0 instructions are implemented in silicon, no more software emulation

  • The FPU also enhances integer performance by performing integer multiply, divide and popc

  • The FPU also enhances Cryptographic performance by providing arithmetic functions to the per core Crypto units.

The theoretical max raw performance of the 8 floating point units is 11 Giga flops per second (GFlops/s). A huge advantage over other implementations however is that 64 threads can share the units and thus we can achieve an extremely high percentage of theoretical peak. Our experiments have achieved nearly 90% of the 11 Gflop/s.

From a Floating point perspective the other advantage T2 has is the huge memory bandwidth from the on-chip memory controllers that are connected directly to FBDIMM memory. This memory subsystem can deliver a theoretical max of 60GB/s of combined memory bandwidth.

For example, during early testing utilizing the floating-point rate component of the benchmark suite SPECfp_rate2000, some of the codes achieved over 20 Gbytes per second of memory bandwidth.

This combination of FP and memory bandwidth can be seen in our SPEC CPU2006 floating-point rate result (1). The Sun SPARC Enterprise T5220 server, driven by the UltraSPARC T2 processor running at 1.4GHz, achieved a a world record result of 62.3 SPECfp_rate2006 running with 63 threads – note this number will be published at http://www.spec.org/cpu2006/results/res2007q4/#SPECfp_rate after the systems are launched.

Another proof point is a set of Floating point applications called the Performance Evaluation Application Suite (PEAS) developed by a Sun engineer, Ruud van der Pas, an described in his blog http://blogs.sun.com/ruud/

PEAS currently consists of 20 technical-scientific applications written in Fortran and C. These are all single threaded user programs derived from the real applications. Ruud has run all twenty of these codes in parallel on a T2 system and then two sets of the codes – 40 in parallel. There was still plenty of spare CPU on the box

At the UltraSPARC T2 launch http://www.sun.com/featured-articles/2007-0807/feature/index.jsp Dave Patterson and Sam Williams from Berkely presented their findings using Sparse matrix on the T2. Their tests consist of a variety of FP intensive HPC codes that have large memory bandwidth requirements. Sam will also present his findings at SC07 http://sc07.supercomputing.org/schedule/event_detail.php?evid=11090

We have also had a number of Early Access customers that are testing floating point applications. One customer can now fully load his T5120 where previously it would fail to complete in the desired time on T2000.

In conclusion with the UltraSPARC T2 we have solved the issue of limited floating point capability that was an issue on the original T1. Our guidance to avoid running applications on the T1 with more than 2% FP instructions does not apply to the T2. No changes are required by developers to take advantage of the 8 floating point units as they are fully integrated into the pipeline of each core.

(1) SPEC® and the benchmark name SPECfp® are registered trademarks of the Standard Performance Evaluation Corporation. Presented result have been submitted to SPEC. For the latest SPECcpu2006 benchmark results, visit http://www.spec.org/cpu2006. Competitive claims are based on results at www.spec.org/cpu2006 as of 8 October 2007


Post a Comment:
Comments are closed for this entry.



« April 2014