Thursday Jul 31, 2008

UltraSPARC T2 benchmarks

I was presenting to a customer this morning and the question of what applications does the UltraSPARC T2 excel at came up. A very similar question came up when I was presenting at CommunityOne, so it's probably something I should talk about.

I do work on benchmarks - I've described before some of the analysis I did for the development of CPU2006 - but I don't get involved in the "marketing" of benchmark results - I spend my time at the disassembly level. So this is a bit of a departure for me.

Anyway the answer the question of "What's a T2 good for?" is basically "Everything". The only exception is that it's not going to be much use for a customer who only wants to run a single copy of a single threaded application "and that is all". But most people want to run a bunch of codes, or multiple copies of the same code, or some multithreaded app. All these are very suitable for the UltraSPARC T2, or UltraSPARC T2+.

In terms of evidence, here's a bunch of benchmarks for the UltraSPARC T2 and the UltraSPARC T2+. And there's also a number of testimonials from customers. In terms of the benchmark numbers I think its important to see that the processor does well on "commercial" workloads, but also does well on SPEC CPU2006 rate, both floating point and integer.

Moving back to a more technical discussion. An UltraSPARC T2 has 8 cores each with a floating point unit. Each core can execute 2 instructions per cycle, or one floating point operation per cycle. Assuming the chip is clocked at 1.4GHz, each chip can execute:

  • 8 cores \* 2 instructions \* 1.4 GHz = 22.4 billion instructions per second
  • or 8 cores \* 1.4 GHz = 11.2 billion FPops per second.

The T5240 puts two of these chips into a single system, so the numbers double. So in terms of raw instruction issue performance the chips are pretty much unbeatable. Of course, the performance of the chip is normally hampered by the memory stall time. But being CMT processors, the stall time gives the opportunity for other threads to execute, so this is not a factor for the T2.

Tuesday Oct 09, 2007

CMT Developer Tools on the UltraSPARC T2 systems

The CMT Developer Tools are included on the new UltraSPARC T2 based systems, together with Sun Studio 12, and GCC for SPARC Systems.

The CMT Developer Tools are installed in:

and are (unfortunately) not on the default path.

Threads and cores

The UltraSPARC T2 has eight cores, each of these cores is capable of executing 8 threads; making a total of 64 virtual processors. The threads are grouped into two groups of four threads. Each core can execute at most 2 instructions per cycle, the load/store unit and the floating point unit are shared between the two groups of threads, but each thread has its own integer pipeline.

Usually, threads 0-7 are assigned to core 0, threads 8-15 to core 1 etc. The exact mapping is reported by the service processor (and prtdiag), but the mapping is only likely to be different if the system is configured with LDoMs. Taking core 0 as an example; threads 0-3 are assigned to one group and threads 4-7 are assigned to the other group.

To disable a core using psradm it is necessary to disable all the threads on that core. On the other hand if the objective is just to reduce the total number of active threads, keeping the core enabled, then best performance will be attained if the threads are disabled across all groups rather than just disabling threads within a single group. The reason for this is that each group gets to execute a single instruction per cycle, so disabling all the threads within a group will reduce the maximum number of instructions that can be executed in a cycle.

Compiling for the UltraSPARC T2

Today, Sun launched systems based on the UltraSPARC T2. A question that is bound to come up is what compiler flags should be used for the processor?

Sun Studio 12 has the flag -xtarget=ultraT2 to specifically target the UltraSPARC T2. But before jumping off and using this flag, let's take the flag apart and see what it actually means. There are three components that are set by the -xtarget flag :

  • -xcache flag. This flag tells the compiler to target a particular cache configuration. The flag will have an impact on floating point code where the loops can be tiled to fit into cache. Obviously not all codes are amenable to this optimisation, so the -xcache setting is usually unimportant.
  • -xchip flag. This sets the instruction latencies and instruction selection preferences. The UltraSPARC T2 (in common with the UltraSPARC T1) has a simple pipeline so there is nothing much to gain from accurately modelling the instruction latencies. There are also no real situations where it will do better with one instruction sequence in preference to another (unless one is longer than the other). So for the UltraSPARC T2 this flag has little impact on the generated code.
  • -xarch flag. The -xarch flag controls the target architecture. This is traditionally used principally to control whether 32-bit or 64-bit binaries are generated. However, Sun Studio 12 introduced the flags -m32 and -m64 to separate the address-size of the binary from the instruction set selection. There are no UltraSPARC T2 specific instructions which the compiler currently generates, so the default of the SPARC V9 ISA is fine.
  • To summarise, there is an UltraSPARC T2 specific compiler flag, but for most situations the best target to use would be -xtarget=generic which should give good performance over a wide range of processors.

Wednesday Sep 26, 2007

Interpreting the performance counters on the UltraSPARC T1 and UltraSPARC T2

I've previously written up a short entry on using the UltraSPARC T1 performance counters to determine what the processor is doing and where effort might be spent in improving performance. I've just completed a follow up article for the developer portal which discusses this concept in more depth, and covers both the UltraSPARC T1 and the UltraSPARC T2.

A quick refresher here is that it's simple to calculate the utilisation of the processor. They have a fixed maximum number of instructions per second and cpustat can easily determine what proportion of that instruction budget is being utilised. Where it gets interesting is looking at the bottlenecks on the system - such as the memory stalls. On a traditional system memory stall time is all potential performance gain; but on a CMT system one threads's stall is another thread's instruction issue opportunity. Basically, stall will increase the latency of a thread, but reducing stall may not necessarily improve throughput.

This comes down to a few interesting observations:

  • A processor can tolerate a lot of stall cycles before the stall cycles start reducing the throughput of the application.
  • Traditional optimisations, where the developer, as an example, eliminates memory stall time, are not necessarily going to be the most productive use of developer time for CMT systems.
  • The factor that limits processor throughput is often instruction count, not stalls. Fortunately we have tools like BIT for getting instruction count data.

Tuesday Aug 07, 2007

UltraSPARC T2 documentation available

The documentation for the UltraSPARC T2 is available from the OpenSPARC website.


Darryl Gove is a senior engineer in the Solaris Studio team, working on optimising applications and benchmarks for current and future processors. He is also the author of the books:
Multicore Application Programming
Solaris Application Programming
The Developer's Edge
Free Download


« June 2016
The Developer's Edge
Solaris Application Programming
OpenSPARC Book
Multicore Application Programming