How to evaluate the performance of the heavily threaded UltraSPARC T2 multicore processor?

Some time ago I had the pleasure to have access to an early engineering system with the UltraSPARC T2 processor. I used this opportunity to run my private PEAS (Performance Evaluation Application Suite) test suite.

It turned out that my findings on throughput benchmarks conducted with this suite revealed some interesting aspects of the architecture. In this blog I'll write about that, but a word of warning is also in place. This is rather early work. I'd like to see it as a start and plan to gather and analyze more results in the near future.

PEAS currently consists of 20 technical-scientific applications written in Fortran and C. These are all single threaded user programs, or computational kernels derived from the real application. In this sense it very well reflects what our users are typically running.

PEAS has been used in the past to evaluate the performance of upcoming processors. Access to an early system provided a good opportunity to take a closer first look at the performance of the UltraSPARC T2 processor.

PEAS is about throughput. I typically run all 20 jobs simultaneously. I call this a "STREAM", which by the way has no relationship with the Streams benchmark. It is just a name I picked quite some time ago.

I can of course run multiple STREAMs simultaneously, cranking up the load. Due to the limitations in the system I had access to, I could only run one and two STREAMS, but this still means I was running 20 and 40 jobs simultaneously. The maximum memory usage on the system was around 6 GB per STREAM.

The question is how to interpret the performance results. To this end, I use the single core performances as a reference. Prior to running my throughput tests, I run all jobs sequentially. For each job this gives me a reference timing. With these timings I can then estimate what the elapsed time of a throughput experiment will be.

Let me give a simple example to illustrate this. Let's say I have two programs, A and B. The single core elapsed times for A and B are 100 and 200 seconds respectively. If I run A and B simultaneously on one core, the respective elapsed times are then 200 and 300 seconds. This is because for 200 seconds, both A and B execute on a single core simultaneously and therefore get 50% of the core on average. After these 200 seconds, B still has 100 more seconds to go. Because A has finished, B has 100% of the core to itself and therefore needs an additional 100 seconds to finish.

This idea can easily be extended to multiple cores.

In the past I've used this approach to evaluate the performance of more conventional processor designs. Given the estimates assume ideal hardware and software behavior, the measured values were typically higher than what I estimated. This is actually what one would expect.

When I applied the same methodology to my results on the UltraSPARC T2 processor however, a big difference showed up. The measured times were actually better than what I estimated! This is exactly the opposite of what one might expect.

The explanation is that the threading within one core increases the capacity of that core.

The question is how to attach a number to that. In other words, how much computational capacity does an UltraSPARC T2 processor represent?

To answer this question I introduce what I call a "Core Equivalent", or "CE" for short. A CE is a core that has the capacity of the core used, but without the additional threading.

On multicore designs without additional hardware threading, a CE then corresponds to the core; there is a one to one mapping between the two.

On threaded designs, like UltraSPARC T2, a CE might be more than one core. The question is how much more.

This leads me to introduce the following metric. Define the Average(CE) metric as follows:

Average(CE) := sum{j = 1 to 20}(measured(j)-estimated(j,CE))/measured(j)

It is easy to see that this function increases monotonically as a function of the CE. I then define the "best" CE as the CE for which Average(CE) has the smallest positive value.

The motivation for this metric is that I compute the average over the relative differences of the measured versus the estimated elapsed times. Of course there could be cancellations, but as long as this average is negative, I underestimate the total throughput capacity of the system.

I applied this metric to my PEAS suite and found one STREAM (20 jobs executed simultaneously) to deliver 15 CEs, with Average(15) = 0.56%. For two STREAMs (40 simultaneous jobs), the throughput corresponds to 25 CEs with Average(25) = 1.21%.

In a subsequent blog I'll give more details so you can see for yourself, but these percentages indicate there is a really good match between the measured and estimated timings.

These numbers clearly reflect the observation that UltraSPARC T2 delivers more performance than 8 non-threaded cores. It is interesting to note that the number goes up when increasing the load. This suggests a higher workload might even give a higher CE value.

As mentioned in the introduction, this is early work. In a future blog I'll go into much more detail regarding the above. I also plan to gather more results. In particular I would like to see how the number of CEs changes if I increase the load.

So, stay tuned!


Post a Comment:
Comments are closed for this entry.

Picture of Ruud

Ruud van der Pas is a Senior Principal Software Engineer in the SPARC Microelectronics organization at Oracle. His focus is on application performance, both for single threaded, as well as for multi-threaded programs. He is also co-author on the book Using OpenMP

Cover of the Using OpenMP book


« October 2016