Tuesday Nov 11, 2008

New SPEC CPU search programme announced

SPEC has announced the search programme for the follow up to the CPU2006 benchmark suite. They are looking for compute intensive codes, real production apps, not microkernels or artificial benchmarks, that will cover a broad range of subject domains. They need to have both the code and workloads. There are financial rewards for completing the various selection hurdles.

Friday Jul 18, 2008

SIPEW 2008 conference slides online

Bunch of slides from the SIPEW 2008 conference are now up on line. Interesting one using performance counters to look at the correspondence between SPEC CPU2006 and a selection of commercial workloads.

Tuesday May 27, 2008

CPU2006 monitor wrapper for instrumentation

The monitor_wrapper infrastructure in SPEC CPU2006 is one of the best kept secrets of the suite. It allows you to run the benchmarks under various analysis tools (such as spot). With the X.X release of the kit, the monitor_wrapper infrastructure has finally been documented. An example of using it to run spot is:

monitor_wrapper=/opt/SUNWspro/extra/bin/spot -o${benchmark}.${size} ${command}

Which runs the command under spot, and puts the result in a directory which has the name benchmarkname.workloadsize. I also used the monitor_wrapper infrastructure to gather the data necessary for the papers on the correspondence between the training and reference workloads.

Thursday Mar 27, 2008

SPEC CPU2006 discussions

I recently read a couple of posts about SPEC CPU2006. As you can tell from the papers linked on this blog, I was quite busy helping prepare the suite - which was considerable fun. The first post is by Tom Yager, where he praises the suite for raising awareness of the components of a system that actually contribute to performance: "I added a practical angle to my scientific understanding of compiler optimizations, processor scheduling, CPU cache utilization".

On the other hand Neil Gunther (second time I've mentioned him) condemns "bogus SPECxx_rate benchmarks which simply run multiple instances of a single-threaded benchmark". I hope he's joking, but taking his comments at face value...

Interestingly he suggests SPEC SDM as a good choice. I'd not heard of this suite, but reading up on it it looks like it tests the impact of multiple users typing and executing commands on the system at the same time, and it's not been touched in over 10 years. I guess SDM would be a good match for the SunRay that I use daily, but I'm certain that the suite doesn't include the 22 copies of firefox that I currently see running on the server I'm using. On only slightly less rocky ground he talks about TPC-C, which appeared in 1992!

CPU2006 represents the CPU intensive portion very well, but deliberately tries of avoid hitting the disk or network. Since disk and network do play significant roles in system performance, I probably wouldn't recommend getting a machine purely on its SPECcpu2006 or SPECcpu2006_rate scores. However, the mix of apps in the benchmark suite is representative of most of the codes that are out there (and I believe some are codes that appeared less than 10 years ago;). So what ever app is being run on a system, there is probably a code in CPU2006 which is not that dissimilar.

Tackling his core beef with the suite that the rate metric "simply" runs multiple copies of the the same ecode, this is actually harder work for the system than running a heterogeneous mix of applications. So I'd suggest that it is a better test of system capability than running some codes that stress memory bandwidth together with some other codes that are resident in the L1 cache. So IMO far from being "bogus", specrate is a very good indicator of potential system throughput.

Thursday Apr 19, 2007

CPU2006 Working Set Size

The paper on cpu2006 working set size is meant to provide estimates of the realistic memory footprint of an application. The OS can report SIZE (how much memory is reserved for an application), RSS (how much of an application is resident in memory), but not how much of that memory is actually filled with data that is touched as part of the run of the application.

A way of envisioning this is to consider a program that allocates an array that is sufficient to handle the largest data set that might be input. In usual runs of the application most of that array will not be used. Looking at the RSS and SIZE metrics provided by the OS will give a very different indicator of the amount of memory required than a more careful inspection of the code.

The approach used in the paper is was to use the Shade emulator to capture the address of memory accesses. Then record the the use of each cacheline. Two definitions were also used. The working set size is the number of cachelines touched in a billion cycles (a processor should be able to execute a billion instructions in a time that can be measured in seconds). The core working set size is the number of cachelines that were touched both in this interval and the proceeding one (i.e. the line was reused).

Again the focus of this methodology was to provide something which was not tied to a particular hardware implementation. An alternative would have been to use a cache simulator and identify cache miss rates for a given configuration; but that would have been 'contaminated' by the decision as to the cache configuration. The second issue with a cache simulator is that cache miss rates typically drop by orders of magnitude as the cache size increases, even very small caches can be very effective. The object of the study was not to find the most effective cache size, but to find out how much memory the benchmarks were actively using.

I found the results, initially, quite surprising. The working set sizes for CPU2000 and CPU2006 are not that dissimilar. The floating point workloads in CPU2006 do have greater working set size, also a few integer workloads. However, the memory requirements for CPU2006 are significantly greater at 1GB vs 256MB. So the codes hold a lot more data in memory, but often do not aggressively exercise all of that data.

I recently found this more traditional cache miss rate study for the CPU2006 benchmarks. Unfortunately there seems to be pop-up ads on the site.

Thursday Apr 12, 2007

Training workload quality in CPU2006

The paper on training workload quality in CPU2006 is basically an update of previous work on training workloads in CPU2000.

The training workloads in CPU2006 are pretty good. Surprisingly, as pointed out in the paper, SPEC didn't have to change many of the workloads to make this happen. This supports the hypothesis that generally programs run through the same code paths regardless of the input.

What I like about this work is that it provides a way of assessing training workload quality. This directly addresses one of the concerns of some people about using profile feedback for optimisation: whether the selected training workload is going to be a bad choice for their actual workload.

In terms of the methodology, it's tempting to think that the best test would be to measure performance before and after. That particular approach was rejected because the performance gain is a function of both the quality of the training workload, and the ability of the compiler to exploit that workload (together with a fair mix of whether there is anything the compiler can do to improve performance even with the knowledge). So a performance gain doesn't necessarily mean a good quality training workload, and the absence of a gain doesn't necessarily mean a poor quality training workload.

The final metrics of code coverage and branch behaviour are about as hardware/compiler independent as possible. It should be possible to break down any program on any hardware to a very similar blocks even if the instructions in those blocks end up being different. So the approach seemed particularly good when evaluating cross platform benchmarks.

For those interested in using this method, there's a pretty detailed how-to guide on the developer portal.

Wednesday Apr 11, 2007

Papers on CPU2006

The March issue of Computer Architecture News has a section on the SPEC CPU2006 suite. SPEC has conveniently put together a set of the submitted papers. There are 12 papers, of which 8 are at least co-authored by Sun folks.


Darryl Gove is a senior engineer in the Solaris Studio team, working on optimising applications and benchmarks for current and future processors. He is also the author of the books:
Multicore Application Programming
Solaris Application Programming
The Developer's Edge
Free Download


« June 2016
The Developer's Edge
Solaris Application Programming
OpenSPARC Book
Multicore Application Programming