I've previously written up a short entry on using the UltraSPARC T1 performance counters to determine what the processor is doing and where effort might be spent in improving performance. I've just completed a follow up article for the developer portal which discusses this concept in more depth, and covers both the UltraSPARC T1 and the UltraSPARC T2.
A quick refresher here is that it's simple to calculate the utilisation of the processor. They have a fixed maximum number of instructions per second and cpustat can easily determine what proportion of that instruction budget is being utilised. Where it gets interesting is looking at the bottlenecks on the system - such as the memory stalls. On a traditional system memory stall time is all potential performance gain; but on a CMT system one threads's stall is another thread's instruction issue opportunity. Basically, stall will increase the latency of a thread, but reducing stall may not necessarily improve throughput.
This comes down to a few interesting observations:
- A processor can tolerate a lot of stall cycles before the stall cycles start reducing the throughput of the application.
- Traditional optimisations, where the developer, as an example, eliminates memory stall time, are not necessarily going to be the most productive use of developer time for CMT systems.
- The factor that limits processor throughput is often instruction count, not stalls. Fortunately we have tools like BIT for getting instruction count data.