Sunday Sep 28, 2014

SPARC Processor Documentation

I'm pretty excited, we've now got documentation up for the SPARC processors. Take a look at the SPARC T4 supplement, the SPARC T4 performance instrumentation supplement, the SPARC M5 supplement, or the familiar SPARC 2011 Architecture.

Tuesday Sep 23, 2014

Comparing constant duration profiles

I was putting together my slides for Open World, and in one of them I'm showing profile data from a server-style workload. ie one that keeps running until stopped. In this case the profile can be an arbitrary duration, and it's the work done in that time which is the important metric, not the total amount of time taken.

Profiling for a constant duration is a slightly unusual situation. We normally profile a workload that takes N seconds, do some tuning, and it now takes (N-S) seconds, and we can say that we improved performance by S/N percent. This is represented by the left pair of boxes in the following diagram:

In the diagram you can see that the routine B got optimised and therefore the entire runtime, for completing the same amount of work, reduced by an amount corresponding to the performance improvement for B.

Let's run through the same scenario, but instead of profiling for a constant amount of work, we profile for a constant duration. In the diagram this is represented by the outermost pair of boxes.

Both profiles run for the same total amount of time, but the right hand profile has less time spent in routine B() than the left profile, because the time in B() has reduced more time is spent in A(). This is natural, I've made some part of the code more efficient, I'm observing for the same amount of time, so I must spend more time in the part of the code that I've not optimised.

So what's the performance gain? In this case we're more likely to look at the gain in throughput. It's a safe assumption that the amount of time in A() corresponds to the amount of work done - ie that if we did T units of work, then the average cost per unit work A()/T is the same across the pair of experiments. So if we did T units of work in the first experiment, then in the second experiment we'd do T * A'()/A(). ie the throughput increases by S = A'()/A() where S is the scaling factor. What is interesting about this is that A() represents any measure of time spent in code which was not optimised. So A() could be a single routine or it could be all the routines that are untouched by the optimisation.

Friday Sep 05, 2014

Fun with signal handlers

I recently had a couple of projects where I needed to write some signal handling code. I figured it would be helpful to write up a short article on my experiences.

The article contains two examples. The first is using a timer to write a simple profiler for an application - so you can find out what code is currently being executed. The second is potentially more esoteric - handling illegal instructions. This is probably worth explaining a bit.

When a SPARC processor hits an instruction that it does not understand, it traps. You typically see this if an application has gone off into the weeds and started executing the data segment or something. However, you can use this feature for doing something whenever the processor encounters an illegal instruction. If it's a valid instruction that isn't available on the processor, you could write emulation code. Or you could use it as a kind of break point that you insert into the code. Or you could use it to make up your own instruction set. That bit's left as an exercise for you. The article provides the template of how to do it.

Thursday Sep 04, 2014

C++11 Array and Tuple Containers

This article came out a week or so back. It's a quick overview, from Steve Clamage and myself, of the C++11 tuple and array containers.

When you take a look at the page, I want you to take a look at the "about the authors" section on the right. I've been chatting to various people and we came up with this as a way to make the page more interesting, and also to make the "see also" suggestions more obvious. Let me know if you have any ideas for further improvements.

Monday Aug 25, 2014

My schedule for JavaOne and Oracle Open World

I'm very excited to have got my schedule for Open World and JavaOne:

CON8108: Engineering Insights: Best Practices for Optimizing Oracle Software for Oracle Hardware
Venue / Room: Intercontinental - Grand Ballroom C
Date and Time: 10/1/14, 16:45 - 17:30

CON2654: Java Performance: Hardware, Structures, and Algorithms
Venue / Room: Hilton - Imperial Ballroom A
Date and Time: 9/29/14, 17:30 - 18:30

The first talk will be about some of the techniques I use when performance tuning software. We get very involved in looking at how Oracle software works on Oracle hardware. The things we do work for any software, but we have the advantage of good working relationships with the critical teams.

The second talk is with Charlie Hunt, it's a follow on from the talk we gave at JavaOne last year. We got Rock Star awards for that, so the pressure's on a bit for this sequel. Fortunately there's still plenty to talk about when you look at how Java programs interact with the hardware, and how careful choices of data structures and algorithms can have a significant impact on delivered performance.

Anyway, I hope to see a bunch of people there, if you're reading this, please come and introduce yourself. If you don't make it I'm looking forward to putting links to the presentations up.

Friday Aug 15, 2014

Providing feedback on the Solaris Studio 12.4 Beta

Obviously, the point of the Solaris Studio 12.4 Beta programme was for everyone to try out the new version of the compiler and tools, and for us to gather feedback on what was working, what was broken, and what was missing. We've had lots of useful feedback - you can see some of it on the forums. But we're after more.

Hence we have a Solaris Studio 12.4 Beta survey where you can tell us more about your experiences. Your comments are really helpful to us. Thanks.

Friday Jul 11, 2014

Studio 12.4 Beta Refresh, performance counters, and CPI

We've just released the refresh beta for Solaris Studio 12.4 - free download. This release features quite a lot of changes to a number of components. It's worth calling out improvements in the C++11 support and other tools. We've had few comments and posts on the Studio forums, and a bunch of these have resulted in improvements in this refresh.

One of the features that is deserving of greater attention is default hardware counters in the Performance Analyzer.

Default hardware counters

There's a lot of potential hardware counters that you can profile your application on. Some of them are easy to understand, some require a bit more thought, and some are delightfully cryptic (for example, I'm sure that op_stv_wait_sxmiss_ex means something to someone). Consequently most people don't pay them much attention.

On the other hand, some of us get very excited about hardware performance counters, and the information that they can provide. It's good to be able to reveal that we've made some steps along the path of making that information more generally available.

The new feature in the Performance Analyzer is default hardware counters. For most platforms we've selected a set of meaningful performance counters. You get these if you add -h on to the flags passed to collect. For example:

$ collect -h on ./a.out

Using the counters

Typically the counters will gather cycles, instructions, and cache misses - these are relatively easy to understand and often provide very useful information. In particular, given a count of instructions and a count of cycles, it's easy to compute Cycles per Instruction (CPI) or Instructions per Cycle(IPC).

I'm not a great fan of CPI or IPC as absolute measurements - working in the compiler team there are plenty of ways to change these metrics by controlling the I (instructions) when I really care most about the C (cycles). But, the two measurements have a very useful purpose when examining a profile.

A high CPI means lots cycles were spent somewhere, and very few instructions were issued in that time. This means lots of stall, which means that there's some potential for performance gains. So a good rule of thumb for where to focus first is routines that take a lot of time, and have a high CPI.

IPC is useful for a different reason. A processor can issue a maximum number of instructions per cycle. For example, a T4 processor can issue two instructions per cycle. If I see an IPC of 2 for one routine, I know that the code is not stalled, and is limited by instruction count. So when I look at a code with a high IPC I can focus on optimisations that reduce the instruction count.

So both IPC and CPI are meaningful metrics. Reflecting this, the Performance Analyzer will compute the metrics if the hardware counter data is available. Here's an example:

This code was deliberately contrived so that all the routines had ludicrously high CPI. But isn't that cool - I can immediately see what kinds of opportunities might be lurking in the code.

This is not restricted to just the functions view, CPI and/or IPC are presented in every view - so you can look at CPI for each thread, line of source, line of disassembly. Of course, as the counter data gets spread over more "lines" you have less data per line, and consequently more noise. So CPI data at the disassembly level is not likely to be that useful for very short running experiments. But when aggregated, the CPI can often be meaningful even for short experiments.

Wednesday Jun 25, 2014

Guest post on the OTN Garage

Contributed a post on how compilers handle constants to the OTN Garage. The whole OTN blog is worth reading because as well as serving up useful info, Rick has a good irreverent style of writing.

Monday Jun 23, 2014

Presenting at JavaOne and Oracle Open World

Once again I'll be presenting at Oracle Open World, and JavaOne. You can search the full catalogue on the web. The details of my two talks are:

Engineering Insights: Best Practices for Optimizing Oracle Software for Oracle Hardware [CON8108]

Oracle Solaris Studio is an indispensable toolset for optimizing key Oracle software running on Oracle hardware. This presentation steps through a series of case studies from real Oracle applications, illustrating how the various Oracle Solaris Studio development tools have proven instrumental in ensuring that Oracle software is fully tuned and optimized for Oracle hardware. Learn the secrets of how Oracle uses these powerful compilers and performance, memory, and thread analysis tools to write optimal, well-tested enterprise code for Oracle hardware, and hear about best practices you can use to optimize your existing applications for the latest Oracle systems.

Java Performance: Hardware, Structures, and Algorithms [CON2654]

Many developers consider the deployment platform to be a black box that the JVM abstracts away. In reality, this is not the case. The characteristics of the hardware do have a measurable impact on the performance of any Java application. In this session, two Java Rock Star presenters explore how hardware features influence the performance of your application. You will not only learn how to measure this impact but also find out how to improve the performance of your applications by writing hardware-friendly code.

Friday Jun 20, 2014

What's happening

Been isolating a behaviour difference, used a couple of techniques to get traces of process activity. First off tracing bash scripts by explicitly starting them with bash -x. For example here's some tracing of xzless:

$ bash -x xzless
+ xz='xz --format=auto'
+ version='xzless (XZ Utils) 5.0.1'
+ usage='Usage: xzless [OPTION]... [FILE]...

Another favourite tool is truss, which does all kinds of amazing tracing. In this instance all I needed to do was to see what other commands were started using -f to follow forked processes and -t execve to show calls to execve:

$ truss -f -t execve jcontrol
29211:  execve("/usr/bin/bash", 0xFFBFFAB4, 0xFFBFFAC0)  argc = 2

Friday Jun 13, 2014

Enabling large file support

For 32-bit apps the "default" maximum file size is 2GB. This is because the interfaces use the long datatype which is a signed int for 32-bit apps, and a signed long long for 64-bit apps. For many apps this is insufficient. Solaris already has huge numbers of large file aware commands, these are listed under man largefile.

For a developer wanting to support larger files, the obvious solution is to port to 64-bit, however there is also a way to remain with 32-bit apps. This is to compile with large file support.

Large file support provides a new set of interfaces that take 64-bit integers, enabling support of files greater than 2GB in size. In a number of cases these interfaces replace the existing ones, so you don't need to change the source. However, there are some interfaces where the long type is part of the ABI; in these cases there is a new interface to use.

The way to find out what flags to use is through the command getconf LFS_CFLAGS. The getconf command returns environment settings, and in this case we're asking it to provide the C flags needed to compile with large file support. It's useful to take a look at the other information that getconf can provide.

The documentation for compiling with large file support talks about both the flags that are needed, and what functions need to be changed. There are two functions that do not map directly onto large file equivalents because they have a long data type in their prototypes. These two functions are fseek and ftell; calls to these two functions need to be replaced by calls to fseeko and ftello

Wednesday Jun 11, 2014

Article in Oracle Scene magazine

Oracle Scene is the quarterly for the UK Oracle User Group. For the current issue, I've contributed an article on developing with Solaris Studio.

Wednesday Jun 04, 2014

Pretty printing using indent

If you need to pretty-print some code, then the compiler comes with indent for exactly this purpose!

Friday May 23, 2014

Generic hardware counter events

A while back, Solaris introduced support for PAPI - which is probably as close as we can get to a de-facto standard for performance counter naming. For performance counter geeks like me, this is not quite enough information, I actually want to know the names of the raw counters used. Fortunately this is provided in the generic_events man page:

$ man generic_events
Reformatting page.  Please Wait... done

CPU Performance Counters Library Functions   generic_events(3CPC)

     generic_events - generic performance counter events

     The Solaris  cpc(3CPC)  subsystem  implements  a  number  of
     predefined, generic performance counter events. Each generic
   Intel Pentium Pro/II/III Processor
       Generic Event          Platform Event          Event Mask
     PAPI_ca_shr          l2_ifetch                 0xf
     PAPI_ca_cln          bus_tran_rfo              0x0
     PAPI_ca_itv          bus_tran_inval            0x0
     PAPI_tlb_im          itlb_miss                 0x0
     PAPI_btac_m          btb_misses                0x0
     PAPI_hw_int          hw_int_rx                 0x0

Tuesday May 06, 2014

Introduction to OpenMP

Recently, I had the opportunity to talk with Nawal Copty, our OpenMP representative, about writing parallel applications using OpenMP, and about the new features in the OpenMP 4.0 specification. The video is available on youtube.

The set of recent videos can be accessed either through the Oracle Learning Library, or as a playlist on youtube.

Tuesday Apr 29, 2014

Unsigned integers considered annoying

Let's talk about unsigned integers. These can be tricky because they wrap-around from big to small. Signed integers wrap-around from positive to negative. Let's look at an example. Suppose I want to do something for all iterations of a loop except for the last OFFSET of them. I could write something like:

  if (i < length - OFFSET) {}

If I assume OFFSET is 8 then for length 10, I'll do something for the first 2 iterations. The problem occurs when the length is less than OFFSET. If length is 2, then I'd expect not to do anything for any of the iterations. For a signed integer 2 minus 8 is -6 which is less than i, so I don't do anything. For an unsigned integer 2 minus 8 is 0xFFFFFFFA which is still greater than i. Hence we'll continue to do whatever it is we shouldn't be doing in this instance.

So the obvious fix for this is that for unsigned integers we do:

  if (i + OFFSET < length) {}

This works over the range that we might expect it to work. Of course we have a problem with signed integers if length happens to be close to INT_MAX, at this point adding OFFSET to a large value of i may cause it to overflow and become a large negative number - which will continue to be less than length.

With unsigned ints we encounter this same problem at UINT_MAX where adding OFFSET to i could generate a small value, which is less than the boundary.

So in these cases we might want to write:

  if (i < length - OFFSET) {}


So basically to cover all the situations we might want to write something like:

  if ( (length > OFFSET) && (i < length - OFFSET) ) {}

If this looks rather complex, then it's important to realise that we're handling a range check - and a range has upper and lower bounds. For signed integers zero - OFFSET is representable, so we can write:

  if (i < length - OFFSET) {}

without worrying about wrap-around. However for unsigned integers we need to define both the left and right ends of the range. Hence the more complex expression.

Thursday Apr 17, 2014

What's new in C++11

I always enjoy chatting with Steve Clamage about C++, and I was really pleased to get to interview him about what we should expect from the new 2011 standard.

Wednesday Apr 16, 2014

Lambda expressions in C++11

Lambda expressions are, IMO, one of the interesting features of C++11. At first glance they do seem a bit hard to parse, but once you get used to them you start to appreciate how useful they are. Steve Clamage and I have put together a short paper introducing lambda expressions.

Friday Apr 11, 2014

New set and map containers in the C++11 Standard Library

We've just published a short article on the std::unordered_map, std::unordered_set, std::multimap, and std::multiset containers in the C++ Standard Library.

Monday Apr 07, 2014

RAW hazards revisited (again)

I've talked about RAW hazards in the past, and even written articles about them. They are an interesting topic because they are situation where a small tweak to the code can avoid the problem.

In the article on RAW hazards there is some code that demonstrates various types of RAW hazard. One common situation is writing code to copy misaligned data into a buffer. The example code contains a test for this kind of copying, the results from this test, compiled with Solaris Studio 12.3, on my system look like:

Misaligned load v1 (bad) memcpy()
Elapsed = 16.486042 ns
Misaligned load v2 (bad) byte copy
Elapsed = 9.176913 ns
Misaligned load good
Elapsed = 5.243858 ns

However, we've done some work in the compiler on better identification of potential RAW hazards. If I recompile using the 12.4 Beta compiler I get the following results:

Misaligned load v1 (bad) memcpy()
Elapsed = 4.756911 ns
Misaligned load v2 (bad) byte copy
Elapsed = 5.005309 ns
Misaligned load good
Elapsed = 5.597687 ns

All three variants of the code produce the same performance - the RAW hazards have been eliminated!

Friday Apr 04, 2014

Interview with Don Kretsch

As well as filming the "to-camera" about the Beta program, I also got the opportunity to talk with my Senior Director Don Kretsch about the next compiler release.

About the Studio 12.4 Beta Programme

Here's a short video where I talk about the Solaris Studio 12.4 Beta programme.

Thursday Apr 03, 2014

Discovering the Code Analyzer

We're doing something different with the Studio 12.4 Beta programme. We're also putting together some material about the compiler and features: videos, whitepapers, etc.

One of the first videos is now officially available. You might have seen the preproduction "leak" if you happen to follow Studio on either facebook or twitter.

This first video is an interview with Raj Prakash, the project lead for the Code Analyzer.

The Code Analyzer is our suite for checking the correctness of code. Something that you would run before you deliver an application to customers.

SPARC roadmap

A new SPARC roadmap has been published. We have some very cool stuff coming :)

Monday Mar 31, 2014

Socialising Solaris Studio

I just figured that I'd talk about studio's social media presence.

First off, we have our own forums. One for the compilers and one for the tools. This is a good place to post comments and questions; posting here will get our attention.

We also have a presence on Facebook and Twitter.

Moving to the broader Oracle community, these pages list social media presence for a number of products.

Looking at Oracle blogs, the first stop probably has to be the entertaining The OTN Garage. It's also probably useful to browse the blogs by keywords, for example here's posts tagged with Solaris.

Thursday Mar 27, 2014

Solaris Studio 12.4 documentation

The preview docs for Solaris Studio 12.4 are now available.

Tuesday Mar 25, 2014

Solaris Studio 12.4 Beta now available

The beta programme for Solaris Studio 12.4 has opened. So download the bits and take them for a spin!

There's a whole bunch of features - you can read about them in the what's new document, but I just want to pick a couple of my favourites:

  • C++ 2011 support. If you've not read about it, C++ 2011 is a big change. There's a lot of stuff that has gone into it - improvements to the Standard library, lambda expressions etc. So there is plenty to try out. However, there are some features not supported in the beta, so take a look at the what's new pages for C++
  • Improvements to the Performance Analyzer. If you regularly read my blog, you'll know that this is the tool that I spend most of my time with. The new version has some very cool features; these might not sound much, but they fundamentally change the way you interact with the tool: an overview screen that helps you rapidly figure out what is of interest, improved filtering, mini-views which show more slices of the data on a single screen, etc. All of these massively improve the experience and the ease of using the tool.

There's a couple more things. If you try out features in the beta and you want to suggest improvements, or tell us about problems, please use the forums. There is a forum for the compiler and one for the tools.

Oh, and watch this space. I will be writing up some material on the new features....

Friday Mar 21, 2014

JavaOne award

I was thrilled to get a JavaOne 2013 Rockstar Award for Charlie Hunt's and my talk "Performance tuning where Java meets the hardware".

Getting the award was a surprise and a great honour. It's based on audience feedback, so it's really nice to find out that the audience enjoyed hearing the presentation as much as I enjoyed giving it.

Wednesday Feb 26, 2014

Multicore Application Programming available in Chinese!

This was a complete surprise to me. A box arrived on my doorstep, and inside were copies of Multicore Application Programming in Chinese. They look good, and have a glossy cover rather than the matte cover of the English version.

Article on RAW hazards

Feels like it's been a long while since I wrote up an article for OTN, so I'm pleased that I've finally got around to fixing that.

I've written about RAW hazards in the past. But I recently went through a patch of discovering them in a number of places, so I've written up a full article on them.

What is "nice" about RAW hazards is that once you recognise them for what they are (and that's the tricky bit), they are typically easy to avoid. So if you see 10 seconds of time attributable to RAW hazards in the profile, then you can often get the entire 10 seconds back by tweaking the code.


Darryl Gove is a senior engineer in the Solaris Studio team, working on optimising applications and benchmarks for current and future processors. He is also the author of the books:
Multicore Application Programming
Solaris Application Programming
The Developer's Edge
Free Download


« March 2015
The Developer's Edge
Solaris Application Programming
OpenSPARC Book
Multicore Application Programming