Wednesday Jun 04, 2014

Pretty printing using indent

If you need to pretty-print some code, then the compiler comes with indent for exactly this purpose!

Friday May 23, 2014

Generic hardware counter events

A while back, Solaris introduced support for PAPI - which is probably as close as we can get to a de-facto standard for performance counter naming. For performance counter geeks like me, this is not quite enough information, I actually want to know the names of the raw counters used. Fortunately this is provided in the generic_events man page:

$ man generic_events
Reformatting page.  Please Wait... done

CPU Performance Counters Library Functions   generic_events(3CPC)

     generic_events - generic performance counter events

     The Solaris  cpc(3CPC)  subsystem  implements  a  number  of
     predefined, generic performance counter events. Each generic
   Intel Pentium Pro/II/III Processor
       Generic Event          Platform Event          Event Mask
     PAPI_ca_shr          l2_ifetch                 0xf
     PAPI_ca_cln          bus_tran_rfo              0x0
     PAPI_ca_itv          bus_tran_inval            0x0
     PAPI_tlb_im          itlb_miss                 0x0
     PAPI_btac_m          btb_misses                0x0
     PAPI_hw_int          hw_int_rx                 0x0

Tuesday May 06, 2014

Introduction to OpenMP

Recently, I had the opportunity to talk with Nawal Copty, our OpenMP representative, about writing parallel applications using OpenMP, and about the new features in the OpenMP 4.0 specification. The video is available on youtube.

The set of recent videos can be accessed either through the Oracle Learning Library, or as a playlist on youtube.

Tuesday Apr 29, 2014

Unsigned integers considered annoying

Let's talk about unsigned integers. These can be tricky because they wrap-around from big to small. Signed integers wrap-around from positive to negative. Let's look at an example. Suppose I want to do something for all iterations of a loop except for the last OFFSET of them. I could write something like:

  if (i < length - OFFSET) {}

If I assume OFFSET is 8 then for length 10, I'll do something for the first 2 iterations. The problem occurs when the length is less than OFFSET. If length is 2, then I'd expect not to do anything for any of the iterations. For a signed integer 2 minus 8 is -6 which is less than i, so I don't do anything. For an unsigned integer 2 minus 8 is 0xFFFFFFFA which is still greater than i. Hence we'll continue to do whatever it is we shouldn't be doing in this instance.

So the obvious fix for this is that for unsigned integers we do:

  if (i + OFFSET < length) {}

This works over the range that we might expect it to work. Of course we have a problem with signed integers if length happens to be close to INT_MAX, at this point adding OFFSET to a large value of i may cause it to overflow and become a large negative number - which will continue to be less than length.

With unsigned ints we encounter this same problem at UINT_MAX where adding OFFSET to i could generate a small value, which is less than the boundary.

So in these cases we might want to write:

  if (i < length - OFFSET) {}


So basically to cover all the situations we might want to write something like:

  if ( (length > OFFSET) && (i < length - OFFSET) ) {}

If this looks rather complex, then it's important to realise that we're handling a range check - and a range has upper and lower bounds. For signed integers zero - OFFSET is representable, so we can write:

  if (i < length - OFFSET) {}

without worrying about wrap-around. However for unsigned integers we need to define both the left and right ends of the range. Hence the more complex expression.

Thursday Apr 17, 2014

What's new in C++11

I always enjoy chatting with Steve Clamage about C++, and I was really pleased to get to interview him about what we should expect from the new 2011 standard.

Wednesday Apr 16, 2014

Lambda expressions in C++11

Lambda expressions are, IMO, one of the interesting features of C++11. At first glance they do seem a bit hard to parse, but once you get used to them you start to appreciate how useful they are. Steve Clamage and I have put together a short paper introducing lambda expressions.

Friday Apr 04, 2014

Interview with Don Kretsch

As well as filming the "to-camera" about the Beta program, I also got the opportunity to talk with my Senior Director Don Kretsch about the next compiler release.

About the Studio 12.4 Beta Programme

Here's a short video where I talk about the Solaris Studio 12.4 Beta programme.

Thursday Apr 03, 2014

Discovering the Code Analyzer

We're doing something different with the Studio 12.4 Beta programme. We're also putting together some material about the compiler and features: videos, whitepapers, etc.

One of the first videos is now officially available. You might have seen the preproduction "leak" if you happen to follow Studio on either facebook or twitter.

This first video is an interview with Raj Prakash, the project lead for the Code Analyzer.

The Code Analyzer is our suite for checking the correctness of code. Something that you would run before you deliver an application to customers.

Monday Mar 31, 2014

Socialising Solaris Studio

I just figured that I'd talk about studio's social media presence.

First off, we have our own forums. One for the compilers and one for the tools. This is a good place to post comments and questions; posting here will get our attention.

We also have a presence on Facebook and Twitter.

Moving to the broader Oracle community, these pages list social media presence for a number of products.

Looking at Oracle blogs, the first stop probably has to be the entertaining The OTN Garage. It's also probably useful to browse the blogs by keywords, for example here's posts tagged with Solaris.

Thursday Mar 27, 2014

Solaris Studio 12.4 documentation

The preview docs for Solaris Studio 12.4 are now available.

Tuesday Mar 25, 2014

Solaris Studio 12.4 Beta now available

The beta programme for Solaris Studio 12.4 has opened. So download the bits and take them for a spin!

There's a whole bunch of features - you can read about them in the what's new document, but I just want to pick a couple of my favourites:

  • C++ 2011 support. If you've not read about it, C++ 2011 is a big change. There's a lot of stuff that has gone into it - improvements to the Standard library, lambda expressions etc. So there is plenty to try out. However, there are some features not supported in the beta, so take a look at the what's new pages for C++
  • Improvements to the Performance Analyzer. If you regularly read my blog, you'll know that this is the tool that I spend most of my time with. The new version has some very cool features; these might not sound much, but they fundamentally change the way you interact with the tool: an overview screen that helps you rapidly figure out what is of interest, improved filtering, mini-views which show more slices of the data on a single screen, etc. All of these massively improve the experience and the ease of using the tool.

There's a couple more things. If you try out features in the beta and you want to suggest improvements, or tell us about problems, please use the forums. There is a forum for the compiler and one for the tools.

Oh, and watch this space. I will be writing up some material on the new features....

Wednesday Feb 26, 2014

Multicore Application Programming available in Chinese!

This was a complete surprise to me. A box arrived on my doorstep, and inside were copies of Multicore Application Programming in Chinese. They look good, and have a glossy cover rather than the matte cover of the English version.

Wednesday Sep 11, 2013

Presenting at UK Oracle User Group meeting

I'm very excited to have been invited to present at the UK Oracle User Group conference in Manchester, UK on 1-4 December.

Currently I'm down for two presentations:

As you might expect, I'm very excited to be over there, I've not visited Manchester in about 20 years!

Thursday Aug 29, 2013

Timezone troubles when dual booting

I have a laptop that dual boots Solaris and Windows XP. When I switched between the two OSes I would have to reset the clock because the time would be eight hours out. This has been naggging at me for a while, so I dug into what was going on.

It seems that Windows assumes that the Real-Time Clock (RTC) in the bios is using local time. So it will read the clock and display whatever time is shown there.

Solaris on the other hand assumes that the clock is in Universal Time Format (UTC), so you have to apply a time zone transformation in order to get to the local time.

Obviously, if you adjust the clock to make one correct, then the other is wrong.

To me, it seems more natural to have the clock in a laptop work on UTC - because when you travel the local time changes. There is a registry setting in Windows that, when set to 1, tells the machine to use UTC:


However, it has some problems and is potentially not robust over sleep. So we have to work the other way, and get Solaris to use local time. Fortunately, this is a relatively simple change running the following as root (pick the appropriate timezone for your location):

rtc -z US/Pacific

Tuesday Aug 27, 2013

My Oracle Open World and JavaOne schedule

I've got my schedule for Oracle Open World and JavaOne:

Note that on Thursday I have about 30 minutes between my two talks, so expect me to rush out of the database talk in order to get to the Java talk.

Monday Jul 08, 2013

JavaOne and Oracle Open World 2013

I'll be up at both JavaOne and Oracle Open World presenting. I have a total of three presentations:

  • Mixed-Language Development: Leveraging Native Code from Java
  • Performance Tuning Where Java Meets the Hardware
  • Developing Efficient Database Applications with C and C++

I'm excited by these opportunities - particularly working with Charlie Hunt diving into Java Performance.

Thursday May 30, 2013

Binding to the current processor

Just hacked up a snippet of code to stop a thread migrating to a different CPU while it runs. This should help the thread get, and keep, local memory. This in turn should reduce run-to-run variance.

#include <sys/processor.h>

void bindnow()
  processorid_t proc = getcpuid();
  if (processor_bind(P_LWPID, P_MYID, proc, 0)) 
    { printf("Warning: Binding failed\n"); } 
    { printf("Bound to CPU %i\n", proc); }

Tuesday May 28, 2013

One executable, many platforms

Different processors have different optimal sequences of code. Fortunately, most of the time the differences are minor, and we can easily accommodate them by generating generic code.

If you needed more than this, then the "old" model was to use dynamic string tokens to pick the best library for the platform. This works well, and was the mechanism that used. However, the downside is that you now need to ship a bundle of libraries with the application; this can get (and look) a bit messy.

There's a "new" approach that uses a family of capability functions. The idea here is that multiple versions of the routine are linked into the executable, and the runtime linker picks the best for the platform that the application is running on. The routines are denoted with a suffix, after a percentage sign, indicating the platform. For example here's the family of memcpy() implementations in libc:

$ elfdump -H /usr/lib/ 2>&1 |grep memcpy
      [10]  0x0010627c 0x00001280  FUNC LOCL  D    0 .text          memcpy%sun4u
      [11]  0x001094d0 0x00000b8c  FUNC LOCL  D    0 .text          memcpy%sun4u-opl
      [12]  0x0010a448 0x000005f0  FUNC LOCL  D    0 .text          memcpy%sun4v-hwcap1

It takes a bit of effort to produce a family of implementations. Imagine we want to print something different when an application is run on a sun4v machine. First of all we'll have a bit of code that prints out the compile-time defined string that indicates the platform we're running on:

#include <stdio.h>
static char name[]=PLATFORM;

double platform()
  printf("Running on %s\n",name);

To compile this code we need to provide the definition for PLATFORM - suitably escaped. We will need to provide two versions, a generic version that can always run, and a platform specific version that runs on sun4v platforms:

$ cc -c -o generic.o p.c -DPLATFORM=\"Generic\"
$ cc -c -o sun4v.o   p.c -DPLATFORM=\"sun4v\"

Now we have a specialised version of the routine platform() but it has the same name as the generic version, so we cannot link the two into the same executable. So what we need to do is to tag it as being the version we want to run on sun4v platforms.

This is a two step process. The first step is that we tag the object file as being a sun4v object file. This step is only necessary if the compiler has not already tagged the object file. The compiler will tag the object file appropriately if it uses instructions from a particular architecture - for example if you compiled explicitly targeting T4 using -xtarget=t4. However, if you need to tag the object file, then you can use a mapfile to add the appropriate hardware capabilities:

$mapfile_version 2


We can then ask the linker to apply these hardware capabilities from the mapfile to the object file:

$ ld -r -o sun4v_1.o -Mmapfile.sun4v sun4v.o

You can see that the capabilities have been applied using elfdump:

$ elfdump -H sun4v_1.o

Capabilities Section:  .SUNW_cap

 Object Capabilities:
     index  tag               value
       [0]  CA_SUNW_ID       sun4v
       [1]  CA_SUNW_MACH     sun4v

The second step is to take these capabilities and apply them to the functions. We do this using the linker option -zsymbolcap

$ ld -r -o sun4v_2.o -z symbolcap sun4v_1.o

You can now see that the platform function has been tagged as being for sun4v hardware:

$ elfdump -H sun4v_2.o

Capabilities Section:  .SUNW_cap

 Symbol Capabilities:
     index  tag               value
       [1]  CA_SUNW_ID       sun4v
       [2]  CA_SUNW_MACH     sun4v

     index    value      size      type bind oth ver shndx          name
      [24]  0x00000010 0x00000070  FUNC LOCL  D    0 .text          platform%sun4v

And finally you can combine the object files into a single executable. The main() routine of the executable calls platform() which will print out a different message depending on the platform. Here's the source to main():

extern void platform();

int main()

Here's what happens when the program is compiled and run on a non-sun4v platform:

$ cc -o main -O main.c sun4v_2.o generic.o
$ ./main
Running on Generic

Here's the same executable running on a sun4v platform:

$ ./main
Running on sun4v

Wednesday Dec 12, 2012

Compiling for T4

I've recently had quite a few queries about compiling for T4 based systems. So it's probably a good time to review what I consider to be the best practices.

  • Always use the latest compiler. Being in the compiler team, this is bound to be something I'd recommend :) But the serious points are that (a) Every release the tools get better and better, so you are going to be much more effective using the latest release (b) Every release we improve the generated code, so you will see things get better (c) Old releases cannot know about new hardware.
  • Always use optimisation. You should use at least -O to get some amount of optimisation. -xO4 is typically even better as this will add within-file inlining.
  • Always generate debug information, using -g. This allows the tools to attribute information to lines of source. This is particularly important when profiling an application.
  • The default target of -xtarget=generic is often sufficient. This setting is designed to produce a binary that runs well across all supported platforms. If the binary is going to be deployed on only a subset of architectures, then it is possible to produce a binary that only uses the instructions supported on these architectures, which may lead to some performance gains. I've previously discussed which chips support which architectures, and I'd recommend that you take a look at the chart that goes with the discussion.
  • Crossfile optimisation (-xipo) can be very useful - particularly when the hot source code is distributed across multiple source files. If you're allowed to have something as geeky as favourite compiler optimisations, then this is mine!
  • Profile feedback (-xprofile=[collect: | use:]) will help the compiler make the best code layout decisions, and is particularly effective with crossfile optimisations. But what makes this optimisation really useful is that codes that are dominated by branch instructions don't typically improve much with "traditional" compiler optimisation, but often do respond well to being built with profile feedback.
  • The macro flag -fast aims to provide a one-stop "give me a fast application" flag. This usually gives a best performing binary, but with a few caveats. It assumes the build platform is also the deployment platform, it enables floating point optimisations, and it makes some relatively weak assumptions about pointer aliasing. It's worth investigating.
  • SPARC64 processor, T3, and T4 implement floating point multiply accumulate instructions. These can substantially improve floating point performance. To generate them the compiler needs the flag -fma=fused and also needs an architecture that supports the instruction (at least -xarch=sparcfmaf).
  • The most critical advise is that anyone doing performance work should profile their application. I cannot overstate how important it is to look at where the time is going in order to determine what can be done to improve it.

I also presented at Oracle OpenWorld on this topic, so it might be helpful to review those slides.

Tuesday Dec 04, 2012

It could be worse....

As "guest" pointed out, in my file I/O test I didn't open the file with O_SYNC, so in fact the time was spent in OS code rather than in disk I/O. It's a straightforward change to add O_SYNC to the open() call, but it's also useful to reduce the iteration count - since the cost per write is much higher:

#define SIZE 1024

void test_write()
  int file = open("./test.dat",O_WRONLY|O_CREAT|O_SYNC,S_IWGRP|S_IWOTH|S_IWUSR);

Running this gave the following results:

Time per iteration   0.000065606310 MB/s
Time per iteration   2.709711563906 MB/s
Time per iteration   0.178590114758 MB/s

Yup, disk I/O is way slower than the original I/O calls. However, it's not a very fair comparison since disks get written in large blocks of data and we're deliberately sending a single byte. A fairer result would be to look at the I/O operations per second; which is about 65 - pretty much what I'd expect for this system.

It's also interesting to examine at the profiles for the two cases. When the write() was trapping into the OS the profile indicated that all the time was being spent in system. When the data was being written to disk, the time got attributed to sleep. This gives us an indication how to interpret profiles from apps doing I/O. It's the sleep time that indicates disk activity.

Write and fprintf for file I/O

fprintf() does buffered I/O, where as write() does unbuffered I/O. So once the write() completes, the data is in the file, whereas, for fprintf() it may take a while for the file to get updated to reflect the output. This results in a significant performance difference - the write works at disk speed. The following is a program to test this:

#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <stdio.h>
#include <sys/time.h>
#include <sys/types.h>
#include <sys/stat.h>

static double s_time;

void starttime()

void endtime(long its)
  double e_time=1.0*gethrtime();
  printf("Time per iteration %5.2f MB/s\n", (1.0*its)/(e_time-s_time*1.0)*1000);

#define SIZE 10*1024*1024

void test_write()
  int file = open("./test.dat",O_WRONLY|O_CREAT,S_IWGRP|S_IWOTH|S_IWUSR);
  for (int i=0; i<SIZE; i++)

void test_fprintf()
  FILE* file = fopen("./test.dat","w");
  for (int i=0; i<SIZE; i++)

void test_flush()
  FILE* file = fopen("./test.dat","w");
  for (int i=0; i<SIZE; i++)

int main()

Compiling and running I get 0.2MB/s for write() and 6MB/s for fprintf(). A large difference. There's three tests in this example, the third test uses fprintf() and fflush(). This is equivalent to write() both in performance and in functionality. Which leads to the suggestion that fprintf() (and other buffering I/O functions) are the fastest way of writing to files, and that fflush() should be used to enforce synchronisation of the file contents.

Thursday Nov 01, 2012

Rick Hetherington on the T5

There's an interview with Rick Hetherington about the new T5 processor. Well worth a quick read.

Thursday May 17, 2012

Solaris Developer talk next week

Vijay Tatkar will be talking about developing on Solaris next week Tuesday at 9am PST.

Monday Apr 02, 2012

Efficient inline templates and C++

I've talked before about calling inline templates from C++, I've also talked about calling inline templates efficiently. This time I want to talk about efficiently calling inline templates from C++.

The obvious starting point is that I need to declare the inline templates as being extern "C":

  extern "C"
    int mytemplate(int);

This enables us to call it, but the call may not be very efficient because the compiler will treat it as a function call, and may produce suboptimal code based on that premise. So we need to add the no_side_effect pragma:

  extern "C"
    int mytemplate(int); 
    #pragma no_side_effect(mytemplate)

However, this may still not produce optimal code. We've discussed how the no_side_effect pragma cannot be combined with exceptions, well we know that the code cannot produce exceptions, but the compiler doesn't know that. If we tell the compiler that information it may be able to produce even better code. We can do this by adding the "throw()" keyword to the template declaration:

  extern "C"
    int mytemplate(int) throw(); 
    #pragma no_side_effect(mytemplate)

The following is an example of how these changes might improve performance. We can take our previous example code and migrate it to C++, adding the use of a try...catch construct:

#include <iostream>

extern "C"
  int lzd(int);
  #pragma no_side_effect(lzd)

int a;
int c=0;

class myclass
  int routine();

int myclass::routine()
    for(a=0; a<1000; a++)
    std::cout << "Something happened" << std::endl;
 return 0;

Compiling this produces a slightly suboptimal code sequence in the hot loop:

$ CC -O -xtarget=T4 -S t.cpp
/* 0x0014         23 */         lzd     %o0,%o0
/* 0x0018         21 */         add     %l6,1,%l6
/* 0x001c            */         cmp     %l6,1000
/* 0x0020            */         bl,pt   %icc,.L77000033
/* 0x0024         23 */         st      %o0,[%l7]

There's a store in the delay slot of the branch, so we're repeatedly storing data back to memory. If we change the function declaration to include "throw()", we get better code:

$ CC -O -xtarget=T4 -S t.cpp
/* 0x0014         21 */         add     %i1,1,%i1
/* 0x0018         23 */         lzd     %o0,%o0
/* 0x001c         21 */         cmp     %i1,999
/* 0x0020            */         ble,pt  %icc,.L77000019
/* 0x0024            */         nop

The store has gone, but the code is still suboptimal - there's a nop in the delay slot rather than useful work. However, it's good enough for this example. The point I'm making is that the compiler produces the better code with both the "throw()" and the no side effect pragma.

Friday Mar 30, 2012

Inline template efficiency

I like inline templates, and use them quite extensively. Whenever I write code with them I'm always careful to check the disassembly to see that the resulting output is efficient. Here's a potential cause of inefficiency.

Suppose we want to use the mis-named Leading Zero Detect (LZD) instruction on T4 (this instruction does a count of the number of leading zero bits in an integer register - so it should really be called leading zero count). So we put together an inline template called looking like:

.inline lzd
  lzd %o0,%o0

And we throw together some code that uses it:

int lzd(int);

int a;
int c=0;

int main()
  for(a=0; a<1000; a++)
  return 0;

We compile the code with some amount of optimisation, and look at the resulting code:

$ cc -O -xtarget=T4 -S lzd.c
$ more lzd.s
/* 0x001c         11 */         lzd     %o0,%o0
/* 0x0020          9 */         ld      [%i1],%i3
/* 0x0024         11 */         st      %o0,[%i2]
/* 0x0028          9 */         add     %i3,1,%i0
/* 0x002c            */         cmp     %i0,999
/* 0x0030            */         ble,pt  %icc,.L77000018
/* 0x0034            */         st      %i0,[%i1]

What is surprising is that we're seeing a number of loads and stores in the code. Everything could be held in registers, so why is this happening?

The problem is that the code is only inlined at the code generation stage - when the actual instructions are generated. Earlier compiler phases see a function call. The called functions can do all kinds of nastiness to global variables (like 'a' in this code) so we need to load them from memory after the function call, and store them to memory before the function call.

Fortunately we can use a #pragma directive to tell the compiler that the routine lzd() has no side effects - meaning that it does not read or write to memory. The directive to do that is #pragma no_side_effect(<routine name>), and it needs to be placed after the declaration of the function. The new code looks like:

int lzd(int);
#pragma no_side_effect(lzd)

int a;
int c=0;

int main()
  for(a=0; a<1000; a++)
  return 0;

Now the loop looks much neater:

/* 0x0014         10 */         add     %i1,1,%i1

!   11                !  {
!   12                !    c=lzd(c);

/* 0x0018         12 */         lzd     %o0,%o0
/* 0x001c         10 */         cmp     %i1,999
/* 0x0020            */         ble,pt  %icc,.L77000018
/* 0x0024            */         nop

Friday Mar 23, 2012

Malloc performance

My co-conspirator Rick Weisner has written up a nice summary of malloc performance, including evaluating the mtmalloc changes that we got into S10 U10.

Webinar for developers - 27th March

There's a webinar for developers scheduled for next week. Looks an interesting agenda.

Thursday Mar 22, 2012

Tech day at the Santa Clara campus

The next OTN Sys Admin day is at the Santa Clara campus on 10th April. It has a half hour talk on Studio. More information at the OTN Garage.


Darryl Gove is a senior engineer in the Solaris Studio team, working on optimising applications and benchmarks for current and future processors. He is also the author of the books:
Multicore Application Programming
Solaris Application Programming
The Developer's Edge
Free Download


« February 2015
The Developer's Edge
Solaris Application Programming
OpenSPARC Book
Multicore Application Programming