Thursday Apr 03, 2014

Discovering the Code Analyzer

We're doing something different with the Studio 12.4 Beta programme. We're also putting together some material about the compiler and features: videos, whitepapers, etc.

One of the first videos is now officially available. You might have seen the preproduction "leak" if you happen to follow Studio on either facebook or twitter.

This first video is an interview with Raj Prakash, the project lead for the Code Analyzer.

The Code Analyzer is our suite for checking the correctness of code. Something that you would run before you deliver an application to customers.

Monday Mar 31, 2014

Socialising Solaris Studio

I just figured that I'd talk about studio's social media presence.

First off, we have our own forums. One for the compilers and one for the tools. This is a good place to post comments and questions; posting here will get our attention.

We also have a presence on Facebook and Twitter.

Moving to the broader Oracle community, these pages list social media presence for a number of products.

Looking at Oracle blogs, the first stop probably has to be the entertaining The OTN Garage. It's also probably useful to browse the blogs by keywords, for example here's posts tagged with Solaris.

Tuesday Aug 27, 2013

My Oracle Open World and JavaOne schedule

I've got my schedule for Oracle Open World and JavaOne:

Note that on Thursday I have about 30 minutes between my two talks, so expect me to rush out of the database talk in order to get to the Java talk.

Thursday Mar 14, 2013

The pains of preprocessing

Ok, so I've encountered this twice in 24 hours. So it's probably worth talking about it.

The preprocessor does a simple text substitution as it works its way through your source files. Sometimes this has "unanticipated" side-effects. When this happens, you'll normally get a "hey, this makes no sense at all" error from the compiler. Here's an example:

$ more c.c
#include <ucontext.h>
#include <stdio.h>

int main()
  int FS;

$ CC c.c
$ CC c.c
"c.c", line 6: Error: Badly formed expression.
"c.c", line 7: Error: The left operand must be an lvalue.
2 Error(s) detected.

A similar thing happens with g++:

$  /pkg/gnu/bin/g++ c.c
c.c: In function 'int main()':
c.c:6:7: error: expected unqualified-id before numeric constant
c.c:7:6: error: lvalue required as left operand of assignment

The Studio C compiler gives a bit more of a clue what is going on. But it's not something you can rely on:

$ cc c.c
"c.c", line 6: syntax error before or at: 1
"c.c", line 7: left operand must be modifiable lvalue: op "="

As you can guess the issue is that FS gets substituted. We can find out what happens by examining the preprocessed source:

$ CC -P c.c
$ tail c.i
int main ( )
int 1 ;
1 = 0 ;
printf ( "FS=%i" , 1 ) ;

You can confirm this using -xdumpmacros to dump out the macros as they are defined. You can combine this with -H to see which header files are included:

$ CC -xdumpmacros c.c 2>&1 |grep FS
#define _SYS_ISA_DEFS_H
#define _FILE_OFFSET_BITS 32
#define REG_FSBASE 26
#define REG_FS 22
#define FS 1

If you're using gcc you should use the -E option to get preprocessed source, and the -dD option to get definitions of macros and the include files.

Thursday Mar 22, 2012

Tech day at the Santa Clara campus

The next OTN Sys Admin day is at the Santa Clara campus on 10th April. It has a half hour talk on Studio. More information at the OTN Garage.

Friday Feb 03, 2012

Using prtpicl to get cache sizes

If you are on a SPARC system you can get cache size information using the command fpversion, which is provided with Studio:

$ fpversion
 A SPARC-based CPU is available.
 Kernel says main memory's clock rate is 1012.0 MHz.

 Sun-4 floating-point controller version 0 found.
 An UltraSPARC chip is available.

 Use "-xtarget=sparc64vii -xcache=64/64/2:5120/256/10" code-generation option.

The cache parameters are output exactly as you would want to pass them into the compiler - for each cache it describes the size in KB, the line size in bytes, and the associativity.

fpversion doesn't exist on x86 systems. The next best thing is to use prtpicl to output system configuration information, and inspect that output for cache size. Here's the cache output for the same SPARC system using prtpicl.

$ prtpicl -v |grep cache
              :l1-icache-size    0x10000
              :l1-icache-line-size       0x40
              :l1-icache-associativity   0x2
              :l1-dcache-size    0x10000
              :l1-dcache-line-size       0x40
              :l1-dcache-associativity   0x2
              :l2-cache-size     0x500000
              :l2-cache-line-size        0x100
              :l2-cache-associativity    0xa

Friday Jan 13, 2012

C++ and inline templates

A while back I wrote an article on using inline templates. It's a bit of a niche article as I would generally advise people to write in C/C++, and tune the compiler flags and source code until the compiler generates the code that they want to see.

However, one thing that I didn't mention in the article, it's implied but not stated, is that inline templates are defined as C functions. When used from C++ they need to be declared as extern "C", otherwise you get linker errors. Here's an example template:

.inline nothing

And here's some code that calls it:

void nothing();

int main()

The code works when compiled as C, but not as C++:

$ cc i.c
$ ./a.out
$ CC i.c
Undefined                       first referenced
 symbol                             in file
void nothing()                   i.o
ld: fatal: Symbol referencing errors. No output written to a.out

To fix this, and make the code compilable with both C and C++ we use the __cplusplus feature test macro and conditionally include extern "C". Here's the modified source:

#ifdef __cplusplus
  extern "C"
    void nothing();
#ifdef __cplusplus

int main()

Wednesday Dec 14, 2011

Oracle Solaris Studio 12.3

Oracle Solaris Studio 12.3 was released today. You can download it here.

There's a bundle of exciting stuff that goes into every new release. The headlines are probably the introduction of the Code Analyzer tool which does dynamic and static error reporting on an application, and the ablity of the IDE to be run on a remote system while the builds are done on the host.

I have a couple of other favourite areas of change. First of all we've got spot running on a bunch of recent processors - in particular the SPARC T4 (I'll write more about this later). Secondly, the filtering in the Performance Analyzer has been pushed to the foreground. Let's discuss filtering now.

Filtering is one of those technologies that is very powerful, but has been quite hard to use in previous releases. The change in this release has been that the filters have been placed on the right-click menu. Here's an example:

Adding and removing filters is now just a matter of right clicking. This allows you to rapidly drill down on the profile data. For example filtering out activity by processor, call stack, and so on.

Wednesday Nov 02, 2011

Welcome to the (System) Developer's Edge

The Developer's Edge went out of print a while back. This was obviously frustrating, not just for me, but for the folks who contacted me asking what happened. Well, I'm thrilled to be able to announce that it's available as a pdf download.

This is essentially the same book as was previously available. I've not updated the links back to the original articles. It would have been problematic, in some instances the original articles no longer exist. There are only two significant changes, the first is the branding has been changed (there's no cover art, which keeps the download small). The second is the title of the book has been modified to include the word "system" to indicate that its focused towards the hardware end of the stack.

I hope you enjoy the System Developer's Edge.

Friday Oct 21, 2011


SPARC and x86 processors have different endianness. SPARC is big-endian and x86 is little-endian. Big-endian means that numbers are stored with the most significant data earlier in memory. Conversely little-endian means that numbers are stored with the least significant data earlier in memory.

Think of big endian as writing numbers as we would normally do. For example one thousand, one hundred and twenty would be written as 1120 using a big-endian format. However, writing as little endian it would be 0211 - the least significant digits would be recorded first.

For machines, this relates to which bytes are stored first. To make data portable between machines, a format needs to be agreed. For example in networking, data is defined as being big-endian. So to handle network packets, little-endian machines need to convert the data before using it.

Converting the bytes is a trivial matter, but it has some performance pitfalls. Let's start with a simple way of doing the conversion.

template <class T>
T swapslow(T in)
  T out;
  char * pcin = (char*)∈
  char * pcout = (char*)&out;

  for (int i=0; i<sizeof(T); i++)
    pcout[i] = pcin[sizeof(T)-i];
  return out;

The code uses templates to generalise it to different sizes of integers. But the following observations hold even if you use a C version for a particular size of input.

First thing to look at is instruction count. Assume I'm dealing with ints. I store the input to memory, then I access the input one byte at a time, storing each byte to a new location in memory, before finally loading the result. So for an int, I've got 10 memory operations.

Memory operations can be costly. Processors may be limited to only issuing one per cycle. In comparison most processors can issue two or more logical or integer arithmetic instructions per cycle. Loads are also costly as they have to access the cache, which takes a few cycles.

The other issue is more subtle, and I've discussed it in the past. There are RAW issues in this code. I'm storing an int, but loading it as four bytes. Then I'm storing four bytes, and loading them as an int.

A RAW hazard is a read-after-write hazard. The processor sees data being stored, but cannot convert that stored data into the format that the subsequent load requires. Hence the load has to wait until the result of the store reaches the cache before the load can complete. This can be multiple cycles of wait.

With endianness conversion, the data is already in the registers, so we can use logical operations to perform the conversion. This approach is shown in the next code snippet.

template <class T>
T swap(T in)
  T out=0;
  for (int i=0; i<sizeof(T); i++)
  return out;

In this case, we avoid the stores and loads, but instead we perform four logical operations per byte. This is higher cost than the load and store per byte. However, we can usually do more logical operations per cycle and the operations normally take a single cycle to complete. Overall, this is probably slightly faster than loads and stores.

However, you will usually see a greater performance gain from avoiding the RAW hazards. Obviously RAW hazards are hardware dependent - some processors may be engineered to avoid them. In which case you will only see a problem on some particular hardware. Which means that your application will run well on one machine, but poorly on another.

Tuesday Aug 09, 2011

Standards and headers

Every so often I encounter, or hear about, a problem with function definitions when the standard header files are included. Most often its mmap, but sometimes it's something else. Every time I think that I should write something up. Well, it's finally happened, a short paper on how to write portable code using the standard headers.

Monday Aug 01, 2011

Oracle Solaris Studio 12.3 Beta Programme

Last week, we started the beta programme for Oracle Solaris Studio 12.3. You can participate by downloading the software and reporting any issues.

As with any release, there's a lot of incremental improvements wherever we find opportunities, and there's a couple of new features. The two most interesting new features are:

    The Code Analyzer which reports possible errors in your application, both dynamic (ie memory access errors), and static. The static error detection is the newest feature, this goes beyond the compile time warnings or lint messages, and does much more detailed compile-time analysis of your code.
  • Remote development on Windows. I'm yet to try out this feature, but the IDE has the ability to run remotely on a Windows box seamlessly compiling and running on a remote server. In fact the improvements in the IDE are well worth a look.

Some of the Studio team are giving a webcast on Thursday 4th August at 9am PDT.

Wednesday May 18, 2011

Profiling running applications

Sometimes you want to profile an application, but you either want to profile it after it has started running, or you want to profile it for part of a run. There are a couple of approaches that enable you to do this.

If you want to profile a running application, then there is the option (-P <pid>) for collect to attach to a PID:

$ collect -P <pid>

Behind the scenes this generates a script and passes the script to dbx, which attaches to the process, starts profiling, and then stops profiling after about 5 minutes. If your application is sensitive to being stopped for dbx to attach, then this is not the best way to go. The alternative approach is to start the application under collect, then collect the profile over the period of interest.

The flag -y <signal> will run the application under collect, but collect will not gather any data until profiling is enabled by sending the selected signal to the application. Here's an example of doing this:

First of all we need an application that runs for a bit of time. Since the compiler doesn't optimise out floating point operations unless the flag -fsimple is used, we can quickly write an app that spends a long time doing nothing:

$ more slow.c
int main()
  double d=0.0;
  for (int i=0;i<10000000000; i++) {d+=d;}

$ cc -g slow.c

The next step is to run the application under collect with the option -y SIGUSR1 to indicate that collect should not start collecting data until it receives the signal USR1.

$ collect -y SIGUSR1 ./a.out &
[1] 1187
Creating experiment database ...

If we look at the generated experiment we can see that it exists, but it contains no data.

$ er_print -func
Functions sorted by metric: Exclusive User CPU Time

Excl.     Incl.      Name
User CPU  User CPU
 sec.      sec.
0.        0.         

To start gathering data we send SIGUSR1 to the application, sending the signal again stops data collection. Sending the signal twice we can collect two seconds of data:

$ kill -SIGUSR1 1187;sleep 2;kill -SIGUSR1 1187
$ er_print -func
Functions sorted by metric: Exclusive User CPU Time

Excl.     Incl.      Name
User CPU  User CPU
 sec.      sec.
2.001     2.001      
2.001     2.001      main
0.        2.001      _start

Thursday Apr 28, 2011

Exploring Performance Analyzer experiments

I was recently profiling a script to see where the time went, and I ended up wanting to extract the profiles for just a single component. The structure of an analyzer experiment is that there's a single root directory ( and inside that there's a single level of subdirectories representing all the child processes. Each subdirectory is a profile of a single process - these can all be loaded as individual experiments. Inside every experiment directory there is a log.xml file. This file contains a summary of what the experiment contains.

The name of the executable that was run is held on an xml "process" line. So the following script can extract a list of all the profiles of a particular application.

$ grep myapp `find -name 'log.xml'`|grep process | sed 's/\\:.\*//' > myapp_profiles

Once we have a list of every time my application was run, I can now extract the times from that list, and sort them using the following line:

$ grep exit `cat <myapp_files`|sed 's/.\*tstamp=\\"//'|sed 's/\\".\*//'|sort -n 

Once I have the list of times I can use then locate an experiment with a particular runtime - it's probably going to be the longest runtime:

$ grep exit `cat <myapp_files`|grep 75.9 

Monday Apr 25, 2011

Using pragma opt

The Studio compiler has the ability to control the optimisation level that is applied to particular functions in an application. This can be useful if the functions are designed to work at a specific optimisation level, or if the application fails at a particular optimisation level, and you need to figure out where the problem lies.

The optimisation levels are controlled through pragma opt. The following steps need to be followed to use the pragma:

  • The directive needs to be inserted into the source file. The format of the directive is #pragma opt /level/ (/function/). This needs to be inserted into the code before the start of the function definition, but after the function header.
  • The code needs to be compiled with the flag -xmaxopt=level. This sets the maximum optimisation level for all functions in the file - including those tagged with #pragma opt.

We can see this in action using the following code snippet. This contains two identical functions, both return the square of a global variable. However, we are using #pragma opt to control the optimisation level of the function f().

int f();
int g();

#pragma opt 2 (f)

int d;

int f()
  return d\*d;

int g()
  return d\*d;

The code is compiled with the flag -xmaxopt=5, this specifies the maximum optimisation level that can be applied to any functions in the file.

$ cc -O -xmaxopt=5 -S opt.c

If we compare the disassembly for the functions f() and g(), we can see that g() is more optimal as it does not reload the global data.

/\* 000000          0 \*/         sethi   %hi(d),%o5

!   10                !  return d\*d;

/\* 0x0004         10 \*/         ldsw    [%o5+%lo(d)],%o4 ! volatile    // First load of d
/\* 0x0008            \*/         ldsw    [%o5+%lo(d)],%o3 ! volatile    // Second load of d
/\* 0x000c            \*/         retl    ! Result =  %o0
/\* 0x0010            \*/         mulx    %o4,%o3,%o0

/\* 000000         14 \*/         sethi   %hi(d),%o5
/\* 0x0004            \*/         ld      [%o5+%lo(d)],%o4               // Single load of d

!   15                !  return d\*d;

/\* 0x0008         15 \*/         sra     %o4,0,%o3
/\* 0x000c            \*/         retl    ! Result =  %o0
/\* 0x0010            \*/         mulx    %o3,%o3,%o0

Friday Apr 01, 2011

Profiling scripts

One feature that crept into the Oracle Solaris Studio 12.2 release was the ability for the performance analyzer to follow scripts. It is necessary to set the environment variable SP_COLLECTOR_SKIP_CHECKEXEC to use this feature - as shown below.

bash-3.00$ file `which which`
/bin/which:     executable /usr/bin/csh script
bash-3.00$ collect which
Target `which' is not a valid ELF executable
bash-3.00$ collect which
Creating experiment database ...

Thursday Jan 27, 2011

Don't initialise local strings

Consider the following code:

void s(int i)
  char string[2048]="";
  sprinf(string,"Value = %i",i);
  printf("String = %s\\n",string);

The C standards require that if any elements of the character array string are initialised, then all of them should be. We can demonstrate this by compiling with gcc:

$ gcc -O -S f.c
$ more f.s
        .file   "f.c"
        .type   s, #function
        .proc   020
        save    %sp, -2160, %sp
        stx     %g0, [%fp-2064]
        add     %fp, -2056, %o0
        mov     0, %o1
        call    memset, 0
        mov    2040, %o2

You can see that explicitly initialising string caused all elements of string to be initialised with a call to memset(). Removing the explicit initialisation of string (the ="") avoids the call to memset().

Wednesday Dec 01, 2010


So the spot user's guide has been added to the Solaris Studio 12.2 documentation. There's also another collection of older articles.

Sunday Oct 03, 2010

Memory ordering

Just had a couple of white papers published on memory ordering. This is a topic which is quite hard to find documentation on, and also quite complex. Fortunately, it's also rarely encountered.

In Oracle Solaris Studio 12.2 we introduced the file mbarrier.h. This defines some intrinsics which allow the developer to enforce memory ordering.

The first paper covers avoiding the reordering of memory operations that the compiler may perform when compiling an application. The second paper covers the more complex issue of avoiding the reordering of memory operations that the processor may do at runtime.

Thursday Sep 09, 2010

Book update

I've just handed over the final set of edits to the manuscript. These are edits to the laid-out pages. Some are those last few grammatical errors you only catch after reading a sentence twenty times. Others are tweaks to the figures. There's still a fair amount of production work to do, but my final input will be a review of the indexing - probably next week.

So it's probably a good time to talk about the cover. This is a picture that my wife took last year. It's a picture of the globe at the cliff tops at Durlston Head near Swanage in England. It's 40 tonnes and over 100 years old. It's also surrounded by stone tablets some containing contemporary educational information, and a couple of blank ones that are there just so people can draw on them.

Wednesday Sep 08, 2010

Oracle Solaris Studio 12.2

It's been just over a year since the release of Studio 12 Update 1, today we releasing the first Oracle branded Studio release - Oracle Solaris Studio 12.2. For the previous release I wrote a post for the AMD site looking at the growth in multicore processors. It seemed appropriate to take another look at this.

The graph in the chart below shows the cumulative number of SPECint2006 results broken down by the number of cores for each processor. This data does not represent the number of different types of processor that are available, since the same processor can be used in many different results. It is closer to a snapshot of how the market for multicore processors is growing. Each data point represents a system, so the curve approximates the number of different systems that are being released.

It's perhaps more dramatic to demonstrate the change using a stacked area chart. The chart perhaps overplays the number of single core results, but this is probably fair as "single core" represents pretty much all the results prior to the launch of CPU2006. So what is readily apparent is the rapid decline in the number of single core results, the spread of dual, and then quad core. It's also interesting to note the beginning of a spread of more than quad core chips.

If we look at what is happening with multicore processors in the context of what we are releasing with Solaris Studio, there's a very nice fit of features. We continue to refine our support for OpenMP and automatic parallelisation. We've been providing data race (and deadlock) detection through the Thread Analyzer for a couple of releases. The debugger and the performance analyzer have been fine with threads for a long time. The performance analyzer has the time line view which is wonderful for examining multithreaded (or multiprocess) applications.

In addition to these fundamentals Studio 12.2 introduces a bunch of new features. I discussed some of these when the express release came out:

  • For those who use the IDE, integration of support for the analysis of the runtime behaviour of applications has been very useful. It both provides more information directly back to the developer, and raises awareness of the available tools.
  • Understanding call trees is often an important part of interpreting the performance of the application. Being able to drill down the call tree has been a very useful extension to the Performance Analyzer.
  • Memory error checking is critical for all applications. The trouble with memory access errors is that, like data races, the "problem" is visible arbitrarily far from the point where the error occurred.

The release of a new version of a product is always an exciting time. It's a culmination of a huge amount of analysis, development, and testing, and it's wonderful to finally see it available for others to use. So download it and let us know what you think!

Footnote: SPEC, SPECint, reg tm of Standard Performance Evaluation Corporation. Results from as of 6 September 2010 and this report.

Parallelisation white paper

An interesting white paper on various approaches to writing parallel programs has just been released. Covers OpenMP, Threading Building Blocks, MPI, and a bunch of others.

Monday Jul 26, 2010

What does take_deferred_signal() mean in my profile?

Every so often you'll see take_deferred_signal() appear in the profile of an application. Sometimes as quite a time consuming function. So, what does it mean?

It actually comes from signal handling code in libc. If a signal comes in while the application is in a critical section, the signal gets deferred until the critical section is complete. When the application exits the critical section, all the deferred signals get taken.

Typically, this function becomes hot due to mutex locks in malloc and free, but other library calls can also cause it. The way to diagnose what is happening is to examine the call stack. So let's run through an example. Here is some multithreaded malloc/free heavy code.

#include <stdlib.h>
#include <pthread.h>

void \*work( void\* param )
  while ( 1 ) { free( malloc(100) ); }

int main()
  pthread_t thread;
  pthread_create( &thread, 0, work, 0 );
  for ( int i=0; i<10000000; i++ )
    free ( malloc (100) );

Profiling, we can see that take_deferred_signal() is the hottest function. The other hot functions would probably give us a clue as to the problem, but that is an artifact of the rather simple demonstration code.

Excl.     Incl.      Name
User CPU  User CPU
  sec.      sec.
36.456    36.456     <Total>
14.210    14.210     take_deferred_signal
 4.203    21.265     mutex_lock_impl
 3.082     3.082     clear_lockbyte
 2.872    17.062     mutex_trylock_adaptive

The next thing to look at is the call stack for take_deferred_signal() as this will tell us who is calling the function.

Attr.      Name
User CPU
14.210     do_exit_critical
14.210    \*take_deferred_signal

do_exit_critical() doesn't tell us anything, we already know that it is called when the code exits a critical section. Continuing up the call stack we find:

Attr.      Name
User CPU
14.190     mutex_trylock_adaptive
 0.020     mutex_unlock
 0.       \*do_exit_critical
14.210     take_deferred_signal

Which is more useful, we now know that the time is spent in mutex locks, but we don't know the user of those mutex locks. In this case the bulk of the time comes from mutex_trylock_adaptive(), so that is the routine to investigate:

Attr.      Name
User CPU
17.062     mutex_lock_impl
 2.872    \*mutex_trylock_adaptive
14.190     do_exit_critical

So we're still in the mutex lock code, we need to find who is calling the mutex locks:

Attr.      Name
User CPU
11.938     free
 9.327     malloc
 4.203    \*mutex_lock_impl
17.062     mutex_trylock_adaptive

So we finally discover that the time is due to calls to mutex locks in malloc() and free().

Friday Jul 09, 2010

White paper on using Oracle Solaris Studio

I contributed a fair amount of material to a recent white paper about Oracle Solaris Studio. The paper is available for download from the developer portal.

Optimizing Applications with Oracle Solaris Studio Compilers and Tools

Oracle Solaris Studio delivers a fully integrated development platform for generating robust high-performance applications for the latest Oracle Sun systems (SPARC and x86). In order to take full advantage of the latest multicore systems, applications must be compiled for optimal performance and tuned to exploit the capabilities of the hardware. Learn how Oracle Solaris Studio helps you generate the highest performance applications for your target platform, from selecting the right compiler flags and optimization techniques to simplifying development with advanced multicore tools.

Tuesday Jun 08, 2010

Calltrees in analyzer

The Performance Analyzer has also had a number of new features and improvements. The most obvious one of these is the new call tree tab. This allows you to drill down into the call tree for an application and see exactly how the time is divided between the various call stacks.

Monday Jun 07, 2010

Checking for memory access errors with discover

The latest Solaris Studio Express release contains the tool discover, which tests for memory access errors. These are errors like reading past the end of an array or freeing a pointer twice. The best part of the tool is that it does not require a special build of the application. The sequence is:

$ discover a.out
$ a.out

The discover command adds instrumentation to the executable, and you then run the resulting binary in the same way that you would normally run your program. The output from discover is an html file containing details of any memory access errors that the tool discovered.

Sunday Jun 06, 2010

Solaris Studio Express

The latest Solaris Studio Express release is out, there's also a feedback programme for submitting bugs and posting questions.

One of the first things I did with it was to launch the solstudio IDE. It has the expected functionality. Code completion, and hints on the parameters that are expected by a function:

There's also integrated debugging:

I'll add a couple more posts over the next few days showing some other features.


Darryl Gove is a senior engineer in the Solaris Studio team, working on optimising applications and benchmarks for current and future processors. He is also the author of the books:
Multicore Application Programming
Solaris Application Programming
The Developer's Edge


« April 2014
The Developer's Edge
Solaris Application Programming
OpenSPARC Book
Multicore Application Programming