Friday Apr 04, 2014

Interview with Don Kretsch

As well as filming the "to-camera" about the Beta program, I also got the opportunity to talk with my Senior Director Don Kretsch about the next compiler release.

Tuesday Mar 25, 2014

Solaris Studio 12.4 Beta now available

The beta programme for Solaris Studio 12.4 has opened. So download the bits and take them for a spin!

There's a whole bunch of features - you can read about them in the what's new document, but I just want to pick a couple of my favourites:

  • C++ 2011 support. If you've not read about it, C++ 2011 is a big change. There's a lot of stuff that has gone into it - improvements to the Standard library, lambda expressions etc. So there is plenty to try out. However, there are some features not supported in the beta, so take a look at the what's new pages for C++
  • Improvements to the Performance Analyzer. If you regularly read my blog, you'll know that this is the tool that I spend most of my time with. The new version has some very cool features; these might not sound much, but they fundamentally change the way you interact with the tool: an overview screen that helps you rapidly figure out what is of interest, improved filtering, mini-views which show more slices of the data on a single screen, etc. All of these massively improve the experience and the ease of using the tool.

There's a couple more things. If you try out features in the beta and you want to suggest improvements, or tell us about problems, please use the forums. There is a forum for the compiler and one for the tools.

Oh, and watch this space. I will be writing up some material on the new features....

Thursday Jan 12, 2012

Please mind the gap

I find the timeline view in the Performance Analyzer incredibly useful, but I've often been puzzled by what causes the gaps - like those in the example below:

Timeline view

One of my colleagues pointed out that it is possible to figure out what is causing the gaps. The call stack is indicated by the event after the gap. This makes sense. The Performance Analyzer works by sending a profiling signal to the thread multiple times a second. If the thread is not scheduled on the CPU then it doesn't get a signal. The first thing that the thread does when it is put back onto the CPU is to respond to those signals that it missed. Here's some example code so that you can try it out.

#include <stdio.h>

void write_file()
  char block[8192];
  FILE * file = fopen("./text.txt", "w");
  for (int i=0;i<1024; i++)
    fwrite(block, sizeof(block), 1, file);

void read_file()
  char block[8192];
  FILE * file = fopen("./text.txt", "rw");
  for (int i=0;i<1024; i++)
    fwrite(block, sizeof(block), 1, file);

int main()
  for (int i=0; i<100; i++)

This is the code that generated the timeline shown above, so you know that the profile will have some gaps in it. If we select the event after the gap we determine that the gaps are caused by the application either opening or closing the file.


But that is not all that is going on, if we look at the information shown in the Timeline details panel for the Duration of the event we can see that it spent 210ms in the "Other Wait" micro state. So we've now got a pretty clear idea of where the time is coming from.

Wednesday May 18, 2011

Profiling running applications

Sometimes you want to profile an application, but you either want to profile it after it has started running, or you want to profile it for part of a run. There are a couple of approaches that enable you to do this.

If you want to profile a running application, then there is the option (-P <pid>) for collect to attach to a PID:

$ collect -P <pid>

Behind the scenes this generates a script and passes the script to dbx, which attaches to the process, starts profiling, and then stops profiling after about 5 minutes. If your application is sensitive to being stopped for dbx to attach, then this is not the best way to go. The alternative approach is to start the application under collect, then collect the profile over the period of interest.

The flag -y <signal> will run the application under collect, but collect will not gather any data until profiling is enabled by sending the selected signal to the application. Here's an example of doing this:

First of all we need an application that runs for a bit of time. Since the compiler doesn't optimise out floating point operations unless the flag -fsimple is used, we can quickly write an app that spends a long time doing nothing:

$ more slow.c
int main()
  double d=0.0;
  for (int i=0;i<10000000000; i++) {d+=d;}

$ cc -g slow.c

The next step is to run the application under collect with the option -y SIGUSR1 to indicate that collect should not start collecting data until it receives the signal USR1.

$ collect -y SIGUSR1 ./a.out &
[1] 1187
Creating experiment database ...

If we look at the generated experiment we can see that it exists, but it contains no data.

$ er_print -func
Functions sorted by metric: Exclusive User CPU Time

Excl.     Incl.      Name
User CPU  User CPU
 sec.      sec.
0.        0.         

To start gathering data we send SIGUSR1 to the application, sending the signal again stops data collection. Sending the signal twice we can collect two seconds of data:

$ kill -SIGUSR1 1187;sleep 2;kill -SIGUSR1 1187
$ er_print -func
Functions sorted by metric: Exclusive User CPU Time

Excl.     Incl.      Name
User CPU  User CPU
 sec.      sec.
2.001     2.001      
2.001     2.001      main
0.        2.001      _start

Thursday Apr 28, 2011

Exploring Performance Analyzer experiments

I was recently profiling a script to see where the time went, and I ended up wanting to extract the profiles for just a single component. The structure of an analyzer experiment is that there's a single root directory ( and inside that there's a single level of subdirectories representing all the child processes. Each subdirectory is a profile of a single process - these can all be loaded as individual experiments. Inside every experiment directory there is a log.xml file. This file contains a summary of what the experiment contains.

The name of the executable that was run is held on an xml "process" line. So the following script can extract a list of all the profiles of a particular application.

$ grep myapp `find -name 'log.xml'`|grep process | sed 's/\\:.\*//' > myapp_profiles

Once we have a list of every time my application was run, I can now extract the times from that list, and sort them using the following line:

$ grep exit `cat <myapp_files`|sed 's/.\*tstamp=\\"//'|sed 's/\\".\*//'|sort -n 

Once I have the list of times I can use then locate an experiment with a particular runtime - it's probably going to be the longest runtime:

$ grep exit `cat <myapp_files`|grep 75.9 

Friday Apr 01, 2011

Profiling scripts

One feature that crept into the Oracle Solaris Studio 12.2 release was the ability for the performance analyzer to follow scripts. It is necessary to set the environment variable SP_COLLECTOR_SKIP_CHECKEXEC to use this feature - as shown below.

bash-3.00$ file `which which`
/bin/which:     executable /usr/bin/csh script
bash-3.00$ collect which
Target `which' is not a valid ELF executable
bash-3.00$ collect which
Creating experiment database ...

Friday Jul 09, 2010

White paper on using Oracle Solaris Studio

I contributed a fair amount of material to a recent white paper about Oracle Solaris Studio. The paper is available for download from the developer portal.

Optimizing Applications with Oracle Solaris Studio Compilers and Tools

Oracle Solaris Studio delivers a fully integrated development platform for generating robust high-performance applications for the latest Oracle Sun systems (SPARC and x86). In order to take full advantage of the latest multicore systems, applications must be compiled for optimal performance and tuned to exploit the capabilities of the hardware. Learn how Oracle Solaris Studio helps you generate the highest performance applications for your target platform, from selecting the right compiler flags and optimization techniques to simplifying development with advanced multicore tools.

Tuesday Jun 08, 2010

Calltrees in analyzer

The Performance Analyzer has also had a number of new features and improvements. The most obvious one of these is the new call tree tab. This allows you to drill down into the call tree for an application and see exactly how the time is divided between the various call stacks.

Wednesday Nov 25, 2009

Viewing thread activity in the Performance Analyzer

The Sun Studio Performance Analyzer is one of the two tools that I use most frequently (the other is spot - which is now in SS12U1!). It's a very powerful tool, but a lot of that power is not immediately visible to users. I'm going to discuss a couple of ways I've used the analyzer to view parallel applications.

The most common first step for looking at the performance of parallel apps is to use the timeline. However, the timeline can look a bit cluttered with all of the call stack data. Often you are really just interested in the leaf node. Fortunately this can be configured from the data presentation dialog box. To get the view I want I'm only showing the top leaf in the call stack:

This results in a display of the samples in each routine, by default this can look very colourful. You can make it easier on the eye by selecting the colours used to display the graphic. In the following graphic I've picked green for one parallel routine that I'm interested in, and blue for another, then used a yellow to colour all the time waiting for more work to be assigned:

The graphic shows that the work is not evenly spread across all threads. The first few threads spend more time in the hot routines than the later threads. We can see this much more clearly using the 'threads' view of the data. To get this view you need to go back to the data presentation dialog and select the threads tab, it's also useful to select the 'cpus' tab at the same time.

The threads tab shows the activity of each thread for the currently displayed metrics. This is useful to see if one thread is doing more work than another. The cpus tab shows time that the app spends on each CPU in the machine - this can indicate whether a particular CPU is over subscribed. The thread activity looks like:

This confirms what we thought earlier that some of the threads are much more active than other threads. The top chart shows the user time, which indicates that all the threads spent the same amount of time running 'stuff', the middle chart shows the time that each thread spent running useful work, the lower chart shows the time spent in overhead. The exercise now is to try and improve the distribution of work across the threads......

Friday Oct 09, 2009

Webcast: Improving the performance of parallel codes using the Performance Analyzer

Earlier in the summer I recorded a slidecast on using the Performance Analyzer on parallel codes, it's just come out on the HPC portal.

Wednesday Sep 02, 2009

Profiling a rate

Sometimes it's the rate of doing something which is the target that needs to be improved through optimisation. ie increase the widgets per second of some application. I've just been looking at a code that estimated performance by counting the number of computations completed in a known constant length of time. The code was showing a performance regression, and I wanted to find out what changed. The analysis is kind of counter intuitive, so I thought I'd share an example with you.

Here's an example code that does a computation for a fixed length of time, in this case about 30 seconds:

#include <sys/time.h>
#include <stdio.h>
#include <stdlib.h>

double f1(int i)
  double t=0;
  while (i-->0) {t+=t;}
  return t;

double f2(int i)
  double t=0;
  while (i-->0) {t+=t;}
  return t;

void main(int argc,char\*\*args)
  struct timeval now;
  long startsecs;
  long count=0;
  int vcount;
  if (argc!=2){ printf("Needs a number to be passed in\\n"); exit(0);}


  } while (now.tv_sec<startsecs+30);

  printf("Iterations %i duration %i rate %f\\n",count, now.tv_sec-startsecs, 1.0\*count/(now.tv_sec-startsecs));


The code takes a command line parameter to indicate the number of iterations to do in function f2, function f1 always does 100 iterations.

If I compile and run this code under the performance analyzer with 50 and 70 as the commandline parameters I get the following profile:

Description50 Iterations70 Iterations
Total time26.6s25.87s
Total iterations942,684841,921

We can make the following observation when we go from 70 down to 50 for parameter passed to f2, we see a 12% gain in the total rate. This is to be expected as we are reducing the total number of iterations of the pair of loops in f1 and f2 will reduce from 170 down to 150, which is the same ~12% gain.

Where it gets counter intuitive is that for the run which achieves the higher rate, the time spent in the routines f1 and gettimeofday increases - by the same 12%. This is counter intuitive because increased time in a routine normally indicates that the routine is the one to be investigated, but for a 'rate' situation the opposite is true. These routines are being well behaved. The way to think about it is that each unit of work needs a smidgeon of time in both of these routines, if the number of units of work increases, then the absolute amount of time in these two routines needs to increase linearly with the increase in rate.

However, the time in routine f2 decreases as the rate increases. This is the routine which has been "improved" to get the better rate. The other thing to note is that the time went from ~6s to ~4.5s, but the rate went from 841k to 941k, so the time per unit work dropped further than that - this makes comparing the profiles of the two runs more tricky.

Note that Amdahl's law would still tell us that the routines that need to be optimised are the ones where the time is spent - so in one sense nothing has changed. But my particular scenario today is figuring out what has changed in the executable when compiled in two different ways that leads to the performance gain. In this context, I now know the routine, and I can dig into the assembly code to figure out the why.

Friday Jan 16, 2009

Out of memory in the Performance Analyzer

I've been working on an Analyzer experiment from a long running multithreaded application. Being MT I really needed to see the Timeline view to make sense of what was happening. However, when I switched to the Timeline I got a Java Out of Memory error (insufficient heap space).

Tracking this down, I used prstat to watch the Java application run and the memory footprint increase. I'd expected it to get to 4GB and die at that point, so I was rather surprised when the process was only consuming 1.1GB when the error occurred.

I looked at the commandline options for the Java process using pargs, and spotted the flag -Xmx1024m; which sets the max memory to be 1GB. Ok, found the culprit. You can use the -J option to analyzer to pass flags to the JVM. The following invocation of the analyzer sets the limit to 4GB:

$ analyzer -J-Xmx4096m

If you need more memory than that, you'll have to go to the 64-bit JVM, and allocate an appropriate amount of memory:

$ analyzer -J-d64 -J-Xmx8192m

Wednesday Jul 23, 2008

Hot topics presentation: "Compilers, Tools, and Performance"

Just put together the slides for tomorrow's presentation: "Compilers, Tools, and Performance". One of my colleagues has kindly agreed to translate the presentation into Japanese, so I don't expect to get much material into half an hour, since we'll be sharing the time.

Monday Nov 26, 2007

Other Sun Studio videos

There were two other videos posted to the hpc portal. Marty Itzkowitz talking about the Performance Analyzer and Vijay Tatkar talking about portable applications.

Multi-threading webcast

A long while back I was asked to contribute a video that talked about parallelising applications. The final format is a webcast (audio and slides) rather than the expected video. This choice ended up being made to provide the clearest visuals of the slides, plus the smallest download.

I did get the opportunity to do the entire presentation on video - which was an interesting experience. I found it surprisingly hard to present to just a camera - I think the contrast with presenting to an audience is that you can look around the room and get feedback as to the appropriate level of energy to project. A video camera gives you no such feedback, and worse, there's no other place to look. Still I was quite pleased with the final video. The change to a webcast was made after this, so the audio from the video was carried over, and you still get to see about 3 seconds of the original film, but the rest has gone. I also ended up reworking quite a few of the slides - adding animation to clarify some of the topics.

The topics covered at a break-neck pace are, parallelising using Pthreads and OpenMP. Autoparallelisation by the compiler. Profiling parallel applications. Finally, detecting data races using the thread analyzer.

Wednesday Sep 12, 2007

Sun Studio 12 Performance Analyzer docs available

The documentation for the Sun Studio 12 version of the Performance Analyzer has gone live. The Sun Studio 12 docs collection is also available.

Tuesday Sep 04, 2007

Recording Analyzer experiments over long latency links (-S off)

I was recently collecting a profile from an app running on a machine in Europe, but writing the data back to a machine here in CA. The application normally ran in 5 minutes, so I was surprised that it had made no progress after 3 hours when run under collect.

The Analyzer experiment looked like:

Dir : archives 	 	08/28/07  	23:47:10
File: dyntext 		08/28/07 	23:47:12
File: log.xml 	598 KB 	08/29/07 	03:02:52 
File: map.xml 	3 KB 	08/28/07 	23:47:22
File: overview 	4060 KB 08/29/07 	03:02:51 
File: profile 	256 KB 	08/28/07 	23:47:56

Two of the files (log.xml and overview) had accumulated data since the start of the application, the other files had not. truss output showed plenty of writes to these files:

 0.0001 open("/net/remotefs/.../", O_WRONLY|O_APPEND) = 4
 0.0000 write(4, " < e v e n t   k i n d =".., 74)      = 74
 0.0004 close(4)                                        = 0

In fact it looked rather like opening and closing these remote files were taking all the time away from running the application. One of the Analyzer team suggested passing -S off to collect to switch off periodic sampling. Periodic sampling is collecting application state at one second intervals. Using this flag, the application terminated in the usual 5 minutes and produced a valid profile.

Thursday Aug 16, 2007

Presenting at Stanford HPC conference

I'll be presenting at Stanford next week as part of their HPC conference (Register here). I plan to cover:

Wednesday Aug 15, 2007

Comparing analyzer experiments

When performance tuning an application it is really helpful to be able to compare the performance of the current version of the code with an older version. This was one of the motivators for the performance reporting tool spot. spot takes great care to capture information about the system that the code was run on, the flags used to build the application, together with the obvious things like the profile of the application. The tool spot_diff, which is included in the latest release of spot, pulls out the data from multiple experiments and produces a very detailed comparison between them - indicating if, for example, one version had more TLB misses than another version.

However, there are situations where it's necessary to compare two analyzer experiments, and er_xcmp is a tool which does just that.

er_xcmp extracts the time spent in each routine for the input data that is passed to it, and presents this as a list of functions together with the time spent in each function from each data set. er_xcmp handles an arbitrary number of input files, so it's just as happy comparing three profiles as it is two. It's also able to handle data from bit, so comparisons of instruction count as well as user time are possible.

The input formats can be Analyzer experiments, fsummary output from er_print, or output directories from er_html - all three formats get boiled down to the same thing and handled in the same way by the script.

Here's some example output:

% er_xcmp
    2.8     8.8 <Total>
    1.8     2.6 foo
    1.0     6.2 main
    N/A     N/A _start

Thursday Apr 19, 2007

Locating DTLB misses using the Performance Analyzer

DTLB misses typically appear in the Performance Analyzer as loads with significant user time. The following code strides through memory in blocks of 8192 bytes, and so encounters many DTLB misses

void main()
  double \*a;
  double total=0;
  int i;
  int j;
  for (i=0;i<10000;i++)

A profile can be gathered as follows:

cc -g -O -xbinopt=prepare -o tlb tlb.c
collect tlb

Viewing the profile for the main loop using er_print produces the following snippet:

   User CPU  
   0.230                [11]    10c70:  prefetch    [%i5 + %i1], #n_reads
   0.                   [11]    10c74:  faddd       %f4, %f2, %f8
   3.693                [11]    10c78:  ldd         [%i5], %f12
## 7.685                [10]    10c7c:  add         %i5, %i3, %o3
   0.                   [11]    10c80:  faddd       %f8, %f0, %f10
   4.183                [11]    10c84:  ldd         [%o3], %f2
## 6.935                [10]    10c88:  add         %o3, %i3, %o1
   0.                   [11]    10c8c:  faddd       %f10, %f12, %f4
   0.                   [10]    10c90:  inc         3072, %i4
   4.123                [11]    10c94:  ldd         [%o1], %f0
## 7.065                [10]    10c98:  cmp         %i4, %i0
   0.                   [10]    10c9c:  ble,pt      %icc,0x10c70
   0.                   [10]    10ca0:  add         %o1, %i3, %i5

This can be compared with the situation where has been preloaded to enable the application to get large pages:

   0.                   [11]    10c70:  prefetch    [%i5 + %i1], #n_reads
   0.                   [11]    10c74:  faddd       %f4, %f2, %f8
   0.                   [11]    10c78:  ldd         [%i5], %f12
## 7.445                [10]    10c7c:  add         %i5, %i3, %o3
   0.                   [11]    10c80:  faddd       %f8, %f0, %f10
   0.220                [11]    10c84:  ldd         [%o3], %f2
## 6.955                [10]    10c88:  add         %o3, %i3, %o1
   0.                   [11]    10c8c:  faddd       %f10, %f12, %f4
   0.                   [10]    10c90:  inc         3072, %i4
   0.340                [11]    10c94:  ldd         [%o1], %f0
   0.                   [10]    10c98:  cmp         %i4, %i0
   0.                   [10]    10c9c:  ble,pt      %icc,0x10c70
   0.                   [10]    10ca0:  add         %o1, %i3, %i5

The difference between the two profiles is the appearance of user time attributed directly to the load instruction (and not the normal instruction after the load).

It is possible to confirm that these are DTLB misses using the Performance Analyzer's ability to profile an application using the hardware performance counters:

collect -h DTLB_miss tlb
          0             [11]    10c70:  prefetch    [%i5 + %i1], #n_reads
          0             [11]    10c74:  faddd       %f4, %f2, %f8
   30000090             [11]    10c78:  ldd         [%i5], %f12
          0             [10]    10c7c:  add         %i5, %i3, %o3
          0             [11]    10c80:  faddd       %f8, %f0, %f10
## 42000126             [11]    10c84:  ldd         [%o3], %f2
          0             [10]    10c88:  add         %o3, %i3, %o1
          0             [11]    10c8c:  faddd       %f10, %f12, %f4
          0             [10]    10c90:  inc         3072, %i4
   30000090             [11]    10c94:  ldd         [%o1], %f0
          0             [10]    10c98:  cmp         %i4, %i0
          0             [10]    10c9c:  ble,pt      %icc,0x10c70
          0             [10]    10ca0:  add         %o1, %i3, %i5

The events are reported on the load instructions that are causing the DTLB misses.


Darryl Gove is a senior engineer in the Solaris Studio team, working on optimising applications and benchmarks for current and future processors. He is also the author of the books:
Multicore Application Programming
Solaris Application Programming
The Developer's Edge


« April 2014
The Developer's Edge
Solaris Application Programming
OpenSPARC Book
Multicore Application Programming