Friday Oct 21, 2011

Differences between the various STL options on Solaris

Steve Clamage has provided a nice summary of the trade-offs between the various STL options. I'll summarise it here:

  • Default STL. Available as part of the OS so does not require a separate library to be shipped with the application. However, does not support the standard.
  • -library=stlport4 Much better conformance with the standard, but no internationalisation. Must be distributed with applications that use it.
  • -library=stdcxx4 (Apache). Complete implementation of standard. Available on S10U10 and onwards.

I'd also add that stlport4 and stdcxx4 typically have much better performance than the default library.

The other point that bears repetition is that you can only include one STL per application. So you cannot use different implementations for different libraries or for the application.

Tuesday Oct 04, 2011

Slides from Oracle Open World

The slides from my presentation are now available for download.

Sunday Aug 14, 2011

Comments

It used to be possible to have comments automatically appear without me having to moderate them. I preferred approach since otherwise it feels a bit like censorship when I need to explicitly approve them. Unfortunately that option has been removed, so I'm now having to manually approve any comments. This may mean a bit of a delay before I notice that I have comments to approve.

Tuesday Aug 09, 2011

Standards and headers

Every so often I encounter, or hear about, a problem with function definitions when the standard header files are included. Most often its mmap, but sometimes it's something else. Every time I think that I should write something up. Well, it's finally happened, a short paper on how to write portable code using the standard headers.

Monday Aug 01, 2011

Standard header files

Interesting (but old) blog post about the standard header files included with Solaris.

Thursday Jul 14, 2011

Best practices for libraries and linkers (part 8)

Part 8 is the conclusion of the series on the best practices for libraries and linking. The core set of best practices are:

  • Ensure at link time that all symbols are resolved.
  • Minimise the number of symbols of global scope.
  • Specify the library search paths at link time.

Putting this series of articles together turned out to be a fair amount of work. Hopefully you can see from the scale of the topics why we chose to break it down into bite-sized chunks. I'll be happy to hear feedback on whether you found it useful, or what other topics you would like discussed.

Using symbol scoping. Libraries and linker best practices part 7

In general the compiler is going to scope symbols declared in object files as being global. This means that they can be seen and bound to by any object. There are two other settings for symbol scope - "symbolic" and "hidden".

Hidden scope is easiest to describe as it just means that the symbol can only be seen within the module and is not exported for applications or libraries to use. This is basically a locally defined symbol. There are multiple advantages to using hidden scoping when possible, it reduces the number of symbols that the linker needs to handle at runtime, so reduces start up time. It also reduces the number of names, so reduces the chance of duplicate names. Finally hidden symbols cannot be bound to externally, so they cannot cause a link order problem. This makes hidden scope a good choice for all those symbols that don't need to be exported.

The other option is symbolic scope. A symbol with symbolic scope is still available for other modules to bind to - so it is like a global symbol in that respect. However, a symbolic symbol can only be satisfied from within the library or application. So if I have an unresolved symbolic symbol foo() then that symbol can only bind within the library or application. So symbolic-scoped symbols avoid the cross-library issue that causes link order problems.

Symbols can be declared with their scoping; __global,__symbolic, or __hidden. We can also use the compiler flag -xldscope=<scope> to set the default scoping for all the symbols not otherwise scoped.

The details of all this are discussed much more thoroughly in Part 7 of the series.

The best practices for symbol scoping come in two flavours:

The easiest way of handling scoping is to declare all the defined symbols to have symbolic scoping (-xldscope=symbolic). This ensures that these symbols end up with local binding rather than pulling in definitions that are present in other libraries. The downside of this is that it could cause multiple definitions for the same symbol to become present in the address space of an application.

The other approach is to carefully define interfaces by declaring exported symbols to be __symbolic, so that other libraries can bind to them, but this library will bind to the local versions in preference. Then to declare imported symbols as __global which will ensure that the library can bind to an external definition for the symbol. Then finally use -xldscope=hidden to avoid further pollution of the name space. This is time consuming but reduces runtime link costs, and also increases the robustness of the application.

Setting the initialisation order for libraries (Best practices for libraries and linking part 6)

Part 5 of the series talked about diagnosing initialisation problems. These are situations where the libraries are loaded in the wrong order and this causes the application not to function correctly (or at all). Part 6 discusses how to resolve this problem.

The easiest, but the least reliable approach is to reorder the libraries on the link line until they get initialised in the right order. This is an easy fix since it is just a matter of changing the link line, but it's not reliable. There are various reasons why this is a poor fix. It is limited to just fixing the one application, and does not fix the root of the problem. It is not robust as a change in one of the libraries may cause the whole problem to recur. etc. Better fixes involve avoiding the duplicate symbol problem that causes the library load order to be indeterminate.

If the symbols are introduced because of C++ templates, then the -instlib=<library> flag causes the compiler not to generate symbols that are defined in the listed libraries.

Direct binding is another approach which records the exact library dependencies at link time so that the linker knows exactly which libraries are required, and hence can determine the appropriate load order. This has the downside that it enables different libraries to bind to different definitions of the same symbol, this could be a useful feature, but could also introduce problems.

Wednesday Jul 13, 2011

Presenting at Oracle Open World

I'll be presenting at Oracle Open World in October. The full searchable catalog is on-line, or you can browse the speakers by name.

Tuesday Jul 12, 2011

Feature Test Macros

Feature test macros are a set of macros that are either:

  • Defined by the development environment indicating that the environment conforms to a particular standard

or

  • Defined by the source code for the application before the header files are included to indicate that the application requires a particular environment to build

The macros define what APIs are available, and what parameters are passed through the APIs. Adherence to a particular standard (like POSIX) will define a particular set of APIs, and define their parameters. A good example of this is on Solaris where munmap changes definition depending on what standards have been requested:

$ grep munmap /usr/include/sys/*.h
/usr/include/sys/mman.h:extern int munmap(void *, size_t);
/usr/include/sys/mman.h:extern int munmap(caddr_t, size_t);

The Linux man page for feature_test_macros includes useful source code (ftm.c) for reporting which feature test macros are set by default. This changes depending on the the OS and compiler used. One of the big differences between Linux and Solaris are the feature test macros that are set by default. Here's the output from the program compiled on a Linux box and a Solaris box - both using gcc.

Linux

$ gcc ftm.c
$ ./a.out
_POSIX_SOURCE defined
_POSIX_C_SOURCE defined: 200809L
_BSD_SOURCE defined
_SVID_SOURCE defined
_ATFILE_SOURCE defined

Solaris

$ gcc ftm.c
$ ./a.out
_LARGEFILE64_SOURCE defined
_FILE_OFFSET_BITS defined: 32

The list of standards that Solaris 10 adheres to is documented under man standards, the list for Linux is documented under man feature_test_macros.

Monday Jul 11, 2011

OpenMP 3.1 specification released

OpenMP is a great way to produce parallel applications with the minimal amount of work. The 3.1 specification came out a couple of days ago. As should be apparent from the version number, its more incremental than significant. The significant changes I see are:

  • Support for min and max reductions in C/C++. This was a frustrating omission from the previous versions, so I'm pleased to see that fixed here.
  • Support for thread binding. The specification introduces OMP_PROC_BIND which binds threads to cores. This is rather similar to the original SUNW_MP_PROCBIND in Studio, which only took true or false, more recent compilers allow a much finer granularity of control. Still "true" or "false" is a good start!

Wednesday Jun 29, 2011

Library initialisation in C++ - libraries and linking part 5

Part 5 of the series of articles on linking and libraries is up. This one gets into the details of what can go wrong when writing libraries in C++. The key take aways from the article are to use:

  • LD_DEBUG=init to view runtime initialisation
  • LD_DEBUG=bindings to examine how symbols are bound to libraries at runtime

Wednesday Jun 01, 2011

Avoiding problems at linktime (part 4 in series)

Part 4 in the series on best practices for linking is available. The key takeaways are:

  • Avoid defining duplicate symbols. The Solaris tool lari will produce a report on this issue (besides doing a bundle of other stuff). The problem with multiple definitions of symbols is that it is not predictable which definition will be picked at runtime. This is often deterministic on a particular platform, but could change on a different platform.
  • Always define libraries as a hierarchy, with no circular dependencies. If there are circular dependencies the libraries may get loaded in an unpredictable order.

Friday May 27, 2011

Using LD_DEBUG to examine application startup (linking best practices part 3)

Part 3 of the series on best practices for linking C/C++ applications is up. This sections focuses on using LD_DEBUG to examine application startup.

The paper talks about the options LD_DEBUG=init which shows the initialisation and finalisation stages of an applications run, and LD_DEBUG=bindings which shows how the symbols are bound between the application and libraries.

Tuesday May 24, 2011

Best practices for linking - part 2

Part 2 of the article on library linking best practices is up on OTN. This is a relatively short read about ensuring that the library records its dependencies.

The relevant options are:

  • -z defs which will cause the linker to report any unresolved symbols found in the library. This is the default for applications, but is not the default for libraries. Using this flag requires that all the libraries that are required for successful linking are listed on the link line. Doing this will ensure that the library will fail to link rather than fail at runtime.
  • The command ldd -U -r lib.so will report if the library (or executable) is linked to libraries that it does not use. This is helpful in ensuring that the minimal number of libraries are loaded in order for an application to run.

Wednesday May 18, 2011

Profiling running applications

Sometimes you want to profile an application, but you either want to profile it after it has started running, or you want to profile it for part of a run. There are a couple of approaches that enable you to do this.

If you want to profile a running application, then there is the option (-P <pid>) for collect to attach to a PID:

$ collect -P <pid>

Behind the scenes this generates a script and passes the script to dbx, which attaches to the process, starts profiling, and then stops profiling after about 5 minutes. If your application is sensitive to being stopped for dbx to attach, then this is not the best way to go. The alternative approach is to start the application under collect, then collect the profile over the period of interest.

The flag -y <signal> will run the application under collect, but collect will not gather any data until profiling is enabled by sending the selected signal to the application. Here's an example of doing this:

First of all we need an application that runs for a bit of time. Since the compiler doesn't optimise out floating point operations unless the flag -fsimple is used, we can quickly write an app that spends a long time doing nothing:

$ more slow.c
int main()
{
  double d=0.0;
  for (int i=0;i<10000000000; i++) {d+=d;}
}

$ cc -g slow.c

The next step is to run the application under collect with the option -y SIGUSR1 to indicate that collect should not start collecting data until it receives the signal USR1.

$ collect -y SIGUSR1 ./a.out &
[1] 1187
Creating experiment database test.1.er ...

If we look at the generated experiment we can see that it exists, but it contains no data.

$ er_print -func test.1.er
Functions sorted by metric: Exclusive User CPU Time

Excl.     Incl.      Name
User CPU  User CPU
 sec.      sec.
0.        0.         

To start gathering data we send SIGUSR1 to the application, sending the signal again stops data collection. Sending the signal twice we can collect two seconds of data:

$ kill -SIGUSR1 1187;sleep 2;kill -SIGUSR1 1187
$ er_print -func test.1.er
Functions sorted by metric: Exclusive User CPU Time

Excl.     Incl.      Name
User CPU  User CPU
 sec.      sec.
2.001     2.001      
2.001     2.001      main
0.        2.001      _start

Friday May 06, 2011

Calling functions

I was looking at some code today and it reminded me of a very common performance issue - reloading data around calls. Suppose I have some code like:

int variable;

void function(int *array)
{
  for (i=0; i<1000; i++)
  {
     if (variable==1) 
     { 
       func1(a[i]); 
     } 
     else 
     { 
       func2(array[i]); 
     }
  }
}

You might be surprised to find that "variable" is reloaded very iteration of the loop. The reason for this is that the loop calls another function - either func1() or func2() and the compiler knows that the function might change the value of "variable" - so to be correct it needs to be reloaded.

This problem can be fixed by caching a local copy of the variable. The compiler "knows" that local (or stack based) variables don't get modified by function calls.

However, the problem is more general than this, in C++ you might observe a reloading of variables that are members of objects - for similar reasons. The general rule for avoiding this is to examine every load or store in the hot region of code to check whether it is necessary, or whether it has been introduced because of a function call.

Thursday Apr 28, 2011

Exploring Performance Analyzer experiments

I was recently profiling a script to see where the time went, and I ended up wanting to extract the profiles for just a single component. The structure of an analyzer experiment is that there's a single root directory (test.1.er) and inside that there's a single level of subdirectories representing all the child processes. Each subdirectory is a profile of a single process - these can all be loaded as individual experiments. Inside every experiment directory there is a log.xml file. This file contains a summary of what the experiment contains.

The name of the executable that was run is held on an xml "process" line. So the following script can extract a list of all the profiles of a particular application.

$ grep myapp `find test.1.er -name 'log.xml'`|grep process | sed 's/\\:.\*//' > myapp_profiles

Once we have a list of every time my application was run, I can now extract the times from that list, and sort them using the following line:

$ grep exit `cat <myapp_files`|sed 's/.\*tstamp=\\"//'|sed 's/\\".\*//'|sort -n 

Once I have the list of times I can use then locate an experiment with a particular runtime - it's probably going to be the longest runtime:

$ grep exit `cat <myapp_files`|grep 75.9 

Monday Apr 25, 2011

Using pragma opt

The Studio compiler has the ability to control the optimisation level that is applied to particular functions in an application. This can be useful if the functions are designed to work at a specific optimisation level, or if the application fails at a particular optimisation level, and you need to figure out where the problem lies.

The optimisation levels are controlled through pragma opt. The following steps need to be followed to use the pragma:

  • The directive needs to be inserted into the source file. The format of the directive is #pragma opt /level/ (/function/). This needs to be inserted into the code before the start of the function definition, but after the function header.
  • The code needs to be compiled with the flag -xmaxopt=level. This sets the maximum optimisation level for all functions in the file - including those tagged with #pragma opt.

We can see this in action using the following code snippet. This contains two identical functions, both return the square of a global variable. However, we are using #pragma opt to control the optimisation level of the function f().

int f();
int g();

#pragma opt 2 (f)

int d;

int f()
{
  return d\*d;
}

int g()
{
  return d\*d;
}

The code is compiled with the flag -xmaxopt=5, this specifies the maximum optimisation level that can be applied to any functions in the file.

$ cc -O -xmaxopt=5 -S opt.c

If we compare the disassembly for the functions f() and g(), we can see that g() is more optimal as it does not reload the global data.

                        f:
/\* 000000          0 \*/         sethi   %hi(d),%o5

!   10                !  return d\*d;

/\* 0x0004         10 \*/         ldsw    [%o5+%lo(d)],%o4 ! volatile    // First load of d
/\* 0x0008            \*/         ldsw    [%o5+%lo(d)],%o3 ! volatile    // Second load of d
/\* 0x000c            \*/         retl    ! Result =  %o0
/\* 0x0010            \*/         mulx    %o4,%o3,%o0

                        g:
/\* 000000         14 \*/         sethi   %hi(d),%o5
/\* 0x0004            \*/         ld      [%o5+%lo(d)],%o4               // Single load of d

!   15                !  return d\*d;

/\* 0x0008         15 \*/         sra     %o4,0,%o3
/\* 0x000c            \*/         retl    ! Result =  %o0
/\* 0x0010            \*/         mulx    %o3,%o3,%o0

Tuesday Apr 05, 2011

Porting from 32-bits to 64-bits

Article on porting from 32-bit to 64-bit just posted up on OTN. It's well worth a read as it covers all those corner cases which could result in errors in the 64-bit version.

Friday Apr 01, 2011

Profiling scripts

One feature that crept into the Oracle Solaris Studio 12.2 release was the ability for the performance analyzer to follow scripts. It is necessary to set the environment variable SP_COLLECTOR_SKIP_CHECKEXEC to use this feature - as shown below.

bash-3.00$ file `which which`
/bin/which:     executable /usr/bin/csh script
bash-3.00$ collect which
Target `which' is not a valid ELF executable
bash-3.00$ export SP_COLLECTOR_SKIP_CHECKEXEC=1
bash-3.00$ collect which
Creating experiment database test.1.er ...

Monday Feb 14, 2011

Interview with Jim Mauro

I was really pleased that Jim Mauro agreed to interview about developing for multicore processors. The interview has just gone live on the informit site.

Thursday Jan 27, 2011

Don't initialise local strings

Consider the following code:

void s(int i)
{
  char string[2048]="";
  sprinf(string,"Value = %i",i);
  printf("String = %s\\n",string);
}

The C standards require that if any elements of the character array string are initialised, then all of them should be. We can demonstrate this by compiling with gcc:

$ gcc -O -S f.c
$ more f.s
        .file   "f.c"
...
        .type   s, #function
        .proc   020
s:
        save    %sp, -2160, %sp
        stx     %g0, [%fp-2064]
        add     %fp, -2056, %o0
        mov     0, %o1
        call    memset, 0
        mov    2040, %o2

You can see that explicitly initialising string caused all elements of string to be initialised with a call to memset(). Removing the explicit initialisation of string (the ="") avoids the call to memset().

Thursday Jan 20, 2011

Interview about Multicore book

An interview I did with Christy Confetti-Higgins has just gone up on OTN.

Saturday Jan 15, 2011

pginfo & pgstat

A couple of very welcome commands crept into Solaris 11. They are pginfo and pgstat (these are links to the Oracle documentation site).

The two commands deal with "processor groups", this seems a bit of a misnomer to me as they are really about CPU topology. They report information and utilisation stats demonstrating the resource sharing going on on the system, and how threads are using those resources. It's probably easiest to use a couple of examples from the man pages to show this. First off pginfo:

$ pginfo -p -T
0 (System) CPUs: 0-31
`-- 3 (Data_Pipe_to_memory [system,chip]) CPUs: 0-31
    `-- 2 (Floating_Point_Unit [system,chip]) CPUs: 0-31
        |-- 1 (Integer_Pipeline [core]) CPUs: 0-3
        |-- 4 (Integer_Pipeline [core]) CPUs: 4-7
        |-- 5 (Integer_Pipeline [core]) CPUs: 8-11
        |-- 6 (Integer_Pipeline [core]) CPUs: 12-15
        |-- 7 (Integer_Pipeline [core]) CPUs: 16-19
        |-- 8 (Integer_Pipeline [core]) CPUs: 20-23
        |-- 9 (Integer_Pipeline [core]) CPUs: 24-27
        `-- 10 (Integer_Pipeline [core]) CPUs: 28-31

This shows a processor with 32 virtual CPUs, sharing a single floating point pipeline, 8 cores with a single integer pipe each - looks like an UltraSPARC T1 to me.

The output from pginfo shows the utilisation of the processor:

$ pgstat 1 2
PG  RELATIONSHIP            HW    SW  CPUS
 0  System                   -  0.4%  0-31
 3   Data_Pipe_to_memory     -  0.4%  0-31
 2    Floating_Point_Unit   0%  0.4%  0-31
 1     Integer_Pipeline     0%    0%  0-3
 4     Integer_Pipeline     0%    0%  4-7
 5     Integer_Pipeline     0%    0%  8-11
 6     Integer_Pipeline     0%  0.2%  12-15
 7     Integer_Pipeline     0%    0%  16-19
 8     Integer_Pipeline   2.8%  2.7%  20-23
 9     Integer_Pipeline   0.1%  0.2%  24-27
10     Integer_Pipeline     0%    0%  28-31

It reports both software utilisation - meaning what work the operating system has assigned to the cores, plus it can report pipeline utilisation using the hardware counters. Pipeline utilisation indicates whether the core is saturated or not - each pipeline can be fully utilised before the core is running the maximal number of threads.

I'm pleased to see these tools appear. It is useful to have tools that report the topology of the system, and it is great to see tools that report actual hardware utilisation. On earlier releases of Solaris you can always use corestat to get similar data.

Tuesday Jan 11, 2011

RAW pipeline hazards

When a processor stores an item of data back to memory it actually goes through quite a complex set of operations. A sketch of the activities is as follows. The first thing that needs to be done is that the cache line containing the target address of the store needs to be fetched from memory. While this is happening, the data to be stored there is placed on a store queue. When the store is the oldest item in the queue, and the cache line has been successfully fetched from memory, the data can be placed into the cache line and removed from the queue.

This works very well if data is stored and either never reused, or reused after a relatively long delay. Unfortunately it is common for data to be needed almost immediately. There are plenty of reasons why this is the case. If parameters are passed through the stack, then they will be stored to the stack, and then immediately reloaded. If a register is spilled to the stack, then the data will be reloaded from the stack shortly afterwards.

It could take some considerable number of cycles if the loads had to wait for the stores to exit the queue before they could fetch the data. So many processors implement some kind of bypassing. If a load finds the data it needs in the store queue, then it can fetch it from there. There are often some caveats associated with this bypass. For example, the store and load often have to be of the same size to the same address. i.e. you cannot bypass a byte from a store of a word. If the bypass fails, then the situation is referred to as a "RAW" hazard, meaning "Read-After-Write". If the bypass fails, then the load has to wait until the store has completed before it can retrieve the new value - this can take many cycles.

As a general rule it is best to avoid potential RAWs. It is hardware, and runtime situation dependent whether there will be a RAW hazard or not, so avoiding the possibility is the best defense. Consider the following code which uses loads and stores of bytes to construct an integer.

#include <stdio.h>
#include <sys/time.h>

void tick()
{
  hrtime_t now = gethrtime();
  static hrtime_t then = 0;
  if (then>0) printf("Elapsed = %f\\n", 1.0\*(now-then)/100000000.0);
  then = now;
}

int func(char \* value)
{
  int temp;
  ((char\*)&temp)[0] = value[3];
  ((char\*)&temp)[1] = value[2];
  ((char\*)&temp)[2] = value[1];
  ((char\*)&temp)[3] = value[0];
  return temp;
}

int main()
{
  int value = 0x01020304;
  tick();
  for (int i=0; i<100000000; i++) func((char\*)&value);
}

In the above code we're reversing the byte order by loading the bytes one-by-one, and storing them into an integer in the correct position, then loading the integer. Running this code on a test machine it reports 12ns per iteration.

However, it is possible to perform the same reordering using logical operations (shifts and ORs) as follows:

int func2(char\* value)
{
  return (value[0]<<24) | (value[1]<<16) | (value[2]<<8) | value[0];
}

This modified routine takes about 8ns per iteration. Which is significantly faster than the original code.

The actual speed up observed will depend on many factors, the most obvious being how often the code is encountered. The more observation is that the speed up depends on the platform. Some platforms will be more sensitive to the impact of RAWs than others. So the best advice is, whereever possible, to avoid passing data through the stack.

Wednesday Dec 01, 2010

Documentation

So the spot user's guide has been added to the Solaris Studio 12.2 documentation. There's also another collection of older articles.

Sunday Nov 28, 2010

Multicore Application Programming: Source code available

I've just uploaded all the source code to the examples in Multicore Application Programming. About 160 files.

Saturday Nov 13, 2010

Multicore Application Programming arrived!

It was an exciting morning - my copy of Multicore Application Programming was delivered. After reading the text countless times, it's great to actually see it as a finished article. It starting to become generally available. Amazon lists it as being available on Wednesday, although the Kindle version seems to be available already. It's also available on Safari books on-line. Even turned up at Tesco!

About

Darryl Gove is a senior engineer in the Solaris Studio team, working on optimising applications and benchmarks for current and future processors. He is also the author of the books:
Multicore Application Programming
Solaris Application Programming
The Developer's Edge

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
5
6
8
9
10
12
13
14
15
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today
Bookmarks
The Developer's Edge
Solaris Application Programming
Publications
Webcasts
Presentations
OpenSPARC Book
Multicore Application Programming
Docs