Thursday Apr 03, 2014

Discovering the Code Analyzer

We're doing something different with the Studio 12.4 Beta programme. We're also putting together some material about the compiler and features: videos, whitepapers, etc.

One of the first videos is now officially available. You might have seen the preproduction "leak" if you happen to follow Studio on either facebook or twitter.

This first video is an interview with Raj Prakash, the project lead for the Code Analyzer.

The Code Analyzer is our suite for checking the correctness of code. Something that you would run before you deliver an application to customers.

Tuesday Feb 25, 2014

OpenMP, macros, and #define

Sometimes you need to include directives in macros. The classic example would be putting OpenMP directives into macros. The "obvious" way of doing this is:

#define BARRIER \
#pragma omp barrier

void foo()
{
  BARRIER
}

Which produces the following error:

"test.c", line 6: invalid source character: '#'
"test.c", line 6: undefined symbol: pragma
"test.c", line 6: syntax error before or at: omp

Fortunately C99 introduced the _Pragma mechanism to solve this problem. So the functioning code looks like:

#define BARRIER \
_Pragma("omp barrier")

void foo()
{
  BARRIER
}

Wednesday Sep 18, 2013

SPARC processor documentation

The SPARC processor documentation can be found here. What is really exciting though is that you can finally download the Oracle SPARC Architecture 2011 spec, which describes the current SPARC instruction set.

Tuesday Aug 27, 2013

My Oracle Open World and JavaOne schedule

I've got my schedule for Oracle Open World and JavaOne:

Note that on Thursday I have about 30 minutes between my two talks, so expect me to rush out of the database talk in order to get to the Java talk.

Friday Aug 09, 2013

How to use a lot of threads ....

The SPARC M5-32 has 1,536 virtual CPUs. There is a lot that you can do with that much resource and a bunch of us sat down to write a white paper discussing the options.

There are a couple of key observations in there. First of all it is quite likely that such a large system will not end up running a single instance of a single application. Therefore it is useful to understand the various options for virtualising the system. The second observation is that there are a number of details to take care of when writing an application that is expected to scale to large numbers of threads.

Anyhow, I hope you find it a useful read.

Tuesday Jun 11, 2013

SPARC family

This is a nice table showing the various SPARC processors being shipped by Oracle.

Tuesday May 28, 2013

One executable, many platforms

Different processors have different optimal sequences of code. Fortunately, most of the time the differences are minor, and we can easily accommodate them by generating generic code.

If you needed more than this, then the "old" model was to use dynamic string tokens to pick the best library for the platform. This works well, and was the mechanism that libc.so used. However, the downside is that you now need to ship a bundle of libraries with the application; this can get (and look) a bit messy.

There's a "new" approach that uses a family of capability functions. The idea here is that multiple versions of the routine are linked into the executable, and the runtime linker picks the best for the platform that the application is running on. The routines are denoted with a suffix, after a percentage sign, indicating the platform. For example here's the family of memcpy() implementations in libc:

$ elfdump -H /usr/lib/libc.so.1 2>&1 |grep memcpy
      [10]  0x0010627c 0x00001280  FUNC LOCL  D    0 .text          memcpy%sun4u
      [11]  0x001094d0 0x00000b8c  FUNC LOCL  D    0 .text          memcpy%sun4u-opl
      [12]  0x0010a448 0x000005f0  FUNC LOCL  D    0 .text          memcpy%sun4v-hwcap1
...

It takes a bit of effort to produce a family of implementations. Imagine we want to print something different when an application is run on a sun4v machine. First of all we'll have a bit of code that prints out the compile-time defined string that indicates the platform we're running on:

#include <stdio.h>
static char name[]=PLATFORM;

double platform()
{
  printf("Running on %s\n",name);
}

To compile this code we need to provide the definition for PLATFORM - suitably escaped. We will need to provide two versions, a generic version that can always run, and a platform specific version that runs on sun4v platforms:

$ cc -c -o generic.o p.c -DPLATFORM=\"Generic\"
$ cc -c -o sun4v.o   p.c -DPLATFORM=\"sun4v\"

Now we have a specialised version of the routine platform() but it has the same name as the generic version, so we cannot link the two into the same executable. So what we need to do is to tag it as being the version we want to run on sun4v platforms.

This is a two step process. The first step is that we tag the object file as being a sun4v object file. This step is only necessary if the compiler has not already tagged the object file. The compiler will tag the object file appropriately if it uses instructions from a particular architecture - for example if you compiled explicitly targeting T4 using -xtarget=t4. However, if you need to tag the object file, then you can use a mapfile to add the appropriate hardware capabilities:

$mapfile_version 2

CAPABILITY sun4v {
        MACHINE=sun4v;
};

We can then ask the linker to apply these hardware capabilities from the mapfile to the object file:

$ ld -r -o sun4v_1.o -Mmapfile.sun4v sun4v.o

You can see that the capabilities have been applied using elfdump:

$ elfdump -H sun4v_1.o

Capabilities Section:  .SUNW_cap

 Object Capabilities:
     index  tag               value
       [0]  CA_SUNW_ID       sun4v
       [1]  CA_SUNW_MACH     sun4v

The second step is to take these capabilities and apply them to the functions. We do this using the linker option -zsymbolcap

:
$ ld -r -o sun4v_2.o -z symbolcap sun4v_1.o

You can now see that the platform function has been tagged as being for sun4v hardware:

$ elfdump -H sun4v_2.o

Capabilities Section:  .SUNW_cap

 Symbol Capabilities:
     index  tag               value
       [1]  CA_SUNW_ID       sun4v
       [2]  CA_SUNW_MACH     sun4v

  Symbols:
     index    value      size      type bind oth ver shndx          name
      [24]  0x00000010 0x00000070  FUNC LOCL  D    0 .text          platform%sun4v

And finally you can combine the object files into a single executable. The main() routine of the executable calls platform() which will print out a different message depending on the platform. Here's the source to main():

extern void platform();

int main()
{
  platform();
}

Here's what happens when the program is compiled and run on a non-sun4v platform:

$ cc -o main -O main.c sun4v_2.o generic.o
$ ./main
Running on Generic

Here's the same executable running on a sun4v platform:

$ ./main
Running on sun4v

Monday Apr 01, 2013

OpenMP and language level parallelisation

The C11 and C++11 standards introduced some very useful features into the language. In particular they provided language-level access to threading and synchronisation primitives. So using the new standards we can write multithreaded code that compiles and runs on standard compliant platforms. I've tackled translating Windows and POSIX threads before, but not having to use a shim is fantastic news.

There's some ideas afoot to do something similar for higher level parallelism. I have a proposal for consideration at the April meetings - leveraging the existing OpenMP infrastructure.

Pretty much all compilers use OpenMP, a large chunk of shared memory parallel programs are written using OpenMP. So, to me, it seems a good idea to leverage the existing OpenMP library code, and existing developer knowledge. The paper is not arguing that we need take the OpenMP syntax - that is something that can be altered to fit the requirements of the language.

What do you think?

Thursday Mar 14, 2013

The pains of preprocessing

Ok, so I've encountered this twice in 24 hours. So it's probably worth talking about it.

The preprocessor does a simple text substitution as it works its way through your source files. Sometimes this has "unanticipated" side-effects. When this happens, you'll normally get a "hey, this makes no sense at all" error from the compiler. Here's an example:

$ more c.c
#include <ucontext.h>
#include <stdio.h>

int main()
{
  int FS;
  FS=0;
  printf("FS=%i",FS);
}

$ CC c.c
$ CC c.c
"c.c", line 6: Error: Badly formed expression.
"c.c", line 7: Error: The left operand must be an lvalue.
2 Error(s) detected.

A similar thing happens with g++:

$  /pkg/gnu/bin/g++ c.c
c.c: In function 'int main()':
c.c:6:7: error: expected unqualified-id before numeric constant
c.c:7:6: error: lvalue required as left operand of assignment

The Studio C compiler gives a bit more of a clue what is going on. But it's not something you can rely on:

$ cc c.c
"c.c", line 6: syntax error before or at: 1
"c.c", line 7: left operand must be modifiable lvalue: op "="

As you can guess the issue is that FS gets substituted. We can find out what happens by examining the preprocessed source:

$ CC -P c.c
$ tail c.i
int main ( )
{
int 1 ;
1 = 0 ;
printf ( "FS=%i" , 1 ) ;
}

You can confirm this using -xdumpmacros to dump out the macros as they are defined. You can combine this with -H to see which header files are included:

$ CC -xdumpmacros c.c 2>&1 |grep FS
#define _SYS_ISA_DEFS_H
#define _FILE_OFFSET_BITS 32
#define REG_FSBASE 26
#define REG_FS 22
#define FS 1
....

If you're using gcc you should use the -E option to get preprocessed source, and the -dD option to get definitions of macros and the include files.

Wednesday Dec 12, 2012

Compiling for T4

I've recently had quite a few queries about compiling for T4 based systems. So it's probably a good time to review what I consider to be the best practices.

  • Always use the latest compiler. Being in the compiler team, this is bound to be something I'd recommend :) But the serious points are that (a) Every release the tools get better and better, so you are going to be much more effective using the latest release (b) Every release we improve the generated code, so you will see things get better (c) Old releases cannot know about new hardware.
  • Always use optimisation. You should use at least -O to get some amount of optimisation. -xO4 is typically even better as this will add within-file inlining.
  • Always generate debug information, using -g. This allows the tools to attribute information to lines of source. This is particularly important when profiling an application.
  • The default target of -xtarget=generic is often sufficient. This setting is designed to produce a binary that runs well across all supported platforms. If the binary is going to be deployed on only a subset of architectures, then it is possible to produce a binary that only uses the instructions supported on these architectures, which may lead to some performance gains. I've previously discussed which chips support which architectures, and I'd recommend that you take a look at the chart that goes with the discussion.
  • Crossfile optimisation (-xipo) can be very useful - particularly when the hot source code is distributed across multiple source files. If you're allowed to have something as geeky as favourite compiler optimisations, then this is mine!
  • Profile feedback (-xprofile=[collect: | use:]) will help the compiler make the best code layout decisions, and is particularly effective with crossfile optimisations. But what makes this optimisation really useful is that codes that are dominated by branch instructions don't typically improve much with "traditional" compiler optimisation, but often do respond well to being built with profile feedback.
  • The macro flag -fast aims to provide a one-stop "give me a fast application" flag. This usually gives a best performing binary, but with a few caveats. It assumes the build platform is also the deployment platform, it enables floating point optimisations, and it makes some relatively weak assumptions about pointer aliasing. It's worth investigating.
  • SPARC64 processor, T3, and T4 implement floating point multiply accumulate instructions. These can substantially improve floating point performance. To generate them the compiler needs the flag -fma=fused and also needs an architecture that supports the instruction (at least -xarch=sparcfmaf).
  • The most critical advise is that anyone doing performance work should profile their application. I cannot overstate how important it is to look at where the time is going in order to determine what can be done to improve it.

I also presented at Oracle OpenWorld on this topic, so it might be helpful to review those slides.

Wednesday Dec 05, 2012

Library order is important

I've written quite extensively about link ordering issues, but I've not discussed the interaction between archive libraries and shared libraries. So let's take a simple program that calls a maths library function:

#include <math.h>

int main()
{
  for (int i=0; i<10000000; i++)
  {
    sin(i);
  }
}

We compile and run it to get the following performance:

bash-3.2$ cc -g -O fp.c -lm
bash-3.2$ timex ./a.out

real           6.06
user           6.04
sys            0.01

Now most people will have heard of the optimised maths library which is added by the flag -xlibmopt. This contains optimised versions of key mathematical functions, in this instance, using the library doubles performance:

bash-3.2$ cc -g -O -xlibmopt fp.c -lm
bash-3.2$ timex ./a.out

real           2.70
user           2.69
sys            0.00

The optimised maths library is provided as an archive library (libmopt.a), and the driver adds it to the link line just before the maths library - this causes the linker to pick the definitions provided by the static library in preference to those provided by libm. We can see the processing by asking the compiler to print out the link line:

bash-3.2$ cc -### -g -O -xlibmopt fp.c -lm
/usr/ccs/bin/ld ... fp.o -lmopt -lm -o a.out...

The flag to the linker is -lmopt, and this is placed before the -lm flag. So what happens when the -lm flag is in the wrong place on the command line:

bash-3.2$ cc -g -O -xlibmopt -lm fp.c
bash-3.2$ timex ./a.out

real           6.02
user           6.01
sys            0.01

If the -lm flag is before the source file (or object file for that matter), we get the slower performance from the system maths library. Why's that? If we look at the link line we can see the following ordering:

/usr/ccs/bin/ld ... -lmopt -lm fp.o -o a.out 

So the optimised maths library is still placed before the system maths library, but the object file is placed afterwards. This would be ok if the optimised maths library were a shared library, but it is not - instead it's an archive library, and archive library processing is different - as described in the linker and library guide:

"The link-editor searches an archive only to resolve undefined or tentative external references that have previously been encountered."

An archive library can only be used resolve symbols that are outstanding at that point in the link processing. When fp.o is placed before the libmopt.a archive library, then the linker has an unresolved symbol defined in fp.o, and it will search the archive library to resolve that symbol. If the archive library is placed before fp.o then there are no unresolved symbols at that point, and so the linker doesn't need to use the archive library. This is why libmopt needs to be placed after the object files on the link line.

On the other hand if the linker has observed any shared libraries, then at any point these are checked for any unresolved symbols. The consequence of this is that once the linker "sees" libm it will resolve any symbols it can to that library, and it will not check the archive library to resolve them. This is why libmopt needs to be placed before libm on the link line.

This leads to the following order for placing files on the link line:

  • Object files
  • Archive libraries
  • Shared libraries

If you use this order, then things will consistently get resolved to the archive libraries rather than to the shared libaries.

Tuesday Dec 04, 2012

It could be worse....

As "guest" pointed out, in my file I/O test I didn't open the file with O_SYNC, so in fact the time was spent in OS code rather than in disk I/O. It's a straightforward change to add O_SYNC to the open() call, but it's also useful to reduce the iteration count - since the cost per write is much higher:

...
#define SIZE 1024

void test_write()
{
  starttime();
  int file = open("./test.dat",O_WRONLY|O_CREAT|O_SYNC,S_IWGRP|S_IWOTH|S_IWUSR);
...

Running this gave the following results:

Time per iteration   0.000065606310 MB/s
Time per iteration   2.709711563906 MB/s
Time per iteration   0.178590114758 MB/s

Yup, disk I/O is way slower than the original I/O calls. However, it's not a very fair comparison since disks get written in large blocks of data and we're deliberately sending a single byte. A fairer result would be to look at the I/O operations per second; which is about 65 - pretty much what I'd expect for this system.

It's also interesting to examine at the profiles for the two cases. When the write() was trapping into the OS the profile indicated that all the time was being spent in system. When the data was being written to disk, the time got attributed to sleep. This gives us an indication how to interpret profiles from apps doing I/O. It's the sleep time that indicates disk activity.

Write and fprintf for file I/O

fprintf() does buffered I/O, where as write() does unbuffered I/O. So once the write() completes, the data is in the file, whereas, for fprintf() it may take a while for the file to get updated to reflect the output. This results in a significant performance difference - the write works at disk speed. The following is a program to test this:

#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <stdio.h>
#include <sys/time.h>
#include <sys/types.h>
#include <sys/stat.h>

static double s_time;

void starttime()
{
  s_time=1.0*gethrtime();
}

void endtime(long its)
{
  double e_time=1.0*gethrtime();
  printf("Time per iteration %5.2f MB/s\n", (1.0*its)/(e_time-s_time*1.0)*1000);
  s_time=1.0*gethrtime();
}

#define SIZE 10*1024*1024

void test_write()
{
  starttime();
  int file = open("./test.dat",O_WRONLY|O_CREAT,S_IWGRP|S_IWOTH|S_IWUSR);
  for (int i=0; i<SIZE; i++)
  {
    write(file,"a",1);
  }
  close(file);
  endtime(SIZE);
}

void test_fprintf()
{
  starttime();
  FILE* file = fopen("./test.dat","w");
  for (int i=0; i<SIZE; i++)
  {
    fprintf(file,"a");
  }
  fclose(file);
  endtime(SIZE);
}

void test_flush()
{
  starttime();
  FILE* file = fopen("./test.dat","w");
  for (int i=0; i<SIZE; i++)
  {
    fprintf(file,"a");
    fflush(file);
  }
  fclose(file);
  endtime(SIZE);
}


int main()
{
  test_write();
  test_fprintf();
  test_flush();
}

Compiling and running I get 0.2MB/s for write() and 6MB/s for fprintf(). A large difference. There's three tests in this example, the third test uses fprintf() and fflush(). This is equivalent to write() both in performance and in functionality. Which leads to the suggestion that fprintf() (and other buffering I/O functions) are the fastest way of writing to files, and that fflush() should be used to enforce synchronisation of the file contents.

Thursday Oct 18, 2012

Mixing Java and native code

This was a bit of surprise to me. The slides are available from my presentation at JavaOne on mixed language development. What I wasn't expecting was that there would also be a video of the presentation.

Friday Sep 14, 2012

Current SPARC Architectures

Different generations of SPARC processors implement different architectures. The architecture that the compiler targets is controlled implicitly by the -xtarget flag and explicitly by the -arch flag.

If an application targets a recent architecture, then the compiler gets to play with all the instructions that the new architecture provides. The downside is that the application won't work on older processors that don't have the new instructions. So for developer's there is a trade-off between performance and portability.

The way we have solved this in the compiler is to assume a "generic" architecture, and we've made this the default behaviour of the compiler. The only flag that doesn't make this assumption is -fast which tells the compiler to assume that the build machine is also the deployment machine - so the compiler can use all the instructions that the build machine provides.

The -xtarget=generic flag tells the compiler explicitly to use this generic model. We work hard on making generic code work well across all processors. So in most cases this is a very good choice.

It is also of interest to know what processors support the various architectures. The following Venn diagram attempts to show this:


A textual description is as follows:

  • The T1 and T2 processors, in addition to most other SPARC processors that were shipped in the last 10+ years supported V9b, or sparcvis2.
  • The SPARC64 processors from Fujitsu, used in the M-series machines, added support for the floating point multiply accumulate instruction in the sparcfmaf architecture.
  • Support for this instruction also appeared in the T3 - this is called sparcvis3
  • Later SPARC64 processors added the integer multiply accumulate instruction, this architecture is sparcima.
  • Finally the T4 includes support for both the integer and floating point multiply accumulate instructions in the sparc4 architecture.

So the conclusion should be:

  • Floating point multiply accumulate is supported in both the T-series and M-series machines, so it should be a relatively safe bet to start using it.
  • The T4 is a very good machine to deploy to because it supports all the current instruction sets.

Thursday Aug 30, 2012

SPARC Architecture 2011

With what appears to be minimal fanfare, an update of the SPARC Architecture has been released. If you ever look at SPARC disassembly code, then this is the document that you need to bookmark. If you are not familiar with it, then it basically describes how a SPARC processor should behave - it doesn't describe a particular implementation, just the "generic" processor. As with all revisions, it supercedes the SPARC v9 book published back in the 90s, having both corrections, and definitions of new instructions. Anyway, should be an interesting read :)

Monday Aug 27, 2012

Monday, 1st October: Presenting at JavaOne and Oracle Open World

On Monday 1 October I will be presenting at both JavaOne and Oracle Open World. The full conference schedule is available from here. The logistics for my sessions are as follows:


  • JavaOne: 8:30am Monday 1 October. CON6714: "Mixed-Language Development: Leveraging Native Code from Java". San Francisco Hilton - Continental Ballroom 6
  • Oracle OpenWorld: 10:45am Monday 1 October. CON6382: "Maximizing Your SPARC T4 Oracle Solaris Application Performance". Marriott Marquis - Golden Gate C3

Hope to see you there!

Thursday May 17, 2012

Solaris Developer talk next week

Vijay Tatkar will be talking about developing on Solaris next week Tuesday at 9am PST.

Friday Apr 20, 2012

What is -xcode=abs44?

I've talked about building 64-bit libraries with position independent code. When building 64-bit applications there are two options for the code that the compiler generates: -xcode=abs64 or -xcode=abs44, the default is -xcode=abs44. These are documented in the user guides. The abs44 and abs64 options produce 64-bit applications that constrain the code + data + BSS to either 44 bit or 64 bits of address.

These options constrain the addresses statically encoded in the application to either 44 or 64 bits. It does not restrict the address range for pointers (dynamically allocated memory) - they remain 64-bits. The restriction is in locating the address of a routine or a variable within the executable.

This is easier to understand from the perspective of an example. Suppose we have a variable "data" that we want to return the address of. Here's the code to do such a thing:

extern int data;

int * address()
{
  return &data;
}

If we compile this as a 32-bit app we get the following disassembly:

/* 000000          4 */         sethi   %hi(data),%o5
/* 0x0004            */         retl    ! Result =  %o0
/* 0x0008            */         add     %o5,%lo(data),%o0

So it takes two instructions to generate the address of the variable "data". At link time the linker will go through the code, locate references to the variable "data" and replace them with the actual address of the variable, so these two instructions will get modified. If we compile this as a 64-bit code with full 64-bit address generation (-xcode=abs64) we get the following:

/* 000000          4 */         sethi   %hh(data),%o5
/* 0x0004            */         sethi   %lm(data),%o2
/* 0x0008            */         or      %o5,%hm(data),%o4
/* 0x000c            */         sllx    %o4,32,%o3
/* 0x0010            */         or      %o3,%o2,%o1
/* 0x0014            */         retl    ! Result =  %o0
/* 0x0018            */         add     %o1,%lo(data),%o0

So to do the same thing for a 64-bit application with full 64-bit address generation takes 6 instructions. Now, most hardware cannot address the full 64-bits, hardware typically can address somewhere around 40+ bits of address (example). So being able to generate a full 64-bit address is currently unnecessary. This is where abs44 comes in. A 44 bit address can be generated in four instructions, so slightly cuts the instruction count without practically compromising the range of memory that an application can address:

/* 000000          4 */         sethi   %h44(data),%o5
/* 0x0004            */         or      %o5,%m44(data),%o4
/* 0x0008            */         sllx    %o4,12,%o3
/* 0x000c            */         retl    ! Result =  %o0
/* 0x0010            */         add     %o3,%l44(data),%o0

Monday Apr 02, 2012

Efficient inline templates and C++

I've talked before about calling inline templates from C++, I've also talked about calling inline templates efficiently. This time I want to talk about efficiently calling inline templates from C++.

The obvious starting point is that I need to declare the inline templates as being extern "C":

  extern "C"
  {
    int mytemplate(int);
  }

This enables us to call it, but the call may not be very efficient because the compiler will treat it as a function call, and may produce suboptimal code based on that premise. So we need to add the no_side_effect pragma:

  extern "C"
  {
    int mytemplate(int); 
    #pragma no_side_effect(mytemplate)
  }

However, this may still not produce optimal code. We've discussed how the no_side_effect pragma cannot be combined with exceptions, well we know that the code cannot produce exceptions, but the compiler doesn't know that. If we tell the compiler that information it may be able to produce even better code. We can do this by adding the "throw()" keyword to the template declaration:

  extern "C"
  {
    int mytemplate(int) throw(); 
    #pragma no_side_effect(mytemplate)
  }

The following is an example of how these changes might improve performance. We can take our previous example code and migrate it to C++, adding the use of a try...catch construct:

#include <iostream>

extern "C"
{
  int lzd(int);
  #pragma no_side_effect(lzd)
}

int a;
int c=0;

class myclass
{
  int routine();
};

int myclass::routine()
{
  try
  {
    for(a=0; a<1000; a++)
    {
      c=lzd(c);
    }
  }
  catch(...)
  {
    std::cout << "Something happened" << std::endl;
  }
 return 0;
}

Compiling this produces a slightly suboptimal code sequence in the hot loop:

$ CC -O -xtarget=T4 -S t.cpp t.il
...
/* 0x0014         23 */         lzd     %o0,%o0
/* 0x0018         21 */         add     %l6,1,%l6
/* 0x001c            */         cmp     %l6,1000
/* 0x0020            */         bl,pt   %icc,.L77000033
/* 0x0024         23 */         st      %o0,[%l7]

There's a store in the delay slot of the branch, so we're repeatedly storing data back to memory. If we change the function declaration to include "throw()", we get better code:

$ CC -O -xtarget=T4 -S t.cpp t.il
...
/* 0x0014         21 */         add     %i1,1,%i1
/* 0x0018         23 */         lzd     %o0,%o0
/* 0x001c         21 */         cmp     %i1,999
/* 0x0020            */         ble,pt  %icc,.L77000019
/* 0x0024            */         nop

The store has gone, but the code is still suboptimal - there's a nop in the delay slot rather than useful work. However, it's good enough for this example. The point I'm making is that the compiler produces the better code with both the "throw()" and the no side effect pragma.

Friday Feb 03, 2012

Using prtpicl to get cache sizes

If you are on a SPARC system you can get cache size information using the command fpversion, which is provided with Studio:

$ fpversion
 A SPARC-based CPU is available.
 Kernel says main memory's clock rate is 1012.0 MHz.

 Sun-4 floating-point controller version 0 found.
 An UltraSPARC chip is available.

 Use "-xtarget=sparc64vii -xcache=64/64/2:5120/256/10" code-generation option.

The cache parameters are output exactly as you would want to pass them into the compiler - for each cache it describes the size in KB, the line size in bytes, and the associativity.

fpversion doesn't exist on x86 systems. The next best thing is to use prtpicl to output system configuration information, and inspect that output for cache size. Here's the cache output for the same SPARC system using prtpicl.

$ prtpicl -v |grep cache
              :l1-icache-size    0x10000
              :l1-icache-line-size       0x40
              :l1-icache-associativity   0x2
              :l1-dcache-size    0x10000
              :l1-dcache-line-size       0x40
              :l1-dcache-associativity   0x2
              :l2-cache-size     0x500000
              :l2-cache-line-size        0x100
              :l2-cache-associativity    0xa

Tuesday Jan 17, 2012

Separation of debug and executable

To reduce the size of shipped binaries it can be useful to separate the debug information into a separate file. This procedure is covered in the dbx manual. We can use objdump to extract the debug information and then to link the executable with the extracted data.

Here's a short example executable:

#include <stdio.h>
#include <math.h>

int main()
{
  double d=1.0;
  d = sin(d);
  printf("sin(1.0) = %f\n",d);
}

Compiled with debug:

$ cc -g hello.c -lm
$ ./a.out
sin(1.0) = 0.841471

We can debug this executable with dbx. Note that, in this case, we compiled without optimisation in order to get the best debug information. Doing this does potentially sacrifice some performance. We can follow the same procedure with optimised code.

$ dbx ./a.out
Reading ld.so.1
Reading libm.so.2
Reading libc.so.1
(dbx) stop in main
(2) stop in main
(dbx) run
Running: a.out
(process id 53296)
stopped in main at line 6 in file "hello.c"
    6     double d=1.0;
(dbx) step
stopped in main at line 7 in file "hello.c"
    7     d = sin(d);
(dbx) print d
d = 1.0
(dbx) cont
Reading libc_psr.so.1
sin(1.0) = 0.841471

First of all we are going to use objcopy to extract the debug information from ./a.out and place it into ./a.out.debug:

$ /usr/sfw/bin/gobjcopy --only-keep-debug ./a.out ./a.out.debug

Now we can strip a.out of debug information:

$ strip ./a.out

To prove that this has removed the debug information we can try running under dbx:

$ dbx  ./a.out
Reading ld.so.1
Reading libm.so.2
Reading libc.so.1
(dbx) stop in main
dbx: warning: 'main' has no debugger info -- will trigger on first instruction
(2) stop in main
(dbx) quit

Now we want to use objcopy to make a link between the executable and its debug information:

$ /usr/sfw/bin/gobjcopy --add-gnu-debuglink=./a.out.debug ./a.out

Now when we debug the executable we are back to full debug:

$ dbx ./a.out
Reading ld.so.1
Reading libm.so.2
Reading libc.so.1
(dbx) stop  in main
(2) stop in main
(dbx) run
Running: a.out
(process id 58837)
stopped in main at line 6 in file "hello.c"
    6     double d=1.0;
(dbx) next
stopped in main at line 7 in file "hello.c"
    7     d = sin(d);
(dbx) print d
d = 1.0
(dbx) cont
Reading libc_psr.so.1
sin(1.0) = 0.841471

execution completed, exit code is 0
(dbx) quit

Friday Jan 13, 2012

C++ and inline templates

A while back I wrote an article on using inline templates. It's a bit of a niche article as I would generally advise people to write in C/C++, and tune the compiler flags and source code until the compiler generates the code that they want to see.

However, one thing that I didn't mention in the article, it's implied but not stated, is that inline templates are defined as C functions. When used from C++ they need to be declared as extern "C", otherwise you get linker errors. Here's an example template:

.inline nothing
  nop
.end

And here's some code that calls it:

void nothing();

int main()
{
  nothing();
}

The code works when compiled as C, but not as C++:

$ cc i.c i.il
$ ./a.out
$ CC i.c i.il
Undefined                       first referenced
 symbol                             in file
void nothing()                   i.o
ld: fatal: Symbol referencing errors. No output written to a.out

To fix this, and make the code compilable with both C and C++ we use the __cplusplus feature test macro and conditionally include extern "C". Here's the modified source:

#ifdef __cplusplus
  extern "C"
  {
#endif
    void nothing();
#ifdef __cplusplus
  }
#endif

int main()
{
  nothing();
}

Thursday Jan 12, 2012

Please mind the gap

I find the timeline view in the Performance Analyzer incredibly useful, but I've often been puzzled by what causes the gaps - like those in the example below:

Timeline view

One of my colleagues pointed out that it is possible to figure out what is causing the gaps. The call stack is indicated by the event after the gap. This makes sense. The Performance Analyzer works by sending a profiling signal to the thread multiple times a second. If the thread is not scheduled on the CPU then it doesn't get a signal. The first thing that the thread does when it is put back onto the CPU is to respond to those signals that it missed. Here's some example code so that you can try it out.

#include <stdio.h>

void write_file()
{
  char block[8192];
  FILE * file = fopen("./text.txt", "w");
  for (int i=0;i<1024; i++)
  {
    fwrite(block, sizeof(block), 1, file);
  }
  fclose(file);
}

void read_file()
{
  char block[8192];
  FILE * file = fopen("./text.txt", "rw");
  for (int i=0;i<1024; i++)
  {
    fread(block,sizeof(block),1,file);
    fseek(file,-sizeof(block),SEEK_CUR);
    fwrite(block, sizeof(block), 1, file);
  }
  fclose(file);
}

int main()
{
  for (int i=0; i<100; i++)
  {
    write_file();
    read_file();
  }
}

This is the code that generated the timeline shown above, so you know that the profile will have some gaps in it. If we select the event after the gap we determine that the gaps are caused by the application either opening or closing the file.

_close

But that is not all that is going on, if we look at the information shown in the Timeline details panel for the Duration of the event we can see that it spent 210ms in the "Other Wait" micro state. So we've now got a pretty clear idea of where the time is coming from.

Wednesday Jan 11, 2012

A static function, an inline function, and a static variable walked into a bar....

... well, not really. Hacking around with some library code, so I thought I'd write up a quick refresher on scoping. Steve Clamage and I cover scoping in more detail in the series on libraries and linking. For the code I was working on today, the problem was much more limited.

I had a single file containing all the source code. I wanted to export only the minimal number of symbols that were needed to act as an interface for the library. You can imagine it being something like:

#include <stdio.h>

int count=0;

inline void printcount()
{
  printf("Count = %i\n",count);
  asm("nop");
}

void next()
{
  count++;
  printcount();
}

If I compile this, and then use nm to inspect the resulting library, I can see a global symbol for count. The function printcount() is defined with local scope. However, the only interface I want to export is next().

bash-3.00$ cc -g -G -O -o libt.so t.c
bash-3.00$ nm libt.so|grep GLOB
...
[45]    |     66468|       4|OBJT |GLOB |0    |11     |count
[43]    |       724|      40|FUNC |GLOB |0    |5      |next
[42]    |         0|       0|FUNC |GLOB |0    |UNDEF  |printf
bash-3.00$ nm libt.so |grep count
[44]    |     66460|       4|OBJT |GLOB |0    |11     |count
[32]    |       672|      52|FUNC |LOCL |0    |5      |printcount

So I can define count as a static variable, and that reduces its scope to the file in which it is defined. However, this does not actually make it disappear, it is still there, but with name mangling:

bash-3.00$ nm libt.so|grep count
[40]    |     66476|       4|OBJT |GLOB |0    |11     |$XAS4IkBuA_CPGtc.count
[33]    |       688|      52|FUNC |LOCL |0    |5      |printcount

The reason for this is that I'm building with debug (-g). With debug, I get a local version of the routine printcount(), and I get a globalised version of the variable count. If I remove -g, I get the following output from nm:

bash-3.00$ nm libt.so|grep count
[29]    |     66316|       4|OBJT |LOCL |0    |11     |count
[36]    |         0|       0|FUNC |GLOB |0    |UNDEF  |printcount

The variable count has local scope, which is what we expected - it is no longer exported from the file, so we have avoided possible name conflicts there. However, printcount() is now no longer defined. That might be ok so long as we don't actually call the routine:

bash-3.00$ dis libt.so|grep printcount
printcount()
         2e4:  7f ff ff ef  call        printcount      ! 0x2a0

Oops. We've hit the rule about needing to provide an extern version of any inline functions. Once again, I suggest parsing Douglas Walls' discussion of the topic for the gory details. Anyhow, the upshot is that this library wouldn't work. The fix is trivial, declare printcount() to be static inline, and the compiler will generate the local version of the function:

bash-3.00$ cc -G -O -o libt.so t.c
bash-3.00$ nm libt.so |grep count
[29]    |     66448|       4|OBJT |LOCL |0    |11     |count
[30]    |       664|      52|FUNC |LOCL |0    |5      |printcount

With these fixes the library no longer exports any functions but the ones I left with external linkage. This substantially reduces the risk of "undefined behaviour".

Tuesday Jan 10, 2012

What's inlined by -xlibmil

The compiler flag -xlibmil provides inline templates for some critical maths functions, but it comes with the optimisation that it does not set errno for these functions. The functions it inlines can vary from release to release, so it's useful to be able to see which functions are inlined, and determine whether you care that they don't set errno. You can see the list of functions using the command:

grep inline /compilerpath/prod/lib/libm.il
        .inline sqrtf,1
        .inline sqrt,2
        .inline ceil,2
        .inline ceilf,1
        .inline floor,2
        .inline floorf,1
        .inline rint,2
        .inline rintf,1
...

From a cursory glance at the list I got when I did this just now, I can only see sqrt as a function that sets errno. So if you use sqrt and you care about whether it set errno, then don't use -xlibmil.

Monday Jan 09, 2012

Understanding binary size

One of my colleagues, Miriam Blatt, has written a great article about understanding the size of binary objects. This is worth a read because it describes both what goes into the objects and what tools you can use to discover this information.

Wednesday Dec 14, 2011

Oracle Solaris Studio 12.3

Oracle Solaris Studio 12.3 was released today. You can download it here.

There's a bundle of exciting stuff that goes into every new release. The headlines are probably the introduction of the Code Analyzer tool which does dynamic and static error reporting on an application, and the ablity of the IDE to be run on a remote system while the builds are done on the host.

I have a couple of other favourite areas of change. First of all we've got spot running on a bunch of recent processors - in particular the SPARC T4 (I'll write more about this later). Secondly, the filtering in the Performance Analyzer has been pushed to the foreground. Let's discuss filtering now.

Filtering is one of those technologies that is very powerful, but has been quite hard to use in previous releases. The change in this release has been that the filters have been placed on the right-click menu. Here's an example:

Adding and removing filters is now just a matter of right clicking. This allows you to rapidly drill down on the profile data. For example filtering out activity by processor, call stack, and so on.

Friday Oct 21, 2011

Endianness

SPARC and x86 processors have different endianness. SPARC is big-endian and x86 is little-endian. Big-endian means that numbers are stored with the most significant data earlier in memory. Conversely little-endian means that numbers are stored with the least significant data earlier in memory.

Think of big endian as writing numbers as we would normally do. For example one thousand, one hundred and twenty would be written as 1120 using a big-endian format. However, writing as little endian it would be 0211 - the least significant digits would be recorded first.

For machines, this relates to which bytes are stored first. To make data portable between machines, a format needs to be agreed. For example in networking, data is defined as being big-endian. So to handle network packets, little-endian machines need to convert the data before using it.

Converting the bytes is a trivial matter, but it has some performance pitfalls. Let's start with a simple way of doing the conversion.

template <class T>
T swapslow(T in)
{
  T out;
  char * pcin = (char*)∈
  char * pcout = (char*)&out;

  for (int i=0; i<sizeof(T); i++)
  {
    pcout[i] = pcin[sizeof(T)-i];
  }
  return out;
}

The code uses templates to generalise it to different sizes of integers. But the following observations hold even if you use a C version for a particular size of input.

First thing to look at is instruction count. Assume I'm dealing with ints. I store the input to memory, then I access the input one byte at a time, storing each byte to a new location in memory, before finally loading the result. So for an int, I've got 10 memory operations.

Memory operations can be costly. Processors may be limited to only issuing one per cycle. In comparison most processors can issue two or more logical or integer arithmetic instructions per cycle. Loads are also costly as they have to access the cache, which takes a few cycles.

The other issue is more subtle, and I've discussed it in the past. There are RAW issues in this code. I'm storing an int, but loading it as four bytes. Then I'm storing four bytes, and loading them as an int.

A RAW hazard is a read-after-write hazard. The processor sees data being stored, but cannot convert that stored data into the format that the subsequent load requires. Hence the load has to wait until the result of the store reaches the cache before the load can complete. This can be multiple cycles of wait.

With endianness conversion, the data is already in the registers, so we can use logical operations to perform the conversion. This approach is shown in the next code snippet.

template <class T>
T swap(T in)
{
  T out=0;
  for (int i=0; i<sizeof(T); i++)
  {
    out<<=8;
    out|=(in&255);
    in>>=8;
  }
  return out;
} 

In this case, we avoid the stores and loads, but instead we perform four logical operations per byte. This is higher cost than the load and store per byte. However, we can usually do more logical operations per cycle and the operations normally take a single cycle to complete. Overall, this is probably slightly faster than loads and stores.

However, you will usually see a greater performance gain from avoiding the RAW hazards. Obviously RAW hazards are hardware dependent - some processors may be engineered to avoid them. In which case you will only see a problem on some particular hardware. Which means that your application will run well on one machine, but poorly on another.

About

Darryl Gove is a senior engineer in the Solaris Studio team, working on optimising applications and benchmarks for current and future processors. He is also the author of the books:
Multicore Application Programming
Solaris Application Programming
The Developer's Edge

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
5
6
8
9
10
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today
Bookmarks
The Developer's Edge
Solaris Application Programming
Publications
Webcasts
Presentations
OpenSPARC Book
Multicore Application Programming
Docs