Wednesday Apr 16, 2014

Lambda expressions in C++11

Lambda expressions are, IMO, one of the interesting features of C++11. At first glance they do seem a bit hard to parse, but once you get used to them you start to appreciate how useful they are. Steve Clamage and I have put together a short paper introducing lambda expressions.

Monday Apr 07, 2014

RAW hazards revisited (again)

I've talked about RAW hazards in the past, and even written articles about them. They are an interesting topic because they are situation where a small tweak to the code can avoid the problem.

In the article on RAW hazards there is some code that demonstrates various types of RAW hazard. One common situation is writing code to copy misaligned data into a buffer. The example code contains a test for this kind of copying, the results from this test, compiled with Solaris Studio 12.3, on my system look like:

Misaligned load v1 (bad) memcpy()
Elapsed = 16.486042 ns
Misaligned load v2 (bad) byte copy
Elapsed = 9.176913 ns
Misaligned load good
Elapsed = 5.243858 ns

However, we've done some work in the compiler on better identification of potential RAW hazards. If I recompile using the 12.4 Beta compiler I get the following results:

Misaligned load v1 (bad) memcpy()
Elapsed = 4.756911 ns
Misaligned load v2 (bad) byte copy
Elapsed = 5.005309 ns
Misaligned load good
Elapsed = 5.597687 ns

All three variants of the code produce the same performance - the RAW hazards have been eliminated!

Tuesday Mar 25, 2014

Solaris Studio 12.4 Beta now available

The beta programme for Solaris Studio 12.4 has opened. So download the bits and take them for a spin!

There's a whole bunch of features - you can read about them in the what's new document, but I just want to pick a couple of my favourites:

  • C++ 2011 support. If you've not read about it, C++ 2011 is a big change. There's a lot of stuff that has gone into it - improvements to the Standard library, lambda expressions etc. So there is plenty to try out. However, there are some features not supported in the beta, so take a look at the what's new pages for C++
  • Improvements to the Performance Analyzer. If you regularly read my blog, you'll know that this is the tool that I spend most of my time with. The new version has some very cool features; these might not sound much, but they fundamentally change the way you interact with the tool: an overview screen that helps you rapidly figure out what is of interest, improved filtering, mini-views which show more slices of the data on a single screen, etc. All of these massively improve the experience and the ease of using the tool.

There's a couple more things. If you try out features in the beta and you want to suggest improvements, or tell us about problems, please use the forums. There is a forum for the compiler and one for the tools.

Oh, and watch this space. I will be writing up some material on the new features....

Wednesday Feb 26, 2014

Multicore Application Programming available in Chinese!

This was a complete surprise to me. A box arrived on my doorstep, and inside were copies of Multicore Application Programming in Chinese. They look good, and have a glossy cover rather than the matte cover of the English version.

Article on RAW hazards

Feels like it's been a long while since I wrote up an article for OTN, so I'm pleased that I've finally got around to fixing that.

I've written about RAW hazards in the past. But I recently went through a patch of discovering them in a number of places, so I've written up a full article on them.

What is "nice" about RAW hazards is that once you recognise them for what they are (and that's the tricky bit), they are typically easy to avoid. So if you see 10 seconds of time attributable to RAW hazards in the profile, then you can often get the entire 10 seconds back by tweaking the code.

Wednesday Sep 11, 2013

Presenting at UK Oracle User Group meeting

I'm very excited to have been invited to present at the UK Oracle User Group conference in Manchester, UK on 1-4 December.

Currently I'm down for two presentations:

As you might expect, I'm very excited to be over there, I've not visited Manchester in about 20 years!

Tuesday Aug 27, 2013

My Oracle Open World and JavaOne schedule

I've got my schedule for Oracle Open World and JavaOne:

Note that on Thursday I have about 30 minutes between my two talks, so expect me to rush out of the database talk in order to get to the Java talk.

Friday Aug 09, 2013

How to use a lot of threads ....

The SPARC M5-32 has 1,536 virtual CPUs. There is a lot that you can do with that much resource and a bunch of us sat down to write a white paper discussing the options.

There are a couple of key observations in there. First of all it is quite likely that such a large system will not end up running a single instance of a single application. Therefore it is useful to understand the various options for virtualising the system. The second observation is that there are a number of details to take care of when writing an application that is expected to scale to large numbers of threads.

Anyhow, I hope you find it a useful read.

Monday Jul 08, 2013

JavaOne and Oracle Open World 2013

I'll be up at both JavaOne and Oracle Open World presenting. I have a total of three presentations:

  • Mixed-Language Development: Leveraging Native Code from Java
  • Performance Tuning Where Java Meets the Hardware
  • Developing Efficient Database Applications with C and C++

I'm excited by these opportunities - particularly working with Charlie Hunt diving into Java Performance.

Wednesday Dec 12, 2012

Compiling for T4

I've recently had quite a few queries about compiling for T4 based systems. So it's probably a good time to review what I consider to be the best practices.

  • Always use the latest compiler. Being in the compiler team, this is bound to be something I'd recommend :) But the serious points are that (a) Every release the tools get better and better, so you are going to be much more effective using the latest release (b) Every release we improve the generated code, so you will see things get better (c) Old releases cannot know about new hardware.
  • Always use optimisation. You should use at least -O to get some amount of optimisation. -xO4 is typically even better as this will add within-file inlining.
  • Always generate debug information, using -g. This allows the tools to attribute information to lines of source. This is particularly important when profiling an application.
  • The default target of -xtarget=generic is often sufficient. This setting is designed to produce a binary that runs well across all supported platforms. If the binary is going to be deployed on only a subset of architectures, then it is possible to produce a binary that only uses the instructions supported on these architectures, which may lead to some performance gains. I've previously discussed which chips support which architectures, and I'd recommend that you take a look at the chart that goes with the discussion.
  • Crossfile optimisation (-xipo) can be very useful - particularly when the hot source code is distributed across multiple source files. If you're allowed to have something as geeky as favourite compiler optimisations, then this is mine!
  • Profile feedback (-xprofile=[collect: | use:]) will help the compiler make the best code layout decisions, and is particularly effective with crossfile optimisations. But what makes this optimisation really useful is that codes that are dominated by branch instructions don't typically improve much with "traditional" compiler optimisation, but often do respond well to being built with profile feedback.
  • The macro flag -fast aims to provide a one-stop "give me a fast application" flag. This usually gives a best performing binary, but with a few caveats. It assumes the build platform is also the deployment platform, it enables floating point optimisations, and it makes some relatively weak assumptions about pointer aliasing. It's worth investigating.
  • SPARC64 processor, T3, and T4 implement floating point multiply accumulate instructions. These can substantially improve floating point performance. To generate them the compiler needs the flag -fma=fused and also needs an architecture that supports the instruction (at least -xarch=sparcfmaf).
  • The most critical advise is that anyone doing performance work should profile their application. I cannot overstate how important it is to look at where the time is going in order to determine what can be done to improve it.

I also presented at Oracle OpenWorld on this topic, so it might be helpful to review those slides.

Wednesday Dec 05, 2012

Library order is important

I've written quite extensively about link ordering issues, but I've not discussed the interaction between archive libraries and shared libraries. So let's take a simple program that calls a maths library function:

#include <math.h>

int main()
{
  for (int i=0; i<10000000; i++)
  {
    sin(i);
  }
}

We compile and run it to get the following performance:

bash-3.2$ cc -g -O fp.c -lm
bash-3.2$ timex ./a.out

real           6.06
user           6.04
sys            0.01

Now most people will have heard of the optimised maths library which is added by the flag -xlibmopt. This contains optimised versions of key mathematical functions, in this instance, using the library doubles performance:

bash-3.2$ cc -g -O -xlibmopt fp.c -lm
bash-3.2$ timex ./a.out

real           2.70
user           2.69
sys            0.00

The optimised maths library is provided as an archive library (libmopt.a), and the driver adds it to the link line just before the maths library - this causes the linker to pick the definitions provided by the static library in preference to those provided by libm. We can see the processing by asking the compiler to print out the link line:

bash-3.2$ cc -### -g -O -xlibmopt fp.c -lm
/usr/ccs/bin/ld ... fp.o -lmopt -lm -o a.out...

The flag to the linker is -lmopt, and this is placed before the -lm flag. So what happens when the -lm flag is in the wrong place on the command line:

bash-3.2$ cc -g -O -xlibmopt -lm fp.c
bash-3.2$ timex ./a.out

real           6.02
user           6.01
sys            0.01

If the -lm flag is before the source file (or object file for that matter), we get the slower performance from the system maths library. Why's that? If we look at the link line we can see the following ordering:

/usr/ccs/bin/ld ... -lmopt -lm fp.o -o a.out 

So the optimised maths library is still placed before the system maths library, but the object file is placed afterwards. This would be ok if the optimised maths library were a shared library, but it is not - instead it's an archive library, and archive library processing is different - as described in the linker and library guide:

"The link-editor searches an archive only to resolve undefined or tentative external references that have previously been encountered."

An archive library can only be used resolve symbols that are outstanding at that point in the link processing. When fp.o is placed before the libmopt.a archive library, then the linker has an unresolved symbol defined in fp.o, and it will search the archive library to resolve that symbol. If the archive library is placed before fp.o then there are no unresolved symbols at that point, and so the linker doesn't need to use the archive library. This is why libmopt needs to be placed after the object files on the link line.

On the other hand if the linker has observed any shared libraries, then at any point these are checked for any unresolved symbols. The consequence of this is that once the linker "sees" libm it will resolve any symbols it can to that library, and it will not check the archive library to resolve them. This is why libmopt needs to be placed before libm on the link line.

This leads to the following order for placing files on the link line:

  • Object files
  • Archive libraries
  • Shared libraries

If you use this order, then things will consistently get resolved to the archive libraries rather than to the shared libaries.

Tuesday Dec 04, 2012

It could be worse....

As "guest" pointed out, in my file I/O test I didn't open the file with O_SYNC, so in fact the time was spent in OS code rather than in disk I/O. It's a straightforward change to add O_SYNC to the open() call, but it's also useful to reduce the iteration count - since the cost per write is much higher:

...
#define SIZE 1024

void test_write()
{
  starttime();
  int file = open("./test.dat",O_WRONLY|O_CREAT|O_SYNC,S_IWGRP|S_IWOTH|S_IWUSR);
...

Running this gave the following results:

Time per iteration   0.000065606310 MB/s
Time per iteration   2.709711563906 MB/s
Time per iteration   0.178590114758 MB/s

Yup, disk I/O is way slower than the original I/O calls. However, it's not a very fair comparison since disks get written in large blocks of data and we're deliberately sending a single byte. A fairer result would be to look at the I/O operations per second; which is about 65 - pretty much what I'd expect for this system.

It's also interesting to examine at the profiles for the two cases. When the write() was trapping into the OS the profile indicated that all the time was being spent in system. When the data was being written to disk, the time got attributed to sleep. This gives us an indication how to interpret profiles from apps doing I/O. It's the sleep time that indicates disk activity.

Write and fprintf for file I/O

fprintf() does buffered I/O, where as write() does unbuffered I/O. So once the write() completes, the data is in the file, whereas, for fprintf() it may take a while for the file to get updated to reflect the output. This results in a significant performance difference - the write works at disk speed. The following is a program to test this:

#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <stdio.h>
#include <sys/time.h>
#include <sys/types.h>
#include <sys/stat.h>

static double s_time;

void starttime()
{
  s_time=1.0*gethrtime();
}

void endtime(long its)
{
  double e_time=1.0*gethrtime();
  printf("Time per iteration %5.2f MB/s\n", (1.0*its)/(e_time-s_time*1.0)*1000);
  s_time=1.0*gethrtime();
}

#define SIZE 10*1024*1024

void test_write()
{
  starttime();
  int file = open("./test.dat",O_WRONLY|O_CREAT,S_IWGRP|S_IWOTH|S_IWUSR);
  for (int i=0; i<SIZE; i++)
  {
    write(file,"a",1);
  }
  close(file);
  endtime(SIZE);
}

void test_fprintf()
{
  starttime();
  FILE* file = fopen("./test.dat","w");
  for (int i=0; i<SIZE; i++)
  {
    fprintf(file,"a");
  }
  fclose(file);
  endtime(SIZE);
}

void test_flush()
{
  starttime();
  FILE* file = fopen("./test.dat","w");
  for (int i=0; i<SIZE; i++)
  {
    fprintf(file,"a");
    fflush(file);
  }
  fclose(file);
  endtime(SIZE);
}


int main()
{
  test_write();
  test_fprintf();
  test_flush();
}

Compiling and running I get 0.2MB/s for write() and 6MB/s for fprintf(). A large difference. There's three tests in this example, the third test uses fprintf() and fflush(). This is equivalent to write() both in performance and in functionality. Which leads to the suggestion that fprintf() (and other buffering I/O functions) are the fastest way of writing to files, and that fflush() should be used to enforce synchronisation of the file contents.

Thursday Nov 01, 2012

Rick Hetherington on the T5

There's an interview with Rick Hetherington about the new T5 processor. Well worth a quick read.

Thursday Oct 18, 2012

Mixing Java and native code

This was a bit of surprise to me. The slides are available from my presentation at JavaOne on mixed language development. What I wasn't expecting was that there would also be a video of the presentation.

Maximising T4 performance

My presentation from Oracle Open World is available for download.

Wednesday Aug 29, 2012

CON6714 - Mixed-Language Development: Leveraging Native Code from Java

Here's the abstract from my JavaOne talk:

There are some situations in which it is necessary to call native code (C/C++ compiled code) from Java applications. This session describes how to do this efficiently and how to performance-tune the resulting applications.

The objectives for the session are:

  • Explain reasons for using native code in Java applications
  • Describe pitfalls of calling native code from Java
  • Discuss performance-tuning of Java apps that use native code

I'll cover how to call native code from Java, debugging native code, and then I'll dig into performance tuning the code. The talk is not going too deep on performance tuning - focusing on the JNI specific topics; I'll do a bit more about performance tuning in my OpenWorld talk later in the day.

Monday Aug 27, 2012

Monday, 1st October: Presenting at JavaOne and Oracle Open World

On Monday 1 October I will be presenting at both JavaOne and Oracle Open World. The full conference schedule is available from here. The logistics for my sessions are as follows:


  • JavaOne: 8:30am Monday 1 October. CON6714: "Mixed-Language Development: Leveraging Native Code from Java". San Francisco Hilton - Continental Ballroom 6
  • Oracle OpenWorld: 10:45am Monday 1 October. CON6382: "Maximizing Your SPARC T4 Oracle Solaris Application Performance". Marriott Marquis - Golden Gate C3

Hope to see you there!

Thursday May 17, 2012

Solaris Developer talk next week

Vijay Tatkar will be talking about developing on Solaris next week Tuesday at 9am PST.

Monday Apr 23, 2012

sincos()

If you are computing both the sine and cosine of an angle, then you will be twice as quick if you call sincos() than if you call cos() and sin() independently:

#include 

int main()
{
  double a,b,c;
  a=1.0;
  for (int i=0;i<100000000;i++) { b=sin(a); c=cos(a); }
}

$ cc -O sc.c -lm
$ timex ./a.out
real          19.13

vs

#include 

int main()
{
  double a,b,c;
  a=1.0;
  for (int i=0;i<100000000;i++) { sincos(a,&b,&c); }
}
$ cc -O sc.c -lm
$ timex ./a.out
real           9.80

Friday Apr 20, 2012

What is -xcode=abs44?

I've talked about building 64-bit libraries with position independent code. When building 64-bit applications there are two options for the code that the compiler generates: -xcode=abs64 or -xcode=abs44, the default is -xcode=abs44. These are documented in the user guides. The abs44 and abs64 options produce 64-bit applications that constrain the code + data + BSS to either 44 bit or 64 bits of address.

These options constrain the addresses statically encoded in the application to either 44 or 64 bits. It does not restrict the address range for pointers (dynamically allocated memory) - they remain 64-bits. The restriction is in locating the address of a routine or a variable within the executable.

This is easier to understand from the perspective of an example. Suppose we have a variable "data" that we want to return the address of. Here's the code to do such a thing:

extern int data;

int * address()
{
  return &data;
}

If we compile this as a 32-bit app we get the following disassembly:

/* 000000          4 */         sethi   %hi(data),%o5
/* 0x0004            */         retl    ! Result =  %o0
/* 0x0008            */         add     %o5,%lo(data),%o0

So it takes two instructions to generate the address of the variable "data". At link time the linker will go through the code, locate references to the variable "data" and replace them with the actual address of the variable, so these two instructions will get modified. If we compile this as a 64-bit code with full 64-bit address generation (-xcode=abs64) we get the following:

/* 000000          4 */         sethi   %hh(data),%o5
/* 0x0004            */         sethi   %lm(data),%o2
/* 0x0008            */         or      %o5,%hm(data),%o4
/* 0x000c            */         sllx    %o4,32,%o3
/* 0x0010            */         or      %o3,%o2,%o1
/* 0x0014            */         retl    ! Result =  %o0
/* 0x0018            */         add     %o1,%lo(data),%o0

So to do the same thing for a 64-bit application with full 64-bit address generation takes 6 instructions. Now, most hardware cannot address the full 64-bits, hardware typically can address somewhere around 40+ bits of address (example). So being able to generate a full 64-bit address is currently unnecessary. This is where abs44 comes in. A 44 bit address can be generated in four instructions, so slightly cuts the instruction count without practically compromising the range of memory that an application can address:

/* 000000          4 */         sethi   %h44(data),%o5
/* 0x0004            */         or      %o5,%m44(data),%o4
/* 0x0008            */         sllx    %o4,12,%o3
/* 0x000c            */         retl    ! Result =  %o0
/* 0x0010            */         add     %o3,%l44(data),%o0

Monday Apr 02, 2012

Efficient inline templates and C++

I've talked before about calling inline templates from C++, I've also talked about calling inline templates efficiently. This time I want to talk about efficiently calling inline templates from C++.

The obvious starting point is that I need to declare the inline templates as being extern "C":

  extern "C"
  {
    int mytemplate(int);
  }

This enables us to call it, but the call may not be very efficient because the compiler will treat it as a function call, and may produce suboptimal code based on that premise. So we need to add the no_side_effect pragma:

  extern "C"
  {
    int mytemplate(int); 
    #pragma no_side_effect(mytemplate)
  }

However, this may still not produce optimal code. We've discussed how the no_side_effect pragma cannot be combined with exceptions, well we know that the code cannot produce exceptions, but the compiler doesn't know that. If we tell the compiler that information it may be able to produce even better code. We can do this by adding the "throw()" keyword to the template declaration:

  extern "C"
  {
    int mytemplate(int) throw(); 
    #pragma no_side_effect(mytemplate)
  }

The following is an example of how these changes might improve performance. We can take our previous example code and migrate it to C++, adding the use of a try...catch construct:

#include <iostream>

extern "C"
{
  int lzd(int);
  #pragma no_side_effect(lzd)
}

int a;
int c=0;

class myclass
{
  int routine();
};

int myclass::routine()
{
  try
  {
    for(a=0; a<1000; a++)
    {
      c=lzd(c);
    }
  }
  catch(...)
  {
    std::cout << "Something happened" << std::endl;
  }
 return 0;
}

Compiling this produces a slightly suboptimal code sequence in the hot loop:

$ CC -O -xtarget=T4 -S t.cpp t.il
...
/* 0x0014         23 */         lzd     %o0,%o0
/* 0x0018         21 */         add     %l6,1,%l6
/* 0x001c            */         cmp     %l6,1000
/* 0x0020            */         bl,pt   %icc,.L77000033
/* 0x0024         23 */         st      %o0,[%l7]

There's a store in the delay slot of the branch, so we're repeatedly storing data back to memory. If we change the function declaration to include "throw()", we get better code:

$ CC -O -xtarget=T4 -S t.cpp t.il
...
/* 0x0014         21 */         add     %i1,1,%i1
/* 0x0018         23 */         lzd     %o0,%o0
/* 0x001c         21 */         cmp     %i1,999
/* 0x0020            */         ble,pt  %icc,.L77000019
/* 0x0024            */         nop

The store has gone, but the code is still suboptimal - there's a nop in the delay slot rather than useful work. However, it's good enough for this example. The point I'm making is that the compiler produces the better code with both the "throw()" and the no side effect pragma.

Friday Mar 30, 2012

Inline template efficiency

I like inline templates, and use them quite extensively. Whenever I write code with them I'm always careful to check the disassembly to see that the resulting output is efficient. Here's a potential cause of inefficiency.

Suppose we want to use the mis-named Leading Zero Detect (LZD) instruction on T4 (this instruction does a count of the number of leading zero bits in an integer register - so it should really be called leading zero count). So we put together an inline template called lzd.il looking like:

.inline lzd
  lzd %o0,%o0
.end

And we throw together some code that uses it:

int lzd(int);

int a;
int c=0;

int main()
{
  for(a=0; a<1000; a++)
  {
    c=lzd(c);
  }
  return 0;
}

We compile the code with some amount of optimisation, and look at the resulting code:

$ cc -O -xtarget=T4 -S lzd.c lzd.il
$ more lzd.s
                        .L77000018:
/* 0x001c         11 */         lzd     %o0,%o0
/* 0x0020          9 */         ld      [%i1],%i3
/* 0x0024         11 */         st      %o0,[%i2]
/* 0x0028          9 */         add     %i3,1,%i0
/* 0x002c            */         cmp     %i0,999
/* 0x0030            */         ble,pt  %icc,.L77000018
/* 0x0034            */         st      %i0,[%i1]

What is surprising is that we're seeing a number of loads and stores in the code. Everything could be held in registers, so why is this happening?

The problem is that the code is only inlined at the code generation stage - when the actual instructions are generated. Earlier compiler phases see a function call. The called functions can do all kinds of nastiness to global variables (like 'a' in this code) so we need to load them from memory after the function call, and store them to memory before the function call.

Fortunately we can use a #pragma directive to tell the compiler that the routine lzd() has no side effects - meaning that it does not read or write to memory. The directive to do that is #pragma no_side_effect(<routine name>), and it needs to be placed after the declaration of the function. The new code looks like:

int lzd(int);
#pragma no_side_effect(lzd)

int a;
int c=0;

int main()
{
  for(a=0; a<1000; a++)
  {
    c=lzd(c);
  }
  return 0;
}

Now the loop looks much neater:

/* 0x0014         10 */         add     %i1,1,%i1

!   11                !  {
!   12                !    c=lzd(c);

/* 0x0018         12 */         lzd     %o0,%o0
/* 0x001c         10 */         cmp     %i1,999
/* 0x0020            */         ble,pt  %icc,.L77000018
/* 0x0024            */         nop

Friday Mar 23, 2012

Malloc performance

My co-conspirator Rick Weisner has written up a nice summary of malloc performance, including evaluating the mtmalloc changes that we got into S10 U10.

Friday Jan 13, 2012

C++ and inline templates

A while back I wrote an article on using inline templates. It's a bit of a niche article as I would generally advise people to write in C/C++, and tune the compiler flags and source code until the compiler generates the code that they want to see.

However, one thing that I didn't mention in the article, it's implied but not stated, is that inline templates are defined as C functions. When used from C++ they need to be declared as extern "C", otherwise you get linker errors. Here's an example template:

.inline nothing
  nop
.end

And here's some code that calls it:

void nothing();

int main()
{
  nothing();
}

The code works when compiled as C, but not as C++:

$ cc i.c i.il
$ ./a.out
$ CC i.c i.il
Undefined                       first referenced
 symbol                             in file
void nothing()                   i.o
ld: fatal: Symbol referencing errors. No output written to a.out

To fix this, and make the code compilable with both C and C++ we use the __cplusplus feature test macro and conditionally include extern "C". Here's the modified source:

#ifdef __cplusplus
  extern "C"
  {
#endif
    void nothing();
#ifdef __cplusplus
  }
#endif

int main()
{
  nothing();
}

Thursday Jan 12, 2012

Please mind the gap

I find the timeline view in the Performance Analyzer incredibly useful, but I've often been puzzled by what causes the gaps - like those in the example below:

Timeline view

One of my colleagues pointed out that it is possible to figure out what is causing the gaps. The call stack is indicated by the event after the gap. This makes sense. The Performance Analyzer works by sending a profiling signal to the thread multiple times a second. If the thread is not scheduled on the CPU then it doesn't get a signal. The first thing that the thread does when it is put back onto the CPU is to respond to those signals that it missed. Here's some example code so that you can try it out.

#include <stdio.h>

void write_file()
{
  char block[8192];
  FILE * file = fopen("./text.txt", "w");
  for (int i=0;i<1024; i++)
  {
    fwrite(block, sizeof(block), 1, file);
  }
  fclose(file);
}

void read_file()
{
  char block[8192];
  FILE * file = fopen("./text.txt", "rw");
  for (int i=0;i<1024; i++)
  {
    fread(block,sizeof(block),1,file);
    fseek(file,-sizeof(block),SEEK_CUR);
    fwrite(block, sizeof(block), 1, file);
  }
  fclose(file);
}

int main()
{
  for (int i=0; i<100; i++)
  {
    write_file();
    read_file();
  }
}

This is the code that generated the timeline shown above, so you know that the profile will have some gaps in it. If we select the event after the gap we determine that the gaps are caused by the application either opening or closing the file.

_close

But that is not all that is going on, if we look at the information shown in the Timeline details panel for the Duration of the event we can see that it spent 210ms in the "Other Wait" micro state. So we've now got a pretty clear idea of where the time is coming from.

Tuesday Jan 10, 2012

What's inlined by -xlibmil

The compiler flag -xlibmil provides inline templates for some critical maths functions, but it comes with the optimisation that it does not set errno for these functions. The functions it inlines can vary from release to release, so it's useful to be able to see which functions are inlined, and determine whether you care that they don't set errno. You can see the list of functions using the command:

grep inline /compilerpath/prod/lib/libm.il
        .inline sqrtf,1
        .inline sqrt,2
        .inline ceil,2
        .inline ceilf,1
        .inline floor,2
        .inline floorf,1
        .inline rint,2
        .inline rintf,1
...

From a cursory glance at the list I got when I did this just now, I can only see sqrt as a function that sets errno. So if you use sqrt and you care about whether it set errno, then don't use -xlibmil.

Wednesday Nov 02, 2011

Welcome to the (System) Developer's Edge

The Developer's Edge went out of print a while back. This was obviously frustrating, not just for me, but for the folks who contacted me asking what happened. Well, I'm thrilled to be able to announce that it's available as a pdf download.

This is essentially the same book as was previously available. I've not updated the links back to the original articles. It would have been problematic, in some instances the original articles no longer exist. There are only two significant changes, the first is the branding has been changed (there's no cover art, which keeps the download small). The second is the title of the book has been modified to include the word "system" to indicate that its focused towards the hardware end of the stack.

I hope you enjoy the System Developer's Edge.

About

Darryl Gove is a senior engineer in the Solaris Studio team, working on optimising applications and benchmarks for current and future processors. He is also the author of the books:
Multicore Application Programming
Solaris Application Programming
The Developer's Edge

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
5
6
8
9
10
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today
Bookmarks
The Developer's Edge
Solaris Application Programming
Publications
Webcasts
Presentations
OpenSPARC Book
Multicore Application Programming
Docs