Thursday Mar 05, 2009

Peak rate of window spill and fill traps

I've been looking at the performance of a code recently. Written in C++ using many threads. One of the things with C++ is that as a language it encourages developers to have lots of small routines. Small routines lead to many calls and returns; and in particular they lead to register window spills and fills. Read more about register windows. Anyway, I wondered what the peak rate of issuing register window spills and fills was?

I'm going to use some old code that I used to examine the cost of library calls a while back. The code uses recursion to reach a deep call depth. First off, I define a couple of library routines:

extern int jump2(int count);

int jump1(int count)
{
  count--;
  if (count==0)
  {
    return 1;
  }
  else
  {
    return 1+jump2(count);
  }
}
and
int jump1(int count);

int jump2(int count)
{
  count--;
  if (count==0)
  {
    return 1;
  }
  else
  {
    return 1+jump1(count);
  }
}

I can then turn these into libraries:

 
$ cc -O -G -o libjump1.so jump1.c
$ cc -O -G -o libjump2.so jump2.c

Done in this way, both libraries have hanging dependencies:

$ ldd -d libjump1.so
        symbol not found: jump2         (./libjump1.so)

So the main executable will have to resolve these. The main executable looks like:

#include <stdio.h>

#define RPT 100
#define SIZE 600

extern int jump1(int count);

int main()
{
  int index,count,links,tmp;
  tmp=jump1(100);
  #pragma omp parallel for default(none) private(count,index,tmp,links)
  for(links=1; links<10000; links++)
  {
   for (count=0; count<RPT; count++)
   {
     for (index=0;index<SIZE;index++) 
     {
       tmp=jump1(links);
       if (tmp!=links) {printf("mismatch\\n");}
     }
   }
  }
}

This needs to be compiled in such a way as to resolve the unresolved dependencies

$ cc -xopenmp=noopt -o par par.c -L. -R. -ljump1 -ljump2

Note that I'm breaking rules again by making the runtime linker look for the dependent libraries in the current directory rather than use $ORIGIN to locate them relative to the executable. Oops.

I'm also using OpenMP in the code. The directive tells the compiler to make the outer loop run in parallel over multiple processors. I picked 10,000 for the trip count so that the code would run for a bit of time, so I could look at the activity on the system. Also note that the outer loop defines the depth of the call stack, so this code will probably cause stack overflows at some point, if not before. Err...

I'm compiling with -xopenmp=noopt since I want the OpenMP directive to be recognised, but I don't want the compiler to use optimisation, since if the compiler saw the code it would probably eliminate most of it, and that would leave me with nothing much to test.

The first thing to test is whether this generates spill fill traps at all. So we run the application and use trapstat to look at trap activity:

vct name           |   cpu13  
----------------------------------------
...
 84 spill-user-32       |  3005899  
...
 c4 fill-user-32        |  3005779 
...

So on this 1.2GHz UltraSPARC T1 system, we're getting 3,000,000 traps/second. The generated code is pretty plain except for the save and restore instructions:

jump1()
         21c:  9d e3 bf a0  save        %sp, -96, %sp
         220:  90 86 3f ff  addcc       %i0, -1, %o0
         224:  12 40 00 04  bne,pn      %icc,jump1+0x18 ! 0x234
         228:  01 00 00 00  nop       

         22c:  81 c7 e0 08  ret       
         230:  91 e8 20 01  restore     %g0, 1, %o0

         234:  40 00 00 00  call        jump2
         238:  01 00 00 00  nop       
         23c:  81 c7 e0 08  ret       
         240:  91 ea 20 01  restore     %o0, 1, %o0

So you can come up with an estimate of 300 ns/trap.

The reason for using OpenMP is to enable us to scale the number of active threads. Rerunning with 32 threads, by setting the environment variable OMP_NUM_THREADS to be 32, we get the following output from trapstat:

vct name                |    cpu21    cpu22    cpu23    cpu24    cpu25    cpu26 
------------------------+-------------------------------------------------------
...
 84 spill-user-32       |  1024589  1028081  1027596  1174373  1029954  1028695
...
 c4 fill-user-32        |   996739   989598   955669  1169058  1020349  1021877

So we're getting 1M traps per thread, with 32 threads running. Let's take a look at system activity using vmstat.

 vmstat 1
 kthr      memory            page            disk          faults      cpu
 r b w   swap  free  re  mf pi po fr de sr s1 s2 s3 s4   in   sy   cs us sy id
...
 0 0 0 64800040 504168 0  0  0  0  0  0  0  0  0  0  0 3022  427  812 100 0  0
 0 0 0 64800040 504168 0  0  0  0  0  0  0  0  0  0  0 3020  428  797 100 0  0
 0 0 0 64800040 504168 0  0  0  0  0  0  0  0  0  0  0 2945  457  760 100 0  0
 0 0 0 64800040 504168 0  0  0  0  0  0  0  0  0  0  0 3147  429 1025 99  1  0
 0 0 0 64800040 504168 0 15  0  0  0  0  0  0  0  0  0 3049  666  820 99  1  0
 0 0 0 64800040 504168 0  1  0  0  0  0  0  0  0  0  0 3044  543  866 100 0  0
 0 0 0 64800040 504168 0  0  0  0  0  0  0  0  0  0  0 3021  422  798 100 0  0

So there's no system time being recorded - the register spill and fill traps are fast traps, so that's not a surprise.

One final thing to look at is the instruction issue rate. We can use cpustat to do this:

  2.009   6  tick  63117611 
  2.009  12  tick  69622769 
  2.009   7  tick  62118451 
  2.009   5  tick  64784126 
  2.009   0  tick  67341237 
  2.019  17  tick  62836527 

As might be expected from the cost of each trap, and the sparse number of instructions between traps, the issue rate of the instructions is quite low. Each of the four threads on a core is issuing about 65M instructions per second. So the core is issuing about 260M instructions per second - that's about 20% of the peak issue rate for the core.

If this were a real application, what could be done? Well, obviously the trick would be to reduce the number of calls and returns. At a compiler level, that would using flags that enable inlining - so an optimisation level of at least -xO4; adding -xipo to get cross-file inlining; using -g0 in C++ rather than -g (which disables front-end inlining). At a more structural level, perhaps the way the application is broken into libraries might be changed so that routines that are frequently called could be inlined.

The other thing to bear in mind, is that this code was designed to max out the register window spill/fill traps. Most codes will get nowhere near this level of window spills and fills. Most codes will probably max out at about a tenth of this level, so the impact from register window spill fill traps at that point will be substantially reduced.

About

Darryl Gove is a senior engineer in the Solaris Studio team, working on optimising applications and benchmarks for current and future processors. He is also the author of the books:
Multicore Application Programming
Solaris Application Programming
The Developer's Edge

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
5
6
8
9
10
12
13
14
15
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today
Bookmarks
The Developer's Edge
Solaris Application Programming
Publications
Webcasts
Presentations
OpenSPARC Book
Multicore Application Programming
Docs