Friday Oct 21, 2011

Endianness

SPARC and x86 processors have different endianness. SPARC is big-endian and x86 is little-endian. Big-endian means that numbers are stored with the most significant data earlier in memory. Conversely little-endian means that numbers are stored with the least significant data earlier in memory.

Think of big endian as writing numbers as we would normally do. For example one thousand, one hundred and twenty would be written as 1120 using a big-endian format. However, writing as little endian it would be 0211 - the least significant digits would be recorded first.

For machines, this relates to which bytes are stored first. To make data portable between machines, a format needs to be agreed. For example in networking, data is defined as being big-endian. So to handle network packets, little-endian machines need to convert the data before using it.

Converting the bytes is a trivial matter, but it has some performance pitfalls. Let's start with a simple way of doing the conversion.

template <class T>
T swapslow(T in)
{
  T out;
  char * pcin = (char*)∈
  char * pcout = (char*)&out;

  for (int i=0; i<sizeof(T); i++)
  {
    pcout[i] = pcin[sizeof(T)-i];
  }
  return out;
}

The code uses templates to generalise it to different sizes of integers. But the following observations hold even if you use a C version for a particular size of input.

First thing to look at is instruction count. Assume I'm dealing with ints. I store the input to memory, then I access the input one byte at a time, storing each byte to a new location in memory, before finally loading the result. So for an int, I've got 10 memory operations.

Memory operations can be costly. Processors may be limited to only issuing one per cycle. In comparison most processors can issue two or more logical or integer arithmetic instructions per cycle. Loads are also costly as they have to access the cache, which takes a few cycles.

The other issue is more subtle, and I've discussed it in the past. There are RAW issues in this code. I'm storing an int, but loading it as four bytes. Then I'm storing four bytes, and loading them as an int.

A RAW hazard is a read-after-write hazard. The processor sees data being stored, but cannot convert that stored data into the format that the subsequent load requires. Hence the load has to wait until the result of the store reaches the cache before the load can complete. This can be multiple cycles of wait.

With endianness conversion, the data is already in the registers, so we can use logical operations to perform the conversion. This approach is shown in the next code snippet.

template <class T>
T swap(T in)
{
  T out=0;
  for (int i=0; i<sizeof(T); i++)
  {
    out<<=8;
    out|=(in&255);
    in>>=8;
  }
  return out;
} 

In this case, we avoid the stores and loads, but instead we perform four logical operations per byte. This is higher cost than the load and store per byte. However, we can usually do more logical operations per cycle and the operations normally take a single cycle to complete. Overall, this is probably slightly faster than loads and stores.

However, you will usually see a greater performance gain from avoiding the RAW hazards. Obviously RAW hazards are hardware dependent - some processors may be engineered to avoid them. In which case you will only see a problem on some particular hardware. Which means that your application will run well on one machine, but poorly on another.

Differences between the various STL options on Solaris

Steve Clamage has provided a nice summary of the trade-offs between the various STL options. I'll summarise it here:

  • Default STL. Available as part of the OS so does not require a separate library to be shipped with the application. However, does not support the standard.
  • -library=stlport4 Much better conformance with the standard, but no internationalisation. Must be distributed with applications that use it.
  • -library=stdcxx4 (Apache). Complete implementation of standard. Available on S10U10 and onwards.

I'd also add that stlport4 and stdcxx4 typically have much better performance than the default library.

The other point that bears repetition is that you can only include one STL per application. So you cannot use different implementations for different libraries or for the application.

Tuesday Aug 09, 2011

Standards and headers

Every so often I encounter, or hear about, a problem with function definitions when the standard header files are included. Most often its mmap, but sometimes it's something else. Every time I think that I should write something up. Well, it's finally happened, a short paper on how to write portable code using the standard headers.

Monday Aug 01, 2011

Standard header files

Interesting (but old) blog post about the standard header files included with Solaris.

Oracle Solaris Studio 12.3 Beta Programme

Last week, we started the beta programme for Oracle Solaris Studio 12.3. You can participate by downloading the software and reporting any issues.

As with any release, there's a lot of incremental improvements wherever we find opportunities, and there's a couple of new features. The two most interesting new features are:

    The Code Analyzer which reports possible errors in your application, both dynamic (ie memory access errors), and static. The static error detection is the newest feature, this goes beyond the compile time warnings or lint messages, and does much more detailed compile-time analysis of your code.
  • Remote development on Windows. I'm yet to try out this feature, but the IDE has the ability to run remotely on a Windows box seamlessly compiling and running on a remote server. In fact the improvements in the IDE are well worth a look.

Some of the Studio team are giving a webcast on Thursday 4th August at 9am PDT.

Tuesday Jul 12, 2011

Feature Test Macros

Feature test macros are a set of macros that are either:

  • Defined by the development environment indicating that the environment conforms to a particular standard

or

  • Defined by the source code for the application before the header files are included to indicate that the application requires a particular environment to build

The macros define what APIs are available, and what parameters are passed through the APIs. Adherence to a particular standard (like POSIX) will define a particular set of APIs, and define their parameters. A good example of this is on Solaris where munmap changes definition depending on what standards have been requested:

$ grep munmap /usr/include/sys/*.h
/usr/include/sys/mman.h:extern int munmap(void *, size_t);
/usr/include/sys/mman.h:extern int munmap(caddr_t, size_t);

The Linux man page for feature_test_macros includes useful source code (ftm.c) for reporting which feature test macros are set by default. This changes depending on the the OS and compiler used. One of the big differences between Linux and Solaris are the feature test macros that are set by default. Here's the output from the program compiled on a Linux box and a Solaris box - both using gcc.

Linux

$ gcc ftm.c
$ ./a.out
_POSIX_SOURCE defined
_POSIX_C_SOURCE defined: 200809L
_BSD_SOURCE defined
_SVID_SOURCE defined
_ATFILE_SOURCE defined

Solaris

$ gcc ftm.c
$ ./a.out
_LARGEFILE64_SOURCE defined
_FILE_OFFSET_BITS defined: 32

The list of standards that Solaris 10 adheres to is documented under man standards, the list for Linux is documented under man feature_test_macros.

Wednesday Jun 29, 2011

Library initialisation in C++ - libraries and linking part 5

Part 5 of the series of articles on linking and libraries is up. This one gets into the details of what can go wrong when writing libraries in C++. The key take aways from the article are to use:

  • LD_DEBUG=init to view runtime initialisation
  • LD_DEBUG=bindings to examine how symbols are bound to libraries at runtime

Wednesday Jun 01, 2011

Avoiding problems at linktime (part 4 in series)

Part 4 in the series on best practices for linking is available. The key takeaways are:

  • Avoid defining duplicate symbols. The Solaris tool lari will produce a report on this issue (besides doing a bundle of other stuff). The problem with multiple definitions of symbols is that it is not predictable which definition will be picked at runtime. This is often deterministic on a particular platform, but could change on a different platform.
  • Always define libraries as a hierarchy, with no circular dependencies. If there are circular dependencies the libraries may get loaded in an unpredictable order.

Friday May 27, 2011

Using LD_DEBUG to examine application startup (linking best practices part 3)

Part 3 of the series on best practices for linking C/C++ applications is up. This sections focuses on using LD_DEBUG to examine application startup.

The paper talks about the options LD_DEBUG=init which shows the initialisation and finalisation stages of an applications run, and LD_DEBUG=bindings which shows how the symbols are bound between the application and libraries.

Wednesday May 11, 2011

Best practices for linking libraries (part 1)

A while ago I was looking into some application start up problems. The problem turned out to be an issue relating to the order in which the libraries were loaded and initialised. It seemed to me that this was a rather tricky area, and it would be very helpful to document the best practices around it. I thought this would be a quick couple of pages, but it turned out to be a rather high page count, and I ended up working on the document with Steve Clamage (with Rod Evans helping out).

The first part of the document is available. This section covers basic linker good practices. Using -L and -R rather than LD_LIBRARY_PATH, generating relocatable code etc. The key take aways are:

  • Use -L to specify the path to where the libraries can be found at compile time.
  • Use -R to specify the location of the libraries at run time.
  • Use the token $ORIGIN to specify a relative path for the libraries' location. This avoids the need to have a hard-coded location where the libraries can be found.

Monday Apr 25, 2011

Using pragma opt

The Studio compiler has the ability to control the optimisation level that is applied to particular functions in an application. This can be useful if the functions are designed to work at a specific optimisation level, or if the application fails at a particular optimisation level, and you need to figure out where the problem lies.

The optimisation levels are controlled through pragma opt. The following steps need to be followed to use the pragma:

  • The directive needs to be inserted into the source file. The format of the directive is #pragma opt /level/ (/function/). This needs to be inserted into the code before the start of the function definition, but after the function header.
  • The code needs to be compiled with the flag -xmaxopt=level. This sets the maximum optimisation level for all functions in the file - including those tagged with #pragma opt.

We can see this in action using the following code snippet. This contains two identical functions, both return the square of a global variable. However, we are using #pragma opt to control the optimisation level of the function f().

int f();
int g();

#pragma opt 2 (f)

int d;

int f()
{
  return d\*d;
}

int g()
{
  return d\*d;
}

The code is compiled with the flag -xmaxopt=5, this specifies the maximum optimisation level that can be applied to any functions in the file.

$ cc -O -xmaxopt=5 -S opt.c

If we compare the disassembly for the functions f() and g(), we can see that g() is more optimal as it does not reload the global data.

                        f:
/\* 000000          0 \*/         sethi   %hi(d),%o5

!   10                !  return d\*d;

/\* 0x0004         10 \*/         ldsw    [%o5+%lo(d)],%o4 ! volatile    // First load of d
/\* 0x0008            \*/         ldsw    [%o5+%lo(d)],%o3 ! volatile    // Second load of d
/\* 0x000c            \*/         retl    ! Result =  %o0
/\* 0x0010            \*/         mulx    %o4,%o3,%o0

                        g:
/\* 000000         14 \*/         sethi   %hi(d),%o5
/\* 0x0004            \*/         ld      [%o5+%lo(d)],%o4               // Single load of d

!   15                !  return d\*d;

/\* 0x0008         15 \*/         sra     %o4,0,%o3
/\* 0x000c            \*/         retl    ! Result =  %o0
/\* 0x0010            \*/         mulx    %o3,%o3,%o0

Saturday Jan 15, 2011

pginfo & pgstat

A couple of very welcome commands crept into Solaris 11. They are pginfo and pgstat (these are links to the Oracle documentation site).

The two commands deal with "processor groups", this seems a bit of a misnomer to me as they are really about CPU topology. They report information and utilisation stats demonstrating the resource sharing going on on the system, and how threads are using those resources. It's probably easiest to use a couple of examples from the man pages to show this. First off pginfo:

$ pginfo -p -T
0 (System) CPUs: 0-31
`-- 3 (Data_Pipe_to_memory [system,chip]) CPUs: 0-31
    `-- 2 (Floating_Point_Unit [system,chip]) CPUs: 0-31
        |-- 1 (Integer_Pipeline [core]) CPUs: 0-3
        |-- 4 (Integer_Pipeline [core]) CPUs: 4-7
        |-- 5 (Integer_Pipeline [core]) CPUs: 8-11
        |-- 6 (Integer_Pipeline [core]) CPUs: 12-15
        |-- 7 (Integer_Pipeline [core]) CPUs: 16-19
        |-- 8 (Integer_Pipeline [core]) CPUs: 20-23
        |-- 9 (Integer_Pipeline [core]) CPUs: 24-27
        `-- 10 (Integer_Pipeline [core]) CPUs: 28-31

This shows a processor with 32 virtual CPUs, sharing a single floating point pipeline, 8 cores with a single integer pipe each - looks like an UltraSPARC T1 to me.

The output from pginfo shows the utilisation of the processor:

$ pgstat 1 2
PG  RELATIONSHIP            HW    SW  CPUS
 0  System                   -  0.4%  0-31
 3   Data_Pipe_to_memory     -  0.4%  0-31
 2    Floating_Point_Unit   0%  0.4%  0-31
 1     Integer_Pipeline     0%    0%  0-3
 4     Integer_Pipeline     0%    0%  4-7
 5     Integer_Pipeline     0%    0%  8-11
 6     Integer_Pipeline     0%  0.2%  12-15
 7     Integer_Pipeline     0%    0%  16-19
 8     Integer_Pipeline   2.8%  2.7%  20-23
 9     Integer_Pipeline   0.1%  0.2%  24-27
10     Integer_Pipeline     0%    0%  28-31

It reports both software utilisation - meaning what work the operating system has assigned to the cores, plus it can report pipeline utilisation using the hardware counters. Pipeline utilisation indicates whether the core is saturated or not - each pipeline can be fully utilised before the core is running the maximal number of threads.

I'm pleased to see these tools appear. It is useful to have tools that report the topology of the system, and it is great to see tools that report actual hardware utilisation. On earlier releases of Solaris you can always use corestat to get similar data.

Sunday Nov 28, 2010

Multicore Application Programming: Source code available

I've just uploaded all the source code to the examples in Multicore Application Programming. About 160 files.

Friday Nov 12, 2010

Stopping whichs

I was using a tool the other day, and I started it in the background. It didn't come up and, when I looked it had stopped. When this has happened in the past I foreground it and it continues working. I've only noticed this on rare occasions, and I'd previously put it down to some misconfiguration of the system. However, one of my colleagues had also noticed it, so this was the ideal opportunity to figure out what really was going on.

The first step was to identify which process was stopped using jobs -l:

$ jobs -l
[1]- 25195 Running                 process1 &
[2]+ 25223 Stopped (tty output)    process2

Having done that, the next step was to find out where the process had actually stopped. This information can be obtained using ptree which prints out the process call tree:

$ ptree 25223
511   /usr/lib/ssh/sshd
   25160 /usr/lib/ssh/sshd
     25161 /usr/lib/ssh/sshd
       25166 -bash
         25223 /bin/sh process2
           25232 sed -n $p
             25233 /usr/bin/csh -f /usr/bin/which java java
               25234 /usr/bin/stty erase \^H 

So the process has stalled in stty setting the erase character to be \^H. The callstack, printed by pstack, was not very enlightening.

$ pstack 25234
25234:  /usr/bin/stty erase \^H
  feef14d7 ioctl    (0, 540f, 8067988)
  080516f8 main     (3, 8047b24, 8047b34, 80511ff) + 40c
  0805125d _start   (3, 8047c08, 8047c16, 8047c1c, 0, 8047c1f) + 7d 

However the interesting step is from which to stty. What's interesting about which is that it is a C-shell script. The interesting bit is the following:

#! /usr/bin/csh -f
#...
if ( -r ~/.cshrc && -f ~/.cshrc ) source ~/.cshrc

So which sources the .cshrc file, and my .cshrc file happened to contain stty erase \^H. So why does this cause the process to stop?

Well stty controls the characteristics of the terminal, but when the script is executing in the background, there is no terminal. When there's no terminal, stty stops and waits for one to appear!

TThe easiest fix is to move the call to stty into my .login file. The .login file is only parsed at login, and not every time a shell is started. Alternatively it's possible to check for the existence of a prompt:

if ($?prompt) then
  if ("$prompt" =~ ?\*) then
  /usr/bin/stty  erase \^H
  endif
endif

Wednesday Nov 10, 2010

Slides from Solaris Summit available

I found out about the Solaris Summit too late to be able to attend, but the slides are now on line.

Tuesday Nov 09, 2010

Multicore application programming: sample chapter

No sign of the actual books yet - I expect to see them any day now - but there's a sample chapter up on the informit site. There's also a pdf version which includes preface and table of contents.

This is chapter 3 "Identifying opportunities for parallelism". These range from the various OS-level approaches, through virtualisation, and into multithread/multiprocess. It's this flexibility that makes multicore processors so appealing. You have the choice of whether you take advantage of them through some consolidation of existing applications, or whether you take advantage of them, as a developer, through scaling a single application.

mtmalloc performance

A while back I discussed how the performance of mtmalloc could be improved. Well Rick Weisner was working on this, so I provided him with a fix for my hot issue. So I'm very pleased to see, from the bug status, that this code was integrated last month!

Wednesday Sep 08, 2010

Oracle Solaris Studio 12.2

It's been just over a year since the release of Studio 12 Update 1, today we releasing the first Oracle branded Studio release - Oracle Solaris Studio 12.2. For the previous release I wrote a post for the AMD site looking at the growth in multicore processors. It seemed appropriate to take another look at this.

The graph in the chart below shows the cumulative number of SPECint2006 results broken down by the number of cores for each processor. This data does not represent the number of different types of processor that are available, since the same processor can be used in many different results. It is closer to a snapshot of how the market for multicore processors is growing. Each data point represents a system, so the curve approximates the number of different systems that are being released.

It's perhaps more dramatic to demonstrate the change using a stacked area chart. The chart perhaps overplays the number of single core results, but this is probably fair as "single core" represents pretty much all the results prior to the launch of CPU2006. So what is readily apparent is the rapid decline in the number of single core results, the spread of dual, and then quad core. It's also interesting to note the beginning of a spread of more than quad core chips.

If we look at what is happening with multicore processors in the context of what we are releasing with Solaris Studio, there's a very nice fit of features. We continue to refine our support for OpenMP and automatic parallelisation. We've been providing data race (and deadlock) detection through the Thread Analyzer for a couple of releases. The debugger and the performance analyzer have been fine with threads for a long time. The performance analyzer has the time line view which is wonderful for examining multithreaded (or multiprocess) applications.

In addition to these fundamentals Studio 12.2 introduces a bunch of new features. I discussed some of these when the express release came out:

  • For those who use the IDE, integration of support for the analysis of the runtime behaviour of applications has been very useful. It both provides more information directly back to the developer, and raises awareness of the available tools.
  • Understanding call trees is often an important part of interpreting the performance of the application. Being able to drill down the call tree has been a very useful extension to the Performance Analyzer.
  • Memory error checking is critical for all applications. The trouble with memory access errors is that, like data races, the "problem" is visible arbitrarily far from the point where the error occurred.

The release of a new version of a product is always an exciting time. It's a culmination of a huge amount of analysis, development, and testing, and it's wonderful to finally see it available for others to use. So download it and let us know what you think!

Footnote: SPEC, SPECint, reg tm of Standard Performance Evaluation Corporation. Results from www.spec.org as of 6 September 2010 and this report.

Monday Jul 26, 2010

What does take_deferred_signal() mean in my profile?

Every so often you'll see take_deferred_signal() appear in the profile of an application. Sometimes as quite a time consuming function. So, what does it mean?

It actually comes from signal handling code in libc. If a signal comes in while the application is in a critical section, the signal gets deferred until the critical section is complete. When the application exits the critical section, all the deferred signals get taken.

Typically, this function becomes hot due to mutex locks in malloc and free, but other library calls can also cause it. The way to diagnose what is happening is to examine the call stack. So let's run through an example. Here is some multithreaded malloc/free heavy code.


#include <stdlib.h>
#include <pthread.h>

void \*work( void\* param )
{
  while ( 1 ) { free( malloc(100) ); }
}

int main()
{
  pthread_t thread;
  pthread_create( &thread, 0, work, 0 );
  for ( int i=0; i<10000000; i++ )
  {
    free ( malloc (100) );
  }
}

Profiling, we can see that take_deferred_signal() is the hottest function. The other hot functions would probably give us a clue as to the problem, but that is an artifact of the rather simple demonstration code.

Excl.     Incl.      Name
User CPU  User CPU
  sec.      sec.
36.456    36.456     <Total>
14.210    14.210     take_deferred_signal
 4.203    21.265     mutex_lock_impl
 3.082     3.082     clear_lockbyte
 2.872    17.062     mutex_trylock_adaptive

The next thing to look at is the call stack for take_deferred_signal() as this will tell us who is calling the function.

Attr.      Name
User CPU
  sec.
14.210     do_exit_critical
14.210    \*take_deferred_signal

do_exit_critical() doesn't tell us anything, we already know that it is called when the code exits a critical section. Continuing up the call stack we find:

Attr.      Name
User CPU
  sec.
14.190     mutex_trylock_adaptive
 0.020     mutex_unlock
 0.       \*do_exit_critical
14.210     take_deferred_signal

Which is more useful, we now know that the time is spent in mutex locks, but we don't know the user of those mutex locks. In this case the bulk of the time comes from mutex_trylock_adaptive(), so that is the routine to investigate:

Attr.      Name
User CPU
  sec.
17.062     mutex_lock_impl
 2.872    \*mutex_trylock_adaptive
14.190     do_exit_critical

So we're still in the mutex lock code, we need to find who is calling the mutex locks:

Attr.      Name
User CPU
  sec.
11.938     free
 9.327     malloc
 4.203    \*mutex_lock_impl
17.062     mutex_trylock_adaptive

So we finally discover that the time is due to calls to mutex locks in malloc() and free().

Monday May 17, 2010

Multicore application programming: Table of contents

I've uploaded the current table of contents for Multicore Application Programming. You can find all the detail in there, but I think it's appropriate to talk about how the book is structured.

Chapter 1. The design of any processor has a massive impact on its performance. This is particularly true for multicore processors since multiple software threads will be sharing hardware resources. Hence the first chapter provides a whistle-stop tour of the critical features of hardware. It is important to do this up front as the terminology will be used later in the book when discussing how hardware and software interact.

Chapter 2. Serial performance remains important, even for multicore processors. There's two main reasons for this. The first is that a parallel program is really a bunch of serial threads working together, so improving the performance of the serial code will improve the performance of the parallel program. The second reason is that even a parallel program will have serial sections of code. The performance of the serial code will limit the maximum performance that the parallel program can attain.

Chapter 3. One of important aspects of using multicore processors is identifying where the parallelism is going to come from. If you look at any system today, there are likely to be many active processes. So at one level no change is necessary, systems will automatically use multiple cores. However, we want to get beyond that, and so the chapter discusses approaches like virtualisation as well as discussing the more obvious approach of multi-thread or multi-process programming. One message that needs to be broadcast is that multicore processors do not need a rewrite of existing applications. However, getting the most from a multicore processor may well require that.

Chapter 4. The book discusses Windows native threading, OpenMP, automatic parallelisation, as well as the POSIX threads that are available on OS-X, Linux, and Solaris. Although the details do sometimes change across platforms, the concepts do not. This chapter discusses synchronisation primitives like mutex locks and so on, this enables the chapters which avoids having to repeat information in the implementation chapters.

Chapter 5. This chapter covers POSIX threads (pthreads), which are available on Linux, OS-X, and Solaris, as well as other platforms not covered in the book. The chapter covers multithreaded as well as multiprocess programming, together with methods of communicating between threads and processes.

Chapter 6. This chapter covers Windows native threading. The function names and the parameters that need to be passed to them are different to the POSIX API, but the functionality is the same. This chapter provides the same coverage for Windows native threads that chapter 5 provides for pthreads.

Chapter 7. The previous two chapters provide a low level API for threading. This gives very great control, but provides more opportunities for errors, and requires considerable lines of code to be written for even the most basic parallel code. Automatic parallelisation and OpenMP place more of the burden of parallelisation on the compiler, less on the developer. Automatic parallelisation is the ideal situation, where the compiler does all the work. However, there are limitations to this approach, and this chapter discusses the current limitations and how to make changes to the code that will enable the compiler to do a better job. OpenMP is a very flexible technology for writing parallel applications. It is widely supported and provides support for a number of different approaches to parallelism.

Chapter 8. Synchronisation primitives provided by the operating system or compiler can have high overheads. So it is tempting to write replacements. This chapter covers some of the potential problems that need to be avoided. Most applications will be adequately served by the synchronisation primitives already provided, the discussion in the chapter provides insight about how hardware, compilers, and software can cause bugs in parallel applications.

Chapter 9. The difference between a multicore system and a single core system is in its ability to simultaneously handle multiple active threads. The difference between a multicore system and a multiprocessor system is in the sharing of processor resources between threads. Fundamentally, the key attribute of a multicore system is how it scales to multiple threads, and how the characteristics of the application affect that scaling. This chapter discusses what factors impact scaling on multicore processors, and also what the benefits multicore processors bring to parallel applications.

Chapter 10. Writing parallel programs is a growing and challenging field. The challenges come from producing correct code and getting the code to scale to large numbers of cores. There are some approaches that provide high numbers of cores, there are other approaches which address issues of producing correct code. This chapter discusses a large number of other approaches to programming parallelism.

Chapter 11. The concluding chapter of the book reprises some of the key points of the previous chapters, and tackles the question of how to write correct, scalable, parallel applications.

Wednesday Nov 25, 2009

Viewing thread activity in the Performance Analyzer

The Sun Studio Performance Analyzer is one of the two tools that I use most frequently (the other is spot - which is now in SS12U1!). It's a very powerful tool, but a lot of that power is not immediately visible to users. I'm going to discuss a couple of ways I've used the analyzer to view parallel applications.

The most common first step for looking at the performance of parallel apps is to use the timeline. However, the timeline can look a bit cluttered with all of the call stack data. Often you are really just interested in the leaf node. Fortunately this can be configured from the data presentation dialog box. To get the view I want I'm only showing the top leaf in the call stack:

This results in a display of the samples in each routine, by default this can look very colourful. You can make it easier on the eye by selecting the colours used to display the graphic. In the following graphic I've picked green for one parallel routine that I'm interested in, and blue for another, then used a yellow to colour all the time waiting for more work to be assigned:

The graphic shows that the work is not evenly spread across all threads. The first few threads spend more time in the hot routines than the later threads. We can see this much more clearly using the 'threads' view of the data. To get this view you need to go back to the data presentation dialog and select the threads tab, it's also useful to select the 'cpus' tab at the same time.

The threads tab shows the activity of each thread for the currently displayed metrics. This is useful to see if one thread is doing more work than another. The cpus tab shows time that the app spends on each CPU in the machine - this can indicate whether a particular CPU is over subscribed. The thread activity looks like:

This confirms what we thought earlier that some of the threads are much more active than other threads. The top chart shows the user time, which indicates that all the threads spent the same amount of time running 'stuff', the middle chart shows the time that each thread spent running useful work, the lower chart shows the time spent in overhead. The exercise now is to try and improve the distribution of work across the threads......

Monday Oct 19, 2009

Fishing with cputrack

I'm a great fan of the hardware performance counters that you find on most processors. Often you can look at the profile and instantly identify what the issue is. Sometimes though, it is not obvious, and that's where the performance counters can really help out.

I was looking at one such issue last week, the performance of the application was showing some variation, and it wasn't immediately obvious what the issue was. The usual suspects in these cases are:

  • Excessive system time
  • Process migration
  • Memory placement
  • Page size
  • etc.

Unfortunately, none of these seemed to explain the issue. So I hacked together the following script cputrackall which ran the test code under cputrack for all the possible performance counters. Dumped the output into a spreadsheet, and compared the fast and slow runs of the app. This is something of a "fishing trip" script, just gathering as much data as possible in the hope that something leaps out, but sometimes that's exactly what's needed. I regularly get to sit in front of a new chip before the tools like ripc have been ported, and in those situations the easiest thing to do is to look for hardware counter events that might explain the runtime performance. In this particular instance, it helped me to confirm my suspicion that there was a difference in branch misprediction rates that was causing the issue.

Wednesday Sep 30, 2009

Querying locality groups

Locality groups are a mechanism that provides Solaris information about how the physical hardware is wired together. A locality group is a bunch of threads that share the same CPU or memory access characteristics. For example a locality group might be all the threads on a single chip.

The command to display the locality group information is lgrpinfo, but this is not on Solaris 10. Here's an example of the output from that command:

% lgrpinfo
lgroup 0 (root):
        Children: 1 2
        CPUs: 0-7
        Memory: installed 16G, allocated 3.8G, free 12G
        Lgroup resources: 1 2 (CPU); 1 2 (memory)
        Latency: 90
lgroup 1 (leaf):
        Children: none, Parent: 0
        CPUs: 0-3
        Memory: installed 8.0G, allocated 1.8G, free 6.2G
        Lgroup resources: 1 (CPU); 1 (memory)
        Load: 0.263
        Latency: 54
lgroup 2 (leaf):
        Children: none, Parent: 0
        CPUs: 4-7
        Memory: installed 8.0G, allocated 2.0G, free 6.0G
        Lgroup resources: 2 (CPU); 2 (memory)
        Load:    0
        Latency: 54

It is possible to access this programmatically:

#include <sys/lgrp_user.h>
#include <stdio.h>
#include <stdlib.h>

void explore(lgrp_cookie_t cookie,lgrp_id_t node,int level)
{
  printf("Lgroup level %i\\n",level);
  
  int ncpus=lgrp_cpus(cookie,node,0,0,LGRP_CONTENT_DIRECT);
  
  processorid_t \* cpus=(processorid_t\*)calloc(ncpus,sizeof(processorid_t));
  lgrp_cpus(cookie,node,cpus,ncpus,LGRP_CONTENT_DIRECT);
  printf("CPUs: ");
  for(int i=0; i<ncpus; i++)
  {
    printf("%i ",cpus[i]);
  }
  printf("\\n");
  
  int nchildren=lgrp_children(cookie,  node, 0,0);

  lgrp_id_t\* children=(lgrp_id_t\*)calloc(nchildren,sizeof(lgrp_id_t));
  lgrp_children(cookie,  node,children,nchildren);

  for (int i=0; i<nchildren; i++)
  {
    explore(cookie,children[i],level+1);
  }
  free(children);
}

void main()
{
  lgrp_cookie_t cookie =lgrp_init(LGRP_VIEW_CALLER); 
  lgrp_id_t node = lgrp_root(cookie);
  explore(cookie,node,0);
  lgrp_fini(cookie);
}

Which provides the following output:

% cc local.c -llgrp
% ./a.out
Lgroup level 0
CPUs:
Lgroup level 1
CPUs: 0 1 2 3
Lgroup level 1
CPUs: 4 5 6 7

Monday Jul 13, 2009

Lesser known Solaris features

Joerg Moellenkamp has put together a very nice downloadable pdf book on lesser known Solaris features. Well worth skimming through if you're interested in some of the features that Solaris has that have not captured the headlines. The book walks through the features and has examples of how to use them.

Friday Jun 26, 2009

Cost of pointer chasing in libmtmalloc

I thought I'd quickly write up some comments about how to improve the issue with mtmalloc.

The "obvious" solution is to allocate larger chunks of memory. I was using 512 byte objects, so 72KB can allocate 144 of these. That's quite a few, I'm not sure that in general I'd want to allocate say 1024 objects of 512 bytes just in case I needed them. So increasing the chunk size might be a useful workaround, but I think it's rather a sledgehammer solution.

Of course, if I end up allocating 60KB objects, then I hope the code does the right thing and reserves space for several of them. If I need one, I'm quite likely to need a second, and I don't want to be doing pointer chasing on a linked list of single object blocks - that would be very painful. So hopefully there is some kind of scaling of the requestsize for larger objects.

However, the fundamental problem is not actually the number of objects in each allocated chunk, that's what reveals the problem. No, the real problem is the pointer chasing loop to locate a chunk with free space in it. There are several upfront problems with this. Fortunately the two data structures that are used in the pointer chasing (mt-next and mt-nfree) are on the same cache line - although given their offsets, I'm not convinced that this was a design decision. However, the pointer chasing from block to block guarantees that the next pair of values that need to be inspected are not in the cache. Given the fact that we're jumping at least 72KB, there's a good chance that the mapping isn't even in the TLB.

We could argue at this point that there should be a list of free memory so a single pointer step could get us to the next free block, but this approach almost certainly opens up a lot of issues around ensuring thread safety (ie you probably need mutexes) and getting good performance (ie mutexes cost quite a few cycles). So I don't think that's the solution.

I suspect the easiest thing that could be done would be to take the two key fields and place them into an array. So you would have an array of pointers to the chunks interleaved with the counts of the number of free elements in each of the chunks. The advantage of this is that we could identify a chunk with free space without having to stride through memory to do it, we can just scan the array. We'd rarely need multiple TLB entries, and we might even be able to fit the details of multiple chunks on the same cacheline - reducing cache misses substantially (there is an issue of false sharing here, so it may not be entirely feasible), and the other gain would be that we'd be streaming through memory so the hardware might be able to prefetch the next cacheline, or we just just add prefetches if that were necessary.

The programming challenge with this approach would be in the situations where we need to increase the size of the array to allocate more chunks. This should happen rarely, but could be tricky to do safely. But not impossible.

mtmalloc vs umem

A little while back I was looking at the performance of the STL with multithreaded code. I got the opportunity to try this on a particular code, and rather shockingly to me performance was absolutely terrible! I'd linked the code with mtmalloc, and the hottest function in the profile was malloc_internal. I've put together a fake code, and here's the profile from that:

Excl.     Incl.      Name
User CPU  User CPU
   sec.      sec.
266.446   266.446    
258.301   263.084    malloc_internal
  1.661     1.951    free
  1.401     1.401    mutex_lock_impl
  0.961     0.961    mutex_unlock

We can dig into the disassembly of malloc_internal to find out what's going on:

    73.201    73.201            [?]     1724:  cmp         %o5, 0
     1.981     1.981            [?]     1728:  bne         0x1740
     0.320     0.320            [?]     172c:  nop
     0.490     0.490            [?]     1730:  ld          [%i2 + 44], %i2
     1.191     1.191            [?]     1734:  cmp         %i2, 0
     0.901     0.901            [?]     1738:  bne,a       0x1724
## 176.443   176.443            [?]     173c:  ld          [%i2 + 32], %o5

It's not hard to visualise what the original C code would look like:

  while ((ptr->value==0) && (ptr->next!=0)) { ptr=ptr->next; }

Fortunately the source code is searchable and the above loop looks sufficiently similar to line 1032 of mtmalloc.c:

   1032 	while (thiscache != NULL && thiscache->mt_nfree == 0)
   1033 		thiscache = thiscache->mt_next;

So what's going on?

Reading through the source of malloc_internal, it appears that mtmalloc builds up a linked list of chunks of memory for each size of memory request. The size of the chunks of memory is 8KB\*requestsize, and requestsize is 9. So basically each chunk of memory is 72KB in size. So when a memory request comes in, malloc_internal looks at the current chunk, and if memory can be allocated from there, then it returns memory from that chunk. If not it goes to the next chunk and so on. This works very well when memory is allocated at once, but as memory gets freed, these chunks of memory become like Swiss-cheese, with lots of holes in them. If a lot of memory of a particular size is requested, then freed, there can be a large number of these chunks in the linked list, and scanning through the chunks to find one with free space can be time consuming. And that is the condition that my test code exercises.

It's probably worth revealing the test code, at this point, so that you can see what it does:

#include <stdlib.h>
typedef struct s
{
  struct s \* next;
  char padding[508];
} S;

void main()
{
  struct s \* head;
  struct s \* keep;
  struct s \* current;
  head=0;
  keep=0;
  for (int j=0; j<100; j++)
  {
    for (int i=0; i<100000; i++)
    {
      current=(struct s\*)malloc(sizeof(struct s));
      if (i&1)
      {
        current->next=head;
        head=current;
      }
      else
      {
        current->next=keep;
        keep=current;
      }
    }
    current = head;
    while (current!=0)
    {
      struct s \* tmp = current;
      current = current -> next;
      free(current);
    }
    head = 0;
  }
}

The code maintains two lists, one that it places memory onto for a long duration, and another list that holds memory for only a short duration. The memory footprint of the code keeps increasing, so more chunks are added to the lists, and holding on to the memory for a long period of time ensures that the chunks end up with lots of gaps in them. The runtime of this code is as follows:

% cc -O mtm.c -lmtmalloc
% timex a.out
real        4:44.18
user        4:33.80
sys            8.70

However there is an API to libmtmalloc that allows us to adjust the size of the chunks. The following changes increase the requestsize from 9 to 20:

#include 
...
  mallocctl(MTCHUNKSIZE,20);
...

The performance reduces from nearly 5 minutes to about 1 minute:

% cc -O mtm.c -lmtmalloc
% timex a.out
real        1:09.10
user        1:01.09
sys            6.53

If we increase the requestsize to 30, performance improves still further:

% cc -O mtm.c -lmtmalloc
% timex a.out
real          38.36
user          31.41
sys            4.96

Of course, libmtmalloc is not the only memory allocator that is optimised for multi-threaded allocation. We also have libumem, compiling the original code to use this results in the following performance:

% cc -O mtm.c -lumem
% timex a.out
real          31.06
user          18.10
sys           10.95

So this is probably a good indication that you will get better performance from libumem if your application allocates and deallocates lots of memory. If you are using libmtmalloc in this role, then you may need to tune the requestsize to a greater number than the default - although this will increase the memory footprint of your application.

Sunday Jun 14, 2009

Audio for JavaOne interview available

A couple of weeks back I recorded an interview where I discussed The Developer's Edge. I've just found the audio up at BlogTalkRadio, it's about 15 minutes in duration.

Wednesday Jun 10, 2009

Utilising a CMT system

I got asked about how to utilise a CMT system, it's probably not an uncommon question, so I'll post my (somewhat brief) answer here:

The CMT processor appears as a system with many CPUs. These virtual CPUs can be provisioned in the same way as you would with any multiprocessor system:

  • The OS will naturally handle positioning multiple active threads so as to get the optimal performance.
  • If you wish to manually tune this then you can use Solaris tools like processor binding (pbind, or processor_bind) to statically allocate a particular thread or process to a particular core. You can use processor sets (psrset) to restrict a set of processes to a particular set of processors (or to exclude particular processes from using these processors).
  • The machine can be divided into multiple virtual machines either through Solaris Zones, where all zones run the same version of the Solaris operating system. Or through logical domains where multiple different operating systems can be installed onto the same machine.
    • The optimal configuration will depend on the problem to be solved.

      I've actually heard someone argue that multicore processors require a redesign of applications. Um, no. Applications will work just fine. However, multicore processors do give you opportunities to throw more threads at a problem - which can be very useful.

Thursday May 21, 2009

Graph of libraries used by firefox and thunderbird

Just gathered library usage charts for firefox and thunderbird. The full charts look like:

Firefox

Thunderbird

Neither of which is particularly telling. The reduced charts look much better:

Firefox

Thunderbird

Friday Mar 27, 2009

The Developer's Edge available in hardcopy

The Developer's Edge is now available as hardcopy!

It is being made available as print-on-demand. You can either order through the publisher Vervante, or you can order through Amazon.

However, I suggest you wait until next week before ordering as the current cover art is not the final version (you can play spot the difference between the image on the left and the one on the Vervante website). I'll post when it gets fixed. Of course, you can order the "limited-edition" version if you want :)

I introduced the book in a previous post. I'll reproduce a bit more of the details in this post. The brief summary is:


The Developer's Edge: Selected Blog Posts and Articles focuses on articles in the following areas:

  • Native language issues
  • Performance and improving performance
  • Specific features of the x86 and SPARC processors
  • Tools that are available in the Solaris OS

The articles provide a broad overview on a topic, or an in-depth discussion. The texts should provide insights into new tools or new methods, or perhaps the opportunity to review a known domain in a new light.


You can get more details than this from the Safari site, either reading the preface or skimming the table of contents

I would love to hear feedback on this book, feel free to e-mail me directly, leave comments on amazon, or leave comments on this blog, or on the blogs of the other contributors.

About

Darryl Gove is a senior engineer in the Solaris Studio team, working on optimising applications and benchmarks for current and future processors. He is also the author of the books:
Multicore Application Programming
Solaris Application Programming
The Developer's Edge

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
5
6
8
9
10
12
13
14
15
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today
Bookmarks
The Developer's Edge
Solaris Application Programming
Publications
Webcasts
Presentations
OpenSPARC Book
Multicore Application Programming
Docs