Friday Jun 26, 2009

Cost of pointer chasing in libmtmalloc

I thought I'd quickly write up some comments about how to improve the issue with mtmalloc.

The "obvious" solution is to allocate larger chunks of memory. I was using 512 byte objects, so 72KB can allocate 144 of these. That's quite a few, I'm not sure that in general I'd want to allocate say 1024 objects of 512 bytes just in case I needed them. So increasing the chunk size might be a useful workaround, but I think it's rather a sledgehammer solution.

Of course, if I end up allocating 60KB objects, then I hope the code does the right thing and reserves space for several of them. If I need one, I'm quite likely to need a second, and I don't want to be doing pointer chasing on a linked list of single object blocks - that would be very painful. So hopefully there is some kind of scaling of the requestsize for larger objects.

However, the fundamental problem is not actually the number of objects in each allocated chunk, that's what reveals the problem. No, the real problem is the pointer chasing loop to locate a chunk with free space in it. There are several upfront problems with this. Fortunately the two data structures that are used in the pointer chasing (mt-next and mt-nfree) are on the same cache line - although given their offsets, I'm not convinced that this was a design decision. However, the pointer chasing from block to block guarantees that the next pair of values that need to be inspected are not in the cache. Given the fact that we're jumping at least 72KB, there's a good chance that the mapping isn't even in the TLB.

We could argue at this point that there should be a list of free memory so a single pointer step could get us to the next free block, but this approach almost certainly opens up a lot of issues around ensuring thread safety (ie you probably need mutexes) and getting good performance (ie mutexes cost quite a few cycles). So I don't think that's the solution.

I suspect the easiest thing that could be done would be to take the two key fields and place them into an array. So you would have an array of pointers to the chunks interleaved with the counts of the number of free elements in each of the chunks. The advantage of this is that we could identify a chunk with free space without having to stride through memory to do it, we can just scan the array. We'd rarely need multiple TLB entries, and we might even be able to fit the details of multiple chunks on the same cacheline - reducing cache misses substantially (there is an issue of false sharing here, so it may not be entirely feasible), and the other gain would be that we'd be streaming through memory so the hardware might be able to prefetch the next cacheline, or we just just add prefetches if that were necessary.

The programming challenge with this approach would be in the situations where we need to increase the size of the array to allocate more chunks. This should happen rarely, but could be tricky to do safely. But not impossible.

mtmalloc vs umem

A little while back I was looking at the performance of the STL with multithreaded code. I got the opportunity to try this on a particular code, and rather shockingly to me performance was absolutely terrible! I'd linked the code with mtmalloc, and the hottest function in the profile was malloc_internal. I've put together a fake code, and here's the profile from that:

Excl.     Incl.      Name
User CPU  User CPU
   sec.      sec.
266.446   266.446    
258.301   263.084    malloc_internal
  1.661     1.951    free
  1.401     1.401    mutex_lock_impl
  0.961     0.961    mutex_unlock

We can dig into the disassembly of malloc_internal to find out what's going on:

    73.201    73.201            [?]     1724:  cmp         %o5, 0
     1.981     1.981            [?]     1728:  bne         0x1740
     0.320     0.320            [?]     172c:  nop
     0.490     0.490            [?]     1730:  ld          [%i2 + 44], %i2
     1.191     1.191            [?]     1734:  cmp         %i2, 0
     0.901     0.901            [?]     1738:  bne,a       0x1724
## 176.443   176.443            [?]     173c:  ld          [%i2 + 32], %o5

It's not hard to visualise what the original C code would look like:

  while ((ptr->value==0) && (ptr->next!=0)) { ptr=ptr->next; }

Fortunately the source code is searchable and the above loop looks sufficiently similar to line 1032 of mtmalloc.c:

   1032 	while (thiscache != NULL && thiscache->mt_nfree == 0)
   1033 		thiscache = thiscache->mt_next;

So what's going on?

Reading through the source of malloc_internal, it appears that mtmalloc builds up a linked list of chunks of memory for each size of memory request. The size of the chunks of memory is 8KB\*requestsize, and requestsize is 9. So basically each chunk of memory is 72KB in size. So when a memory request comes in, malloc_internal looks at the current chunk, and if memory can be allocated from there, then it returns memory from that chunk. If not it goes to the next chunk and so on. This works very well when memory is allocated at once, but as memory gets freed, these chunks of memory become like Swiss-cheese, with lots of holes in them. If a lot of memory of a particular size is requested, then freed, there can be a large number of these chunks in the linked list, and scanning through the chunks to find one with free space can be time consuming. And that is the condition that my test code exercises.

It's probably worth revealing the test code, at this point, so that you can see what it does:

#include <stdlib.h>
typedef struct s
  struct s \* next;
  char padding[508];
} S;

void main()
  struct s \* head;
  struct s \* keep;
  struct s \* current;
  for (int j=0; j<100; j++)
    for (int i=0; i<100000; i++)
      current=(struct s\*)malloc(sizeof(struct s));
      if (i&1)
    current = head;
    while (current!=0)
      struct s \* tmp = current;
      current = current -> next;
    head = 0;

The code maintains two lists, one that it places memory onto for a long duration, and another list that holds memory for only a short duration. The memory footprint of the code keeps increasing, so more chunks are added to the lists, and holding on to the memory for a long period of time ensures that the chunks end up with lots of gaps in them. The runtime of this code is as follows:

% cc -O mtm.c -lmtmalloc
% timex a.out
real        4:44.18
user        4:33.80
sys            8.70

However there is an API to libmtmalloc that allows us to adjust the size of the chunks. The following changes increase the requestsize from 9 to 20:


The performance reduces from nearly 5 minutes to about 1 minute:

% cc -O mtm.c -lmtmalloc
% timex a.out
real        1:09.10
user        1:01.09
sys            6.53

If we increase the requestsize to 30, performance improves still further:

% cc -O mtm.c -lmtmalloc
% timex a.out
real          38.36
user          31.41
sys            4.96

Of course, libmtmalloc is not the only memory allocator that is optimised for multi-threaded allocation. We also have libumem, compiling the original code to use this results in the following performance:

% cc -O mtm.c -lumem
% timex a.out
real          31.06
user          18.10
sys           10.95

So this is probably a good indication that you will get better performance from libumem if your application allocates and deallocates lots of memory. If you are using libmtmalloc in this role, then you may need to tune the requestsize to a greater number than the default - although this will increase the memory footprint of your application.


Darryl Gove is a senior engineer in the Solaris Studio team, working on optimising applications and benchmarks for current and future processors. He is also the author of the books:
Multicore Application Programming
Solaris Application Programming
The Developer's Edge


« April 2014
The Developer's Edge
Solaris Application Programming
OpenSPARC Book
Multicore Application Programming