Cost of pointer chasing in libmtmalloc

I thought I'd quickly write up some comments about how to improve the issue with mtmalloc.

The "obvious" solution is to allocate larger chunks of memory. I was using 512 byte objects, so 72KB can allocate 144 of these. That's quite a few, I'm not sure that in general I'd want to allocate say 1024 objects of 512 bytes just in case I needed them. So increasing the chunk size might be a useful workaround, but I think it's rather a sledgehammer solution.

Of course, if I end up allocating 60KB objects, then I hope the code does the right thing and reserves space for several of them. If I need one, I'm quite likely to need a second, and I don't want to be doing pointer chasing on a linked list of single object blocks - that would be very painful. So hopefully there is some kind of scaling of the requestsize for larger objects.

However, the fundamental problem is not actually the number of objects in each allocated chunk, that's what reveals the problem. No, the real problem is the pointer chasing loop to locate a chunk with free space in it. There are several upfront problems with this. Fortunately the two data structures that are used in the pointer chasing (mt-next and mt-nfree) are on the same cache line - although given their offsets, I'm not convinced that this was a design decision. However, the pointer chasing from block to block guarantees that the next pair of values that need to be inspected are not in the cache. Given the fact that we're jumping at least 72KB, there's a good chance that the mapping isn't even in the TLB.

We could argue at this point that there should be a list of free memory so a single pointer step could get us to the next free block, but this approach almost certainly opens up a lot of issues around ensuring thread safety (ie you probably need mutexes) and getting good performance (ie mutexes cost quite a few cycles). So I don't think that's the solution.

I suspect the easiest thing that could be done would be to take the two key fields and place them into an array. So you would have an array of pointers to the chunks interleaved with the counts of the number of free elements in each of the chunks. The advantage of this is that we could identify a chunk with free space without having to stride through memory to do it, we can just scan the array. We'd rarely need multiple TLB entries, and we might even be able to fit the details of multiple chunks on the same cacheline - reducing cache misses substantially (there is an issue of false sharing here, so it may not be entirely feasible), and the other gain would be that we'd be streaming through memory so the hardware might be able to prefetch the next cacheline, or we just just add prefetches if that were necessary.

The programming challenge with this approach would be in the situations where we need to increase the size of the array to allocate more chunks. This should happen rarely, but could be tricky to do safely. But not impossible.

Comments:

Darryl,

While not related to this blog, I was wondering if you can clarify on one of your earlier articles at http://developers.sun.com/solaris/articles/t1t2_perf_counter.html
.
Cycles 1 636,000,000 636,000,000 0.53 100%
How did you calculate the number of cycles as above? Did you base it as a % of your run time? or did you use the tick register. If my test prog ran for 10s on 1.2GHz T2 then what should be cycles in my calc? Also, if we add up % spent in other phases, it doesn't add up to total cycle time. Thanks for your time.
Ashok

Posted by Ashok Nair on June 29, 2009 at 10:04 PM PDT #

Hi, Ashok,

Cycles = 0.53s \* 1.2GHz = 636,000,000

So in your example you'd do 10s \* 1.2GHz = 12B cycles.

Yes, the total doesn't add to 100%. There no counter for instruction latency, and the calculation is an approximation anyhow. The point of the exercise is to indicate where time is being spent, and to give some idea of the performance being lost.

HTH,

Darryl.

Posted by Darryl Gove on June 30, 2009 at 01:55 AM PDT #

Darryl,

Thanks very much for your reply.

Ashok

Posted by Ashok Nair on June 30, 2009 at 03:54 AM PDT #

Post a Comment:
Comments are closed for this entry.
About

Darryl Gove is a senior engineer in the Solaris Studio team, working on optimising applications and benchmarks for current and future processors. He is also the author of the books:
Multicore Application Programming
Solaris Application Programming
The Developer's Edge

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
5
6
8
9
10
12
13
14
15
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today
Bookmarks
The Developer's Edge
Solaris Application Programming
Publications
Webcasts
Presentations
OpenSPARC Book
Multicore Application Programming
Docs