Location, Location, Location

Allocating a bunch of objects together and then using them together seems to be a good thing for cache utilization. Not always true, of course, but, if you're willing to take that as a quasi-truth, then you might be interested in how garbage collections can stir up that good locality.

I'm going to talk first about our serial garbage collector. It is a generational garbage collector with a copying young generation collector and a mark-sweep-compact tenured generation collector. My comments generally apply to all our collectors but allow me to do the simplest one first. After that I'll comment on how doing the collection in parallel makes it even more interesting. By the way the techniques I'll be describing for the serial collector are pretty generic and are used by other garbage collectors.

When a application thread T allocates objects, the allocations are generally done out of the young generation. Those allocations done by T together (i.e., at about the same time) are allocated in close proximity in the heap. This is mostly true even though allocations are being done in parallel by different application threads (different T's). Each T allocates from its own thread local allocation buffer (TLAB). I described recent improvement to TLAB's in an earlier blog with the title "A Little Thread Privacy, Please". If you read the first part of that one, you'll get the idea. At this point after the allocations are done, we have the good locality we've come to expect from the allocate-together, use-together practice. Then a garbage collection happens.

When the young generation is collected, the surviving objects can be copied to either the survivor spaces or the tenured generation. The survivor spaces are areas in the young generation where young generation objects are kept if they survive a few collections. After surviving a few collections objects in the survivor spaces are copied (promoted) to the tenured generation where long lived objects are kept. Let's consider the very first collection and the case where all the objects that survive that first collection are copied to a survivor space.

The young generation collection starts by finding objects being used directly by the application. By "directly" I mean the application has a reference to those objects A. That's as opposed to having a reference to an object (A) that contains a reference to a second object (B). The garbage collector scans the stacks of each application thread and, when it finds a reference to an object A, it copies A to a survivor space. Incidentally, that's also when the reference to A in the thread's stack is updated to point to the new location of A. Only A is copied at this point. If A contains references to another object B, the object B is not touched. The garbage collector continues to scan the stacks of the threads until it's found and copied all the objects directly referenced by the application. At that point the garbage collector starts looking inside the objects A that it has copied. It's then that the garbage collector would find B in A and would copy B to a survivor space. And as before if B contains a reference to object C, nothing is yet done about C.

As an example, suppose an application thread T

\* allocates object C

\* allocates object B and puts a reference to C into B

\* allocates object A and puts a reference to B into A

repeatedly. The layout of objects in the heap would look something like

... C1 B1 A1 C2 B2 A2 C3 B3 A3 ...

If thread T only kept a reference to the objects A (i.e., there are no reference directly to the B's or the C's on the thread stacks), then after the young generation collection the layout in the survivor space would look like

... A1 A2 A3 ... B1 B2 B3 ... C1 C2 C3 ...

where the (...) represent other objects X found directly from T's stack or objects found from those X's. Here I'm being just a little generous in grouping the A's, B's, and C's together. That grouping assumes that A1 is found first, then A2 is found, and then A3 is found.

Before the collection loading A1 into a cache line could reasonably also load B1 and C1 into the same cache line. After the collection B1 and C1 are in a galaxy far, far away. The A's, B's and C's have been separated so that their allocation order is no longer reflected in their location in the heap. The basic damage has been done.

In the case where all the objects A, B, and C are promoted to the tenured generation, the layout in the tenured generation would be similar (just in the tenured generation instead of in a survivor space). In the case where some of the objects are copied to a survivor space and some are promoted, some of each of the A's, B's, and C's can end up in either a survivor space or the tenured generation. You might think that it would tend to look something like

(survivor space) ... A1 A2 A3 ...

(tenured generation) ... B1 B2 B3 ... C1 C2 C3 ...

It might. Or it might be a bit messier. The above example would result if there was room in the survivor spaces for the A's but not for the B's and C's. If the B's were particularly large (so that they did not fit into the space remaining in the survivor space) and the C's were particularly small, then you might see

(survivor space) ... A1 A2 A3 ... C1 C2 C3 ...

(tenured generation) ... B1 B2 B3 ...

If there are references in the application thread stack to the B's and C's as well as to the A's, and the collector found objects in the order

C2 B3 A1 A2 A3

then after the collection the object layout would look like

... C2 B3 A1 A2 A3 ... B1 B2 ... C1 C3 ...

The mixtures are as endless as the ways in which the application references its objects.

How does collecting in parallel affect things? Recall that the serial collector first looks at all the references found in the application thread stacks (in the simpliest case the A's) before it looks at references inside the A's (i.e., the B's). The parallel collector starts by looking at some of the A's, but then may look at some of the B's before all the A's have been found. A parallel GC thread copies an A and puts its new location on a stack. I'll refer to this stack as its grey stack so as not to confuse it with the stacks of the application threads. Each GC thread has its own grey stack. When its grey stack starts to get full, a GC thread will start taking the locations of the A's off its grey stack and looking inside them for B's to copy. It starts looking inside the A's so that its grey stack does not grow too large. In general we prefer to do some copying instead of having to malloc more space for the grey stack. Very quickly there are A's, B's and C's on the grey stacks and the order in which they are copied looks largely random from outside the garbage collector.

When a parallel GC thread promotes an object, it promotes it into its own distinct area (let me call them PLAB's) in the tenured generation. A GC thread that has run out of objects to copy (no more application thread stacks to scan and no more objects on its own grey stack) can steal objects from another GC thread's grey stack. So A1 may be copied by one GC thread into its PLAB while B1 (if it were stolen) may be copied by another GC thread into a different PLAB.

So collections can disrupt good object locality in a variety of different ways. We don't like that. We're thinking about ways to improve it.

Hope this was understandable. Please ask if it was not.



Yes, thanks for that. I kinda suspected it would get messy, especially when concurrency gets involved.

Being "close" once you have large memory pages (4MB) being used by the JVM probably only matters at the cache-line level I'd guess, but maybe you could fold in a post-pass into the tenured- and permanent- generation collectors (serial and CMS please!) for some simple special cases, eg:

1) Where the only direct reference of one object is one other object, eg the wrapped/protected data array of an object like String or ByteArray{In,Out}putStream, then move the "embedded" item to be close (within the same cache line) as the "holder" object. (I believe that some Java GCs, though possibly only research ones, aim to do this already.) You might even have the GCs recognise the signatures of common objects in java.\* rather than count the referents...

2) Objects referred to from the array might be shuffled together when the opportunity arises, maybe on the assumption access may often be small-step linear over the holdign array, and that lifetime might be similar also.

Would it be useful to think of more cases like that examining some real production code? Surely the JVM geeks in simulated-CITES-friendly-ivory towers already have a good list to start with?

(I'm suggesting ternured and perm gens simply because the results of getting prximity right or wrong should last for a while, of course.)



Posted by Damon Hart-Davis on August 15, 2006 at 08:53 PM PDT #

Your GC blogs are very interesting and useful. The next generation language Sun'll put out needs to explicate the latencies important to concurrent and distributed development, all the way from memory latencies to disk latencies to network latencies to data and code migration latencies and partitioning.

Posted by Mikael Gueck on August 15, 2006 at 10:33 PM PDT #

Hi Damon, As Jon may have mentioned in his blog, CMS uses a free list space and does not, currently, move objects concurrently. So, maintaining or creating object locality during promotion or in a post-processing step into the CMS free list space is somewhat difficult currently. Changing the allocator to allow contiguous allocation would probably work, but we found in the past that it caused fragmentation issues with certain allocation streams. But do not lose hope; we are thinking of ways of fixing this. -- ramki.

Posted by ysr on August 17, 2006 at 04:34 AM PDT #

Hi ramki/Jon,

Well I think an HP JVM tried this (eg to get container/wrapper/decorator objects and their contained data on the same cache line, and found it marginally helpful (do you want me to try to dig out the URL?), but not to the extent that I think it would be worth moving Heaven and Earth (and a large chunk of heap) for.

My feeling (with the luxury of not having to write the code) is that at the moment that I was ANYWAY moving a container/warpper object (ie with a suitable type signature) from a short-life heap to a longer-life one (eg new to tenured), I'd see if I \*could\* move the wrapped data at the same time into an adjacent space opportunistickly, ie like a "peephole" optimisation. Who knows, it might often be possible and easy and might buy you a % or so in locality, but it does depend strongly how you happen to maintain the CMS heap, eg buddy lists would make it harder I guess because your wrapper and wrapped objects would rarely be similar sizes.



Posted by Damon Hart-Davis on August 24, 2006 at 06:45 AM PDT #

Post a Comment:
Comments are closed for this entry.



« August 2016