TLAB sizing: an annoying little problem!
By daviddetlefs on Oct 25, 2005
IntroductionOne of our major goals for the HotSpot VM is that users should have to do as little tweaking of parameters as possible -- whenever we can, we should have the system monitor the application's behavior and adjust dynamically. (If you're a user of our system, we'd like it if you were reading this kind of blog because of intellectual interest only, not because you're hoping for a special knob to tweak!)
This week I'm going to talk about a fairly small, nitty-gritty detail in trying to achieve this ambitious long-term goal: automatic sizing of thread-local allocation buffers (aka TLABs).
Why TLABs?TLABs are a solution to the following problem. When a multi-threaded application runs on a multiprocessor, how does it do allocation? In HotSpot, almost all application allocation is performed in the eden space of the young generation, using contiguous (aka "pointer-bumping") allocation. That is, there's a current pointer and an end address, and as long the next allocation fits we return the current pointer value and increment it by the allocation size.
Allocation in a contiguous space
But on a multiprocessor, this isn't safe: two threads could both read the current alloc pointer value, and both update, and think they were using the same memory for different objects. So some form of synchronization is necessary. We could just take a lock, but we can do better by using an atomic hardware instruction, a compare-and-swap, or CAS (in the SPARC architecture; the same thing is called compare-and-exchange on x86). So in our race scenario, one thread's CAS succeeds and the other fails; the latter retries.
This still has two problems: atomic hardware operations are expensive on most architectures, and on machines with many processors this could be a source of cache contention, making them even more expensive.
So we avoid this with TLABs: each thread allocates a medium-sized chunk and saves it, allocating within it with no synchronization required. Only when its TLAB is used does it go back to allocating from the shared space.
This diagram shows the space above divided up into TLAB's, with different TLABs at different degrees of fullness:
Allocation in a contiguous space
How big should they be?This was all background to get to the problem at hand: how big should the TLAB's be? Bigger TLAB's have several advantages:
- The bigger the TLAB, the less often you have to allocate a new one from the relatively expensive shared allocation space.
- The less often you do allocated from the shared space, the less likely cache contention on reading/updating the allocation pointer.
There is an important pressure in the other direction. Let's say we have a 100MB young generation, and 10 mutator threads are each allocating and filling 1MB TLAB. Some thread allocates the last remaining space for its TLAB, and some (probably) other thread is the next to attempt a TLAB allocation. This latter thread fails, causing a young generation GC.
Consider the fullness of the TLABs of the threads at this point. The safest assumption is that this is randomly distributed. (Note that I chose 10 threads, 1MB TLABs, and a 100MB young generation, to make "lock-step" operation seem less plausible). So on average, a thread's TLAB will be about 1/2 full. So we do this young-gen GC when about
1/2 \* 1MB/thread \* 10 threads = 5MBis likely to be unused: about 5% of the young generation.
This is probably fine. If however, we'd used 10 MB TLAB's, then we could have had (by the same calculation) about 50% of the young-generation wasted. This is not fine.
Our SolutionFixing this fell to my co-worker, long-time collaborator, and friend Ross Knippel. Ross called me a couple times as well, and we consulted. Here's what we eventually came up with (and Ross implemented :-)). There's now a parameter (which you shouldn't really ever have to modify) specifying the maximum expected TLAB waste, as a fraction of the total size of the space in which allocation occurs. Let's say we had specified 5%, the result from our 10 thread/100 MB young example, with 1 MB TLABs. We want to work backwards from the given expected waste percentage (5%), the number of threads and the young-gen size, to discover the appropriate TLAB size.
We first translate the waste percentage to a desired number of TLAB refills. The 5% expected waste should be half of the amount of TLAB space is use at the time of the young-gen collection, so the in-use TLABs should occupy 10% of the space, or 10 MB. We know that we have 10 threads, so each thread should have 1 MB TLABs.
This example is oversimplified in one important respect: it assumes that all threads have equal allocation rates. In practice, some threads may allocate much more frequently than others. The frequently-allocating threads should get larger TLABs than slowly-allocating threads -- with equal-sized TLABs, the slow threads are likely to waste too much. (Also, its less important to decrease the allocation cost of the slowly-allocating threads).
So what we actually do is translate the expected waste specification into a desired number of TLAB refills per young-gen collection. Another way to look at things is that 5% expected waste means that the TLAB size for thread T should be about 10% of total amount of allocation for thread T between two young collections. We can invert this percentage to get a desired number of TLAB refills for the thread: 10, in our example.
So this is what the adaptive TLAB sizing scheme actually does: computes the number of desired refills from the waste percentage spec, and then each thread counts the number of TLAB refills between collections, adjusting its desired TLAB size up or down in order to achieve the desired target.
ResultsFor large young generations or small numbers of thread, this work mostly doesn't matter: even if the TLABs are larger than they should be, if they're still a small fraction of the total young-generation size, so the maximum waste is still small. Where we saw considerable improvement was running programs with many threads in small-to-medium young-generations. The Volano benchmark is such a program. The actual Volano product is a chat server (apologies if Volano the company actually makes other products); their benchmark simulates their application. (By the way, developers: packaging up a version of your application as an easily-runnable benchmark is a great way of getting VM vendors to make sure your application runs well on their VM!) Volano has (I think) a thread per simulated user, so a run may have several thousand threads, most of which are waiting on I/O most of the time, and thus have low allocation rates. Our TLAB sizing system infers a desired TLAB size of only a few hundred bytes for these threads, allowing us to scale well to thousands of threads with only a moderate-sized young generation.
There it is: one more thing we worried about, so that you don't have to!