Tuesday Oct 25, 2005

TLAB sizing: an annoying little problem!

Introduction

One of our major goals for the HotSpot VM is that users should have to do as little tweaking of parameters as possible -- whenever we can, we should have the system monitor the application's behavior and adjust dynamically. (If you're a user of our system, we'd like it if you were reading this kind of blog because of intellectual interest only, not because you're hoping for a special knob to tweak!)

This week I'm going to talk about a fairly small, nitty-gritty detail in trying to achieve this ambitious long-term goal: automatic sizing of thread-local allocation buffers (aka TLABs).

Why TLABs?

TLABs are a solution to the following problem. When a multi-threaded application runs on a multiprocessor, how does it do allocation? In HotSpot, almost all application allocation is performed in the eden space of the young generation, using contiguous (aka "pointer-bumping") allocation. That is, there's a current pointer and an end address, and as long the next allocation fits we return the current pointer value and increment it by the allocation size.

Allocation in a contiguous space

But on a multiprocessor, this isn't safe: two threads could both read the current alloc pointer value, and both update, and think they were using the same memory for different objects. So some form of synchronization is necessary. We could just take a lock, but we can do better by using an atomic hardware instruction, a compare-and-swap, or CAS (in the SPARC architecture; the same thing is called compare-and-exchange on x86). So in our race scenario, one thread's CAS succeeds and the other fails; the latter retries.

This still has two problems: atomic hardware operations are expensive on most architectures, and on machines with many processors this could be a source of cache contention, making them even more expensive.

So we avoid this with TLABs: each thread allocates a medium-sized chunk and saves it, allocating within it with no synchronization required. Only when its TLAB is used does it go back to allocating from the shared space.

This diagram shows the space above divided up into TLAB's, with different TLABs at different degrees of fullness:

Allocation in a contiguous space

How big should they be?

This was all background to get to the problem at hand: how big should the TLAB's be? Bigger TLAB's have several advantages:
  • The bigger the TLAB, the less often you have to allocate a new one from the relatively expensive shared allocation space.
  • The less often you do allocated from the shared space, the less likely cache contention on reading/updating the allocation pointer.

There is an important pressure in the other direction. Let's say we have a 100MB young generation, and 10 mutator threads are each allocating and filling 1MB TLAB. Some thread allocates the last remaining space for its TLAB, and some (probably) other thread is the next to attempt a TLAB allocation. This latter thread fails, causing a young generation GC.

Consider the fullness of the TLABs of the threads at this point. The safest assumption is that this is randomly distributed. (Note that I chose 10 threads, 1MB TLABs, and a 100MB young generation, to make "lock-step" operation seem less plausible). So on average, a thread's TLAB will be about 1/2 full. So we do this young-gen GC when about

  1/2 \* 1MB/thread \* 10 threads = 5MB
is likely to be unused: about 5% of the young generation.

This is probably fine. If however, we'd used 10 MB TLAB's, then we could have had (by the same calculation) about 50% of the young-generation wasted. This is not fine.

Our Solution

Fixing this fell to my co-worker, long-time collaborator, and friend Ross Knippel. Ross called me a couple times as well, and we consulted. Here's what we eventually came up with (and Ross implemented :-)). There's now a parameter (which you shouldn't really ever have to modify) specifying the maximum expected TLAB waste, as a fraction of the total size of the space in which allocation occurs. Let's say we had specified 5%, the result from our 10 thread/100 MB young example, with 1 MB TLABs. We want to work backwards from the given expected waste percentage (5%), the number of threads and the young-gen size, to discover the appropriate TLAB size.

We first translate the waste percentage to a desired number of TLAB refills. The 5% expected waste should be half of the amount of TLAB space is use at the time of the young-gen collection, so the in-use TLABs should occupy 10% of the space, or 10 MB. We know that we have 10 threads, so each thread should have 1 MB TLABs.

This example is oversimplified in one important respect: it assumes that all threads have equal allocation rates. In practice, some threads may allocate much more frequently than others. The frequently-allocating threads should get larger TLABs than slowly-allocating threads -- with equal-sized TLABs, the slow threads are likely to waste too much. (Also, its less important to decrease the allocation cost of the slowly-allocating threads).

So what we actually do is translate the expected waste specification into a desired number of TLAB refills per young-gen collection. Another way to look at things is that 5% expected waste means that the TLAB size for thread T should be about 10% of total amount of allocation for thread T between two young collections. We can invert this percentage to get a desired number of TLAB refills for the thread: 10, in our example.

So this is what the adaptive TLAB sizing scheme actually does: computes the number of desired refills from the waste percentage spec, and then each thread counts the number of TLAB refills between collections, adjusting its desired TLAB size up or down in order to achieve the desired target.

Results

For large young generations or small numbers of thread, this work mostly doesn't matter: even if the TLABs are larger than they should be, if they're still a small fraction of the total young-generation size, so the maximum waste is still small. Where we saw considerable improvement was running programs with many threads in small-to-medium young-generations. The Volano benchmark is such a program. The actual Volano product is a chat server (apologies if Volano the company actually makes other products); their benchmark simulates their application. (By the way, developers: packaging up a version of your application as an easily-runnable benchmark is a great way of getting VM vendors to make sure your application runs well on their VM!) Volano has (I think) a thread per simulated user, so a run may have several thousand threads, most of which are waiting on I/O most of the time, and thus have low allocation rates. Our TLAB sizing system infers a desired TLAB size of only a few hundred bytes for these threads, allowing us to scale well to thousands of threads with only a moderate-sized young generation.

There it is: one more thing we worried about, so that you don't have to!

Friday Oct 07, 2005

"Everybody introduce themselves..."

Greetings! I'm Dave Detlefs. I just joined the HotSpot team, after several years of working with the group and observing with the product's development from my role as a researcher in Sun Labs. My work was pretty practical as research goes; my group and I got a fair amount of code transferred into the HotSpot product. We've decided that it makes sense for me to work within the team. Actually, I'll be taking on the role of leader of HotSpot's "technical cabinet," where we make decisions about the future technical direction of the VM.

What do I bring to the table for this job? Well, I hope a pretty broad range of experience in VM implementation techniques. I've been working in various aspects of VM implementation since I joined Sun in 1996. I was a member of the team that developed the Solaris version of Java 1.2, known internally as the "ExactVM." Here I worked on GC, locking, and JIT compilation. When Sun made the decision to unify our J2SE VM efforts behind HotSpot, I helped ported the ExactVM internal "GC interface" to HotSpot, paving the way for easy use of multiple collectors, including an adaptation of the low pause time garbage collector had developed. Since then I've had occasion to delve into both the client and server compilers, so I'm now at least somewhat familiar with most of the system.

And an exciting system it is! The HotSpot team's hard work over the past few years has significantly increased the reliability and usability of the system. For example, the "ergonomics" work done by the GC team has pretty much eliminated the need to do painstaking tuning of garbage collection parameters. Because of greater reliability and decreased bug counts, we believe that we now have a really stable base from which to attempt some interesting performance innovations in the next couple of years. I'm really looking forward to helping to make some of these things happen!

In this space, I'll try to tell you interesting things about HotSpot. While I'm fairly familiar with the system, my new role will require me to get more familiar with what we've been doing in the last couple of years. I may also hint, from time to time, about what issues we're worrying about for future development (within the bounds of intellectual property protection :-). I hope you find it useful!

About

daviddetlefs

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today