Wednesday Dec 14, 2005

When Are You Out of Memory?

You know why you get an out-of-memory exception, right? Your live data exceeds the space available in the Java heap. Well, that's very nearly, always right. Very, very nearly.

If the Java heap is barely large enough to hold all the live data, the JVM could be doing almost continual garbage collections. For example if 98% of the data in the heap is live, then there is only 2% that is available for new objects. If the application is using that 2% for temporary objects, it can seem to be humming along quite nicely, but not getting much work done. How can that be? Well the application runs until it has allocated that 2% and then a garbage collection happens and recovers that 2%. The application runs along happily allocating and the garbage collector runs along respectfully collecting. Over and over and over. The application will be making forward progress but maybe oh so slowly. Are you out of memory?

Back in the 1.4.1 days a customer noticed this type of behavior and asked for help in detecting that bad situation. In 1.4.2 the throughput collector started throwing an out-of-memory exception if the VM was spending the vast majority of its time doing garbage collection and not recovering very much space in the Java heap. In 5.0 the implementation was changed some, but the idea was the same. If you are spending way too much time doing garbage collections, you're going to get an out-of-memory. Interestingly enough this identified at least one case in our own usage of Java applications where we were spending most of our time doing garbage collection. We were happy to find it.

Why do I bring this up? Well, mostly because it was brought up in our GC meeting this morning. If you're in this situation of spending most of your time in garbage collection, I think you are out of memory and you need a bigger heap. If you don't think that, you can turn off this behavior with the command line flag -XX:-UseGCTimeLimit. May you never need it.

Tuesday Dec 06, 2005

What Are We Thinking?

What's next for GC ergonomics?

Just a friendly warning. This one verges on GC stream-of-consciousness ramblings.

GC ergonomics has been implemented so far in the throughput collector only. We've been thinking about how to extend it to the low pause collector. The low pause collector currently is implemented as a collector that does some of it's work while the application continues to run. It's described in

Some of the policies we used in the throughput collector will also be useful for the low pause collector, but because the low pause collector can be running at the same time as the application, there are some intriguing differences. By the way the low pause collector does completely stop the application in order to do some parts of the collection so some of our experience with the throughput collector is directly applicable. On the other hand having this mix of behaviors can be interesting in and of itself.

When we were developing the low pause collector we decided that any parts of the collection that we could do while the application continued to run was good. It was free. If there are spare cycles on the machine, that's almost true. If there aren't spare cycles, then it can get fuzzy. If the collection steals cycles that the application could use, then there is a cost. Especially if there is only one processor on the machine. If there are more than one processor on the machine and I'm doing GC, am I stealing cycles from the application? If I steal cycles from another process on the machine, does it become free again? We've been thinking about how to assess the load on a machine and what we should do in different load situations. That type of information may turn out to be input for GC ergonomics.

Another aspect that we have to deal with is the connection between the young generation size and the tenured generation pause times (pauses in collecting the tenured generation, that is). When collecting the tenured generation, we need to be aware of objects in the young generation that can be referencing (and thus keeping alive) objects in the tenured generation. In fact we have to find those objects in the young generation. And the larger the young generation is the longer it takes to find those objects. With the throughput collector the times to collect the tenured generation is only distantly related to the size of the young generation. With the low pause collector the connection is stronger. If we're trying to meet a pause time goal for a pause that is part of the tenured generation collection, then maybe we should reduce the size of the young generation as well as reduce the size of the tenured generation. But maybe not.

With the throughput collector a collection is started when the application attempts to allocate an object and there is no room left in the Java heap. With the low pause collector we want the collection to finish before we run out of room in the Java heap. So when does the low pause collector start a collection of the tenured generation? Just In Time, hopefully. Starting too early means that some of the capacity of the tenured generation is not used. Starting too late makes the low pause collector not a low pause collector. In the 5.0 release we did some good work to measure how quickly the tenured generation was being filled and used that to decide when to start a collection. It's a nice self contained problem as long as we can start a collection early enough. But if we cannot start a collection in time then we probably need a larger tenured generation. So a failure to JIT/GC needs to feed into GC ergonomics decisions. Well, really we don't actually want to fail to JIT/GC before we expand the tenured generation so there's more to think about. But not right now.

Monday Oct 24, 2005

What Were We Thinking?

There were some decisions made during the development of GC ergonomics that perhaps deserve some explanation. For example,

  • The pause time goal comes first.
  • A pause is a pause is a pause.
  • Ignore the cost of System.gc()'s.
  • Why is the pause time goal satisfied first?

    GC ergonomics tries to satisfy a pause time goal before considering any throughput goal. Why not the throughput goal first? I tried both ways with a variety of applications. As one might expect it was not black and white. In the end we chose to consider the goals in this order.

  • Pause time goal
  • Throughput goal
  • Smaller footprint
  • The pause time goal definitely has the potential for being the hardest goal to meet. It's dependence on heap size is complicated and trying to meet the pause time goal without the encumberances of either of the other goals was easier to think about. If we could meet the pause time goal, then increasing the heap to try and meet a throughput goal felt safer (i.e., the relationship between throughput and heap size is more linear so it was easier to understand how undoing an increase would get us back to where we started).

    In retrospect it also seems more natural to have the pause time goal (which pushes heap size down) competing with the throughput goal (which pushes heap size up). And only then to have the throughput goal (which again pushes the heap size up) competing with the footprint goal (which, of course, pushes the heap size down).

    A pause is a pause ...

    We talked quite a bit about whether the pause time goal should apply to both the major and minor pause times. The issue was whether it would be effective to shrink the size of the old generation to reduce the major pause times. With a young generation collection you can shrink the heap more easily because there is always some place to put any live objects in the young generation (namely into the old generation). It was clear that reducing the young generation size would reduce the minor collection times (after you've paid the cost of getting the collection started and shutting it down). Well, that's true if you can ignore the fact that more frequent collections give objects less time to die. With the old generation it was much less obvious what would happen. The old generation can only be shunk down to a size big enough to hold all the live data in the old generation. Also the amount of free space in the old generation has an effect on the young generation collection in that young generation collection may need to copy objects into the old generation. In the end we decided that trying to limit both the major pauses and minor pauses with the pause time goal, while harder was more meaningful. Would you have accepted the excuse "Yes, we missed the goal but it was a major collection pause not a minor collection pause".

    System.gc()'s. Just ignore them.

    During development I initially tried to include the costs of System.gc()'s in the calculation of the averages used by GC ergonomics. In calculating the cost of collections the frequency of collections matters. If you are having collections more often then the cost of GC is higher. The strategy to reduce that cost is to increase the size of the heap so that collections are less frequent (i.e., since the heap is larger you can do more allocations before having to do another collection). The difficulty with System.gc()'s is that increasing the size of the heap does not in general increase the time between System.gc()'s. I tried to finesse the cost of a System.gc() by considering how full the heap was when the System.gc() happened and extrapolating to how long the interval between collections would have been. After some experimentation I found that picking how to do the extrapolation was basically picking the answer (i.e., what the GC cost would have been). I could tailor an extrapolation to fit one application, but invariably it did not fit some other applications. Basically it was too hard. So GC ergonomics ignores System.gc()'s.

    Tuesday Oct 04, 2005

    Where Are the Sharp Edges?

    In general GC ergonomics works best for an application that has reached a steady state behavior in terms of its allocation pattern. Or at least it is not changing its allocation pattern quickly. GC ergonomics measures the pause times and throughput of the application and changes the size of the heap based on those measurements.

    The measurements of pause time and throughput are kept in terms of a weighted average where (as one would expect) the most recent measurements are weighted more heavily. By using a weighted average GC ergonomics is not going to turn on a dime in response to a change in behavior by the the application, but it is also not going to go flying off in a wrong direction because of normal variations in behavior.

    If past behavior is not a good indicator of future performance, then GC ergonomics can lag behind in its decision making. If a change is just an occasional bump in the road, GC ergonomics will catch up. If behavior is all over the map, well, what can I say.

    The easiest way to get into trouble with GC ergonomics is to specify a pause time goal that is not reachable. Typically what happens is that GC ergonomics reduces the pause times by reducing the size of the heap. As the heap is shrunk the frequency of collections goes up and throughput goes down. GC ergonomics is willing to drive throughput to nearly zero (by doing collections nearly all the time) in order to reach the pause time goal. I tell people to run without a pause time goal initially and see how large the collection pauses get. That gives a baseline for experimenting with a pause time goal. Then have a little fun and try some pause times.

    Another thing you should be aware of is that GC ergonomics is going to run at every collection. If your applications has settled into a stable steady state, GC ergonomics is still looking to see if anything is changing so it can adjust. It does cost you some cycles, but I don't think it's significant. Let me put it this way. I've never seen GC ergonomics code show up in a significant way on performance profiles. This is probably less a sharp edge than a mild poke in the ribs. If you think that your application is really not going to be changing its behavior after it has settled in and want those last few cycles, run with GC ergonomics until your application has reached its steady state and look to see how the heap is sized. You'll have to pay attention to how the generations are sized also. Then select those sizes on the command line and turn GC ergonomics off. At least for most of you that should be plenty good. If performances is not quit as high and you don't already know about survivor spaces, you may have to learn about them. The document "Tuning Garbage Collection with the 5.0 Java Virtual Machine" should help. It can be found under the URL below (same one as in "Magic"). If performances is actually better, rejoice and let me know how we can be doing better.

    Monday Sep 26, 2005

    It's Not Magic

    In our J2SE (tm) 1.5.0 release we added a new way of tuning the Java(tm) heap which we call "garbage collector (GC) ergonomics". This was added only to the parallel GC collector. You may also have seen it referred to as "Smart Tuning" or "Simplified Tuning". GC ergonomics allows a user to tune the Java heap by specifying a desired behavior for the application. These behaviors are a maximum pause time goal and a throughput goal.

    So what is GC ergonomics? Prior J2SE 1.5.0 if you wanted to tune the Java Virtual Machine (JVM)(tm) for an application you typically did it by trial-and-error. You would run the JVM on your application without changing any parameters and see how it ran. If the throughput of the application was not as high as you wanted, the usual solution was to increase the heap size. With a larger heap collections happen less often so the cost of garbage collection decreases as a percentage of the total execution time. But as you increase the size of the heap, often the length of the garbage collections increase. Since the garbage collector pauses all application threads to do a collection, the application would see longer and longer pauses as you chose larger and larger heaps. If the pauses became too long for your application, then you would have to reduce the size of the heap. You usually have to choose a compromise between pause times and throughput.

    With GC ergonomics in J2SE 1.5.0 you choose a pause time goal and a throughput goal and let the JVM increase or decrease the size of the heap to try to meet those goals. On big machines a larger maximum heap size is chosen as a default. GC ergonomics only grows the heap enough to meet your goals so the maximum heap size is not necessarily used. Sometimes you might have to increase the maximum size of the heap if the default maximum size is too small.

    So how does this work? Actually GC ergonomics does pretty much what you would do to tune the heap. As I say in the title, it's not magic. But it does have the benefit of being able to tune dynamically during the execution of the application. GC ergonomics

  • Measures the performance (both throughput and pause times) of you application.

  • Compares the performance against the goals.

  • Decreases the heap size to shorten pause times, OR

  • Increases the heap size to get fewer collections.
  • If both the pause time goal and the throughput goal are being met, GC ergonomics will decrease the size of the heap to try and minimize the application's footprint.

    GC ergonomics tries to meet your goals but there are no guarantees that it can. For example, a maximum pause time of zero would be nice, but it's not going to happen. Can you tune the heap better that GC ergonomics? Probably yes. Is it worth your time to do it? And to keep it tuned as your circumstances change? You'll have to tell us.

    For more information on GC ergonomics, please see "Ergonomics in the 5.0 Java Virtual Machine" under




    « July 2016