Recent Posts


The Unspoken - The Why of GC Ergonomics

Do you use GC ergonomics, -XX:+UseAdaptiveSizePolicy,with the UseParallelGC collector? The jist of GC ergonomics for that collector isthat it tries to grow or shrink the heap to meet a specified goal.The goals that you can choose are maximum pause time and/or throughput.Don't get too excited there. I'm speaking about UseParallelGC (thethroughput collector) so there are definite limits to what pausegoals can be achieved. When you say out loud "I don't care about pause times, give me the best throughput I can get" and then say to yourself "Well, maybe 10 seconds really is too long", then think about a pause time goal. By default there is no pause time goal andthe throughput goal is high (98% of the time doing application workand 2% of the time doing GC work). You can get more details onthis in my very first blog. GC ergonomics The UseG1GC has its own version of GC ergonomics, but I'll be talkingonly about the UseParallelGC version.If you use this option and wanted to know what it (GC ergonomics)was thinking, try -XX:AdaptiveSizePolicyOutputInterval=1This will print out information every i-th GC (above i is 1)about what the GC ergonomics to trying to do. For example, UseAdaptiveSizePolicy actions to meet *** throughput goal *** GC overhead (%) Young generation: 16.10 (attempted to grow) Tenured generation: 4.67 (attempted to grow) Tenuring threshold: (attempted to decrease to balance GC costs) = 1GC ergonomics tries to meet (in order)Pause time goalThroughput goalMinimum footprintThe first line says that it's trying to meet the throughputgoal. UseAdaptiveSizePolicy actions to meet *** throughput goal ***This run has the default pause time goal (i.e., no pause timegoal) so it is trying to reach a 98% throughput.The lines Young generation: 16.10 (attempted to grow) Tenured generation: 4.67 (attempted to grow)say that we're currently spending about 16% of the time doingyoung GC's and about 5% of the time doing fullGC's. These percentages are a decaying, weighted average (earliercontributions to the average are given less weight). The sourcecode is available as part of the OpenJDK so you can take a lookat it if you want the exact definition. GC ergonomics is tryingto increase the throughput by growing the heap (so says the"attempted to grow").The last line Tenuring threshold: (attempted to decrease to balance GC costs) = 1says that the ergonomics is trying to balance the GC times betweenyoung GC's and full GC's by decreasing the tenuring threshold.During a young collection the younger objects are copied to thesurvivor spaces while the older objects are copied to the tenured generation. Younger and older are defined by thetenuring threshold. If the tenuring threshold hold is 4, anobject that has survived fewer than 4 young collections (andhas remained in the young generation by being copied to the part of the young generation called a survivor space)it is younger and copied again to a survivor space. If it hassurvived 4 or more young collections, it is older and gets copied to thetenured generation. A lower tenuring threshold moves objects moreeagerly to the tenured generation and, conversely a higher tenuring threshold keeps copying objects between survivor spaceslonger. The tenuring threshold varies dynamically with theUseParallelGC collector. That is different than our othercollectors which have a static tenuring threshold. GC ergonomicstries to balance the amount of work done by the young GC's andthe full GC's by varying the tenuring threshold. Want more workdone in the young GC's? Keep objects longer in the survivorspaces by increasing the tenuring threshold.This is an example of the output when GC ergonomics is trying toachieve a pause time goal UseAdaptiveSizePolicy actions to meet *** pause time goal *** GC overhead (%) Young generation: 20.74 (no change) Tenured generation: 31.70 (attempted to shrink)The pause goal was set at 50 millisecs and the last GC was0.415: [Full GC (Ergonomics) [PSYoungGen: 2048K->0K(26624K)] [ParOldGen: 26095K->9711K(28992K)] 28143K->9711K(55616K), [Metaspace: 1719K->1719K(2473K/6528K)], 0.0758940 secs] [Times: user=0.28 sys=0.00, real=0.08 secs]The full collection took about 76 millisecs so GC ergonomics wants to shrinkthe tenured generation to reduce that pause time.The previous young GC was0.346: [GC (Allocation Failure) [PSYoungGen: 26624K->2048K(26624K)] 40547K->22223K(56768K), 0.0136501 secs] [Times: user=0.06 sys=0.00, real=0.02 secs]so the pause time there was about 14 millisecs so no changes areneeded.If trying to meet a pause time goal, the generations are typicallyshrunk. With a pause time goal in play, watch the GC overhead numbers and you will usually see the cost of settinga pause time goal (i.e., throughput goes down). If the pause goalis too low, you won't achieve your pause time goal and you will spend all your time doing GC.GC ergonomics is meant to be simple because it is meant to beused by anyone. It was not meant to be mysterious and sothis output was added. If you don't like what GC ergonomics is doing, you can turn it off with -XX:-UseAdaptiveSizePolicy, butbe pre-warned that you have to manage the size of the generationsexplicitly. If UseAdaptiveSizePolicy is turned off, the heapdoes not grow. The size of the heap (and the generations) atthe start of execution is always the size of the heap. I don'tlike that and tried to fix it once (with some help from anOpenJDK contributor) but it unfortunately never made it outthe door. I still have hope though.Just a side note. With the default throughput goal of 98% the heapoften grows to it's maximum value and stays there. Definitely reducethe throughput goal if footprint is important. Start with -XX:GCTimeRatio=4for a more modest throughput goal (%20 of the time spent in GC). A highervalue means a smaller amount of time in GC (as the throughput goal).

Do you use GC ergonomics, -XX:+UseAdaptiveSizePolicy, with the UseParallelGC collector? The jist of GC ergonomics for that collector isthat it tries to grow or shrink the heap to meet a...


The Unspoken - Application Times

Sometimes it is interesting to know how long your applicationruns between garbage collections. That can be calculated from theGC logs but a convenient way to see that information is throughthe command line flags -XX:+PrintGCApplicationStoppedTime and-XX:+PrintGCApplicationConcurrentTime. Adding these to yourcommand line produces output such as this.Application time: 0.2863875 secondsTotal time for which application threads were stopped: 0.0225087 secondsApplication time: 0.1476791 secondsTotal time for which application threads were stopped: 0.0255697 secondsThe application ran (reported in the first line) for about 287 milliseconds and then wasstopped for about 22 milliseconds (reported in the second line).The flags can be used separately or together.Add -XX:+PrintGCTimeStamps -XX:+PrintGCDetails and you getApplication time: 0.1325032 seconds20.149: [GC (Allocation Failure) 20.149: [ParNew: 78656K->8704K(78656K), 0.0221598 secs] 225454K->158894K(253440K), 0.0222106 secs] [Times: user=0.13 sys=0.00,real=0.03 secs]Total time for which application threads were stopped: 0.0224188 seconds When the flags was first implemented the "stopped" time really was the GC stopped time but it was later changed to include the complete stopped time. There is other work that may be done during a "safepoint" (period during which the application is halted and the VM can do some work without the application changing). All that is now included in the "stopped" time. In the example above I would say that GC was the only thing happening during the safepoint(GC time being 0.0222106 secs and the complete time being nearlythe same at 0.0224188 secs).I can't speak authoritatively about what else can happen because I justdon't know. But to give one example I've heard about is compiled code deoptimization. When thecode for a JIT'ed method needsto be thrown away (typically because some assumption made during the compilation has beenviolated), the VM has to switch from the compiled code to interpreting that method again.That switch is done at a safepoint (but don't quote me on that because its notmy area).

Sometimes it is interesting to know how long your application runs between garbage collections. That can be calculated from the GC logs but a convenient way to see that information is throughthe...


The Unspoken - CMS and PrintGCDetails

What follows is an example of the GC logging output with CMS (UseConcMarkSweepGC) and PrintGCDetails plus some explanation of the output. The "CMS-initial-mark" indicates the start of a CMS concurrent collection.[GC [1 CMS-initial-mark: 463236K(515960K)] 464178K(522488K), 0.0018216 secs] [Times: user=0.01 sys=0.00, real=0.00 secs]463236K above is the space occupied by objects in the old (CMS) generation at the start of the collection. Not all those objects are necessarily alive.515960K is the total size of the old (CMS) generation. This value changes if the generation grows or shrinks.464178K is the sum of the space occupied by objects in the young generation and the old (CMS) generation.522488K is the total size of the heap (young generation plus old (CMS) generation)0.0018216 secs is the duration of the initial mark pause. The initial mark is a stop-the-world phase.After the initial mark completes the CMS concurrent mark starts. The concurrent mark phase is a concurrent phase and can be interrupted by young generation collections. In this case the ParNew (UseParNewGC) is being used to collect the young generation. When a ParNew collection is ready to start, it raises a flag and the CMS collector yields execution to ParNew and waits for ParNew to finish before resuming.[GC[ParNew: 6528K->702K(6528K), 0.0130227 secs] 469764K->465500K(522488K), 0.0130578 secs] [Times: user=0.05 sys=0.00, real=0.01 secs]6528K is the space in the young generation occupied by objects at the start of the ParNew collection. Not all those objects are necessarily alive.702K is the space occupied by live objects at the end of the ParNew collection.6528K is the total space in the young generation.0.0130227 is the pause duration for the ParNew collection.469764K is the space occupied by objects in the young generation and the old (CMS) generation before the collection starts.465500K is the space occupied by live objects in the young generation and all objects in the old (CMS) generation. For a ParNew collection, only the liveness of the objects in the young generation is known so the objects in the old (CMS) generation may be live or dead.522488K is the total space in the heap.[Times: user=0.05 sys=0.00, real=0.01 secs] is like the outputof time(1) command. The ratio user / real give you an approximation for the speed up you're getting from the parallel execution of the ParNew collection. The sys time can be an indicator of system activity that is slowing down the collection. For example if paging is occurring, sys will be high.[GC[ParNew: 6526K->702K(6528K), 0.0136447 secs] 471324K->467077K(522488K), 0.0136804 secs] [Times: user=0.04 sys=0.01, real=0.01 secs][GC[ParNew: 6526K->702K(6528K), 0.0161873 secs] 472901K->468830K(522488K), 0.0162411 secs] [Times: user=0.05 sys=0.00, real=0.02 secs][GC[ParNew: 6526K->702K(6528K), 0.0152107 secs] 474654K->470569K(522488K), 0.0152543 secs] [Times: user=0.05 sys=0.00, real=0.02 secs]...[GC[ParNew: 6526K->702K(6528K), 0.0144212 secs] 481073K->476809K(522488K), 0.0144719 secs] [Times: user=0.05 sys=0.00, real=0.01 secs]This is the completion of the concurrent marking phase. After this point the precleaning starts.[CMS-concurrent-mark: 1.039/1.154 secs] [Times: user=2.32 sys=0.02, real=1.15 secs]The 1.039 is the elapsed time for the concurrent marking. The 1.154 is the wall clock time.The "Times" output is less meaningful because it is measured from the start of the concurrent marking and includes more than just the work done for the concurrent marking.This is the end of the precleaning phase.[CMS-concurrent-preclean: 0.006/0.007 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]The format of the precleaning phase output is analogous to that of the concurrent marking phase.[GC[ParNew: 6526K->702K(6528K), 0.0141896 secs] 482633K->478368K(522488K), 0.0142292 secs] [Times: user=0.04 sys=0.00, real=0.01 secs][GC[ParNew: 6526K->702K(6528K), 0.0162142 secs] 484192K->480082K(522488K), 0.0162509 secs] [Times: user=0.05 sys=0.00, real=0.02 secs]The remark phase is scheduled so that it does not occur back-to-back with a ParNew so as not to appear to be a pause that is the sum of the ParNew and the remark pause. A second precleaning phase is started and is aborted when the remark phase is ready to start. Aborting this second precleaning phase is the expected behavior. That it was aborted is not an indication of an error. Since the remark phase is waiting, why not preclean but don't delay the remark for the sake of precleaning.[CMS-concurrent-abortable-preclean: 0.022/0.175 secs] [Times: user=0.36 sys=0.00, real=0.17 secs]This is the remark phase.[GC[YG occupancy: 820 K (6528 K)][Rescan (parallel) , 0.0024157 secs][weak refs processing, 0.0000143 secs][scrub string table, 0.0000258 secs] [1 CMS-remark: 479379K(515960K)] 480200K(522488K), 0.0025249 secs] [Times: user=0.01 sys=0.00, real=0.00 secs][YG occupancy: 820 K (6528 K)]shows that at the start of the remark the occupancy (sum of the total of the allocated objects in the young generation is 820K out of a total of 6528K.The length of the remark pause depends to some degree on the occupancy of the young generation so we print it out.The "Rescan" completes the marking of live objects while the application is stopped. In this case the rescan was done in paralleland took 0.0024157 secs. "weak refs processing" and "scrub string table" are tasks done during the remark. Those tasks took 0.0000143 secs and 0.0000258 secs, respectively. If those numbers dominate the remark pause time, they can explain unexpectedly large pauses. Not that they cannot legitimately be large. Just that generally they are not and when they are, take note. If the weak refs processing is dominate, you might be able to cut that time down by using parallel reference processing (-XX:+ParallelRefProcEnabled). No comment on the case when scrub string table is dominant. I've never had to deal with it.The concurrent sweeping phase starts at the end of the remark.[GC[ParNew: 6526K->702K(6528K), 0.0133250 secs] 441217K->437145K(522488K), 0.0133739 secs] [Times: user=0.04 sys=0.00, real=0.01 secs][GC[ParNew: 6526K->702K(6528K), 0.0125530 secs] 407061K->402841K(522488K), 0.0125880 secs] [Times: user=0.04 sys=0.00, real=0.01 secs]...[GC[ParNew: 6526K->702K(6528K), 0.0121435 secs] 330503K->326239K(522488K), 0.0121996 secs] [Times: user=0.04 sys=0.00, real=0.01 secs]The sweep phase ends here.[CMS-concurrent-sweep: 0.756/0.833 secs] [Times: user=1.68 sys=0.01, real=0.83 secs]The format above is analogous to that of the concurrent marking.And lastly the reset phase.[CMS-concurrent-reset: 0.009/0.009 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]It is not expected that another concurrent CMS collection start before several ParNew collections have been done. If another CMS collection starts immediately, check how full the old (CMS) generation is. If the old (CMS) generation is close to being full immediately after the end of a collection, the heap might be too small.I took this log with an early build of JDK8java version "1.8.0-ea"Java(TM) SE Runtime Environment (build 1.8.0-ea-b73)Java HotSpot(TM) Server VM (build 25.0-b14, mixed mode)and used the flags-server -XX:+UseConcMarkSweepGC -XX:NewRatio=8-XX:-PrintGCCause -XX:ParallelGCThreads=4 -Xmx1g -XX:+PrintGCDetails I usually also use -XX+PrintTimeStampsps to get time stamps in the logs and use -XX:+PrintGCDateStamps if I want to correlate the GC output to application events.

What follows is an example of the GC logging output with CMS (UseConcMarkSweepGC) and PrintGCDetails plus some explanation of the output. The "CMS-initial-mark" indicates the start of a CMS...


The Unspoken - Phases of CMS

CMS (-XX:+UseConcMarkSweepGC) or the concurrent mark sweep GC could have been called the mostly concurrent mark sweep GC and here's why.These are the phases of a CMS concurrent collection.1. Initial mark. This is the the start of CMS collection. This is a stop-the-world phase. All the applications threads are stopped and CMS scans the application threads (e.g. the thread stacks) for object references.2. Concurrent mark. This is a concurrent phase. CMS restarts all the applications threads and then starting from the objects found in 1), it finds all the other live objects. Almost (see the remark phase).3. Precleaning phase. This is a concurrent phase. This phase is an optimization and you don't really need to understand what it does so feel free to skip to 4. While CMS is doing concurrent marking (2.), the application threads are running and they can be changing the objects they are using. CMS needs to find any of these changes and ultimately does that during the remark phase. However, CMS would like to discovery this concurrently and so has the precleaning phase. CMS has a way of recording when an application thread makes a change to the objects that it is using. Precleaning looks at the records of these changes and marks live objects as live. For example if thread AA has a reference to an object XX and passes that reference to thread BB, then CMS needs to understand that BB is keeping XX alive now even if AA no longer is.4. The remark phase. The remark phase is a stop-the-world. CMS cannot correctly determine which objects are alive (mark them live), if the application is running concurrently and keeps changing what is live. So CMS stops all the application threads during the remark phase so that it can catch up with the changes the application has been making.5. The sweep phase. This is a concurrent phase. CMS looks at all the objects and adds newly dead objects to its freelists. CMS does freelist allocation and it is during the sweep phase that those freelists get repopulated.6. The reset phase. This is a concurrent phase. CMS is cleaning up its state so that it is ready for the next collection.That's the important part of this blog. I'm going to ramble now.I recently talked to some users who, of course, knew how their application worked but who also didn't know so much about how garbage collection worked. Why should they. That's my job. During our chat I came to appreciate much more the fact that there were things that I habitually left unspoken. So, here's one of the unspoken truths that I forget to explain: The phases of CMS.

CMS (-XX:+UseConcMarkSweepGC) or the concurrent mark sweep GC could have been called the mostly concurrent mark sweep GC and here's why. These are the phases of a CMS concurrent collection. 1. Initial...


Really? iCMS? Really?

When I use the term iCMS, I'm referring to theincremental mode of CMS (-XX:+UseConcMarkSweepGC-XX:+CMSIncrementalMode). This is a mode ofCMS where the concurrent phases (concurrent markingand concurrent sweeping) are run inshort increments (does some work, yields andsleeps for a while, then runs again). It wasimplemented in CMS primarily for use on platformswith a single hardware thread. The concurrent phasesare typically long (think seconds and not milliseconds).If CMS hogged the single hardware thread for severalseconds, the application would not execute during thoseseveral seconds and would ineffect experience a stop-the-world pause. With iCMSthe application makes progress during concurrentphases. That of course is good. The down sideis that there is more overhead to iCMS (e.g., all thatstopping and starting), more floating garbage (objectsthat die during a CMS collection that we can't tellhave died until the next collection) and more flagsto tune (just what you wanted, right). Also it'sjust more complex (more things to go wrong). We putquite a bit of effort into teaching iCMS how to dothe work in pieces and still finish the collectionbefore the heap fill up, but it's not perfect.Are you using iCMS? Does your platform only haveone hardware thread? Ok, maybe two hardware threadsis reason enough if you don't want half your cycleseaten by CMS during a long concurrent phases. But,otherwise, I'd suggest you try without the incrementalmode. There are exceptions to every rule, but I thinkyou'll be happier. Really.

When I use the term iCMS, I'm referring to the incremental mode of CMS (-XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode). This is a mode of CMS where the concurrent phases (concurrent markingand...


Help for the NUMA Weary

When you think about running on a machine with a non-uniform memory architecture (NUMA),do you think, "Cool, some memory is really close"? Or do you think, "Why are mymemory latency choices bad, worse and worst"? Or are you like me and try not tothink about it at all?Well for all of the above, this option rocks. Executive summary: There is a option which attempts to improve the performance of an applicationon a NUMA machine by increasing the application's use of lower latency memory. Thisis done by creating per cpu pools of memory in the young generation. A thread AA thatruns on a cpu XX will get objects allocated out of the pool for XX. When cpu XX first touchesa new page of memory, Solaris tries to assign to XX memory that is closer to XX. Additionally, if athread AA has run on XX, Solaris tries to keep AA runningon XX. When this all works as we hope,AA is accessing the memory that is closer to it more of the time. This option does not improve GC performance but improves the performance of application threads. I'll say something about support fornon-Solaris platforms at the end.So if you're asking yourself if you should care about NUMA, twoexamples of Sun servers with NUMA architectures are the Sun Sparc E6900and the AMD Opteron X4600. Sun's early chip multithreading (CMT) boxes (T1000, T2000, T5120 and T5220) are not NUMA, but the later CMTT5140 and T5240 are.In the above summary I used the term "per cpu pools" for brevity whenI should really have used the term "per node pools" to be more precise.Nodes in this context have 1 or more cpu's, local memory andinterconnect as a minimum. I'll try to be more precise below, butif you see node and think cpu, it's close enough.On a NUMA system there is some memory that is closer (local) to the cpu's ina node and some memorythat is farther away (remote). Local and remote is relative to a node.On Solaris 10 the OS tries to improve the performance of a thread AA executing on a node RR by increasing AA's use of memory local to RR. This is achieved by a "first touch" policy (my words,not a technical term). AA can make a call to get memory but physical memory isnot committed to AA until the memory is used (first touched). When AA executing on RRfirst touches memory, Solaris tries to assign it memory that is local to RR. If there is no available local memory, AA will get memory remote from RR. The UseNUMA feature takes advantage of this policy to getbetter locality for an application thread that allocates objectsand then uses them (as opposed to an architecture where onethread allocates objects and some other thread uses them).In JDK6 update 2 we added -XX:+UseNUMA in the throughput collector (UseParallelGC).Whenyou turn this feature on the, JVM then divides the young generationinto separate pools, 1 pool for each node. When AA allocates an object, theJVM looks to see what node AA is on and then allocates the object from the poolfor that node. In the diagram AA is running on RR and RR has its pool inthe young generation.Combine this witha first touch policy and the memory in the pool for RR is first touchedby a thread running on RR and so is likely to be local to RR. And as I mentioned above if AA has run on RR. Solaris will try to keep AA executing onRR. So best case is thatyou have AA accessing local memory most of the time. It may sounda bit like wishful thinking, but we've seen very nice performanceimprovements on some applications. Contrast this with the allocation without per node pools. As athread does allocations, it marches deeper and deeper into theyoung generation. A thread actually does allocations out of threadlocal buffers (TLAB's) but even so, the TLAB's for a thread aregenerally scattered throughout the young generation and it is even more wishful thinking to expect those TLAB's to all be mapped tolocal memory.This part is extra credit and you don't really need to know about itto use UseNUMA. Solaris has the concept of locality groupsor lgroups. You can read more about lgroups at Locality Group APIs A node has an lgroup and within that lgroup are the resources that are closer to thenode. There is actually a hierarchy of lgroups, but lets talk as if a node hasan lgroup that has its closest resources (local resources) and the resources farther away are just someplace else (remote resources). Thread AA running on RR can askwhat lgroup MM it is in and can ask if a page of memory is in MM. This type ofinformation is used by the page scanner that I describe below.There are a couple of caveats. The young generation is divided into per node pools. Ifany of these pools are exhausted, a minor collection is done. That's potentially wastefulof memory so to ameliorate that, the sizes of the pools are adjusted dynamically sothat threads that do more allocation get larger pools. In situations where memory is tight and there are several processes running on thesystem, the per node pools can be a mixture of local and remote memory. That simplycomes about when RR first touches a page in its pool and there is not local memoryavailable.It just gets remote memory. To try to increase the amount of local memory in thepool, there is a scanner that looks to see if a page in a pool for RR is in the lgroupMM for RR. If the page is not, the scanner releases that page back to the OS in thehopes that, when AA on RR again first touches that page, the OS will allocate memory in MM forthat page. Recall that eden in the young generation is usually emptyafter a minor collection so these pages can be released. The scanner also looks for small pages in the pool. On Solaris youcan have a mixture of pages of different sizes in a pool and performancecan be improved by using more large pages and fewer small pages. So the scanneralso releases small pages in the hope that it will be allocated a large pagethe next time it uses the memory. This scanning is doneafter a collection and only scans a certain number of pages (NUMAPageScanRate)per collection so as to bound the amount of scanning done per collection.To review, if you haveThread AA running on node RR and the JVM allocating objects for AA in the pool for RR.Solaris mapping memory for the pool for RR in the lgroup MM (i.e., local to RR) based on first touch.Solaris keeping thread AA running on node RR.then your application will run faster.And all you have to do is turn on -XX:+UseNUMA.An implementation on linux is in the works and will be in an upcoming update of jdk6.The API's are different for binding the per node pools to local memory (e.g.,the JVM requests that pages be bound rather than relying on first touch)but you really don't need to know about any differences. Just turn it on.We've looked at an implementation for windowsplatforms and have not figured out how to do it yet. If you would like to know a little more about dealing with NUMA machines,you might find this useful. Increasing Application Performance on NUMA Architectures

When you think about running on a machine with a non-uniform memory architecture (NUMA), do you think, "Cool, some memory is really close"? Or do you think, "Why are mymemory latency choices bad,...


Our Collectors

I drew this diagram on a white board for some customers recently. They seemed tolike it (or were just being very polite) so I thought I redraw it for youramusement.Each blue box represents a collector that is used to collect a generation. The young generation is collected by the blue boxes in the yellow region andthe tenured generation is collected by the blue boxes in the gray region."Serial" is a stop-the-world, copying collector which uses a single GC thread."ParNew" is a stop-the-world, copying collector which uses multiple GC threads. It differsfrom "Parallel Scavenge" in that it has enhancements that make it usable with CMS. For example, "ParNew" does thesynchronization needed so that it can run during the concurrent phases of CMS."Parallel Scavenge" is a stop-the-world, copying collector which uses multiple GC threads."Serial Old" is a stop-the-world,mark-sweep-compact collector that uses a single GC thread."CMS" is a mostly concurrent, low-pause collector."Parallel Old" is a compacting collector that uses multiple GC threads.Using the -XX flags for our collectors for jdk6,UseSerialGC is "Serial" + "Serial Old"UseParNewGC is "ParNew" + "Serial Old"UseConcMarkSweepGC is "ParNew" + "CMS" + "Serial Old". "CMS" is used most of the time to collect the tenured generation. "Serial Old" is used when a concurrent mode failure occurs.UseParallelGC is "Parallel Scavenge" + "Serial Old"UseParallelOldGC is "Parallel Scavenge" + "Parallel Old"FAQ1) UseParNew and UseParallelGC both collect the young generation usingmultiple GC threads. Which is faster? There's no one correct answer forthis questions. Mostly they perform equally well, but I've seen onedo better than the other in different situations. If you want to useGC ergonomics, it is only supported by UseParallelGC (and UseParallelOldGC)so that's what you'll have to use.2) Why doesn't "ParNew" and "Parallel Old" work together? "ParNew" is written ina style where each generation being collected offers certain interfaces for itscollection. For example, "ParNew" (and "Serial") implementsspace_iterate() which will apply an operation to every objectin the young generation. When collecting the tenured generation witheither "CMS" or "Serial Old", the GC can use space_iterate() to do some work on the objects in the young generation. This makes the mix-and-match of collectors work but adds some burdento the maintenance of the collectors and to the addition of newcollectors. And the burden seems to be quadratic in the numberof collectors.Alternatively, "Parallel Scavenge"(at least with its initial implementation before "Parallel Old")always knew how the tenured generation was being collected andcould call directly into the code in the "Serial Old" collector."Parallel Old" is not written in the "ParNew" style so matching it with"ParNew" doesn't just happen without significant work.By the way, we would like to match "Parallel Scavenge" only with"Parallel Old" eventually and clean up any of the ad hoc code neededfor "Parallel Scavenge" to work with both.Please don't think too much about the examples I used above. Theyare admittedly contrived and not worth your time.3) How do I use "CMS" with "Serial"? -XX:+UseConcMarkSweepGC -XX:-UseParNewGC.Don't use -XX:+UseConcMarkSweepGC and -XX:+UseSerialGC. Although that's seems like a logical combination, it will result in a message saying something about conflicting collector combinations and the JVM won't start. Sorry about that.Our bad.4) Is the blue box with the "?" a typo?That box represents the new garbage collector that we're currently developing calledGarbage First or G1 for short. G1 will provideMore predictable GC pausesBetter GC ergonomicsLow pauses without fragmentationParallelism and concurrency in collectionsBetter heap utilizationG1 straddles the young generation - tenured generation boundary because it isa generational collector only in the logical sense. G1 divides theheap into regions and during a GC can collect a subset of the regions.It is logically generational because it dynamically selects a set ofregions to act as a young generation which will then be collected atthe next GC (as the young generation would be). The user can specify a goal for the pauses and G1 will do an estimate (based on past collections) of how many regions can be collected in that time (the pause goal). That set of regions is called a collection set and G1 willcollect it during the next GC.G1 can choose the regions with the most garbage to collect first (Garbage First, get it?)so gets the biggest bang for the collection buck.G1 compacts so fragmentation is much less a problem. Why is it a problem at all?There can be internal fragmentation due to partially filled regions.The heap is not statically divided into a young generation and a tenured generation so the problem ofan imbalance in their sizes is not there.Along with a pause time goal the user can specify a goal on the fraction oftime that can be spent on GC during some period (e.g., during the next 100 secondsdon't spend more than 10 seconds collecting). For such goals (10 seconds ofGC in a 100 second period) G1 can choose a collection set that it expects it can collect in 10 seconds and schedules the collection 90 seconds (or more) from the previous collection. You can see how an evil user could specify 0 collectiontime in the next century so again, this is just a goal, not a promise.If G1 works out as we expect, it will become our low-pause collector in place of "ParNew" + "CMS". And if you're about to ask when will it be ready, please don'tbe offended by my dead silence. It's the highest priority project for our team,but it is software development so there are the usual unknowns. It will be outby JDK7. The sooner the better as far as we're concerned.Updated February 4. Yes, I can edit an already posted blog. Here'sa reference to the G1 paper if you have ACM portal access.http://portal.acm.org/citation.cfm?id=1029879

I drew this diagram on a white board for some customers recently. They seemed to like it (or were just being very polite) so I thought I redraw it for your amusement. Each blue box represents a...


GC Errata 2

I know that none of you are still making the transition fromjdk1.4.2 to jdk5 (aka jdk1.5.0, yah, we changed the numbering) but if you were, I'd want you to know that you might need a larger permanent generation. In jdk5 class data sharing wasimplemented (the -Xshare option, see the "java -X" output)when the client VM (-client) is being used.The implementation of class data sharing split some of theinformation about classes into two parts (the part thatcan be shared and the part that is specific to an executionof the JVM). The space in the permanent generation for classes that can be shared increased by 10-20%. That doesn't mean that you needa permanent generation that is 10-20% larger! Onlysome of the classes are affected and only some of thedata structures for those classes are affected. But ifyou're running a tight ship in terms of the size ofpermanent generation, you might be affected by this. Of course,none of you are still making the transition to jdk5.Some users have reported occasional long pauses which occur at thetime of a GC but cannot be accounted for by GC times.Such long pauses may be a symptom of a known latency in the use of (Linux's) mprotect()which is used by the JVM while steering application threads to so-calledsafepoints. A known workaround at the JVM level to step around theLinux mprotect() issue is the use of JVM option -XX:+UseMembar.The real fix, unfortunately, lies in the Linux kernel; seehttp://bugzilla.kernel.org/show_bug.cgi?id=5493. You might wish topursue this with your Linux support engineer and see if a patch is available for the problem.To avoid running into this bug, make sure there is plenty of physical memoryfree in your system so that the Linux pager does not start evicting pagesfor any process from memory to swap.If you're seeing large variations in your minor GC pauses and are usingUseParNewGC (which is on by default with the low pause collector),you might be running into 6433335. A large variation is a factor of a 100.This is a bug that has to do with large objects (or large free blocks) in theheap. If you turn off UseParNewGC and see larger pauses (by a multiplier that ison the order of the number of cpu's) but more regular pauses, then 6433335 is a likely candidate. It's been fixed in jdk 1.4.2 update 14, jdk 5 update 10and jdk 6 update 1.I've heard of more than one report recently of crashes with the low-pausecollector that were caused by bug6558100 - CMS crash following parallel work queue overflowThis bug is fixed in 1.4.2u18, 5.0u14 and 6u4. You can workaround thebug with the flag -XX:-ParallelRemarkEnabled. You could also run intothis bug if you explicitly enable ParallelRefProcEnabled, so if youinclude -XX:+ParallelRefProcEnabled, remove it. ParallelRefProcEnabled is off by default so if you don't explicitlyturn it on, don't worry about it.Starting in jdk6 the biased locking optimization is "on" by default(command line option UseBiasedLocking). This optimization reduces the cost ofuncontended locks by treating the thread that owns the lock preferentially.It's a nice optimization but could increase the memory usage of the JVM.If you move from jdk5 to jdk6 and start seeing messages such asjava.lang.OutOfMemoryError: requested bytes for GrET in Cyou could try turning off biased locking and see if it helps. Improvements inbiased locking were made in 6u1 to make this much less of a problem.

I know that none of you are still making the transition from jdk1.4.2 to jdk5 (aka jdk1.5.0, yah, we changed the numbering) but if you were, I'd want you to know that you might needa larger permanent...


Did You Know ...

These are a few esoteric factoids that I never expected users toneed, but which have actually come up recently. Most of thetext is just background information. If you already recognize the command line flags that I've bold'ed, you probably already knowmore than is good for you. ParallelCMSThreads The low-pause collector (UseConcMarkSweepGC) does parts of the collectionof the tenured generation concurrently with the execution of the application(i.e., not during a stop-the-world). There are principally twoconcurrent phases of the collection: the concurrent marking phase andthe concurrent sweeping phase. In JDK 6 the concurrent marking phase can use more than 1 GC threads (uses parallelism as well as concurrency).This use of the parallelism is controlled by the command line flagCMSConcurrentMTEnabled. The number of threads used during a concurrentmarking phase is ParallelCMSThreads. If it is not set on the commandline it is calculated as (ParallelGCThreads + 3)/4)where ParallelGCThreads is the command line flag for setting the number of GC threads to be used in a stop-the-world parallel collection.Where did this number come from? We added parallelism to the concurrentmarking phase because we observed that a single GC thread doing concurrent marking could be overwhelmed by the allocations of many applications threads(i.e., while the concurrent marking was happening, lots of applicationsthreads doing allocations could exhaust the heap before the concurrentmarking finished). We could see this with a fewer application threads allocating at a furiousrate or many application threads allocating at a more modest rate, butwhatever the application we would often seen the concurrent markingthread overwhelmed on platforms with 8 or more processors. The abovepolicy provides a second concurrent marking threads at ParallelGCThreads=5and approaches a fourth of ParallelGCThread at the higher processornumbers. Because we still do have the added overheard of parallelism2 concurrent marking threads provide only a small boost in concurrentmarking over a single concurrent marking thread. We expect that to stillbe adequate up to ParallelGCThreads=8. At ParallelGCThreads=9 we get a third concurrent markingthread and that's when we expect to need it. CMSMaxAbortablePrecleanTime Our low-pause collector (UseConcMarkSweepGC) which we are usually carefulto call our mostly concurrent collector has several phases, two of which are stop-the-world (STW) phases.STW initial markConcurrent markingConcurrent precleaningSTW remarkConcurrent sweepingConcurrent resetThe first STW pause is used to find all the references to objectsin the application (i.e., object references on thread stacks and in registers).After this first STW pause is the concurrent marking phaseduring which the application threads runs while GC is doing additionalmarking to determine the liveness of objects. After theconcurrent marking phase there is a concurrent preclean phase(described more below) and then the second STW pause which is called theremark phase. The remark phase is a catch-up phasein which the GC figures out all the changes thatthe application threads have made during the previous concurrent phases. The remark phase is the longer of these two pauses.It is also typically the longestof any of the STW pauses (including the minor collection pauses). Becauseit is typically the longest pause we like to use parallelism where ever we can in the remark phase.Part of the work in the remark phase involves rescanning objects that have been changed byan application thread (i.e., looking at the object A to see if Ahas been changed by the application thread so that A now references another object B and B was not previously marked as live). This includes objects in the young generation and here we come tothe point of these ramblings. Rescanning the young generation in parallelrequires that we divide the young generation into chunks so that we cangive chunks out to the parallel GC threads doing the rescanning. Achunk needs to begin on the start of an object and in general we don't havea fast way to find the starts of objects in the young generation. Given an arbitrary location in the young generation we are likelyin the middle of an object, don't know what kind of object it is, anddon't know how far we are from the start of the object. We know that the first objectin the young generation starts at the beginning of the young generation and so we could start at the beginning and walk from object to objectto do the chunking but that would be expensive. Instead we piggy-backthe chunking of the young generation on another concurrent phase, theprecleaning phase. During the concurrent marking phase the applications threads are running and changing objects so that we don't have an exact pictureof what's alive and what's not. We ultimately fix this up in theremark phase as described above (the object-A-gets-changed-to-point-to-object-B example). But we would like to do as much of the collection as wecan concurrently so we have the concurrent precleaning phase. Theprecleaning phase does work similar to parts of the remark phase but does itconcurrently. The detailsare not needed for this story so let me just say that there is a concurrent precleaning phase. During the latter part ofthe concurrent precleaning phasethe the young generation "top" (the next location to be allocated in the young generationand so at an object start)is sampled at likely intervals and is saved as the start ofa chunk."Likely intervals" just means that we want to create chunks that are not toosmall and not too large so as to get good load balancing during theparallel remark. Ok, so here's the punch line for all this.When we're doing the precleaning we do the samplingof the young generation top for a fixed amount of time before starting the remark. That fixed amount of time is CMSMaxAbortablePrecleanTime and its default value is 5 seconds. The best situation is to have a minor collection happen duringthe sampling. When that happens the sampling is done overthe entire region in the young generation from its start to itsfinal top.If a minor collection is not done during that 5 seconds thenthe region below the first sample is 1 chunk and it might bethe majority of the young generation. Such a chunkingdoesn't spread the work out evenly to the GC threads so reduces theeffective parallelism. If thetime between your minor collections is greater than 5 seconds andyou're using parallel remark with the low-pause collector (which youare by default), you might not be getting parallel remarking after all.A symptom of this problem is significant variations in your remarkpauses. This is not the only cause of variation in remark pauses buttake a look at the times between your minor collections and if theyare, say, greater than 3-4 seconds, you might need to up CMSMaxAbortablePrecleanTime so that you get a minor collectionduring the sampling.And finally, why not just have the remark phase wait for a minorcollection so that we get effective chunking? Waiting is often abad thing to do. While waiting the application is running and changing objects and allocating new objects. The former makes morework for the remark phase when it happens and the latter could causean out-of-memory before the GC can finish the collection. There isan option CMSScavengeBeforeRemark which is off by default. If turnedon, it will cause a minor collection to occur just before the remark.That's good because it will reduce the remark pause. That's bad becausethere is a minor collection pause followed immediately by the remarkpause which looks like 1 big fat pause.l DontCompileHugeMethods We got a complaint recently from a user who said that all his GC pauseswere too long. I, of course, take such a statement with a grain ofsalt, but I still try to go forward with an open mind. And this time theuser was right, his GC pauses were way too long. So we started askingthe usual questions about anything unusual about the application'sallocation pattern. Mostly that boils down to asking about very largeobjects or large arrays of objects. I'm talking GB size objects here.But, no, there were nothing like that. The user was very helpful in terms of trying experiments with his application, but we weren'tgetting anywhere until the user came back and said that he hadcommented out part of his code and the GC's got much smaller. Hmmm. Curiouser and curiouser. Not only that, but the code thatwas commented out was not being executed. At this point the strainon my brain began to be too much and I lost consciousness. Fortunately,another guy in the group persevered and with some further experimentsdetermined that the code that was being commented out was in a method that was not always being JIT'ed. Methods larger than acertain size will not be JIT'ed in hotspot.Commenting out some code would bring the size of the method below theJIT size limit and the method would get compiled. How did thataffect GC you might ask? When a method is compiled, the compilersgenerate and save information on where object references live (e.g.,where on the stack or in which registers). We refer to these as oop maps andoop maps are generated to speed up GC. If the method has notbeen JIT'ed, the GC has to generate the oop maps itself during theGC. We do that by a very laborious means that we call abstractinterpretation. Basically, we simulate the execution of the methodwith regard to where reference are stored. Large methods mean largeabstract interpretation times to generate the oop maps. We do save theoop maps for the next GC, but oop maps are different at differentlocations in the method. If we generate an oop map for PC=200 thistime but stop for a GC at PC=300 next time, we have to generatethe oop map for PC=300. Anyway, the method in which code was beingcommented in and out, was too large to be JIT'ed with the codecommented in and that led to the long GC's. If you have somehuge methods and GC's are taking a very long time,you could try -XX:-DontCompileHugeMethods. This will tell theJIT to ignore its size limit on compilation. I'm told by the compilerguy in my carpool that it's not a good idea to use that flag ingeneral. Refactor your methods down to a less than huge size instead.By the way, the huge method was something like 2500 lines so it waswhat I would call huge.

These are a few esoteric factoids that I never expected users to need, but which have actually come up recently. Most of the text is just background information. If you already recognizethe command...


Size Matters

I recently had an interesting exchange with a customer about concurrent modefailures with the low-pause collector. My initial response was with my litany of whysand wherefores of concurrent mode failures. But it got more interesting, because itturned out to be an example of where object sizes and the young generation size madean unexpected difference.Before I get to that let's talk a little bit about the JVM's behavior with regard toallocating large objects. Large, of course, is a relative term and what I mean hereis large relative to the size of the young generation. My comments here apply to JDK 5 update 10 and later and JDK 1.4.2 update 13 and later. Earlier versions of thoseJVM's had a bug in them (6369448) that made their behavior different. All versions of JDK 6 are also free from this bug. By the way the customer wason JDK 5 update 6 so was subject to 6369448. That made it a little more interestingfor me but may not be of interest to you if you are running on the later releases.And also the policy I describe below does not apply to the throughput collector(UseParallelGC) which has its own variation on this policy.Objects are normally allocated out of the young generation and get moved to the tenured generation as they age. But what happens when a object is larger than theyoung generation (actually larger than the eden space in the young generation)?If the application tries to allocate object AAA and there is not enough free space in eden for the allocation, a garbage collection usually occurs. Theexception to this is if AAA is larger than eden. An outline ofthe policy for allocation and collections follow.Allocate out of the young generationPossibly allocate out of the tenured generationCollect the young generationPossibly collect the tenured generationHeroicsIf the allocation in 1. fails, before startinga GC the JVM considers allocating AAA out of the tenured generation at 2. The JVM compares the size of AAA to the current capacity of eden. By capacity Imean the total size available in an empty eden.If the capacity of eden is too small to hold AAA, then collecting the young generationwill not help with the allocation of AAA. So the JVM tries to allocate AAA directlyout of the tenured generation. If the allocation out of the tenured generation succeeds, no young generation GC is done and the JVM just continues. If AAA is smaller than eden, then the JVM proceeds to the collection of the young generationat 3. After the the young generation collection, a check is done at 4. tosee if enough space has been freed to allocate AAA in eden. If AAA stillcannot be allocated out of the young generation, then the tenured generation is collectedat 5. and an allocation attempt is made out of the tenured generation. If AAA cannot be allocated in the tenured generation, some additional heroics are attempted (e.g.,clearing all SoftReferences). Failing those heroics an out-of-memory is thrown.A few things to note. The initial allocation at 1. is what we refer to as fast-path allocation.Fast-path allocation does not call into the JVM to allocate an object. The JIT compilers know how to allocate out of the young generation and code for an allocationis generated in-line for object allocation. The interpreter also knows how to do theallocation without making a call to the VM. And yes, both know about thread localallocation buffers (TLAB's) and use them. The allocation out of the tenured generation at2. could be attempted unconditionally for any sized object. That could avoid a collection of the younggeneration, but would also defeat the purpose of having the young generation. We wantnew objects allocated in the young generation because that's where they will probablydie. We only want the longed-lived objects to make it into the tenured generation. Also an allocation at 2. is not a fast-path allocation so the execution of the applicationcould be significantly affected by too much allocation at 2. At 4. a check is made to seeif enough space has been freed in the young generation for AAA and if not the tenuredgeneration is collected. Why not attempt the allocation of AAA out of the tenuredgeneration before doing the collection? Attempting the allocation first would probablywork lots of the time, but there are pathological cases where allocations wouldrepeatedly be done at 4. If we called allocation at 2. slow-path allocation, we'd haveto call allocation at 4. slow-slow-path allocation. At 4. we've already stopped the worldand done a young generation collection. The pathological case would executevery slowly. We know. We've tried variations that did theslow-slow-path allocation first and the reports that we usually got was the the JVM was hung!The JVM was not hung. Going slow-slow-path just made it seem so. Doing the collectionof the tenured generation actually also collects the young generation and doing thatavoids the pathological case in practice.The final thing to note is that you don't want to be stuck using slow-path allocation(allocation at 2.) It's really slow compared to fast-path allocation. So if youhave large objects (and only you would know), try to make your young generation large enough to hold plenty of them (10's would be good, 100's would be better).Or that you have very few of them.The policy for the throughput collector differs from the above at 2. and 5. At 2.the throughput collector will allocate an object out of the tenured generation ifit is larger than half of eden. That variation tries to not fill up theyoung generation with large objects. Is this better?I just don't know. The heroics at 5. are also different.The interesting exchange with the customer that got me thinking about this blogstarted with this snippet of a GC log. Again this was with JDK 5 update 6 so thebug 6369448 was in play.963236.282: [GC 963236.282: [ParNew: 15343K->0K(21184K), 0.1025429 secs] 8289743K->8274557K(12287936K), 0.1029416 secs]963236.479: [GC 963236.479: [ParNew: 10666K->0K(21184K), 0.1072986 secs] 963236.587: [CMS (concurrent mode failure): 8285141K->7092744K(12266752K), 91.3603721 secs] 8285224K->7092744K(12287936K), 91.4763098 secs]963328.194: [GC 963328.194: [ParNew: 21120K->0K(21184K), 0.1455909 secs] 7135030K->7114033K(12287936K), 0.1459930 secs]963328.434: [GC 963328.435: [ParNew: 7442K->0K(21184K), 0.0745429 secs] 963328.509: [CMS (concurrent mode failure): 7114084K->7073781K(12266752K), 78.1535121 secs] 7121475K->7073781K(12287936K), 78.2286852 secs]963408.503: [GC 963408.503: [ParNew : 21120K->0K(21184K), 0.0651745 secs] 7116067K->7097487K(12287936K), 0.0656080 secs]When I first looked at this log I jumped right to the first concurrent mode failure. The first thing to notice was that the available free space in the tenured generation was large (about 4g). The total space in the tenured generation was 12266752K and the occupancy (amount of space currently being used) of the tenured generation was only 8285141K (see the part of the log immediately following the first "concurrent mode failure" message). That's about 4g of space. With that much free space the only thing I could think of was a really bad case of fragmentation. Now a concurrent mode failure is bad for the low-pause collector. If it is due to fragmentation at least the resulting collection is a compacting collection and you should not have to worry about fragmentation for a while. So why is there another concurrent mode failure after just one minor collection? And there is even more free space in the tenured generation when thissecond one occurred (about 5g). So now I'm sweating. I fortunately started noticing some other peculiarities in the log. For example the first minor collection starts when the young generation had only allocated 15343K out of a total of 21184K. That's not that unusual because the survivor space can be sizable and mostly empty. But the third collection is also a minor collection and the young generation has allocated 21120K by the time the minor collection starts (as does the fifth collection). Hmmm. Looking at the occupancies of the young generation at the point of the two concurrent mode failures shows the pattern: the young generation has a good chunk of free space but is still being collected. So I asked the customer if very large objects are being allocated, objects that are approaching the size of the young generation. To my relief the answer was yes. What follows now is my interpretation of the logs. I didn't have the application to run and really debugthe behavior, but I think the pieces fit together.The first collection (a minor collection) starts seemingly early because a large object is being allocated. The object is larger than the avalable space in eden but is not larger than all of eden so the minor collection does free up enough space to allocate that large object. The next collectionis started when an even larger object allocation is attempted.This is where bug 6369448 comes into play. If not for 6369448 the larger object would have been allocated out of the tenuredgeneration at 2. (the corrected case that I discussed above). There was an inconsistencyin the test at 2. and the check at 4. before the collection of the tenured generation.A young generation is divided into an eden and two survivor spaces. The test at 2.compared the object size to the total size of the young generation. The check at 4.only checked the size (as is correct) against eden. So the short circuit at 2. was nottaken, the collection at 3. doesn't free up enough space, and a tenured generationcollection (concurrent mode failure) is the result. The third collection is an uninteresting minor collection. The fourth collection again is a collection prompted by the allocation of an object too big to fit into eden so another concurrent mode failure happens. The fifth collection is also uninteresting as are the many minor collections that follow.My story for the two concurrent mode failures does depend on the large objects fittingin that gap in size between eden and the total young generation, but I'm willing to assumethat there was a mix of object sizes and occasionally there was something in the gap. Thetwo concurrent mode failure pattern above didn't happen often in the entire log and it wasa long log. And the customer increased the size of the young generation and these typesof concurrent mode failures did not reoccur.I've always expected the occurrence of slow-path allocation to be a rare event. I thinkit is rare, but that's not the same as never so I'm thinking about adding a counter tobe printed into the GC log to indicate how many slow-path allocations have occurred sincethe last GC.

I recently had an interesting exchange with a customer about concurrent mode failures with the low-pause collector. My initial response was with my litany of whysand wherefores of concurrent mode...


Get What You Need

As we all know by nowGC ergonomics exists to automatically adjust the sizes of the generationsto achieve a pause time goal and/or a throughput goal while using a minimum heap size.But sometimes that's not what you really want. I was talking to a user recentlywho was using java.lang.Runtime.totalMemory() and java.lang.Runtime.freeMemory()to monitor how full the heap was. The intent was that the user's application beable to anticipate when a GC was coming and to throttle back on the number ofrequests being serviced so as to avoid dropping requests due to the reduced throughput when the GC pause did occur. The problem was that the capacity of theheap (the amount returned by totalMemory()) was varying so much that anticipatingwhen a GC was coming was not very reliable. First the pretty picture.Looking at the memory used plot first, I inferred from the graph that therewas some processing during initialization that required a lesser amountof memory. After that there was activity that used and released memory untilabout 6k seconds. At that point the activity died down until about 7.5k secondswhen there was an up tick in activity and then a quiet period again until about11k seconds. After that there was pretty regular allocations being done.Looking at the total memory I see what I'm expecting to see. Theinitial total heap size at zero time is larger than is needed during the first phaseand the total memory drops. Then as allocations picked up the size of theheap grew and, on average, stayed at a higher level until the first quiet period.During the first quiet period the size total heap size decayed to a lower value that corresponds to the lesser demand for allocations. Then there is the shortburst of activity with an corresponding growth in the heap. The toal heap again dropsduring the second quiet period and then grows again as the activity picks up.I like the behavior I see, but it wasn't what the user wanted. The amount of variation in the total heap size during the active periods is high. This is varying too quickly for this user.The reason for the variations is that the GC ergonomics is trying to react to variationsin the behavior of the applications and so is making changes to the heap as soon as a variation in the application's behavior is recognized. There are two ways to lookat this: GC ergonomics was designed for the case where a steady statebehavior has been reached, and as exhibited by the variations in the amount ofmemory used, this application doesn't reach a steady state (on the timescale of interest) and GC ergonomics is doing the best that it canunder the circumstances, and the user's application is what it is and what should he do?The first bullet is the excuse I use for almost everything. Let's consider the second.First a word about what GC ergonomics does by default that provides somedamping of variations.GC ergonomics calculates the desired size of a generation (in terms of meeting thegoals specified by the user) and then stepstoward that size. There is a miminum size for a step and if thedistance to the desired size is less than that minimum step, thenthe step is not made. The minimum step size is the page sizeon the platform. This eliminates changes as we get close to thegoal and removes some of the jitter.When deciding whether a goal is being met, GC ergonomics usesweighted averages for quantities. For example, a weighted averagefor each pause time is used. The weighted averages typically change more slowly than the instantaneous values of the quantities so their use also tends to damp out variations.But if you really want a more stable heap size, here's what youcan try. These suggestions limit the range of changes that canbe made by GC ergonomics. The more minor limitations are givenfirst followed by the more serious ones, finally ending with turningoff GC ergonomics. You can just try the suggestions (in bold)and skip over the explanations as you wish.Reduce the jitter caused by the different sizes of the survivor spaces (and the flip-flopping of the roles of the survivor spaces) by reducingthe size of the survivor spaces. Try NNN = 8 as a starter.-XX:MinSurvivorRatio=NNN-XX:InitialSurvivorRatio=NNNThe total size ofthe young generation is the size of eden plus the size of from-space.Only from-space is counted because the space in the young generationthat contain objects allocated by the application is only eden andfrom-space. The other survivor space (to-space) can be thought of asscratch space for the copying collection (the type of collectionthat is used for the young generation). Each survivor space alternatelyplays the role of from-space (i.e. during a collection survivor space Ais from-space and in the next collection survivor space B is from-space).Since the two survivor spaces can be of different sizes, just the swappingcan change the total size of the young generation. In a steady statesituation the survivor spaces tend to be the same size but in situationswhere the application is changing behavior and GC ergonomics is trying to adjust the sizes of the survivor spaces to get better performance,the sizes of the survivor spaces are often different temporarily. The default value for MinSurvivorRatio is 3 and the default value forInitialSurvivorRatio is 8. Pick something in between. A smallervalue puts more space into the survivor spaces. That space mightgo unused. A larger value limits the the size of the survivorspaces and could result in objects being promoted to the tenuredgeneration prematurely. Note that the survivor space sizes are stilladjusted by GC ergonomics. This change only puts a limit on how largethey can get. Reduce the variation in the young generation size by setting a minimum and a maximum with -XX:NewSize=NNN and -XX:MaxNewSize=MMM.GC ergonomics will continue to adjust the size of the young generationwithin the range specified. If you want to make the minimum and maximumlimits of the young generation the same youcan use -XX:NewSize=NNN and -XX:MaxNewSize=NNNor the flag -XmnNNN will also do that.Reduce the range over which the generations can change by specifyingthe minimum and maximum heap size with -Xms and -Xmx, respectively.If you've already explicitly specified the size of the young generation, thenspecifying the limit on the entire heap will in effect specify the limitsof the tenured generation.You don't have to make the minimum and maximum the same but makingthem the same will have the maximum effect in terms of reducingthe variations.Turn off GC ergonomics with the flag -XX:-UseAdaptiveSizePolicyThe size of the generations, the sizes of the survivor spaces in the young generation and the tenuring threshold stay at their starting value throughout the execution of the VM. You can exercise this level ofcontrol but the you're back to tuning theGC yourself. But sometimes that really is the best solution. The user Italked to liked the more predictable heap size better than the automatic tuning andso turned off ergonomics.

As we all know by now GC ergonomics exists to automatically adjust the sizes of the generations to achieve a pause time goal and/or a throughput goal while using a minimum heap size.But sometimes...


GC Errata 1

Just some tid bits of information to fill some of the holes in the documentation. GC ergonomics is used by default on server class machines. Please seehttp://java.sun.com/docs/hotspot/gc5.0/ergo5.htmlfor a description of GC ergonomics and server class machine. The flag "-server" causes theserver VM to be used. "-server" does not causethe platform on which the VM is running to be considered aserver class machine.Prior to JDK6 the UseParallelGC collector (with GC ergonomics turned on) used only the flags InitialSurvivorRatio andMinSurvivorRatio to set the initial survivor ratio and the minimum survivorratio, respectively. GC ergonomics dynamically resizes the survivor spaces,but these flags allow the user to specify initial and minimum sizes forthe survivor spaces. The flag SurvivorRatio is used by the other collectors(that don't dynamically resize the survivor spaces) to set the size of thesurvivor spaces. SurvivorRatio was ignored by the UseParallelGC collector.Beginning with JDK6 if the SurvivorRatio is set on the command line forthe UseParallelGC collector and InitialSurvivorRatio and MinSurvivorRatioare not explicitly set on the command line, then InitialSurvivorRatio and MinSurvivorRatio areset to SurvivorRatio + 2.If UseAdaptiveSizePolicy is turned off when using the UseParallelGC collector,the tenured generation and the young generation stay at their initial sizesthroughout the execution of the VM.Starting with JDK5 update 06 the maximum tenuring threshold is limited to 15. TheGC logging output does not correctly reflect this change. This is a bug thatwill be fixed under 6521376.If a minimum heap size is set with -Xms and NewSize isnot explicitly set, the minimum young generation size iscalculated using NewRatio. The UseParallelGC collectoris not correctly using the minimum value calculated from-Xms and NewRatio but is using the default value of NewSize.Explicitly setting NewSize does correctly set the minimumsize for the generation. This is being fixed under bug 6524727.

Just some tid bits of information to fill some of the holes in the documentation. GC ergonomics is used by default on server class machines. Please see http://java.sun.com/docs/hotspot/gc5.0/ergo5.html f...


How Does Your Garden Grow?

Actually, the title should be "How Does Your Generation Grow"but then the literary reference would have been totally lost.I've written about how the generations grow if you are usingGC ergonomics. http://blogs.sun.com/jonthecollector/entry/it_s_not_magic What if you're not using GC ergonomics? There is a differentpolicy for growing and shrinking the heap that is usedby the low pause collector and the serial collector. Ifyou're interested in that policy, this is you're luckyblog.It's actually pretty straightforward. At the end of the collectionof a generation, we check the amount of free space in the generation.If the amount of free space is below a certain percentage (specified byMinHeapFreeRatio) of the total size of the generation, then thegeneration is expanded. Basically, we ask if we have enoughfree space in the generation so that the VM can run a whilebefore we have to collect again. At the other end we don't want to havean excessive amount of free space so if the amount of free space is above acertain percentage (specified by MaxHeapFreeRatio) the generationis shrunk. That's it.If you decide to play with the free ratio parameters,leave enough distance between MinHeapFreeRatio and MaxHeapFreeRatio so that the generations are not constantly adjustingby small amounts to get to the "perfect" ratio. Also ourexperience is that even if a generation does not need theextra free space right now, it will shortly, so don't be tooaggressive with MaxHeapFreeRatio.Will this policy eventually be replaced by GC ergonomics? Actually,I think not. I recently talked to a customer who told me that he was more concerned with bounds on the heap footprint than withachieving some specific throughput. The way he put it was somethinglike "I don't want to be doing GC's all the time so there should bea good amount of free space in the heap but not too much".This policy is one approximation of what he wants.

Actually, the title should be "How Does Your Generation Grow" but then the literary reference would have been totally lost. I've written about how the generations grow if you are using GC ergonomics. ht...


When You're at Your Limit

It's often the case that we advise users of large applications touse large young generations. The rational is that large applications(very often multithreaded) have large allocation rates and a largeyoung generations is good because it allows more time between minorcollections. One case where that is not a good idea is when you'reusing the largest heap you can (e.g. close to 4G on a 32 bit system)and the live data is bigger than will fit in the tenured generation.When the young generation fills up, we want to do a minor collection. Minorcollections are in general more productive (the young generationtends to have more garbage) and faster (the time to collectdepends largely on the amount of live data in the younggeneration and not on the about ofgarbage). If, however, when we're contemplating a minorcollection and the tenured generation is full, the garbage collector will think that it's time to do a major collection and collect both the tenuredgeneration and the young generation. If the tenured generationis actually full (or very nearly full) of live data, then the tenured generation may still be full at the end of themajor collection. The collection worked, but the next time theyoung generation fills up, a major collection will again bedone.If you find yourself in this situation, try reducing thesize of the young generation and increasing the sizeof the tenured generation accordingly. The goal is tohave enough free space available in the tenured generation aftera major collection so that the next collection can be a minorcollection. Even if you can only manage enough free space in the tenured generation to do 1 minor collection, it's a win. Sizing the generations so you can get a few minor collections betweenmajor collections would be sweeter, but that all depends on how closeyou are to your limit.

It's often the case that we advise users of large applications to use large young generations. The rational is that large applications(very often multithreaded) have large allocation rates and...


Presenting the Permanent Generation

Have you ever wondered how the permanent generation fits into ourgenerational system? Ever been curious about what's in the permanentgeneration. Are objects ever promoted into it? Ever promoted out?We'll you're not alone. Here are some of the answers. Java objects are instantiations of Java classes. Our JVM has an internalrepresentation of those Java objects and those internal representationsare stored in the heap (in the young generation or the tenured generation).Our JVM also has an internal representation of the Java classes and thoseare stored in the permanent generation. That relationship is shown in the figure below. The internal representation of aJava object and an internal representation of a Java class are verysimilar. From this point on let me just call them Java objects and Javaclasses and you'll understand that I'm referring to their internalrepresentation. The Java objects and Java classes are similar to the extent that during a garbage collection bothare viewed just as objects and are collected in exactly the same way. So why store the Java objects in a separate permanent generation? Why not juststore the Java classes in the heap along with the Java objects?Well, there is a philosophical reason and a technical reason. The philosophical reason is that the classes are part of our JVM implementationand we should not fill up the Java heap with our data structures. Theapplication writer has a hard enough time understanding the amount of live data the application needs and we shouldn't confuse the issuewith the JVM's needs.The technical reason comes in parts.Firstly the origins of the permanent generation predate my joining the teamso I had to do some code archaeology to get the story straight (thanksSteffen for the history lesson).Originally there was no permanent generation. Objects and classeswere just stored together. Back in those days classes were mostly static. Custom class loaders werenot widely used and so it was observed that not much class unloading occurred. As a performance optimization the permanent generation was created and classes were put into it.The performance improvement was significant back then. With the amountof class unloading that occur with some applications, it's not clear thatit's always a win today.It might be a nice simplification to not have a permanent generation, butthe recent implementation of the parallel collector for the tenuredgeneration (aka parallel old collector)has made a separate permanent generation again desirable. The issuewith the parallel old collector has to do with the order in whichobjects and classes are moved. If you're interested, I describe thisat the end.So the Java classes are stored in the permanent generation. What all does that entail? Besides the basic fields of a Java class there areMethods of a class (including the bytecodes)Names of the classes (in the form of an object that points to a string also in the permanent generation)Constant pool information (data read from the class file, see chapter 4 of the JVMspecification for all the details).Object arrays and type arrays associated with a class (e.g., an object arraycontaining references to methods).Internal objects created by the JVM (java/lang/Object or java/lang/exception for instance)Information used for optimization by the compilers (JITs)That's it for the most part. There are a few other bits of information thatend up in the permanent generation but nothing of consequence in terms of size. All these are allocated in the permanent generation and stayin the permanent generation. So now you know.This last part is really, really extra credit.During a collection the garbage collector needsto have a description of a Java object (i.e., how big is it and whatdoes it contain). Say I have an object X and X has a class K.I get to X in the collection and I need K to tell me what X looks like. Where's K? Has it been moved already?With a permanent generation during a collection we move thepermanent generation first so we know that all the K's are intheir new location by the time we're looking at any X's.How do the classes in the permanent generation get collected while theclasses are moving? Classes also have classes that describe their content.To distinguish these classes from those classes we spell the former klasses.The classes of klasses we spell klassKlasses. Yes, conversations around theoffice can be confusing. Klasses are instantiation of klassKlasses sothe klassKlass KZ of klass Z has already been allocated before Z can be allocated. Garbage collections in the permanent generation visit objects in allocation order and that allocation order is always maintained during the collection. That is, if A is allocated before B then A alwayscomes before B in the generation. Therefore if a Z is being moved it's always the case that KZ has already been moved.And why not use the same knowledge about allocation order to eliminate the permanent generations even in the parallel oldcollector case?The parallel old collector does maintain allocation order ofobjects, but objects are moved in parallel. When the collectiongets to X, we no longer know if K has been moved. It might bein its new location (which is known) or it might be in itsold location (which is also known) or part of it might havebeen moved (but not all of it). It is possible to keep trackof where K is exactly, but it would complicate the collectorand the extra work of keeping track of K might make it a performanceloser. So we take advantage of the fact that classes are kept in the permanent generation by collecting the permanent generation before collectingthe tenured generation. And the permanent generation is currently collected serially.

Have you ever wondered how the permanent generation fits into our generational system? Ever been curious about what's in the permanent generation. Are objects ever promoted into it? Ever promoted out?W...


Why Now?

Are you ever surprised by the occurrence of a full collection (aka major collection)? Generally a full collectiononly happens when an allocation is attempted and there is not enough space available in any of the generations. But there areexceptions to that rule and here are some of them.Actually, these are all of them that I could think of.I thought I would have a longer list so to fill outthis blog I've included some bugs that relate to full collections that might interest you. And we're thinking about a policy change for the low pause collector that would affect full collections.The simplest reason for an unexpected full collection is that you asked for it. System.gc() can be called from most anywhere in the program. Try using -XX:+DisableExplicitGC tohave the garbage collector ignore such requests.The permanent generation is collected only by a full collection. It's very seldom thecause for a full collection but don't overlook the possibility. If you turn onPrintGCDetails, you can see information about the permanent generation collection.0.167: [Full GC [PSYoungGen: 504K->0K(10368K)] [ParOldGen: 27780K->28136K(41408K)] 28284K->28136K(51776K) [PSPermGen: 1615K->1615K(16384K)], 0.4558222 secs]This output is from the throughput collector. The "PSPermGen" denotes the permanent generation. The size of the permanent generation currently is 16384K. It's occupancybefore the collection is only 1615K and is probably the same after the collection.If the permanent generation needed collecting, the occupancy before the collectionwould have been closer to the 16384K.The garbage collector tracks the average of the rate of promotion of live objects from the younggeneration into the tenured generation. If the averageexceeds the free space in the tenured generation, the nexttime the young generation fills up, a full collection isdone instead of a minor collection. There is a pathologicalsituation where, after the full collection, the free space in the tenured generationis still not large enough to take all the live objects expectedto be promoted from the next minor collection. This will result in another fullcollection the next time the young generation fills up. The bad part about this is that the averagevalue for the amount promoted only changes when a minorcollection is done and if no minor collections are done, thenthe average does not change. Increasing the size of the heapusually avoids this problem.Sometimes it's not really a full collection. There was a bug affectingJDK5 (6432427) which print "Full" when the collection was not a fullcollection. When JNI critical sections are in use, GC can be lockedout. When the JNI critical sections are exited, if a GC had beenrequested during the lock out, a GC is done. That GC most likely would be a minor collection but was mistakenly labeled a fullcollection. A second place where "Full" was also printed erroneouslywas when -XX:+CMSScavengeBeforeRemark is used. The same bug report explainsand fixes that error.An attempt to allocate an object A larger than the young generation cancause a minor collection. That minor collection will fail to free enough space in the young generation to satisfy the allocation of A.A full collection is then attempted and A is then allocated out of thetenured generation. This behavior was a bug (6369448). It has beenfixed in JDK's 1.4.2_13, 5.0_10, and 6. The correct behavior is forthe GC to recognize when an object is too large to be allocated out of theyoung generation and to attempt the allocation out of the tenuredgeneration before doing a collection.When an allocation cannot be satisfied by the tenured generation afull collection is done. That is the correct behavior but that behaviorhas an adverse affect on CMS (the low pause collector). Early in arun before the CMS (tenured) generation has expanded to a working size,concurrent mode failures (resulting in full collections) happen.These concurrent mode failures perhaps can be avoided. We are thinking about a change to the policy such that CMS expands thegeneration to satisfy the allocation and then starts a concurrentcollection. We're cautious about this approachbecause similar policies in the past has led to inappropriate expansion ofthe CMS generation to its maximum size. To avoid these full collectionstry specifying a larger starting size for the CMS generation.So just because these are all the unexpected "Full" collections that I couldthink of, that doesn't mean that these are all there are. If you have amysterious "Full" collection happening, submit a description tohttp://java.sun.com/docs/forms/gc-sendusmail.html I'll post any additional cases of unexpected "Full" collections that I get.Late addition (I'm allowed to edit my blog even after they've been posted). If youask about unexpected "Full" collections and send in a gc log, please add thecommand line flags-XX:+PrintGCDetails -XX:+PrintHeapAtGC -XX:+PrintGCTimeStampsThe additional output (of which there will be plenty) may help me see what's happening.

Are you ever surprised by the occurrence of a full collection (aka major collection)? Generally a full collection only happens when an allocation is attempted and there is not enough space availablein...


The Second Most Important GC Tuning Knob

is NewRatio or it's equivalent. Number 1 would be the total heap size. And just so the guys down the hall don't start throwing rocks at me, the 0-th most important GC tuning knob is the choice of the type of garbage collector itself. But let's talk about number 2.NewRatio is a flag that specifies the amount of the total heap that will be partitioned into the young generation. It's the tenured-generation-size / young-generation-size. A NewRatio value of 2 means that the tenured generation size is 2/3 of the total heap and the young generation size is 1/3. There are other ways of specifying that split. MaxNewSize will specify the maximum size of the young generation. The flag -XmnNNN is equivalent to -XX:NewSize=NNN and -XX:MaxNewSize=NNN. NewSize is the initialsize of the young generation.Say you've chosen the right size for your overall heap. But does the value of NewRatio split the space between the young generation and the tenured generation optimally. If your application has a large proportion of long lived data, the value of NewRatio may not put enough of the space into the tenured generation. I recently was looking at the performance of an application where too many major collections where being done. The problem was that the long lived data overflowed the tenured generation. When a collection was needed the tenured generation was basically full of live data. Much of the young generation was also filled with long lived data. The result was that a minor collection could not be done successfully (there wasn't enough room in the tenured generation for the anticipated promotions out of the young generation) so a major collection was done. The major collection worked fine, but the results again was that the tenured generation was full of long lived data and there was long lived data in the young generation. There was also free space in the young generation for more allocations, but the next collection was again destined to be a major collection. By decreasing the space in the young generation and putting that space into the tenured generation (avalue of NewRatio larger than the default valuewas chosen), there was enough room in the tenured generation to holdall the long lived data and also space to support minor collections. This particularapplication used lots of short lived objects so after the fix mostly minor collectionswere done.Why not just make NewRatio conservatively large to avoid this situation? You want your young generation to be large enough so that objects have a chance to die between collections. Applications with many threads may be allocating at a high rate so a large young generation might be appropriate.The point here is that if you choose your overall heap size carefully, also think about the size of the young generation. This is particularly true if the maximum size of your heap is constrained for some reason (e.g., by a limited amount of physical memory).Something to add here is that the default value of NewRatio is platform dependent and runtime compiler (JIT) dependent.Below are the values as we get ready for the JDK6 release.-server on amd64 2-server on ia32 8-server on sparc 2-client on ia32 12-client on sparc 8These reflect the expectation that the server compiler will be used on applications with more threads and so have higher allocation rates.And, finally, if you're about to ask why there are different ways to set the size of the young generations:NewRatio gives you a way to scale the young generation size with the total heap size.NewSize and MaxNewSize give you precise control.-Xmn is a convenience

is NewRatio or it's equivalent. Number 1 would be the total heap size. And just so the guys down the hall don't start throwing rocks at me, the 0-th most important GC tuning knob is the choice of...


More of a Good Thing

The throughput collector is getting more parallel.In our J2SE 6 release there is an option in the throughput collector to collectthe tenured generation using multiple GC threads. Yes, the UseParallelGC collector wasn'tall that we wanted it to be in terms of parallelism. Without this recent workonly the young generation was being collected in parallel.The option to turn on the parallel collection of the tenured generation is-XX:+UseParallelOldGC. That's the punch line for this blog. Oh, also it's off by default.Below area few details on the new collector. Actually, first I'll describe the serial collector used for the tenured generation collection and some of the things we learned from it. Then I'll describe the new collector and how we used what we had learned.The J2SE 5 (and earlier) collector for the tenured generation was a serial mark-sweep-compact collector. There were four phases of that collector. 1. Mark all the live objects 2. Calculate the new locations of the objects 3. Adjust all object references to point to the new locations 4. Move the objects to their new locationsPhase 1 found and marked the live objects.In phase 2 the collector kept a pointer to the first location in the tenured generation to which we could move a live object during the collection. Let's call that location T. T started at the bottom of the generation. The collector scanned forwardfrom the bottom of the generation looking for marked (i.e., live) objects. Whena live object A was found, the collector saved T in A (i.e., saved the new location of A in A). It then calculatedthe next location T as T + sizeof(A) and restarted the scan looking for the next live object to put at the new T.Phase 3 started at the roots, references to objects known to the application (for themost part), and changed all the references in the roots and all the reference found from the rootsto point to the new location of the objects (tothe T's saved in the A's).Phase 4 again started at the bottom of the tenuredgeneration and scanned forward to find live objects. Whena live object was found, it was moved to it's new location (the T's).The end result was that all the live objects were moved to the lower end of the tenured generation and all the free space was in one contiguous region at the higher end of the tenured generation. Itsimplifies allocations from the tenured generation to have all the free space in asingle region. Also there are no issues of fragmentation as would be the case if the collection resulted inmultiple regions of free space separated by live objects. This singleregion of free space seemed an importantproperty to keep when thinking about how to do a collection of the tenured generationusing multiple GC threads.Another feature of our serial mark-sweep-compact collector that we liked is the flexibilityto leave some deadwood at the lower end of the tenured generation. We use the term "deadwood" torefer to garbage at the lower end of the generation that we're not going to collect. We don'tcollect it to reduce the number of objects that we have to move.If we have objects A B C Dand if B is garbage and we collect it, then we have to move C and D.If we treat B as deadwood, then C and D can stay where they are.We make a judgment call to waste a certain amount of space (that occupied by the deadwood) in order to move fewer objects.You may have noticed that each of the four phases of our mark-sweep-compaction walkedover at least all the live objects. In some cases it actually walked overall the objects (live and dead). When we're scanning the heap (such as in phases 2),we look at a dead object just to find its size so that we can step over it.Any part of the collection that walks over the objects is costly sowith the new parallel collector for the tenured generation we wanted to have fewerwalks over the objects.Keeping all this in mind resulted in a parallel collector that \* Marked the live objects using multiple GC threads and maintained some summary information (explained below). \* Used the summary information to determine where each live object would go. Also determined the amount of deadwood to keep. \* Moved live objects so there was one contiguous block of free space in the generation and updated references to live objects using multiple GC threads.I'll refer to these three phases as the marking phase, summary phase and move-and-update phase. In some of the other documentation on this new collector thelast phase is referred to as the compaction phase. I prefer move-and-update becausethat's the type of name used in the code. Just in case you're ever looking at thecode.Marking phaseDoing the marking in parallel was straight forward. The parallel young generation collector does this and a similar technique was used. The additional twist to the marking was creating the summary information. The heap is logically divided into chunks. Each object starts within a chunk. As we're doing the marking, when a live object was found, its size was added to the chunk containing the start of the object. At the end this gives us the size the of data for objects that begin in each chunk.Summary phaseThe summary phase first walks over the chunks looking for the extent of the heap that contains a good amount of deadwood. We call that part of the heap thedense prefix. Live objects that are moved will be placed in the heapstarting immediately after thedense prefix. Given that we know the amount of live datafor each chunk, we can calculate the new location of any live object using the summarydata. Basically adding up the sizes of the live data in all the chunks before chunk A tells you where the live data in chunk A is going to land.The summary phase is currently done by one thread but can be done in parallel if it turns out to be a scaling bottleneck.Update-and-move phaseGiven the summary information the move-and-update phase identifies chunks that are ready to be filled and then moves the appropriate objects into those chunks. A chunk is ready to be filled if it is empty or if its objects are going to be compacted into that same chunk. There is always at least one of those. The move-and-update phase is done with multiple GC threads. One thing to note is that it is possible that there is only 1 ready chunkat any given time. If the chunks are A, B, C, D ... and the only dead object is inA and fits entirely in A, then A is the only ready chunk. The move-and-update phase willfirst fill any remaining space in A. Then B will be the only ready chunk. Obviously an extreme case but it makes the point. There is a technique to widen this "gap" of ready chunks.We expect to implement it, but it's not in this release.That's basically it. We're working on some scaling issues with this new collector, but it working pretty well as is.For more on this new collector see the slides for the GC presentation done at thisyear's JavaOne. They can be download fromhttp://java.sun.com/javaone/The last time I looked, there was a "Click here" link for the technical sessionsin the first paragraph. Following that link, download the Java SE bundle and in there our GC presentation isTS-1168.

The throughput collector is getting more parallel. In our J2SE 6 release there is an option in the throughput collector to collectthe tenured generation using multiple GC threads. Yes,...


The Real Thing

Here's a real treat dished up by Ross K. who hangs out 3 offices down the hall. A whileback I wrote a blog about thread local allocation buffers (TLAB's). The short comings of myblog have probably been annoying Ross so much that he's written this one for us. Ifthere are anydiscrepancies between my earlier blog and this one, believe this one. Thanks, Ross.A Thread Local Allocation Buffer (TLAB) is a region of Edenthat is used for allocation by a single thread. It enablesa thread to do object allocation using thread local top andlimit pointers, which is faster than doing an atomic operationon a top pointer that is shared across threads.A thread acquires a TLAB at it's first object allocationafter a GC scavenge. The size of the TLAB is computed viaa somewhat complex process discribed below. The TLAB isreleased when it is full (or nearly so), or the next GC scavengeoccurs. TLABs are allocated only in Eden, never from From-Spaceor the OldGen.Flags default description UseTLAB true Use thread-local object allocation ResizeTLAB false Dynamically resize tlab size for threads TLABSize (below) Default (or starting) size of a TLAB (in bytes) TLABWasteTargetPercent 1 Percentage of Eden that can be wasted PrintTLAB false Print various TLAB related informationAggressiveHeap settings: TLABSize 256Kb ResizeTLAB true (Corrected 2007/05/09)Minor flags MinTLABSize 2\*K Minimum allowed TLAB size (in bytes) TLABAllocationWeight 35 Weight for exponential averaging of allocation TLABRefillWasteFraction 64 Max TLAB waste at a refill (internal fragmentation) TLABWasteIncrement 4 Increment allowed waste at slow allocation ZeroTLAB false Zero out the newly created TLABThese flags are for tuning the current implementation ofTLABs and maybe disappear or change their initial value in afuture release of the jvm.If it is not specified on the command-line (or specified as zero)via the -XX:TLABSize flag, the initial size of a TLAB is computed as: init_size = size_of_eden / (allocating_thread_count \* target_refills_per_epoch)where: a) Allocating_thread_count is the expected number of threads which will be actively allocating during the next epoch (an epoch is the mutator time between GC scavenges.) At jvm startup this is defined to be one. It is then recomputed at each GC scavenge from the number of threads that did at least one allocation of a tlab during the latest epoch. It's then exponentially averaged over the past epochs. b) Target_refills_per_epoch is the desired number of tlab allocations per thread during an epoch. It is computed from the value of TLABWasteTargetPercent which is the percentage of Eden allowed to be wasted due to TLAB fragmentation. From a mutator thread's perspective a GC scavenge can occur unexpectedly at any time. So, on average, only half of a thread's current TLAB will be allocated when a GC scavenge occurs. TLABWasteTargetPercent = 0.5 \* (1/target_refills_per_epoch) \* 100 Solving for target_refills_per_epoch: target_refills_per_epoch = ( 0.5 \* 100) / TLABWasteTargetPercent With the default value of 1 for TLABWasteTargetPercent target_refills_per_epoch = 50When TLABResize is true (which it is by default) the tlab size is recomputed for each thread that did an allocation in the latest epoch. Threads that did not allocate in the latest epoch do not have their TLABs resized. The resize goal is to get the number of refills closer to the ideal: target_refills_per_epoch (default value is 50). For each thread, the number of refills in the latest epoch is exponentially averaged with values from previous epochs. If this average refill number is greater than target refills_per_epoch, then the tlab size is increased. If the average is less, the tlab size is decreased. The computation is (approximately): new_size = (old_size \* avg_refills_per_epoch) / target_refills_per_epoch It's actually computed from the fraction of the latest epoch's eden size used by this thread, because the next epoch may use a resized eden.To experiment with a specific TLAB size, two -XX flags need to be set, one to define the initial size, and one to disable the resizing: -XX:TLABSize= -XX:-ResizeTLAB The minimum size of a tlab is set with -XX:MinTLABSize which defaults to 2K bytes. The maximum size is the maximum size of an integer Java array, which is used to fill the unallocated portion of a TLAB when a GC scavenge occurs.Diagnostic Printing Options-XX:+PrintTLAB Prints at each scavenge one line for each thread (starts with "TLAB: gc thread: " without the "'s) and one summary line.Thread example:TLAB: gc thread: 0x0004ac00 [id: 2] size: 61KB slow allocs: 5 refill waste: 980B alloc: 0.99996 3072KB refills: 50 waste 0.1% gc: 0B slow: 4144B fast: 0B The tag "gc" indicates that this information was printed at a GC scavenge, after the tlabs have been filled. The "gc" tag doesn't mean a thread is a gc thread. Fields: thread: The address of the gc thread structure and it's system thread id. size: The size of the tlab in kilobytes. slow allocs: The number of allocations too large for remaining space in the TLAB. The allocation was done directly in eden space. refill waste: (in HeapWord units) The name is truncated in the dump, and should be: refill_waste_limit and is used to limit the amount of wasted space from internal fragmentation. If the remaining space in the TLAB is larger than this amount, and an allocation is requested that is too large to be allocated in the TLAB, then the allocation is done directly in Eden and the TLAB is not retired. If the remaining space is less than refill_waste_limit then the TLAB is retired, a new TLAB is allocated, and the object allocation is attempted in the new TLAB. After each allocation outside of the TLAB, the refill_waste_limit is incremented by TLABWasteIncrement to prevent an allocation of a size slightly less than refill_waste_limit from continually being allocated outside of the TLAB. alloc: [fraction] [sizeInBytes] Expected amount of eden allocated by this thread computed as a fraction of eden and number of heap words. refills: Number of tlab refills. waste [percent] gc: [bytes] slow: [bytes] fast: [bytes] Percentage of eden allocated to this thread that was wasted. Waste is the sum of three components: gc: unused space in the current TLAB when stopped for a scavenge. slow: sum of unused space in TLABs when they're retired to allocate a new one. fast: the client system can allocate a TLAB with a fast allocator. This is the amount of waste via that method.Summary example:TLAB totals: thrds: 1 refills: 50 max: 50 slow allocs: 5 max 5 waste: 0.1% gc: 0B max: 0B slow: 4144B max: 4144B fast: 0B max: 0B thrds: Number of threads that did an allocation. refills: [tt] max: [mm] Total number of TLAB refills by all threads, and maximun number of TLAB refills by a single thread. slow allocs: [ss] max [mm] Total number of allocations done outside of a TLAB, and maximum number by a single thread. waste [percent] gc: [bytes] slow: [bytes] max: [mm] fast: [bytes] max: [mm] Percentage of eden that was wasted across all threads. Waste is the sum of three components: gc: unused space in the current TLABs when scavenge starts. slow: sum of unused space in TLABs when they're retired to allocate a new ones. fast: the client system can allocate a TLAB with a fast allocator. This is the amount of waste via that method. For "slow" and "fast", the maximum value by a single thread is printed.More detail with addition of Verbose flag.-XX:+PrintTLAB -XX:+Verbose Using both: -XX:+PrintTLAB -XX:+Verbose will print the new tlab sizes for each thread when it is resized. Resizing is only done at GC scavenges. Example:TLAB new size: thread: 0x001eac00 [id: 19] refills 50 alloc: 0.402570 size: 19684 -> 18996 New size 18996. Previous size was: 19684. refills: Number of tlab refills for this thread. alloc: The expected fraction of eden this thread will use.

Here's a real treat dished up by Ross K. who hangs out 3 offices down the hall. A while back I wrote a blog about thread local allocation buffers (TLAB's). The short comings of myblog have probably...


Location, Location, Location

Allocating a bunch ofobjects together and then using them together seemsto be a good thing for cache utilization. Not always true, of course,but, if you're willing to take that as a quasi-truth, then you might be interested inhow garbage collections can stir up that good locality.I'm going to talk first about our serial garbage collector. It is a generationalgarbage collector with a copying young generation collector and a mark-sweep-compacttenured generation collector. My comments generally apply to all our collectorsbut allow me to do the simplest one first. After that I'll comment on how doing the collection in parallel makes it evenmore interesting. By the way the techniques I'll be describing for theserial collector are pretty genericand are used by other garbage collectors.When a application thread T allocates objects, the allocations are generally done outof the young generation. Those allocations done by T together (i.e., at about the same time)are allocated in close proximity in the heap. This is mostly true eventhough allocations are being done in parallel by different application threads (different T's). Each T allocates from its own thread local allocation buffer (TLAB).I described recent improvement to TLAB's in an earlier blog with thetitle "A Little Thread Privacy, Please". If you read the first part of thatone, you'll get the idea. At this point after the allocations are done,we have the good locality we've come to expect from the allocate-together, use-together practice. Then a garbage collection happens.When the young generation is collected, the surviving objects can be copiedto either the survivor spaces or the tenured generation. The survivor spacesare areas in the young generation where young generation objects are keptif they survive a few collections. After surviving a few collections objects in the survivor spacesare copied (promoted) to the tenured generation where long lived objects arekept. Let's consider the very first collection and the case where all the objects that survive that first collection are copied to a survivor space. The young generation collection starts by finding objects being useddirectly by the application. By "directly" I mean the application hasa reference to those objects A. That's as opposed to having a referenceto an object (A) that contains a reference to a second object (B).The garbage collector scans the stacks of eachapplication thread and, when it finds a reference to an object A, it copies A toa survivor space. Incidentally, that's also when thereference to A in the thread's stack is updated to point to the new location of A.Only A is copied at this point. If A contains referencesto another object B, the object B is not touched. The garbage collector continues to scan the stacks of the threads until it's found and copied all theobjects directly referenced by the application. At that point the garbagecollector starts looking inside the objects A that it has copied. It's then thatthe garbage collector would find B in A and would copy B to a survivor space.And as before if B contains a reference to object C, nothing is yet doneabout C. As an example, suppose an application thread T \* allocates object C \* allocates object B and puts a reference to C into B \* allocates object A and puts a reference to B into Arepeatedly. The layout of objects in the heap would look something like... C1 B1 A1 C2 B2 A2 C3 B3 A3 ...If thread T only kept a reference to the objects A (i.e., there are no referencedirectly to the B's or the C's on the thread stacks), then after the young generation collection the layout in the survivor space would look like... A1 A2 A3 ... B1 B2 B3 ... C1 C2 C3 ... where the (...) represent other objects X found directly from T's stack or objects found from those X's. Here I'm being just a little generousin grouping the A's, B's, and C's together. That grouping assumes thatA1 is found first, then A2 is found, and then A3 is found.Before the collection loading A1 into a cache line couldreasonably also load B1 and C1 into the same cache line. After the collection B1 and C1 are in a galaxy far, far away. The A's, B'sand C's have been separated so that their allocation order isno longer reflected in their location in the heap. The basicdamage has been done.In the case where all the objects A, B, and C are promoted to thetenured generation, the layout in the tenured generation would besimilar (just in the tenured generation instead of in a survivorspace). In the case where some of the objects are copied to a survivor space and some are promoted, some of each of the A's,B's, and C's can end up in either a survivor space or thetenured generation. You might think that it would tend tolook something like(survivor space) ... A1 A2 A3 ... (tenured generation) ... B1 B2 B3 ... C1 C2 C3 ...It might. Or it might be a bit messier. The above example wouldresult if there was room in the survivor spaces for the A's butnot for the B's and C's. If the B's were particularly large(so that they did not fit into the space remaining in the survivorspace) and the C's were particularly small, then you might see(survivor space) ... A1 A2 A3 ... C1 C2 C3 ...(tenured generation) ... B1 B2 B3 ... If there are references in the application thread stack to the B's and C's aswell as to the A's, and the collector found objects in the order C2 B3 A1 A2 A3 then after the collection the object layout would look like... C2 B3 A1 A2 A3 ... B1 B2 ... C1 C3 ...The mixtures are as endless as the ways in which the application referencesits objects. How does collecting in parallel affect things? Recall that the serialcollector first looks at all the references found in the applicationthread stacks (in the simpliest case the A's) before it looks at references insidethe A's (i.e., the B's). The parallel collector starts by looking at some of theA's, but then may look at some of the B's before all the A's havebeen found. A parallelGC thread copies an A and puts its new location on a stack. I'll referto this stack as its grey stack so as not to confuse it with the stacks of the application threads. Each GC thread hasits own grey stack. When its grey stack starts to get full, a GC thread willstart taking the locations of the A's off its grey stack and looking inside them for B's tocopy. It starts looking inside the A's so that its grey stack does notgrow too large. In general we prefer to do some copying insteadof having to malloc more space for the grey stack. Very quickly thereare A's, B's and C's on the grey stacks and the order in which they arecopied looks largely random from outside the garbage collector.When a parallel GC thread promotes an object, it promotes it intoits own distinct area (let me call them PLAB's)in the tenured generation. A GC thread that has runout of objects to copy (no more application thread stacks toscan and no more objects on its own grey stack) can steal objects from another GC thread's grey stack. So A1 may be copied by one GC thread into its PLAB whileB1 (if it were stolen) may be copied by another GC thread into a different PLAB. So collections can disrupt good object locality in a variety ofdifferent ways. We don't like that.We're thinking about ways to improve it.Hope this was understandable. Please ask if it was not.

Allocating a bunch of objects together and then using them together seems to be a good thing for cache utilization. Not always true, of course,but, if you're willing to take that as a quasi-truth,...


Top 10 GC Reasons

to upgrade to J2SE 5.010) We fixed some bugs.9) Thread local allocation buffer dynamic tuning. (see "A Little Thread Privacy Please")8) Promotion failure handling was implemented for the low pause collector. Promotion failure handling is the ability to start a minor collection andthen backout of it if there is not enough space in the tenured generationto promote all the objects that need to be promoted.In 5.0 the "young generation guarantee" need not apply for the low pause collector because we implemented promotion failure handling for that collector. The net effect of having promotion failure handling is that more minor collections can be done before a major collection is needed. Since minor collections are typically more efficient than major collections, it's better. The "young generation guarantee" is discussed in the document below.http://java.sun.com/docs/hotspot/gc5.0/gc_tuning_5.htmlIt was particularly harsh on the low pause collector because the "young generation guarantee" required one contiguous chunk of space in the tenured generation. For the low pause collector that requirement could be difficult to satisfy because the low pause collector does not compact the tenured generation and so is subject to fragmentation.7) Parallel remark in the low pause collector.For the low pause collector there are two stop-the-world pauses associated with each collection of the tenured generation: the initial marking pause and the remarking pause. The latter is typically much longer than the former. In 5.0 the remarking is done with multiple threads in order to shorten it.6) Parallel reference processing in the low pause collector.For an application that uses Reference objects (see)http://java.sun.com/j2se/1.5.0/docs/api/java/lang/ref/Reference.htmlextensively, the GC work to process the Reference objects can be noticeable. It's not necessarily worse in the low pause collector than in theother collects, but it hurts more (because we're trying to keep the pauses low).Parallel reference processing is available for thelow pause collector but is not on by default. Unless there are tons of Reference Objects, doing the referenceprocessing serially is usually faster. Turn it on with the flag -XX:+ParallelRefProcEnabled if you make extensive use of Reference Objects (most applications don't).5) Server class machine defaults.On server class machines (machines with 2 or more cpus and 2G or more of physical memory) the defaultcompiler, garbage collector and maximum heap size are different. Before 5.0 it was \* -client compiler \* serial garbage collector \* 64M maximum heap (on most 32 bit platforms)With 5.0 the defaults for server class machines are \* -server compiler \* throughput garbage collector \* maximum heap size the lesser of 1/4 of physical memory or 1GFor the smaller machines the defaults did not change.The intent of this change was to provide better default performanceon large machines that are typically dedicated to running 1 or a few applications and for which throughput is important. Desktopsfor whichinteractivity was expected to be more important were excluded frombeing server class machines.Seehttp://java.sun.com/docs/hotspot/gc5.0/ergo5.htmlfor more details.4) Low pause collector schedules remarking between minor collections.Recall that the pause due to object remarking is the longer of the two stop-the-world pauses. Before 5.0 it could happen that a remark pause would occur immediately after a minor pause. When this back-to-back minor pause andremark pause occurred it looked like one big fat pause. With 5.0the remarking is scheduled so as to be about mid way between two minor pauses.3) Better scheduling of the start of a concurrent low pause collection. Prior to 5.0 a low pause collector started a collection based on the rate of allocationand the amount of free space in the tenured generation. It did the calculation and starteda concurrent collection so as to finish before the tenured generation ran dry. This was good, but there were some end cases that needed to be recognized. For example, if the tenured generation had to be expanded in order to support promotions for a minor collection that justfinished, a concurrent collection was started right away. Or if the next minor collection might not succeeded because of lack of free space inthe tenured generation, then a concurrent collection was started. That latter examplewouldn't happen if we could perfectly predict the rate at which the tenured generationis filling up. We're not perfect. Also a bump in the allocation rate might mess us up. 2) GC ergonomics (See "It's Not Magic" at the start of my blogs)And the number 1 GC reason for upgrading to 5.0 is1) ???You can tell me what you've found to be a goodreason to upgrade to 5.0 in a comment to this blog. Don't be shy. And thanks for any responses.

to upgrade to J2SE 5.0 10) We fixed some bugs. 9) Thread local allocation buffer dynamic tuning. (see "A Little Thread Privacy Please") 8) Promotion failure handling was implemented for the low...


What the Heck's a Concurrent Mode?

and why does it fail?If you use the low pause collector, have you ever seen a message that containedthe phase "concurrent mode failure" such as this?174.445: [GC 174.446: [ParNew: 66408K->66408K(66416K), 0.0000618 secs]174.446: [CMS (concurrent mode failure): 161928K->162118K(175104K), 4.0975124 secs] 228336K->162118K(241520K)This is from a 6.0 (still under development) JDK but the same type of message can comeout of a 5.0 JDK.Recall that the low pause collector is a mostly concurrent collector: parts of thecollection are done while the application is still running. The message "concurrent modefailure" signifies that the concurrent collection of the tenured generation did notfinish before the tenured generation became full. Recall also that a concurrentcollection attempts to start just-in-time to finish before the tenured generationsbecomes full. The low pause collector measures the rate at which the the tenuredgeneration is filling and the expected amount of time until the next collection andstarts a concurrent collection so that it finished just-in-time (JIT). Three things to notein that last sentence. The "rate" at which the tenured generation is filling is basedon historical data as is the "expected amount of time" until the next collection. Eitherof those might incorrectly predict the future. Also the JIT is reallyJIT plus some amount of padding so as to get it right most of the time.When a concurrent mode failure happens, the low pause collector does a stop-the-world (STW)collection. All the application threads are stopped, a different algorithm is usedto collect the tenured generation (our particular flavor of a mark-sweep-compact),the applications threads are started again, and life goes on. Except that theSTW collection is not very low pause and there's the rub.At this point if you're asking "why does this happen", you've come to the right blog.There are several possibilites.The amount of live data that is in the tenured generation is too large. Specifically,there is not enough free space in the tenured generation to support the rate of allocation intothe tenured generation. For an example in the extreme if there are only 2 words of free spacein the tenured generation after a collection of the tenured generation, chances are those2 words will be exhausted before another concurrent collection of the tenured generationcan be done. If you are seeing lots of concurrent mode failures, chances are your heapis too small.Your application can change behaviors dramatically such that past behavior does notadequately predict future performance. If this is the problem you'll see theconcurrent mode failures only near the change in behavior. After a few more collections thelow pause collector adjusts its expectations to make better decisions. But to deal with the concurrent mode failures in the mean time, you'll usuallybe trading off better performance. You can tell the low pausecollector to start a collection sooner. The flag -XX:CMSInitiatingOccupancyFraction=NNwill cause a concurrent collection to start when NN percent of the tenured generationis full. If you use this option to deal with the concurrent mode failures that resultfrom a change in the behavior of your application, much of the time (when the applicationsbehavior is more steady state) you'll be starting collection too early and so doing more collections than necessary. If you set NN to 0, it will cause one concurrentcollection to be followed as soon as possible by another. The next collection may notstart immediately after the last because the check on when to start a collection isdone only at particular points in the code, but the collection will start at the nextopportunity.Your application may not have dramatic changes in behavior, but if it has a large variance in allocation rates, that can cause the JIT GC to not be JIT. You can add some more padding tothe time at which a concurrent collection kicks off by using the flag -XX:CMSIncrementalSafetyFactor=NN. The default value for NN is 10 (i.e., a 10% paddingon the start of the concurrent collection). Increasing NN to 100 starts a concurrent collection at the next opportunity.

and why does it fail? If you use the low pause collector, have you ever seen a message that contained the phase "concurrent mode failure" such as this? 174.445: [GC 174.446: [ParNew:...


The Fault with Defaults

During early work on the low pause collector some of the applications we used for benchmarking the collector had the characteristic that objects had very short lifetimes or relatively long lifetimeswith not much in between. All our current collectors use a copying collection for the young generation. This copying collection will copy the younger objects back into the young generationand copy the older objects into the tenured generation. The expectation is that more of theyounger objects will die in the young generation if they are give a little more time. Aconsequence of this is that long lived objects are copied repeatedly into the young generationbefore finally being promoted into the tenured generation. Depending on the mix of lifetimes ofyour objects, this can be good (more objects die in the young generation) or this can be bad (extra copying into the young generation). Additionally at about the same time we were discussing thepolicy of never doing the copying but rather always copying any objects that survived a younggeneration collection immediately into the tenured generation. There are some collectorsthat do that and there was some anecdotal evidence that it was a good strategy.Given some example applicationswhere never copying back into the young generations seemed to be the right strategy and the general discussion about what was the right policy for copying, we decided to make the defaultbehavior for the low pause collector to always promoteobjects surviving a young generation collection into the tenured generation.This default policy still seems reasonable for some applications but the old adage about onesize not fitting all certainly applies here. So here's what you should consider and whatyou can do.The young generation is acting like a filter to dispose of the shorter lived objects. Any objectsthat get through the filter are moved to the tenured generation and are henceforth handledby the mostly concurrent collections (which I'll just call the concurrent collections fromhere on out). If the concurrent collectionsare starting too often to suit your tastes, one possibility is tofilter out more of the short lived objects by doing some copying during the young generationcollections. This may slow down the growth in the number of objects in the tenured generationand lessen the need for too frequent concurrent collections. To do this try the command lineflags-XX:MaxTenuringThreshold=15 -XX:SurvivorRatio=8MaxTenuringThreshold is the number of collections that an object must survive before being promotedinto the tenured generation. When objects are copied back into the young generation during ayoung generation collection, the objects are copied into an part of the young generation that isreferred to as a survivor space. SurvivorRatio set to 8 will mean that the survivor spaces areabout 10% of the size of the young generation. The survivor spaces and SurvivorRatio are described in the 5.0 tuning guide that can be found under http://java.sun.com/docs/hotspotThe FAQ attached to the 5.0 tuning guide has an entry about experimenting with MaxTenuringThreshold.By the way having concurrent collections starting "too often" can also be a sign thatthe tenured generation is smaller than is needed for good performance. You can always tryincreasing the size of the heap to reduce the frequency of the concurrent collections.The suggestions above may be able to get you the less frequent concurrent collectionswithout the longer concurrent collections that typically come when the heap is increased.The default setting to always promote objects that survive a young generation collection isonly the default for the low-pause collector. The other collectors by default will dosome copying before promoting an object to the tenured generation.

During early work on the low pause collector some of the applications we used for benchmarking the collector had the characteristic that objects had very short lifetimes or relatively long lifetimeswit...


When the Sum of the Parts

doesn't equal a big enough hole.Did I mention that the low pause collector maintains free lists for the space available in the tenured generation andthat fragmentation can become a problem? If you're using the low pause collector and things aregoing just peachy for days and days and then there is a huge (relatively speaking) pause,the cause may be fragmentation in the tenured generation. In 1.4.2 and older releases in order to do a young generation collectionthere was a requirement that there be a contiguous chunk offree space in the tenured generation that was big enough to holdall the the young generation. In the GC tuning documents athttp://java.sun.com/docs/hotspot/this is referred to as the young generation guarantee. Basically during a young generation collection, any data that survives may have to bepromoted into the tenured generation and we just don't know how much is going tosurvive. Being our usual conservative selves we assumed all of it would survive andso there needed to be room in the tenured generation for all of it. How does thiscause a big pause? If the young generation is full and needs to be collected butthere is not enough room in the tenured generation, then a full collection ofboth the young generation and the tenured generations are done. And this collectionis a stop-the-world collection not a concurrent collection so you generally see apause much longer than you want to. By the way this fullcollection is also a compacting collection so there is no fragmentation at theend of the full collection.In 5.0 we added the ability in the low pause collector to start a younggeneration collection and then to back out of it if there was not enoughspace in the tenured generation. Being able to backout of a young generationcollection allowed us to make a couple of changes.We now keep an average of the amount of spacethat is used for promotions and use that (with some appropriatepadding to be on the safe side) as the requirement for the spaceneeded in the tenured generation. Additionally we no longer needa single contiguous chunk of space for the promotions so we look at the totalamount of free space in the tenured generation in deciding if we cando a young generation collection. Not having to have a single contiguous chunk of space to supportpromotions is where fragmentation comes in (or rather where it doesn't come in as often).Yes, sometimes using the averages for the amount promoted and the total amount of free in the tenured generation tells us togo ahead and do a young generation collection and we get surprised (there really isn't enoughspace in tenured generation). In that situation we have to back out of theyoung generation collection. It's expensive to back out of a collection, but it's doable. That's a very long way of saying that fragmentation is less ofa problem in 5.0. It still occurs, but we have better ways of dealling with it. What should you do if you run into a fragmentation problem?Try 5.0.Or you could try a larger total heap and/or smaller young generation.If your application is on the edge, it might give you just enoughextra space to fit all your live data. But often it just delays the problem.Or you can try to make you application do a full, compacting collectionat a time which will not disturb your users. If your application can go for a day without hitting afragmentation problem, try a System.gc() in the middle of thenight. That will compact the heap and you can hopefully goanother day without hitting the fragmentation problem. Clearly no help for anapplication that does not have a logical "middle of the night". Or if by chance mostof the data in the tenured generation is read in when your applicationfirst starts up and you can do a System.gc() after you completeinitialization, that might help by compacting all data intoa single chunk leaving the rest of the tenured generationavailable for promotions. Depending on the allocation patternof the application, that might be adequate.Or you might want to start theconcurrent collections earlier. The low pause collector tries to start a concurrentcollection just in time (with some safety factor) to collect thetenured generation before it is full. If you are doing concurrentcollections and freeing enough space, you can try starting a concurrent collection sooner so thatit finishes before the fragmentation becomes a problem. The concurrent collections don't do a compaction, but they docoalese adjacent free blocks so larger chunks of free space can result from a concurrent collection.One of the triggers for starting a concurrent collection is the amountof free space in the tenured generation. You can cause a concurrent collection tooccur early by setting theoption -XX:CMSInitiatingOccupancyFraction= where NNN is thepercentage of the tenured generation that is in use above which a concurrent collection is started. This will increase the overall time you spenddoing GC but may avoid the fragmentation problem. And this will be moreeffective with 5.0 because a single contiguous chunk of space is not requiredfor promotions.By the way, I've increased the comment period for my blogs. I hadn't realized it was so short.

doesn't equal a big enough hole. Did I mention that the low pause collector maintains free lists for the space available in the tenured generation andthat fragmentation can become a problem? If you're...


So What's It Worth to You?

Have you ever gone to your favorite restaurant, had a great meal and sat back with yoursecond cup of coffee and asked "Why can't those guys do a pausless garbage collector?How hard can it be?" Well, you're not the only one. That question came up around here recently.There are techniques for doing pauseless garbage collection. The problem isthat it costs in terms of application performance or additional Java heap size. Threads in an application read and write objects. The garbage collector reads and writes (i.e., frees or moves) objects.All those reads and writes have to have some coordination. "Pauseless" GC very often means that the application as a whole is not paused, but individual threads are paused one at a time. Imagine two application threadsthat are furiously playing hot potato with an object. As soon as thread 1 passes the object A to thread 2,thread 1 forgets about A and thread 2 realizes it has A. Then thread 2 immediately passes it backto thread 1 with the analogous result. A pauseless collector pauses thread 1 to look at what objectsit is holding, then restarts thread 1 (so only one thread is paused at a time) and pauses thread 2to examine it. Did object A get passed from thread 2 to thread 1 in that window during which bothwere executing? If it did and that is the only reference to object A, then the collector will not know that object A is live without doing more work.I'm going to skip the details here. The point of this note is not to delve into the particulardifficulties of pauseless GC, but rather to give you an idea of why it can be more costly. Also I'm going to consider collectors that compact objects (i.e., move them). Mostly because it'san easier example to talk about. Some collectors don't move objects during a collections. Not moving objects simplifiessome aspects of a collection but has its own complications (e.g., maintaining freelists of objects, splitting and coalescing blocks of free space, and fragmentation). If the collector does compact the objects, it has to be sure that all references to live objects point to their new locations.A common way to catch object A when it has slipped through the window is to usewhat is fondly referred to as a read barrier. Any read of an object by the application goesthrough code (the read barrier) that looks at the object and decides whether something specialhas to be done. The "something special" depends on the particulars of the collector. Without getting into those particulars just having to do extra stuff on every read has got to hurt.Yes, I'm really over simplifying this example, but what can you expect from my really simple mind. So what's pauseless GC worth to you (i.e., in terms of the extra space and performance costs, the extra complexity, maybe special hardware)? It's definitely worth a bunch to some. But formany it isn't really necessary to have truly pauseless GC. Shorter is good enough.Why can't those guys do a pauseless garbage collector?Would "pauseless garbage collection" give us the biggest bang for the buck (i.e., mostbenefits to most user)? When we were deciding whatto do for the next release, we asked ourselves that question. And we decided there were thingswe should do before "pauseless GC". Did I mention that I recently worked on the parallel collection of the tenured generation for the throughput collector? I'll tell you about it some time.

Have you ever gone to your favorite restaurant, had a great meal and sat back with your second cup of coffee and asked "Why can't those guys do a pausless garbage collector? How hard can it be?" Well,...


Why not a Grand Unified Garbage Collector?

At last count we have three garbage collectors.the parallel collectorthe low pause collectorthe serial collectorWhy is there more than one? Actually, the usual answer applies. Specialization oftenresults inbetter performance. If you're interested in more particulars about ourgarbage collectors, read on.All three collectors are generational (young generation and tenured generation).Let's do the easy comparison first, why a parallel collector and a serialcollector. Parallelism has overhead. Nuf said? Yeah, I usedto read comic books when I was a kid. If you don't understand that reference, ignoreit. I'm just older.As you might infer from the names, the serial collector uses 1 thread to do theGC work and the parallel collector uses multiple threads to do the same. As usualmultiple threads doing the same tasks have to synchronize. That's pretty much it.On a single cpu machine the additional cost of the synchronization means that theparallel collector is slower than the serial collector. On a two cpu machine anda VM that has a small heap the parallel collector is as about fast as the serial collector. With two cpu's and large heaps the parallel collector will usually dobetter. We keep asking ourselves if we can get rid of the serial collector anduse the parallel collector in its place. The answer so far keeps coming back no.More interesting is the case of the low pause collector versus the parallel collector.Above I made the remark about specialization and better performance. This is actuallya case of more complexity and lesser performance.These two collectors do the collection of the young generation using almost the exact sametechniques The differences in the collectors have to do with the collections of the tenured generation. The low pause collector does parts of that collection while the application continues to run. One way to do that is to not move the live objects when collecting the dead objects. The application tendsto get confused if the objects it is using move around while the application is running.The other two collectorscompact the heap during a collection of the tenured generation (i.e., live objects are moved so as to occupy one contiguous region of the heap). The low pause collector collectsthe dead objects and coalesces their space into blocks that are keptin free lists. Maintaining free lists and doing allocations from them takes effort soit's slower than having a heap that is compacted. Having applications run while a collection is happening means that new objects can be allocated during a collection.That leads to so more complexity. Also the collection of the tenured generation canbe interrupted for a collection of the young generation. More complexity still. The bottom lineis that the low pause collector has shorter GC pauses but it costs performance.That performance difference is not huge but it's large enough to keep us from ditchingthe parallel collector and always using the low pause collector. And last but not least, can we replace the serial collector with the low pausecollector? Very tempting. The serial collector is used by default on desktop machines.We expect those to have 1 or 2 cpus and to be running applications that need 10's of megabytes of Java heap as opposed to 100's of megabytes. With small heaps the differencesin collection times tend to make less difference. Even if the low pause collector was10% slower than the serial collector, the difference between, for example, 70 msand 77 ms often isn't large enough to matter. It would probably be a done deal exceptthat the low pause collector has a larger memory footprint. It has additional datastructures that it uses (for example to keep track of what references are beingchanged by the the running application while a collection is on going). It also usuallyneeds a larger Java heap to run an application. Recall that the low pause collectoruses free lists to keep track of the available space in the heap. Fragmentation can become a problem. The best bet is that we'll replace the serial collector with the low pause collector some day but not just yet.

At last count we have three garbage collectors. the parallel collector the low pause collector the serial collector Why is there more than one? Actually, the usual answer applies. Specialization oftenr...


A Little Thread Privacy Please

This is not really about garbage collection but hopefully you'll find it interesting.And there is no punch line (i.e., cool command line flag that I can suggest to make your application run better). It's just a story about letting the VM do it.If you're using multiple threads (or one of the Java libraries is using multiplethreads in your behalf) then threads in your application are doing allocations concurrently. All the threads are allocating from the same heap so some allowanceshave to be made for simultaneous allocations by 2 or more threads. In the simplestcase (other than the case where no allowances are made and which is just plain wrong)each thread would grab a lock, do the allocation and release the lock. That givesthe right answer but is just too slow. The slightly more complicated means would beto use some atomic operation (such as compare-and-swap) plus a little logic to safelydo the allocation. Faster but still too slow. What is commonly done is to giveeach thread a buffer that is used exclusively by that thread to do allocations.You have to use some synchronization to allocate the buffer from the heap, butafter that the thread can allocate from the buffer without synchronization. Inthe hotspot JVM we refer to these as thread local allocation buffers (TLAB's).They work well. But how large should the TLAB's be?Prior to 5.0 that was a difficult question to answer. TLAB's that were too largewasted space. If you had tons of threads and large TLAB's,you could conceivably fill up the heapwith buffers that were mostly unused. Creating more threads mightforce a collection which would be unfortunate because the heapwas mostly empty.TLAB's that were too small would fill quickly and would mean having to geta new TLAB which would require some form of synchronization. There wasnot a general recommendation that we could make on how large TLAB's shouldbe. Yup, we were reduced to trial-and-error.Starting with 5.0 living large with TLAB's got much simpler - except for the guydown the hall that did the implementation. Here's what the VM does for you. Each thread starts with a small TLAB. Between the end of the lastyoung generation collection and the start of the next (let me call thatperiod an epoch), we keep track of the number of TLAB's a thread hasused. We also know the size of the TLAB's for each thread. Averages for each thread are maintained for these two numbers (number andsize of TLAB's). These averages are weighted toward the most recentepoch. Based on these averages the sizes of the TLAB's are adjusted so that athread gets 50 TLAB's during a epoch. Why 50? All things being equal wefigured that a thread would have used half its TLAB by the end of the epoch.Per thread that gave us 1/2 a TLAB not used (out of the magic 50) for a wastageof 1%. That seemed acceptable. Also if the young generation was not large enoughto provide the desired TLAB's, the size of the young generation would beincreased in order to make it so (within the usual heap size constraints, of course).The initial TLAB size is calculated from thenumber of threads doing allocation and the size of the heap. More threads pushesdown the initial size of the TLAB's and larger heaps push up the initial size.An allocation that can not be made from a TLAB does not always mean that the threadhas to get a new TLAB. Depending on the size of the allocation and the unusedspace remaining in the TLAB, the VM could decide to just do the allocation fromthe heap. That allocation from the heap would require synchronization but so wouldgetting a new TLAB. If the allocation was considered large (some significant fractionof the current TLAB size), the allocation would always be done out of the heap.This cut down on wastage and gracefully handled the much-larger-than-averageallocation.

This is not really about garbage collection but hopefully you'll find it interesting. And there is no punch line (i.e., cool command line flag that I can suggest tomake your application run better)....


Since It's Not Magic

So by now you've tried GC ergonomics and things are working pretty well. If you didn't haveany GC related command line flags, your reactions could run the gamut fromthe "big yawn" to an eye-popping "that's cool". The "big yawn" means that GC probablydoesn't matter for your application. The "that's cool" means that GC matters but nobodyever told you. If you did have a bunch of GC related command line flags and you threw them outin order to try GC ergonomics, then 80% of you are happy and just glad to forget thosecommand line flags. But 20% of you are thinking "Hmmm, not quite as good". So what elsedo you need to know about GC ergonomics? By the way I invoke the 80/20 rules quiteoften, especially when I have no idea what the real numbers are.By default ergonomics gives you a larger maximum heap size. "Larger" is relative to the defaultmaximum heap size you got prior to ergonomics in 5.0. This larger maximum heap size depends on theamount of physical memory on the system (the more physical memory you have, the largerthe maximum heap size you get), but it is capped at 1G. If you had a command line flagthat sets a larger maximum heap size, put it back. Ergonomics just doesn't knowyou're in the 20% that needs a larger heap. If you set a larger maximum heap size,ergonomics will still grow and shrink the generations for better performance butnow will have more space to play with. By the way if you set a smaller maximum heapsize, ergonomics still does its stuff but just within a smaller heap.Ergomonics has a default for the relative sizes of the young generation and the tenured generation in the heap. If you have command line flags to specify the sizes of thegenerations (probably something like -XX:NewRatio=ratio or -XX:MaxNewSize=bytes), use them. GC ergonomics normally does not change the maximum size of ageneration so if you know something about how your application runs such thatyou know it benefits from a larger or smaller young generation, pass that information along on the command line. Again GC ergonomics just works withinthe limits of the heap it's given.

So by now you've tried GC ergonomics and things are working pretty well. If you didn't have any GC related command line flags, your reactions could run the gamut fromthe "big yawn" to an eye-popping...


When Are You Out of Memory?

You know why you get an out-of-memory exception, right? Your live data exceeds the spaceavailable in the Java heap. Well, that's very nearly, always right. Very, very nearly. If the Java heap is barely large enough to hold all the live data, the JVM could be doingalmost continual garbage collections. For example if 98% of the data in the heapis live, then there is only 2% that is available for new objects. If the applicationis using that 2% for temporary objects, it can seem to be humming along quite nicely,but not getting much work done. How can that be? Well the application runs until it hasallocated that 2% and then a garbage collection happens and recovers that 2%. The application runs along happily allocating and the garbage collector runs alongrespectfully collecting. Over and over and over. The application will be making forwardprogress but maybe oh so slowly. Are you out of memory?Back in the 1.4.1 days a customer noticed this type of behavior and asked for helpin detecting that bad situation. In 1.4.2 the throughput collector started throwing anout-of-memory exception if the VM was spending the vast majority of its timedoing garbage collection and not recovering very much space in the Java heap.In 5.0 the implementation was changed some, but the idea was the same. If you arespending way too much time doing garbage collections, you're going to get an out-of-memory.Interestingly enough this identified at least one case in our own usage of Javaapplications where we were spending most of our time doing garbage collection.We were happy to find it.Why do I bring this up? Well, mostly because it was brought up in our GC meetingthis morning. If you're in this situation of spending most of your time ingarbage collection, I think you are out of memory and you need a bigger heap.If you don't think that, you can turn off this behavior with the command line flag -XX:-UseGCTimeLimit. May you never need it.

You know why you get an out-of-memory exception, right? Your live data exceeds the space available in the Java heap. Well, that's very nearly, always right. Very, very nearly. If the Java heap is...


What Are We Thinking?

What's next for GC ergonomics? Just a friendly warning. This one verges on GC stream-of-consciousness ramblings.GC ergonomics has been implemented so far in the throughput collector only.We've been thinking about how to extend it to the low pause collector. Thelow pause collector currently is implemented as a collector that doessome of it's work while the application continues to run. It's describedin http://java.sun.com/docs/hotspotSome of the policies we used in the throughput collector will also beuseful for the low pause collector, but because the low pausecollector can be running at the same time as the application, thereare some intriguing differences.By the way the low pause collector does completely stop the application in orderto do some parts of the collection so some of our experience with thethroughput collector is directly applicable. On the other hand having this mixof behaviors can be interesting in and of itself. When we were developing the low pause collector we decided that anyparts of the collection that we could do while the application continuedto run was good. It was free. If there are spare cycles on themachine, that's almost true. If there aren't spare cycles, then it canget fuzzy. If the collection steals cycles that the application could use, thenthere is a cost. Especially if there is only one processor on the machine.If there are more than one processor on the machine and I'm doing GC, am I stealing cycles from the application? If I steal cycles from anotherprocess on the machine, does it become free again? We've been thinking abouthow to assess the load on a machine and what we should do in differentload situations. That type of information may turn out to be input for GCergonomics.Another aspect that we have to deal with is the connection between the young generationsize and the tenured generation pause times (pauses in collecting the tenured generation, that is). When collecting the tenured generation, we need to be aware of objects in the young generation that can be referencing (and thus keepingalive) objects in the tenured generation. In fact we have to find those objects inthe young generation. And the larger the young generation is the longer it takes tofind those objects. With the throughput collector the times to collectthe tenured generation is only distantly related to the size of the young generation.With the low pause collector the connection is stronger. If we're trying to meet apause time goal for a pause that is part of the tenured generation collection, thenmaybe we should reduce the size of the young generation as well as reduce the size of the tenured generation. But maybe not.With the throughput collector a collection is started when the application attempts toallocate an object and there is no room left in the Java heap. With the low pause collectorwe want the collection to finish before we run out of room in the Java heap. So when does thelow pause collector start a collection of the tenured generation? Just In Time, hopefully.Starting too early means that some of the capacity of the tenured generation is not used.Starting too late makes the low pause collector not a low pause collector. In the 5.0release we did some good work to measure how quickly the tenured generation was being filledand used that to decide when to start a collection. It's a nice self contained problem aslong as we can start a collection early enough. But if we cannot start a collection in timethen we probably need a larger tenured generation. So a failure to JIT/GC needs to feed intoGC ergonomics decisions. Well, really we don't actually want to fail to JIT/GC before weexpand the tenured generation so there's more to think about. But not right now.

What's next for GC ergonomics? Just a friendly warning. This one verges on GC stream-of-consciousness ramblings. GC ergonomics has been implemented so far in the throughput collector only.We've been...


What Were We Thinking?

There were some decisions made during the development of GC ergonomics that perhaps deserve someexplanation. For example,The pause time goal comes first.A pause is a pause is a pause.Ignore the cost of System.gc()'s.Why is the pause time goal satisfied first?GC ergonomics tries to satisfy a pause time goal before considering any throughput goal.Why not the throughput goal first? I tried both ways with a variety of applications.As one might expect it was not black and white. In the end we chose to consider the goals in this order. Pause time goalThroughput goalSmaller footprintThe pause time goal definitelyhas the potential for being the hardest goal to meet. It's dependence on heap sizeis complicated and trying to meet the pause time goal without the encumberances of eitherof the other goals was easier to think about. If we could meet the pause time goal, thenincreasing the heap to try and meet a throughput goal felt safer (i.e., the relationshipbetween throughput and heap size is more linear so it was easier to understand how undoing an increase would get us back to where we started).In retrospect it also seems more naturalto have the pause time goal (which pushes heap size down) competing with the throughputgoal (which pushes heap size up). And only then to have the throughput goal (which againpushes the heap size up) competing with the footprint goal (which, of course, pushes theheap size down).A pause is a pause ... We talked quite a bit about whether the pause time goal should apply to both the majorand minor pause times. The issue was whether it would be effective to shrink thesize of the old generation to reduce the major pause times. With a young generationcollection you can shrink the heap more easily because there is always some place toput any live objects in the young generation (namely into the old generation). It was clearthat reducing the young generation size would reduce the minor collection times(after you've paid the cost of getting the collection started and shutting it down). Well, that's true if you can ignore the fact that more frequent collections give objects lesstime to die. With the old generation it was much less obvious what would happen. The old generationcan only be shunk down to a size big enough to hold all the live data in the old generation.Also the amount of free space in the old generation has an effect on the young generationcollection in that young generation collection may need to copy objects into the oldgeneration. In the end we decided that trying to limit both the major pauses and minor pauses with thepause time goal, while harder was more meaningful. Would you have accepted theexcuse "Yes, we missed the goal but it was a major collection pause not a minorcollection pause".System.gc()'s. Just ignore them.During development I initially tried to include the costs of System.gc()'s in the calculation of the averages used by GC ergonomics. In calculating the cost ofcollections the frequency of collections matters. If you arehaving collections more often then the cost of GC is higher. The strategy to reduce that cost is to increase the size of the heap so that collectionsare less frequent (i.e., since the heap is larger you can do more allocations before having to do another collection). The difficultywith System.gc()'s is that increasing the size of the heap does not in general increasethe time between System.gc()'s. I tried to finesse the cost of a System.gc() by consideringhow full the heap was when the System.gc() happened and extrapolating to how long the intervalbetween collections would have been. After some experimentation I found that picking howto do the extrapolation was basically picking the answer (i.e., what the GC cost would havebeen). I could tailor an extrapolation to fit one application, but invariably it did not fitsome other applications. Basically it was too hard. So GC ergonomics ignores System.gc()'s.

There were some decisions made during the development of GC ergonomics that perhaps deserve some explanation. For example, The pause time goal comes first. A pause is a pause is a pause. Ignore the cost...


Where Are the Sharp Edges?

In general GC ergonomics works best for an application that has reached a steady state behavior in terms of its allocation pattern. Or at least it is not changing its allocation pattern quickly. GC ergonomics measures the pause times and throughput of the application and changes the size of the heap based on those measurements.The measurements of pause time and throughput are kept in terms of a weighted average where (as one would expect) the most recent measurements are weighted more heavily. By using a weighted average GC ergonomics is not going to turn on a dime in response to a change in behavior by the the application, but it is also not going to go flying off in a wrong direction because of normal variations in behavior.If past behavior is not a good indicator of future performance, then GC ergonomics can lag behind in its decision making. If a change is just an occasional bump in the road, GC ergonomics will catch up. If behavior is all over the map, well, what can I say.The easiest way to get into trouble with GC ergonomics is to specify a pause time goal that is not reachable. Typically what happens is that GC ergonomics reduces the pause times by reducing the size of the heap. As the heap is shrunk the frequency of collections goes up and throughput goes down. GC ergonomics is willing to drive throughput to nearly zero (by doing collections nearly all the time) in order to reach the pause time goal. I tell people to run without a pause time goal initially and see how large the collection pauses get. That gives a baseline for experimenting with a pause time goal. Then have a little fun and try some pause times.Another thing you should be aware of is that GC ergonomics is going to run at every collection. If your applications has settled into a stable steady state, GC ergonomics is still looking to see if anything is changing so it can adjust. It does cost you some cycles, but I don't think it's significant. Let me put it this way. I've never seen GC ergonomics code show up in a significant way on performance profiles. This is probably less a sharp edge than a mild poke in the ribs. If you think that your application is really not going to be changing its behavior after it has settled in and want those last few cycles, run with GC ergonomics until your application has reached its steady state and look to see how the heap is sized. You'll have to pay attention to how the generations are sized also. Then select those sizes on the command line and turn GC ergonomics off. At least for most of you that should be plenty good. If performancesis not quit as high and you don't already know about survivor spaces, you may have to learn about them.The document "Tuning Garbage Collection with the 5.0 Java Virtual Machine" should help. It can be found under the URL below (same one as in "Magic"). If performances is actually better, rejoice and let me know how we can be doing better.http://java.sun.com/docs/hotspot

In general GC ergonomics works best for an application that has reached a steady state behavior in terms of its allocation pattern. Or at least it is not changing its allocation pattern quickly....


It's Not Magic

In our J2SE (tm) 1.5.0 release we added a new way of tuning the Java(tm) heapwhich we call "garbage collector (GC) ergonomics". This was addedonly to the parallel GC collector. You may also have seen it referredto as "Smart Tuning" or "Simplified Tuning". GC ergonomics allows auser to tune the Java heap by specifying a desired behavior forthe application. These behaviors are a maximum pause time goal and athroughput goal.So what is GC ergonomics? Prior J2SE 1.5.0 if you wanted totune the Java Virtual Machine (JVM)(tm) for an application you typicallydid it by trial-and-error. You would run the JVM onyour application without changing any parameters and see how it ran.If the throughput of the application was not as high as you wanted,the usual solution was to increase the heap size.With a larger heap collections happen less often so the cost ofgarbage collection decreases as a percentage of the total executiontime. But as you increase the size of the heap, often thelength of the garbage collections increase. Since thegarbage collector pauses all application threads to do a collection,the application would see longer and longer pauses as you choselarger and larger heaps. If the pauses became toolong for your application, then you would have to reduce the sizeof the heap. You usually have to choose a compromise between pausetimes and throughput.With GC ergonomics in J2SE 1.5.0 you choose a pause time goal and athroughput goal and let the JVM increase or decrease the size ofthe heap to try to meet those goals. On big machines a largermaximum heap size is chosen as a default. GC ergonomicsonly grows the heap enough to meet your goals so the maximumheap size is not necessarily used. Sometimes you might have toincrease the maximum size of the heap if the default maximum sizeis too small. So how does this work? Actually GC ergonomics does pretty muchwhat you would do to tune the heap. As I say in the title, it's notmagic. But it does have the benefit of being able to tune dynamicallyduring the execution of the application. GC ergonomics Measures the performance (both throughput and pause times) of you application. Compares the performance against the goals. Decreases the heap size to shorten pause times, OR Increases the heap size to get fewer collections.If both the pause time goal and the throughput goal are being met,GC ergonomics will decrease the size of the heap to try and minimizethe application's footprint.GC ergonomics tries to meet your goals but there are no guarantees thatit can. For example, a maximum pause time of zero would be nice, butit's not going to happen. Can you tune the heap better that GCergonomics? Probably yes. Is it worth your time to do it? Andto keep it tuned as your circumstances change? You'll have to tellus.For more information on GC ergonomics, please see "Ergonomics in the5.0 Java Virtual Machine" underhttp://java.sun.com/docs/hotspot

In our J2SE (tm) 1.5.0 release we added a new way of tuning the Java(tm) heap which we call "garbage collector (GC) ergonomics". This was addedonly to the parallel GC collector. You may also have seen...


Integrated Cloud Applications & Platform Services