Understanding CMS GC Logs

CMS GC with -XX:+PrintGCDetails and -XX:+PrintGCTimeStamps prints a lot of information. Understanding this information can help in fine tuning various parameters of the application and CMS to achieve best performance.

Let's have a look at some of the CMS logs generated with 1.4.2_10:

39.910: [GC 39.910: [ParNew: 261760K->0K(261952K), 0.2314667 secs] 262017K->26386K(1048384K), 0.2318679 secs]

Young generation (ParNew) collection. Young generation capacity is 261952K and after the collection its occupancy drops down from 261760K to 0. This collection took 0.2318679 secs.

40.146: [GC [1 CMS-initial-mark: 26386K(786432K)] 26404K(1048384K), 0.0074495 secs]

Beginning of tenured generation collection with CMS collector. This is initial Marking phase of CMS where all the objects directly reachable from roots are marked and this is done with all the mutator threads stopped.

Capacity of tenured generation space is 786432K and CMS was triggered at the occupancy of 26386K.

40.154: [CMS-concurrent-mark-start]

Start of concurrent marking phase.
In Concurrent Marking phase, threads stopped in the first phase are started again and all the objects transitively reachable from the objects marked in first phase are marked here.

40.683: [CMS-concurrent-mark: 0.521/0.529 secs]

Concurrent marking took total 0.521 seconds cpu time and 0.529 seconds wall time that includes the yield to other threads also.

40.683: [CMS-concurrent-preclean-start]

Start of precleaning.
Precleaning is also a concurrent phase. Here in this phase we look at the objects in CMS heap which got updated by promotions from young generation or new allocations or got updated by mutators while we were doing the concurrent marking in the previous concurrent marking phase. By rescanning those objects concurrently, the precleaning phase helps reduce the work in the next stop-the-world “remark” phase.

40.701: [CMS-concurrent-preclean: 0.017/0.018 secs]

Concurrent precleaning took 0.017 secs total cpu time and 0.018 wall time.

40.704: [GC40.704: [Rescan (parallel) , 0.1790103 secs]40.883: [weak refs processing, 0.0100966 secs] [1 CMS-remark: 26386K(786432K)] 52644K(1048384K), 0.1897792 secs]

Stop-the-world phase. This phase rescans any residual updated objects in CMS heap, retraces from the roots and also processes Reference objects. Here the rescanning work took 0.1790103 secs and weak reference objects processing took 0.0100966 secs. This phase took total 0.1897792 secs to complete.

40.894: [CMS-concurrent-sweep-start]

Start of sweeping of dead/non-marked objects. Sweeping is concurrent phase performed with all other threads running.

41.020: [CMS-concurrent-sweep: 0.126/0.126 secs]

Sweeping took 0.126 secs.

41.020: [CMS-concurrent-reset-start]

Start of reset.

41.147: [CMS-concurrent-reset: 0.127/0.127 secs]

In this phase, the CMS data structures are reinitialized so that a new cycle may begin at a later time. In this case, it took 0.127 secs.

This was how a normal CMS cycle runs. Now let us look at some other CMS log entries:

197.976: [GC 197.976: [ParNew: 260872K->260872K(261952K), 0.0000688 secs]197.976: [CMS197.981: [CMS-concurrent-sweep: 0.516/0.531 secs]
(concurrent mode failure): 402978K->248977K(786432K), 2.3728734 secs] 663850K->248977K(1048384K), 2.3733725 secs]

This shows that a ParNew collection was requested, but it was not attempted because it was estimated that there was not enough space in the CMS generation to promote the worst case surviving young generation objects. We name this failure as “full promotion guarantee failure”.

Due to this, Concurrent Mode of CMS is interrupted and a Full GC is invoked at 197.981. This mark-sweep-compact stop-the-world Full GC took 2.3733725 secs and the CMS generation space occupancy dropped from 402978K to 248977K.

The concurrent mode failure can either be avoided by increasing the tenured generation size or initiating the CMS collection at a lesser heap occupancy by setting CMSInitiatingOccupancyFraction to a lower value and setting UseCMSInitiatingOccupancyOnly to true. The value for CMSInitiatingOccupancyFraction should be chosen appropriately because setting it to a very low value will result in too frequent CMS collections.

Sometimes we see these promotion failures even when the logs show that there is enough free space in tenured generation. The reason is 'fragmentation' - the free space available in tenured generation is not contiguous, and promotions from young generation require a contiguous free block to be available in tenured generation. CMS collector is a non-compacting collector, so can cause fragmentation of space for some type of applications. In his blog, Jon talks in detail on how to deal with this fragmentation problem:
http://blogs.sun.com/roller/page/jonthecollector?entry=when_the_sum_of_the

Starting with 1.5, for the CMS collector, the promotion guarantee check is done differently. Instead of assuming that the promotions would be worst case i.e. all of the surviving young generation objects would get promoted into old gen, the expected promotion is estimated based on recent history of promotions. This estimation is usually much smaller than the worst case promotion and hence requires less free space to be available in old generation. And if the promotion in a scavenge attempt fails, then the young generation is left in a consistent state and a stop-the-world mark-compact collection is invoked. To get the same functionality with UseSerialGC you need to explicitly specify the switch -XX:+HandlePromotionFailure.

283.736: [Full GC 283.736: [ParNew: 261599K->261599K(261952K), 0.0000615 secs] 826554K->826554K(1048384K), 0.0003259 secs]
GC locker: Trying a full collection because scavenge failed
283.736: [Full GC 283.736: [ParNew: 261599K->261599K(261952K), 0.0000288 secs]

Stop-the-world GC happening when a JNI Critical section is released. Here again the young generation collection failed due to “full promotion guarantee failure” and then the Full GC is being invoked.

CMS can also be run in incremental mode (i-cms), enabled with -XX:+CMSIncrementalMode. In this mode, CMS collector does not hold the processor for the entire long concurrent phases but periodically stops them and yields the processor back to other threads in application. It divides the work to be done in concurrent phases in small chunks(called duty cycle) and schedules them between minor collections. This is very useful for applications that need low pause times and are run on machines with small number of processors.

Some logs showing the incremental CMS.

2803.125: [GC 2803.125: [ParNew: 408832K->0K(409216K), 0.5371950 secs] 611130K->206985K(1048192K) icms_dc=4 , 0.5373720 secs]
2824.209: [GC 2824.209: [ParNew: 408832K->0K(409216K), 0.6755540 secs] 615806K->211897K(1048192K) icms_dc=4 , 0.6757740 secs]

Here, the scavenges took respectively 537 ms and 675 ms. In between these two scavenges, iCMS ran for a brief period as indicated by the icms_dc value, which indicates a duty-cycle. In this case the duty cycle was 4%. A simple calculation shows that the iCMS incremental step lasted for 4/100 \* (2824.209 - 2803.125 - 0.537) = 821 ms, i.e. 4% of the time between the two scavenges.

Starting with 1.5, CMS has one more phase – concurrent abortable preclean. Abortable preclean is run between a 'concurrent preclean' and 'remark' until we have the desired occupancy in eden. This phase is added to help schedule the 'remark' phase so as to avoid back-to-back pauses for a scavenge closely followed by a CMS remark pause. In order to maximally separate a scavenge from a CMS remark pause, we attempt to schedule the CMS remark pause roughly mid-way between scavenges.

There is a second reason why we do this. Immediately following a scavenge there are likely a large number of grey objects that need rescanning. The abortable preclean phase tries to deal with such newly grey objects thus reducing a subsequent CMS remark pause.

The scheduling of 'remark' phase can be controlled by two jvm options CMSScheduleRemarkEdenSizeThreshold and CMSScheduleRemarkEdenPenetration. The defaults for these are 2m and 50% respectively. The first parameter determines the Eden size below which no attempt is made to schedule the CMS remark pause because the pay off is expected to be minuscule. The second parameter indicates the Eden occupancy at which a CMS remark is attempted.

After 'concurrent preclean' if the Eden occupancy is above CMSScheduleRemarkEdenSizeThreshold, we start 'concurrent abortable preclean' and continue precleanig until we have CMSScheduleRemarkEdenPenetration percentage occupancy in eden, otherwise we schedule 'remark' phase immediately.

7688.150: [CMS-concurrent-preclean-start]
7688.186: [CMS-concurrent-preclean: 0.034/0.035 secs]
7688.186: [CMS-concurrent-abortable-preclean-start]
7688.465: [GC 7688.465: [ParNew: 1040940K->1464K(1044544K), 0.0165840 secs] 1343593K->304365K(2093120K), 0.0167509 secs]
7690.093: [CMS-concurrent-abortable-preclean: 1.012/1.907 secs]
7690.095: [GC[YG occupancy: 522484 K (1044544 K)]7690.095: [Rescan (parallel) , 0.3665541 secs]7690.462: [weak refs processing, 0.0003850 secs] [1 CMS-remark: 302901K(1048576K)] 825385K(2093120K), 0.3670690 secs]

In the above log, after a preclean, 'abortable preclean' starts. After the young generation collection, the young gen occupancy drops down from 1040940K to 1464K. When young gen occupancy reaches 522484K which is 50% of the total capacity, precleaning is aborted and 'remark' phase is started.

Note that in 1.5, young generation occupancy also gets printed in the final remark phase.

For more detailed information and tips on GC tuning, please refer to the following documents:
http://java.sun.com/docs/hotspot/gc5.0/gc_tuning_5.html
http://java.sun.com/docs/hotspot/gc1.4.2/

Comments:

Is there any tool that reads CMS GC out put? (similar to HPJtune or HPJmeter)

Posted by Aviran Harom on November 19, 2008 at 07:15 AM IST #

Try GCHisto tool here:
https://gchisto.dev.java.net/

Posted by Poonam Bajaj on November 19, 2008 at 07:50 AM IST #

I'm using incremental CMS. According to your blog entry I can compute the CPU time a duty cycle took by applying your formula.

As far as I have understood, the duty cycles take over the work that is normally peformed by the "concurrent mark"-phase. So, does the "ordinary line" saying how long the "concurrent-mark"-phase took deliver any valid information when using incremental mode?

By "ordinary line" I'm referring to:
40.683: [CMS-concurrent-mark: 0.521/0.529 secs]

Posted by Frank Brüseke on December 03, 2008 at 01:17 PM IST #

Hi Poonam

Thanks for this information, its very useful

In the traces
[CMS-concurrent-mark: 0.080/0.080 secs] [Times: user=0.45 sys=0.02, real=0.08 secs]

if CPU and real Time is 80 milliseconds why is the user time 450 milliseconds? what are user sys and real time ?

Thanks

Posted by guest on March 27, 2012 at 12:32 PM IST #

Hi,
I have a GC log which looks like:
513037.621: [Full GC 513037.621: [CMS2012-03-20T06:02:52.976+0000: 513037.831: [CMS-concurrent-sweep: 0.627/1.042 secs] [Times: user=1.24 sys=0.02, real=1.04 secs]
(concurrent mode failure): 2137896K->229746K(5072512K), 1.5890340 secs] 2275787K->229746K(5225856K), [CMS Perm : 128762K->127494K(220340K)] icms_dc=30 , 1.5892430 secs] [Times: user=1.57 sys=0.00, real=1.59 secs]

My questions are:
1. What could be the reason for concurrent mode failure in this case?
2. What generation does 2137896K->229746K(5072512K) mean? Young Gen or Old Gen?
3. Does 2275787K->229746K(5225856K) mean Heap Size Before GC, After GC and JVM Heap size?

Any help will be appreciated.

Posted by guest on April 04, 2012 at 01:16 AM IST #

Received the following question which I am trying to answer here:
-----------------------------------------------------------
Hi,
I have a GC log which looks like:
513037.621: [Full GC 513037.621: [CMS2012-03-20T06:02:52.976+0000: 513037.831: [CMS-concurrent-sweep: 0.627/1.042 secs] [Times: user=1.24 sys=0.02, real=1.04 secs]
(concurrent mode failure): 2137896K->229746K(5072512K), 1.5890340 secs] 2275787K->229746K(5225856K), [CMS Perm : 128762K->127494K(220340K)] icms_dc=30 , 1.5892430 secs] [Times: user=1.57 sys=0.00, real=1.59 secs]

My questions are:
1. What could be the reason for concurrent mode failure in this case?
2. What generation does 2137896K->229746K(5072512K) mean? Young Gen or Old Gen?
3. Does 2275787K->229746K(5225856K) mean Heap Size Before GC, After GC and JVM Heap size?
-------------------------------------------------------

The logs indicate that there was a Full GC invoked at time stamp 513037.621 which interrupted the CMS cycle leading to 'concurrent mode failure'.

2137896K->229746K(5072512K) tells us the heap occupancy of the tenured generation. 2137896K is the occupancy before collection, 229746K is the occupancy after collection and 5072512K is the CMS generation capacity.

Yes, 2275787K->229746K(5225856K) means the total heap occupancy before GC, after GC and the total heap size.

Posted by Poonam Bajaj on April 05, 2012 at 10:02 AM IST #

Hi,
Could you please explain Times Field in CMS out put? like USER: Sys: and REAL:

--Venkat

Posted by guest on April 14, 2013 at 10:20 PM IST #

Broken link
In his blog, Jon talks in detail -
https://blogs.oracle.com/jonthecollector/entry/when_the_sum_of_the

Posted by guest on April 29, 2013 at 04:42 PM IST #

The fragmentation link is broken.
(http://blogs.sun.com/roller/page/jonthecollector?entry=when_the_sum_of_the)

Find the actual article here (until it moves):
https://blogs.oracle.com/jonthecollector/entry/when_the_sum_of_the

Posted by P Shaw on May 31, 2013 at 01:50 AM IST #

Hi,
my gc log as below:
2013-08-27T16:59:53.952-0500: 201564.214: [GC 201564.214: [ParNew202182.528: [SoftReference, 0 refs, 0.0000110 secs]202182.528: [WeakReference, 1015 refs, 0.0001720 secs]202182.528: [FinalReference, 1625 refs, 0.0025210 secs]202182.531: [PhantomReference, 1 refs, 0.0000050 secs]202182.531: [JNI Weak Reference, 0.0000100 secs] (promotion failed): 3774912K->3774912K(3774912K), 619.3859950 secs]202183.600: [CMS CMS: abort preclean due to time 2013-08-27T17:10:14.937-0500: 202185.199: [CMS-concurrent-abortable-preclean: 7.271/626.671 secs] [Times: user=699.20 sys=98.95, real=626.55 secs]
(concurrent mode failure)202189.179: [SoftReference, 614 refs, 0.0000710 secs]202189.179: [WeakReference, 5743 refs, 0.0007600 secs]202189.180: [FinalReference, 3380 refs, 0.0004430 secs]202189.180: [PhantomReference, 212 refs, 0.0000260 secs]202189.180: [JNI Weak Reference, 0.0000180 secs]: 8328471K->4813636K(8388608K), 18.5940310 secs] 11323577K->4813636K(12163520K), [CMS Perm : 208169K->208091K(346920K)], 637.9802050 secs] [Times: user=705.47 sys=98.79, real=637.85 secs]
Total time for which application threads were stopped: 637.9820120 seconds

Could you explain the log for me? I encountered the application unavailable quite often. Thanks

Posted by HP on August 28, 2013 at 08:31 AM IST #

The log shows that there is a promotion failure:
(promotion failed): 3774912K->3774912K(3774912K)

which means that the attempt to promote tenured objects from the young to the tenured generation failed and that happened because the old generation was completely full.

(concurrent mode failure)202189.179: [SoftReference, 614 refs, 0.0000710 secs]202189.179: [WeakReference, 5743 refs, 0.0007600 secs]202189.180: [FinalReference, 3380 refs, 0.0004430 secs]202189.180: [PhantomReference, 212 refs, 0.0000260 secs]202189.180: [JNI Weak Reference, 0.0000180 secs]: 8328471K->4813636K(8388608K)

The promotion failure caused the 'concurrent mode failure' which invokes the Full GC that collects the heap in stop-the-world mode.

Starting the CMS collection cycle earlier so that CMS can keep up with the allocation rate in the old gen would help. This can be done by setting the option CMSInitiatingOccupancyFraction to a lower value.

Posted by Poonam Bajaj on August 28, 2013 at 11:34 AM IST #

Hi Sir,
Could you advise the value for the CMSInitiatingOccupancyFraction ?

And here is my JVM setting:

-Xmx12288m
-Xms12288m
-XX:MaxPermSize=1024m
-XX:MaxNewSize=4096m
-XX:InitialCodeCacheSize=128m
-XX:ReservedCodeCacheSize=256m
-XX:+UseCodeCacheFlushing
-XX:+UseParNewGC
-XX:+UseConcMarkSweepGC
-XX:+CMSParallelRemarkEnabled
-XX:+CMSClassUnloadingEnabled
-XX:+UseCodeCacheFlushing
-XX:+OptimizeStringConcat
-XX:+UseTLAB
-XX:+ResizeTLAB
-XX:TLABSize=512k
-XX:+PrintGC
-XX:+DisableExplicitGC
-XX:+PrintGCDetails
-XX:-PrintConcurrentLocks
-XX:+PrintGCDateStamps
-XX:+PrintGCTimeStamps
-XX:+PrintReferenceGC
-XX:+PrintJNIGCStalls
-XX:+PrintGCApplicationStoppedTime
-XX:+CMSScavengeBeforeRemark
-XX:ConcGCThreads=12
-XX:ParallelGCThreads=12

And
I have 24 CPU Intel(R) Xeon(R) CPU X5650 @ 2.67GHz in my machine
RAM: 24675112 kB
JDK version: java version "1.6.0_32"

Posted by HP on August 28, 2013 at 01:32 PM IST #

can I know the dafault value for CMSInitiatingOccupancyFraction in java "1.6.0_32" ? Thanks

Posted by HP on August 28, 2013 at 02:41 PM IST #

By default the CMS collection cycle starts at 80% of the heap occupancy. You can start with setting CMSInitiatingOccupancyFraction=50 and then decrease it further if CMS is still not able to keep up with the allocation rate.

Thanks,
Poonam

Posted by Poonam Bajaj on August 28, 2013 at 05:14 PM IST #

Hi,I have GC log like this:
250169.767: [GC 250169.767: [ParNew (promotion failed): 571584K->571584K(571584K), 0.6487910 secs]250170.416: [CMS250173.050: [CMS-concurrent-mark: 2.887/3.777 secs] [Times: user=10.86 sys=0.56, real=3.78 secs]
(concurrent mode failure): 2268975K->2111899K(2516992K), 8.3732150 secs] 2766660K->2111899K(3088576K), [CMS Perm : 562899K->562896K(1048576K)], 9.0223120 secs] [Times: user=9.78 sys=0.28, real=9.02 secs]

My question is how long is stop-the-world time? Is it 9.02 secs or 9.02-3.78=5.24 secs?
Thanks.

Posted by guest on November 19, 2013 at 07:37 AM IST #

The pause time for the Full GC that got invoked due to the concurrent-mode-failure is 9.02 secs. 3.78 secs is the time taken by the CMS-concurrent-mark phase which is a concurrent phase and is performed concurrently with the application threads running.

Thanks,
Poonam

Posted by guest on November 20, 2013 at 06:30 AM IST #

Post a Comment:
  • HTML Syntax: NOT allowed
About

poonam

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today