Java Monitoring and Tuning Example

Introduction:

I've recently been working with a several ISV's who were concerned about application server performance issues. Various application servers have been investigated, including WebLogic, WebSphere, Tomcat and JBoss, however, some of the tools and techniques necessary to monitor and resolve the application server performance were constant. Since Java is the underlying technology, one of the first issues to investigate is Java garbage collection. VisualGC is a good tool to understand the garbage collection patterns.

This blog entry is a case study of a tuning exercise, shown in three stages.  In the first stage, the system is running well under a medium load. In the second stage, the load has been increased causing response times to spike.  In the third stage, the the Java parameters have been modified and the system is able to run well under a heavier load.  

This blog entry is not intended to be an exhaustive and free standing description of Java Garbage Collection, rather, this blog entry is intended to supplement other descriptions such as Tuning Garbage Collection with the 5.0 Java Virtual Machine and Java SE 6 HotSpot[tm] Virtual Machine Garbage Collection Tuning.

Environment and tools:

In this case study, a web container is deployed into it's own Solaris Container, and it is the only busy process in the container.  This Solaris Container is using a pool of two quad-core SPARC64 VII CPUs, with 2 virtual CPU's per core, yielding a total of 16 virtual CPUs.

Screen shots show the output from tools including VisualGC, prstat, jstat, and xcpustate:

  • VisualGC is a tool that is available for download from with the jvmstat 3.0 package and displays a Java Virtual Machine's memory usage.
  • xcpustate is a tool that is available on the Solaris Companion CD and shows the status of each virtual CPU: blue indicates idle, green is busy in application code and yellow is busy in Solaris kernel code. 
  • jstat is a text based tool that is included with the JDK and is used to display Java garbage collection statistics.  Not to be confused with jvmstat.
  • prstat is a text based tool that is included with Solaris and shows resource usage per process.

Java Options:

The following Java options where in place as the project began and where left intact:

-server -XX:PermSize=512m -XX:MaxPermSize=720m -Xss256k -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -XX:+CMSClassUnloadingEnabled -XX:+CMSPermGenSweepingEnabled -XX:+UseTLAB -XX:+DisableExplicitGC -Dsun.rmi.dgc.client.gcInterval=3600000 -Dsun.rmi.dgc.server.gcInterval=3600000

The following options where modified during the course of the project:

Initial Java Options
Modified Java Options
-Xms1024m -Xms4096m
-Xmx2048m -Xmx6144m
-XX:NewSize=128m -XX:NewSize=1536m
-XX:MaxNewSize=256m -XX:MaxNewSize=1536m
N/A -XX:SurvivorRatio=4
N/A

-verbose:gc
-XX:+PrintGCTimeStamps
-XX:+PrintGCDetails
-XX:+PrintTenuringDistribution
-Xloggc:/web_dir/log/gc.log

Stage 1: "if it ain't broke, don't fix it"

In the first stage, the system is running well under a medium load.

Here is a screen shot of VisualGC running a scenario where the end-user response times are good. The window on the left shows an instantaneous view of the Java memory statistics.  The window on the right show a strip charts of the statistics over time, where new data is published on the right side and slides to the left over time.  Eventually the historic data falls off the left side of the strip chart. The right window has the following strip charts, from top to bottom:

  • Compile TIme strip chart shows that the Java run time compiler has be kicking in, but this a fairly light load which was subsequently seen to disappear as all of the classes in use where fully compiled and optimized.
  • Class Loader strip chart shows that the class loader has recently been idle.
  • GC Time shows that Java garbage has been active, but the narrow spikes indicate short bursts of garbage collection. The narrow spikes are a good omen. If wide peaks are seen it could be indicative of a problem.
  • Eden Space strip chart shows that the data quickly fills the 128MB Eden space and is then collected. When the frequency of the garbage collections approach 1 collection per second or greater, aliasing can be a problem and you should use jstat to ascertain the actual frequency
  • Survivor space strip charts shows that the survivor space is not being used.  Combined with view of the Old Gen activity, deduce that the tenuring process is not working and that data that survives one collection is promoted directly into the Old Generation.  I don't like this configuration, but the user response times are good, so I wouldn't change it, yet.  As the load increases, this will need to be resolved.
  • Old Gen strip chart shows that the memory is being filled as data is promoted but quickly cleaned up by the CMS collector. 
  • Perm Gen strip chart shows a steady state of used space and some availble space, which is good. 

033_visualgc.jpg

I believe in the idiom "if it ain't broke, don't fix it", so although I'm not in love with the above screen shot, I wouldn't change anything unless there is a tangible performance problem.  With this load, the current settings are working.

That being said, I don't like that Eden so small and that the Survivor space is not being used.  Objects that survive an initial young generation GC are being promoted directly to the Old Generation, bypassing the tenuring process.

The following screen shot was also taken while running the medium load scenario where the end-user response times are good. On the left, prstat shows that there is one busy Java process which has been using 7.5% of this Solaris Container's CPU resource. The xcpustate window shows that all of the CPUs are lightly loaded by the Java process.

 033_prstat.jpg

The following jstat data was gathered while running the medium load scenario where the end-user response times are good. The following columns are displayed:

  • The first ten column describe memory usage:
    • The first four columns, S0C, S1C, S0U, and S1U are about survivor space. Capacity (C) and Utilization (U) for spaces 0 and 1. The capacity is 64 KB but neither space is utilized. 
    • The next two columns are about Eden, showing Capacity (EC) and Utilization (EU).  The capacity is constant at approximately 128MB (130944KB). How much data is in Eden at the instantaneous time that the trace is taken depends on how much data has filled in since the most recent young generation collection and varies wildly.
    • The next two columns are about the Old Generation, showing Capacity (OC) and Utilization (OU).  The capacity is constant at 1.2 GB. How much data is in Eden at the instantaneous time that the trace is taken depends on how much data has filled in since the most recent young generation collection and varies slowly.
    • Then there are two columns are about the Permanent Gerneration, showing Capacity (PC) and Utilization (PU). The capacity is constant at 720 MB (737280 KB) and the utilization is fairly flat near 440 MB (450560 KB)
  • The final 5 columns are about the number of collections and time spent in garbage collection:
    • YGC is the count of the total number of Young Generation Collections since the Java process was launched.

#jstat -gc 15515 10000
  S0C    S1C   S0U    S1U      EC       EU        OC         OU        PC      PU        YGC     YGCT  FGC    FGCT     GCT
 64.0   64.0   0.0    0.0   130944.0 128571.1 1226784.0   657397.4  737280.0 450352.0    497   15.956  158    53.966   69.922
 64.0   64.0   0.0    0.0   130944.0 32841.4  1226784.0   624092.3  737280.0 450367.8    503   16.154  160    54.918   71.072
 64.0   64.0   0.0    0.0   130944.0   0.0    1226784.0   631067.9  737280.0 450251.8    506   16.245  162    54.963   71.208
 64.0   64.0   0.0    0.0   130944.0 76982.1  1226784.0   603735.6  737280.0 450255.0    511   16.383  163    55.725   72.107
 64.0   64.0   0.0    0.0   130944.0 118467.4 1226784.0   606041.5  737280.0 450265.4    515   16.538  165    56.504   73.042
 64.0   64.0   0.0    0.0   130944.0 108089.0 1226784.0   644641.9  737280.0 450273.4    520   16.795  165    56.504   73.298
 64.0   64.0   0.0    0.0   130944.0 65716.7  1226784.0   627548.7  737280.0 449934.1    523   16.901  167    57.483   74.384
 64.0   64.0   0.0    0.0   130944.0 93575.9  1226784.0   602544.6  737280.0 449690.5    526   17.021  168    58.691   75.712
 64.0   64.0   0.0    0.0   130944.0 65022.2  1226784.0   635921.9  737280.0 449698.6    531   17.217  169    58.736   75.953
 64.0   64.0   0.0    0.0   130944.0 23657.6  1226784.0   626730.9  737280.0 449708.1    537   17.471  171    59.584   77.055
\^C

The Verbose GC log file confirms the jstat data.

1398.985: [GC[YG occupancy: 32842 K (131008 K)]1398.985: [Rescan (parallel) , 0.0162640 secs]1399.002: [weak refs processing, 0.3515853 secs]1399.353: [class unloading, 0.3569879 secs]1399.711: [scrub symbol & string tables, 0.1207449 secs] [1 CMS-remark: 757680K(1226784K)] 790522K(1357792K), 0.9462555 secs]
1399.932: [CMS-concurrent-sweep-start]
1400.006: [GC 1400.006: [ParNew: 130904K->0K(131008K), 0.0295234 secs] 887457K->758257K(1357792K), 0.0304601 secs]
1400.485: [GC 1400.486: [ParNew: 130944K->0K(131008K), 0.0344479 secs] 872839K->751757K(1357792K), 0.0351573 secs]
1401.453: [GC 1401.453: [ParNew: 130944K->0K(131008K), 0.0309997 secs] 830358K->704885K(1357792K), 0.0318843 secs]
1402.550: [CMS-concurrent-sweep: 1.110/2.618 secs]
1402.550: [CMS-concurrent-reset-start]
1402.563: [CMS-concurrent-reset: 0.013/0.013 secs]
1403.901: [GC 1403.901: [ParNew: 130944K->0K(131008K), 0.0325028 secs] 835829K->711510K(1357792K), 0.0332487 secs]
1404.156: [GC [1 CMS-initial-mark: 711510K(1226784K)] 744661K(1357792K), 0.0442969 secs]
1404.201: [CMS-concurrent-mark-start]
1404.834: [GC 1404.835: [ParNew: 130941K->0K(131008K), 0.0357494 secs] 842451K->717262K(1357792K), 0.0365797 secs]
1404.953: [GC 1404.954: [ParNew: 130893K->0K(131008K), 0.0183660 secs] 848155K->717517K(1357792K), 0.0190584 secs]
1405.055: [GC 1405.055: [ParNew: 130944K->0K(131008K), 0.0173852 secs] 848461K->717814K(1357792K), 0.0180515 secs]
1405.342: [GC 1405.343: [ParNew: 130944K->0K(131008K), 0.0203568 secs] 848758K->720848K(1357792K), 0.0211302 secs]
1405.763: [GC 1405.763: [ParNew: 130944K->0K(131008K), 0.0204437 secs] 851792K->722671K(1357792K), 0.0211113 secs]
1406.192: [GC 1406.192: [ParNew: 130944K->0K(131008K), 0.0204619 secs] 853615K->724768K(1357792K), 0.0214120 secs]
1407.042: [GC 1407.042: [ParNew: 130944K->0K(131008K), 0.0326238 secs] 855712K->734062K(1357792K), 0.0334577 secs]
1409.453: [GC 1409.453: [ParNew: 130944K->0K(131008K), 0.0340238 secs] 865006K->741292K(1357792K), 0.0347661 secs]
1410.355: [CMS-concurrent-mark: 3.050/6.154 secs]
1410.355: [CMS-concurrent-preclean-start]
1412.357: [GC 1412.358: [ParNew: 130944K->0K(131008K), 0.0334870 secs] 872236K->747988K(1357792K), 0.0344020 secs]
1412.936: [CMS-concurrent-preclean: 0.366/2.581 secs]
1412.936: [CMS-concurrent-abortable-preclean-start]
1412.941: [CMS-concurrent-abortable-preclean: 0.005/0.006 secs]
1412.948: [GC[YG occupancy: 66917 K (131008 K)]1412.949: [Rescan (parallel) , 0.0494186 secs]1412.998: [weak refs processing, 0.4237737 secs]1413.422: [class unloading, 0.2246387 secs]1413.647: [scrub symbol & string tables, 0.1256452 secs] [1 CMS-remark: 747988K(1226784K)] 814906K(1357792K), 0.9224302 secs]
1413.872: [CMS-concurrent-sweep-start]
1417.184: [GC 1417.185: [ParNew: 130944K->0K(131008K), 0.0439868 secs] 878065K->755909K(1357792K), 0.0450652 secs]

Resident Set Size

When memory is allocated to a Solaris process, virtual memory pages to fulfill the request are reserved in swap space, but this is a very light weight event that does not require allocating RAM or writing data to the backing store (virtual memory disk).  The first time that any page is accessed, RAM is allocated and zeroed for the page. Finally, in a scenario where RAM is scarce, Solaris may need to pageout the data to the backing store.

It is not uncommon for a Java users to notice a substantial gap between the swap space that has been reserved for the process compared the amount of RAM that is being used.  Observe the prstat SIZE and RSS columns.  The "RSS" column (Resident Set Size) indicates how much RAM is in use by the process. The "SIZE" column displays the the total virtual size of process, including RAM that is allocated to the process, pages that have been reserved but never accessed and data that has been paged out due to memory pressure.  In this case, there is no virtual memory pressure (seen via vmstat, not shown) so we know that there is more than 2 GB of memory space that has been reserved by the process but not yet accessed. Going back to the VisualGC screen shot above, in the top left window, notice that Old Gen has:

  • Active data, solid orange, 664MB.
  • Empty (but already allocated) space shown with a green hash.  The sum of the active data and the empty space is 1.17 GB.
  • Space that has been reserved but not yet allocated, shown with a  gray hash. The virtual memory footprint of the Old Generation is 1.87 GB, including both active, empty but allocated, and reserved space.

In the jstat Old Capacity "OC" column, (1226784 KB) matches the 1.17 GB in the VisualGC output and indicates sum of the memory that has been allocated to the Old Generation, but not pages that have been reserved and not yet allocated.

More detail about Solaris memory can be found at sources including Solaris 10 Physical Memory Management and The Solaris Memory System - Sizing, Tools and Architecture

The following graph is produced using from vmstat gathered while running the medium load scenario where the end-user response times are good. (See Graphing Solaris Performance Stats with gnuplot) You can see that there is a consistent light CPU load on the system throughout the half hour test.


 033_vmstat.jpg

Stage 2: It's broken

In the second stage, the load has been increased causing response times to spike. The processing alternates between two phases. Garbage collection runs constantly while the test is running, but the Old generation CMS collector can't clear out the data fast enough to clear the In once

What I don't like about the following VisualGC

1) GC is constantly running
2) Survivor space is not being used
3) Two phases:

(a) a phase that can be seen in the Eden Space strip chart with very frequent Eden GC's concurrent with application work, and

(b) a phase that can be seen in the Old Gen strip chart where where the application can not execute due to exclusive Old Generation GC work.

4) CMS is not able to hold Old Gen down

 100_bad_visualgc.jpg

The following jstat data was gather during a run where the load has been increased causing response times to spike. There is one sample every 10 seconds (10000 ms) and twenty samples to cover a total of 200 seconds.  Observing the YGC column, notice that there are 685 (3675 - 2990) Young Generation Collections, for an average rate of 3.4 collections/second.  A tuning goal will be to slow this rate down.

#jstat -gc $PID 10000 20
  S0C    S1C   S0U    S1U      EC       EU        OC         OU        PC      PU        YGC     YGCT  FGC    FGCT     GCT
 64.0   64.0   0.0    0.0   130944.0 130944.0 1966080.0  1553383.3  737280.0 449596.9   2990  119.803  123   283.914  403.717
 64.0   64.0   0.0    0.0   130944.0 93869.1  1966080.0  1886113.2  737280.0 449599.9   3046  121.996  123   283.914  405.910
 64.0   64.0   0.0    0.0   130944.0 130944.0 1966080.0  1942614.0  737280.0 449599.9   3055  122.564  124   283.914  406.478
 64.0   64.0   0.0    0.0   130944.0 57856.7  1966080.0  1364918.4  737280.0 449602.5   3093  124.077  125   294.691  418.768
 64.0   64.0   0.0    0.0   130944.0   0.0    1966080.0  1714499.8  737280.0 449606.8   3149  126.198  125   294.691  420.889
 64.0   64.0   0.0    0.0   130944.0 130944.0 1966080.0  1943757.0  737280.0 449608.7   3186  127.848  126   294.691  422.539
 64.0   64.0   0.0    0.0   130944.0 130944.0 1966080.0  1943757.0  737280.0 449605.8   3186  127.848  126   294.691  422.539
 64.0   64.0   0.0    0.0   130944.0   0.0    1966080.0  1389764.8  737280.0 449616.0   3231  129.533  127   307.723  437.256
 64.0   64.0   0.0    0.0   130944.0 101211.9 1966080.0  1741419.7  737280.0 449619.9   3286  131.759  127   307.723  439.483
 64.0   64.0   0.0    0.0   130944.0 130944.0 1966080.0  1943870.4  737280.0 449629.2   3323  133.224  128   307.723  440.947
 64.0   64.0   0.0    0.0   130944.0 91907.9  1966080.0  1116053.3  737280.0 449621.3   3330  133.492  129   319.118  452.610
 64.0   64.0   0.0    0.0   130944.0   0.0    1966080.0  1541494.5  737280.0 449631.2   3392  135.838  129   319.118  454.956
 64.0   64.0   0.0    0.0   130944.0 53983.1  1966080.0  1907711.7  737280.0 449651.6   3444  137.955  129   319.118  457.073
 64.0   64.0   0.0    0.0   130944.0 130944.0 1966080.0  1942148.9  737280.0 449659.3   3450  138.658  130   319.118  457.776
 64.0   64.0   0.0    0.0   130944.0 71632.8  1966080.0  1291300.9  737280.0 449672.5   3484  139.935  131   331.870  471.805
 64.0   64.0   0.0    0.0   130944.0   0.0    1966080.0  1661931.9  737280.0 449687.8   3535  142.096  131   331.870  473.966
 64.0   64.0   0.0    0.0   130944.0 55150.8  1966080.0  1930792.5  737280.0 449693.2   3581  143.826  131   331.870  475.696
 64.0   64.0   0.0    0.0   130944.0 130944.0 1966080.0  1941935.7  737280.0 449693.2   3583  144.291  132   331.870  476.161
 64.0   64.0   0.0    0.0   130944.0 66316.6  1966080.0  1405189.1  737280.0 449701.0   3631  146.142  133   342.789  488.931
 64.0   64.0   0.0    0.0   130944.0   0.0    1966080.0  1744177.7  737280.0 449704.4   3675  148.066  133   342.789  490.855

The Verbose GC log file confirms the jstat data.

1445.781: [GC 1445.781: [ParNew: 130944K->0K(131008K), 0.0528409 secs] 2048880K->1929346K(2097088K), 0.0733931 secs]
1446.057: [GC 1446.077: [ParNew: 130944K->0K(131008K), 0.0507988 secs] 2060290K->1937743K(2097088K), 0.0519872 secs]
1446.328: [GC 1446.330: [ParNew (promotion failed): 130944K->130944K(131008K), 0.3677102 secs]1446.698: [CMS1448.338: [CMS-concurrent-mark: 8.326/20.506 secs]
 (concurrent mode failure): 1943651K->1332161K(1966080K), 13.5947895 secs] 2068687K->1332161K(2097088K), 13.9636616 secs]
1460.373: [GC [1 CMS-initial-mark: 1332161K(1966080K)] 1364958K(2097088K), 0.0570809 secs]
1460.431: [CMS-concurrent-mark-start]
1460.513: [GC 1460.513: [ParNew: 130944K->0K(131008K), 0.0629766 secs] 1463105K->1340611K(2097088K), 0.0638556 secs]
1460.644: [GC 1460.645: [ParNew: 130944K->0K(131008K), 0.0287314 secs] 1471555K->1344385K(2097088K), 0.0295894 secs]
1460.752: [GC 1460.752: [ParNew: 130944K->0K(131008K), 0.0313205 secs] 1475329K->1349195K(2097088K), 0.0325400 secs]
1460.859: [GC 1460.860: [ParNew: 130944K->0K(131008K), 0.0354078 secs] 1480139K->1353073K(2097088K), 0.0363298 secs]
1460.988: [GC 1460.988: [ParNew: 130944K->0K(131008K), 0.0363377 secs] 1484017K->1356257K(2097088K), 0.0374367 secs]
1461.164: [GC 1461.164: [ParNew: 130944K->0K(131008K), 0.0479562 secs] 1487201K->1362016K(2097088K), 0.0488208 secs]
1461.439: [GC 1461.439: [ParNew: 130944K->0K(131008K), 0.0507014 secs] 1492960K->1372934K(2097088K), 0.0515876 secs]
1461.579: [GC 1461.580: [ParNew: 130944K->0K(131008K), 0.0319675 secs] 1503878K->1375415K(2097088K), 0.0332522 secs]
1461.693: [GC 1461.693: [ParNew: 130944K->0K(131008K), 0.0340291 secs] 1506359K->1378926K(2097088K), 0.0348877 secs]
1461.821: [GC 1461.821: [ParNew: 130944K->0K(131008K), 0.0314839 secs] 1509870K->1382392K(2097088K), 0.0323204 secs]
1461.950: [GC 1461.951: [ParNew: 130943K->0K(131008K), 0.0323091 secs] 1513336K->1386662K(2097088K), 0.0331815 secs]
1462.123: [GC 1462.124: [ParNew: 130944K->0K(131008K), 0.0385567 secs] 1517606K->1391853K(2097088K), 0.0398456 secs]
1462.380: [GC 1462.380: [ParNew: 130944K->0K(131008K), 0.0698981 secs] 1522797K->1405319K(2097088K), 0.0708113 secs]
1462.627: [GC 1462.628: [ParNew: 130944K->0K(131008K), 0.0482389 secs] 1536263K->1417126K(2097088K), 0.0493317 secs]
1462.835: [GC 1462.835: [ParNew: 130944K->0K(131008K), 0.0438596 secs] 1548070K->1423124K(2097088K), 0.0447105 secs]
1462.989: [GC 1462.989: [ParNew: 130944K->0K(131008K), 0.0356933 secs] 1554068K->1430934K(2097088K), 0.0365304 secs]
1463.156: [GC 1463.156: [ParNew: 130944K->0K(131008K), 0.0426650 secs] 1561878K->1437914K(2097088K), 0.0438581 secs]
1463.342: [GC 1463.342: [ParNew: 130944K->0K(131008K), 0.0369751 secs] 1568858K->1445730K(2097088K), 0.0378227 secs]
1463.508: [GC 1463.508: [ParNew: 130944K->0K(131008K), 0.0393694 secs] 1576674K->1458926K(2097088K), 0.0402032 secs]
1463.709: [GC 1463.709: [ParNew: 130944K->0K(131008K), 0.0351472 secs] 1589870K->1465137K(2097088K), 0.0364246 secs]
1463.860: [GC 1463.860: [ParNew: 130944K->0K(131008K), 0.0399990 secs] 1596081K->1468331K(2097088K), 0.0408329 secs]
1463.998: [GC 1463.998: [ParNew: 130944K->0K(131008K), 0.0338172 secs] 1599275K->1471847K(2097088K), 0.0346756 secs]
1464.123: [GC 1464.123: [ParNew: 130944K->0K(131008K), 0.0281716 secs] 1602791K->1475013K(2097088K), 0.0292473 secs]
1464.235: [GC 1464.235: [ParNew: 130944K->0K(131008K), 0.0282486 secs] 1605957K->1477824K(2097088K), 0.0293566 secs]
1464.361: [GC 1464.361: [ParNew: 130944K->0K(131008K), 0.0319564 secs] 1608768K->1480943K(2097088K), 0.0328203 secs]
1464.590: [GC 1464.590: [ParNew: 130944K->0K(131008K), 0.0430505 secs] 1611887K->1486579K(2097088K), 0.0439285 secs]
1464.875: [GC 1464.875: [ParNew: 130944K->0K(131008K), 0.0404393 secs] 1617523K->1496962K(2097088K), 0.0415705 secs]
1465.079: [GC 1465.079: [ParNew: 130944K->0K(131008K), 0.0389969 secs] 1627906K->1509873K(2097088K), 0.0398899 secs]
1465.318: [GC 1465.318: [ParNew: 130944K->0K(131008K), 0.0377232 secs] 1640817K->1518156K(2097088K), 0.0386229 secs]
1465.533: [GC 1465.534: [ParNew: 130939K->0K(131008K), 0.0433747 secs] 1649096K->1527800K(2097088K), 0.0446424 secs]
1465.652: [GC 1465.653: [ParNew: 130944K->0K(131008K), 0.0267976 secs] 1658744K->1530747K(2097088K), 0.0276501 secs]
1465.754: [GC 1465.755: [ParNew: 130944K->0K(131008K), 0.0382624 secs] 1661691K->1533772K(2097088K), 0.0390969 secs]
1465.883: [GC 1465.884: [ParNew: 130944K->0K(131008K), 0.0346790 secs] 1664716K->1538809K(2097088K), 0.0355499 secs]
1466.010: [GC 1466.010: [ParNew: 130944K->0K(131008K), 0.0360124 secs] 1669753K->1542202K(2097088K), 0.0368312 secs]
1466.133: [GC 1466.134: [ParNew: 130896K->0K(131008K), 0.0264424 secs] 1673098K->1545037K(2097088K), 0.0276615 secs]
1466.284: [GC 1466.284: [ParNew: 130944K->0K(131008K), 0.0334272 secs] 1675981K->1551352K(2097088K), 0.0342620 secs]
1466.486: [GC 1466.487: [ParNew: 130944K->0K(131008K), 0.0468365 secs] 1682296K->1559392K(2097088K), 0.0476777 secs]
1466.761: [GC 1466.762: [ParNew: 130944K->0K(131008K), 0.0668140 secs] 1690336K->1571022K(2097088K), 0.0679247 secs]
1467.043: [GC 1467.043: [ParNew: 130944K->0K(131008K), 0.0595864 secs] 1701966K->1579608K(2097088K), 0.0604566 secs]
1467.294: [GC 1467.294: [ParNew: 130944K->0K(131008K), 0.0422655 secs] 1710552K->1586731K(2097088K), 0.0433685 secs]
1467.412: [GC 1467.413: [ParNew: 130915K->0K(131008K), 0.0357374 secs] 1717646K->1590261K(2097088K), 0.0365601 secs]
1467.523: [GC 1467.523: [ParNew: 130944K->0K(131008K), 0.0368735 secs] 1721205K->1592832K(2097088K), 0.0376957 secs]
1467.637: [Full GC 1467.637: [ParNew: 130901K->0K(131008K), 0.0387518 secs] 1723877K->1595360K(2097088K), 0.0396553 secs]
1467.755: [GC 1467.755: [ParNew: 130944K->0K(131008K), 0.0328768 secs] 1726304K->1597889K(2097088K), 0.0337467 secs]
1467.913: [GC 1467.913: [ParNew: 130943K->0K(131008K), 0.0321930 secs] 1728833K->1601670K(2097088K), 0.0332656 secs]
1468.115: [GC 1468.115: [ParNew: 130944K->0K(131008K), 0.0378134 secs] 1732614K->1608985K(2097088K), 0.0386423 secs]
1468.259: [GC 1468.260: [ParNew: 130944K->0K(131008K), 0.0389691 secs] 1739929K->1619450K(2097088K), 0.0398194 secs]
1468.378: [GC 1468.379: [ParNew: 130939K->0K(131008K), 0.0346053 secs] 1750390K->1622375K(2097088K), 0.0354541 secs]
1468.509: [GC 1468.509: [ParNew: 130931K->0K(131008K), 0.0329106 secs] 1753307K->1625703K(2097088K), 0.0339618 secs]
1468.627: [GC 1468.636: [ParNew: 130874K->0K(131008K), 0.0315224 secs] 1756578K->1630866K(2097088K), 0.0323764 secs]
1468.751: [GC 1468.752: [ParNew: 130944K->0K(131008K), 0.0316954 secs] 1761810K->1633494K(2097088K), 0.0325242 secs]
1468.926: [GC 1468.933: [ParNew: 130944K->0K(131008K), 0.0361699 secs] 1764438K->1641317K(2097088K), 0.0371095 secs]
1469.145: [GC 1469.146: [ParNew: 130944K->0K(131008K), 0.0367411 secs] 1772261K->1647218K(2097088K), 0.0378685 secs]
1469.380: [GC 1469.380: [ParNew: 130944K->0K(131008K), 0.0446357 secs] 1778162K->1653998K(2097088K), 0.0454686 secs]
1469.688: [GC 1469.689: [ParNew: 130944K->0K(131008K), 0.0390912 secs] 1784942K->1661565K(2097088K), 0.0402177 secs]
1469.951: [GC 1469.951: [ParNew: 130944K->0K(131008K), 0.0486676 secs] 1792509K->1671504K(2097088K), 0.0495607 secs]
1470.248: [GC 1470.249: [ParNew: 130944K->0K(131008K), 0.0414052 secs] 1802448K->1676829K(2097088K), 0.0425514 secs]
1470.447: [GC 1470.447: [ParNew: 130944K->0K(131008K), 0.0342887 secs] 1807773K->1683630K(2097088K), 0.0351706 secs]
1470.610: [GC 1470.611: [ParNew: 130940K->0K(131008K), 0.0319829 secs] 1814571K->1688471K(2097088K), 0.0332171 secs]
1470.722: [GC 1470.722: [ParNew: 130944K->0K(131008K), 0.0311625 secs] 1819415K->1692388K(2097088K), 0.0319923 secs]
1470.852: [GC 1470.852: [ParNew: 130944K->0K(131008K), 0.0314821 secs] 1823332K->1696163K(2097088K), 0.0389738 secs]
1470.967: [GC 1470.979: [ParNew: 130944K->0K(131008K), 0.0318559 secs] 1827107K->1700627K(2097088K), 0.0326812 secs]
1471.085: [GC 1471.086: [ParNew: 130937K->0K(131008K), 0.0328143 secs] 1831564K->1704737K(2097088K), 0.0336577 secs]
1471.229: [GC 1471.229: [ParNew: 130944K->0K(131008K), 0.0335674 secs] 1835681K->1709344K(2097088K), 0.0343768 secs]
1471.385: [GC 1471.386: [ParNew: 130944K->0K(131008K), 0.0406892 secs] 1840288K->1714787K(2097088K), 0.0419120 secs]
1471.516: [GC 1471.516: [ParNew: 130944K->0K(131008K), 0.0295618 secs] 1845731K->1717584K(2097088K), 0.0303950 secs]
1471.625: [GC 1471.625: [ParNew: 130944K->0K(131008K), 0.0376045 secs] 1848528K->1720845K(2097088K), 0.0384246 secs]
1471.750: [GC 1471.750: [ParNew: 130870K->0K(131008K), 0.0260799 secs] 1851715K->1723365K(2097088K), 0.0269050 secs]
1471.839: [GC 1471.840: [ParNew: 130944K->0K(131008K), 0.0255170 secs] 1854309K->1724726K(2097088K), 0.0263177 secs]
1471.988: [GC 1471.988: [ParNew: 130944K->0K(131008K), 0.0285268 secs] 1855670K->1728103K(2097088K), 0.0295660 secs]
1472.143: [GC 1472.143: [ParNew: 130944K->0K(131008K), 0.0392845 secs] 1859047K->1738951K(2097088K), 0.0401130 secs]
1472.362: [GC 1472.362: [ParNew: 130944K->0K(131008K), 0.0379261 secs] 1869895K->1746709K(2097088K), 0.0387573 secs]
1472.554: [GC 1472.554: [ParNew: 130944K->0K(131008K), 0.0431822 secs] 1877653K->1752618K(2097088K), 0.0442519 secs]
1472.672: [GC 1472.673: [ParNew: 130944K->0K(131008K), 0.0434003 secs] 1883562K->1754549K(2097088K), 0.0442311 secs]
1472.795: [GC 1472.796: [ParNew: 130944K->0K(131008K), 0.0295359 secs] 1885493K->1757812K(2097088K), 0.0304006 secs]
1472.904: [GC 1472.904: [ParNew: 130944K->0K(131008K), 0.0386588 secs] 1888756K->1760377K(2097088K), 0.0395649 secs]
1473.035: [GC 1473.036: [ParNew: 130861K->0K(131008K), 0.0297446 secs] 1891238K->1763237K(2097088K), 0.0308653 secs]
1473.188: [GC 1473.201: [ParNew: 130944K->0K(131008K), 0.0319938 secs] 1894181K->1767389K(2097088K), 0.0333206 secs]
1473.401: [GC 1473.401: [ParNew: 130944K->0K(131008K), 0.0387580 secs] 1898333K->1773804K(2097088K), 0.0396182 secs]
1473.522: [GC 1473.538: [ParNew: 130944K->0K(131008K), 0.0312055 secs] 1904748K->1778637K(2097088K), 0.0320711 secs]
1473.650: [GC 1473.650: [ParNew: 130944K->0K(131008K), 0.0346679 secs] 1909581K->1784622K(2097088K), 0.0355099 secs]
1473.769: [GC 1473.780: [ParNew: 130944K->0K(131008K), 0.0325029 secs] 1915566K->1787148K(2097088K), 0.0336199 secs]
1473.889: [GC 1473.890: [ParNew: 130944K->0K(131008K), 0.0306312 secs] 1918092K->1791684K(2097088K), 0.0314725 secs]
1474.023: [GC 1474.025: [ParNew: 130944K->0K(131008K), 0.0346581 secs] 1922628K->1800742K(2097088K), 0.0354589 secs]
1474.219: [GC 1474.219: [ParNew: 130944K->0K(131008K), 0.0431710 secs] 1931686K->1811004K(2097088K), 0.0440093 secs]
1474.392: [GC 1474.393: [ParNew: 130944K->0K(131008K), 0.0384690 secs] 1941948K->1820039K(2097088K), 0.0396068 secs]
1474.572: [GC 1474.584: [ParNew: 130944K->0K(131008K), 0.0413289 secs] 1950983K->1827458K(2097088K), 0.0426382 secs]
1474.801: [GC 1474.813: [ParNew: 130944K->0K(131008K), 0.0385987 secs] 1958402K->1837161K(2097088K), 0.0627487 secs]
1475.046: [GC 1475.047: [ParNew: 130944K->0K(131008K), 0.0488532 secs] 1968105K->1845650K(2097088K), 0.0499528 secs]
1475.262: [GC 1475.262: [ParNew: 130944K->0K(131008K), 0.0512485 secs] 1976594K->1853035K(2097088K), 0.0521045 secs]
1475.458: [GC 1475.458: [ParNew: 130944K->0K(131008K), 0.0461453 secs] 1983979K->1862019K(2097088K), 0.0469998 secs]
1475.652: [GC 1475.653: [ParNew: 130944K->0K(131008K), 0.0447249 secs] 1992963K->1868886K(2097088K), 0.0458052 secs]
1475.763: [GC 1475.780: [ParNew: 130928K->0K(131008K), 0.0324786 secs] 1999815K->1873299K(2097088K), 0.0333689 secs]
1475.882: [GC 1475.882: [ParNew: 130944K->0K(131008K), 0.0340021 secs] 2004243K->1876399K(2097088K), 0.0348220 secs]
1475.993: [GC 1475.994: [ParNew: 130944K->0K(131008K), 0.0299435 secs] 2007343K->1883168K(2097088K), 0.0308629 secs]
1476.097: [GC 1476.097: [ParNew: 130944K->0K(131008K), 0.0343554 secs] 2014112K->1887994K(2097088K), 0.0351879 secs]
1476.218: [GC 1476.218: [ParNew: 130904K->0K(131008K), 0.0331282 secs] 2018898K->1891141K(2097088K), 0.0341797 secs]
1476.375: [GC 1476.376: [ParNew: 130944K->0K(131008K), 0.0460486 secs] 2022085K->1896937K(2097088K), 0.0472445 secs]
1476.522: [GC 1476.522: [ParNew: 130937K->0K(131008K), 0.0305741 secs] 2027875K->1900685K(2097088K), 0.0313665 secs]
1476.630: [GC 1476.630: [ParNew: 130944K->0K(131008K), 0.0354049 secs] 2031629K->1903695K(2097088K), 0.0362252 secs]
1476.749: [GC 1476.750: [ParNew: 130944K->0K(131008K), 0.0332512 secs] 2034639K->1906008K(2097088K), 0.0343826 secs]
1476.861: [GC 1476.882: [ParNew: 130944K->0K(131008K), 0.0302215 secs] 2036952K->1908090K(2097088K), 0.0310552 secs]
1477.030: [GC 1477.030: [ParNew: 130944K->0K(131008K), 0.0308026 secs] 2039034K->1910954K(2097088K), 0.0316133 secs]
1477.250: [GC 1477.251: [ParNew: 130944K->0K(131008K), 0.0385982 secs] 2041898K->1916426K(2097088K), 0.0397542 secs]
1477.451: [GC 1477.452: [ParNew: 130944K->0K(131008K), 0.0355202 secs] 2047370K->1922843K(2097088K), 0.0363764 secs]
1477.641: [GC 1477.641: [ParNew: 130944K->0K(131008K), 0.0374713 secs] 2053787K->1928347K(2097088K), 0.0383333 secs]
1477.864: [GC 1477.865: [ParNew: 130944K->0K(131008K), 0.0420856 secs] 2059291K->1935857K(2097088K), 0.0432463 secs]
1478.088: [GC 1478.089: [ParNew (promotion failed): 130944K->130944K(131008K), 0.2115119 secs]1478.300: [CMS1480.678: [CMS-concurrent-mark: 8.657/20.247 secs]
 (concurrent mode failure): 1943323K->1067360K(1966080K), 12.8686984 secs] 2066801K->1067360K(2097088K), 13.0813570 secs]
1491.257: [GC [1 CMS-initial-mark: 1067360K(1966080K)] 1106126K(2097088K), 0.0660923 secs]
1491.325: [CMS-concurrent-mark-start]
1491.415: [GC 1491.416: [ParNew: 130944K->0K(131008K), 0.0628546 secs] 1198304K->1074342K(2097088K), 0.0642408 secs]
1491.569: [GC 1491.569: [ParNew: 130944K->0K(131008K), 0.0283941 secs] 1205286K->1077410K(2097088K), 0.0292610 secs]  

The following screen shot shows VisualGC, prstat and xcpustate.  There is a sustained period of single threaded behavior. Of the 664 threads in the Java process, only one is able to run. 

 100_bad_visual_gc_single_threaded_gc.jpg

The following screen shot shows VisualGC, prstat and xcpustate.  After the single threaded phase, there is a phase where many of the threads that had been blocked by garbage collection become runnable.  Every virtual CPU's is busy processing runnable threads. 

 100_bad_visual_gc_back_to_work.jpg

The vmstat data shows the stages of single-threaded garage collection (1/16 CPU's = 6.25%) alternating with stages where the the CPU's are extremely busy processing application code.

 100_bad_vmstat.jpg

 The pattern in the vmstat data is more obvious when only 10 minutes of data is displayed.

 100_bad_vmstat_10_min.jpg

Stage 3: Fix it

In the third stage, the Java parameters have been modified and the system able to run well under a heavier load.  

Initial Java Options
Modified Java Options
-Xms1024m -Xms4096m
-Xmx2048m -Xmx6144m
-XX:NewSize=128m -XX:NewSize=1536m
-XX:MaxNewSize=256m -XX:MaxNewSize=1536m
N/A -XX:SurvivorRatio=4
N/A

-verbose:gc
-XX:+PrintGCTimeStamps
-XX:+PrintGCDetails
-XX:+PrintTenuringDistribution
-Xloggc:/web_dir/log/gc.log

What I like about the following VisualGC

1) GC is not constantly running
2) Survivor space is being used (5 generations) but not full
3) CMS is able to hold Old Gen down
4) Eden GCs are about one per second.

 100_big_visualgc.jpg  100_big_prstat.jpg

The following jstat data was gather during a run where the the Java parameters have been modified and the system able to run well under a heavier load.  There is one sample every 10 seconds (10000 ms) and twenty samples to cover a total of 200 seconds.  Observing the YGC column, notice that there are 126 (943 - 817) Young Generation Collections, for an average rate of 0.63 collections/second (one every 1.6 seconds). 

Also notice that the Survivor space is now active. In the VisualGC Survivor Age Histogram, above. observe that data survives 5 generations before being promoted to the Old Generation.  This implied that Java objects that are still referenced after  8 seconds (1.6 seconds X 5 collections).  Most Java objects are not referenced for 8 seconds and therefore are never promoted into the Old Generation,

#jstat -gc $PID 10000 20
  S0C      S1C     S0U     S1U         EC       EU        OC         OU        PC      PU          YGC    YGCT   FGC    FGCT     GCT
262144.0 262144.0  0.0   135491.4   1048576.0  740774.1 2621440.0  1518371.0  737280.0 454756.7    817  204.672  119   105.404  310.076
262144.0 262144.0 143538.3  0.0     1048576.0  429976.2 2621440.0  1667952.6  737280.0 454763.5    826  206.506  119   105.404  311.911
262144.0 262144.0 127788.8  0.0     1048576.0  648808.0 2621440.0  1772166.9  737280.0 454766.5    832  207.969  120   105.404  313.373
262144.0 262144.0 149593.1  0.0     1048576.0  598492.1 2621440.0  1515637.6  737280.0 454766.5    840  209.517  121   107.803  317.320
262144.0 262144.0 162023.5  0.0     1048576.0 1008524.0 2621440.0  1656240.4  737280.0 454768.1    846  210.922  121   107.803  318.725
262144.0 262144.0  0.0   125334.4   1048576.0  953097.1 2621440.0  1836384.7  737280.0 454769.0    853  212.890  121   107.803  320.694
262144.0 262144.0 151573.6 152908.7 1048576.0 1048576.0 2621440.0  1687193.8  737280.0 454769.8    859  214.153  122   109.747  323.900
262144.0 262144.0  0.0   127784.9   1048576.0  170487.4 2621440.0  1597313.5  737280.0 454764.8    865  216.265  123   110.197  326.462
262144.0 262144.0 125183.1  0.0     1048576.0  679868.2 2621440.0  1674807.1  737280.0 454765.9    870  217.484  123   110.197  327.682
262144.0 262144.0  0.0   138781.8   1048576.0  941507.7 2621440.0  1500453.3  737280.0 454766.5    875  218.688  124   112.160  330.848
262144.0 262144.0 154681.9 98652.8  1048576.0 1048576.0 2621440.0  1419845.8  737280.0 454770.0    881  220.185  125   112.576  332.761
262144.0 262144.0 140635.1  0.0     1048576.0  184017.0 2621440.0  1541698.1  737280.0 454774.3    888  222.260  125   112.576  334.836
262144.0 262144.0 126225.1  0.0     1048576.0  800352.6 2621440.0  1664977.6  737280.0 454777.0    894  224.013  125   112.576  336.589
262144.0 262144.0 128930.2  0.0     1048576.0  709898.1 2621440.0  1423721.6  737280.0 454779.2    900  225.276  126   114.513  339.788
262144.0 262144.0  0.0   135929.5   1048576.0  901493.1 2621440.0  1500049.7  737280.0 454783.5    907  227.254  127   114.962  342.216
262144.0 262144.0 78006.9 156657.5  1048576.0 1048576.0 2621440.0  1682226.6  737280.0 454785.8    916  229.098  127   114.962  344.060
262144.0 262144.0 112938.7  0.0     1048576.0  707242.9 2621440.0  1833853.7  737280.0 454790.2    922  230.749  128   114.962  345.712
262144.0 262144.0  0.0   119538.9   1048576.0  466447.4 2621440.0  1587738.4  737280.0 454790.8    929  232.260  129   117.110  349.370
262144.0 262144.0 141025.9 131586.9 1048576.0 1048576.0 2621440.0  1774825.3  737280.0 454791.9    937  233.762  129   117.110  350.872
262144.0 262144.0  0.0   116953.7   1048576.0  562706.0 2621440.0  1886004.9  737280.0 454792.8    943  235.572  129   117.110  352.683

The Verbose GC log file confirms the jstat data.

1063.533: [GC[YG occupancy: 740130 K (1310720 K)]1063.534: [Rescan (parallel) , 0.2932129 secs]1063.827: [weak refs processing, 0.8300947 secs]1064.658: [class unloading, 0.2556826 secs]1064.914: [scrub symbol & string tables, 0.1225031 secs] [1 CMS-remark: 1362667K(2621440K)] 2102798K(3932160K), 1.6268772 secs]
1065.161: [CMS-concurrent-sweep-start]
1065.744: [GC 1065.748: [ParNew: 1199794K->136002K(1310720K), 0.3157680 secs] 2562442K->1538117K(3932160K), 0.3173270 secs]
1067.540: [GC 1067.541: [ParNew: 1183955K->134897K(1310720K), 0.3643224 secs] 2516046K->1487739K(3932160K), 0.3657230 secs]
1069.449: [GC 1069.452: [ParNew: 1183473K->132663K(1310720K), 0.4440330 secs] 2398115K->1372567K(3932160K), 0.4454642 secs]
1071.165: [GC 1071.165: [ParNew: 1181239K->117685K(1310720K), 0.2458847 secs] 2366563K->1322400K(3932160K), 0.2472765 secs]
1073.099: [GC 1073.100: [ParNew: 1166261K->143983K(1310720K), 0.2949013 secs] 2230754K->1208476K(3932160K), 0.2964760 secs]
1074.020: [CMS-concurrent-sweep: 3.145/8.858 secs]
1074.020: [CMS-concurrent-reset-start]
1074.056: [CMS-concurrent-reset: 0.036/0.036 secs]
1074.700: [GC 1074.701: [ParNew: 1192514K->130782K(1310720K), 0.2122237 secs] 2257007K->1216167K(3932160K), 0.2136974 secs]
1075.304: [GC [1 CMS-initial-mark: 1085384K(2621440K)] 1482522K(3932160K), 0.3090166 secs]
1075.614: [CMS-concurrent-mark-start]
1076.299: [GC 1076.300: [ParNew: 1179358K->142731K(1310720K), 0.2360216 secs] 2264743K->1228116K(3932160K), 0.2373256 secs]
1078.147: [GC 1078.148: [ParNew: 1191307K->146389K(1310720K), 0.3161631 secs] 2276692K->1249628K(3932160K), 0.3176056 secs]
1080.368: [GC 1080.369: [ParNew: 1194965K->140667K(1310720K), 0.2221885 secs] 2298204K->1266323K(3932160K), 0.2235858 secs]
1082.719: [GC 1082.720: [ParNew: 1189243K->156618K(1310720K), 0.4543953 secs] 2314899K->1292754K(3932160K), 0.4557551 secs]
1083.887: [GC 1083.888: [ParNew: 1205108K->101152K(1310720K), 0.1484881 secs] 2341243K->1271768K(3932160K), 0.1498135 secs]
1084.462: [CMS-concurrent-mark: 3.962/8.848 secs]
1084.462: [CMS-concurrent-preclean-start]
1085.434: [CMS-concurrent-preclean: 0.859/0.971 secs]
1085.434: [CMS-concurrent-abortable-preclean-start]
1086.081: [GC 1086.082: [ParNew: 1149728K->135612K(1310720K), 0.3098950 secs] 2320344K->1306228K(3932160K), 0.3113219 secs]
1087.294: [CMS-concurrent-abortable-preclean: 0.459/1.860 secs]
1087.342: [GC[YG occupancy: 667129 K (1310720 K)]1087.342: [Rescan (parallel) , 0.3074071 secs]1087.650: [weak refs processing, 0.9052933 secs]1088.555: [class unloading, 0.2571331 secs]1088.813: [scrub symbol & string tables, 0.1228506 secs] [1 CMS-remark: 1170615K(2621440K)] 1837745K(3932160K), 1.7140749 secs]
1089.057: [CMS-concurrent-sweep-start]
1089.396: [GC 1089.397: [ParNew: 1184134K->139422K(1310720K), 0.1621469 secs] 2354726K->1320441K(3932160K), 0.1636681 secs]
1091.280: [GC 1091.281: [ParNew: 1187998K->154229K(1310720K), 0.2434008 secs] 2295339K->1273265K(3932160K), 0.2449105 secs]
1093.547: [GC 1093.548: [ParNew: 1202805K->150707K(1310720K), 0.3279749 secs] 2156871K->1125761K(3932160K), 0.3295486 secs]

 The application is able to run most of the time, but even with the improved Java options, there are short periods of single-threaded computation, which are likely attributable to the "initial mark" or the "remark" phases of CMS.  See this description of CMS for more details.

 100_big_cms_vmstat.jpg

When zooming in on the vmstat data, the short and acceptable phases of single threaded activity are more obvious.  Notice how much better this is than the long phases of single-threaded behavior that were visible with the initial Java options.

 100_big_cms_vmstat._10_min.jpg

Conclusion:

"One-size fits all" Java options are not necessarily a good match for your application.  Monitoring garbage collection with a combination of several tools and using the feedback to adjust the Java options for your application can result is a substantial performance improvement.  


Comments:

Very informative case study.
Excellent visualization.
Having the entire list of options before and after is something I'm always interested in.
- Stefan S.

Posted by Stefan Schneider on April 04, 2010 at 02:28 AM EDT #

Post a Comment:
  • HTML Syntax: NOT allowed
About

user12620111

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today