X

Jeff Taylor's Weblog

  • Sun
    April 2, 2010

Java Monitoring and Tuning Example

Introduction:

I've recently been working with a several ISV's who were concerned about application server performance issues. Various application servers have been investigated, including WebLogic, WebSphere, Tomcat and JBoss, however, some of the tools and techniques necessary to monitor and resolve the application server performance were constant. Since Java is the underlying technology, one of the first issues to investigate is Java garbage collection. VisualGC is a good tool to understand the garbage collection patterns.

This blog entry is a case study of a tuning exercise, shown in three stages.  In the first stage, the system is running well under a medium load. In the second stage, the load has been increased causing response times to spike.  In the third stage, the the Java parameters have been modified and the system is able to run well under a heavier load.  

This blog entry is not intended to be an exhaustive and free standing description of Java Garbage Collection, rather, this blog entry is intended to supplement other descriptions such as Tuning Garbage Collection with the 5.0 Java Virtual Machine and Java SE 6 HotSpot[tm] Virtual Machine Garbage Collection Tuning.

Environment and tools:

In this case study, a web container is deployed into it's own Solaris Container, and it is the only busy process in the container.  This Solaris Container is using a pool of two quad-core SPARC64 VII
CPUs, with 2 virtual CPU's per core, yielding a total of 16 virtual CPUs.

Screen shots show the output from tools including VisualGC, prstat, jstat, and xcpustate:

  • VisualGC is a tool that is available for download from with the jvmstat 3.0 package and displays a Java Virtual Machine's memory usage.
  • xcpustate is a tool that is available on the Solaris Companion CD and shows the status of each virtual CPU: blue
    indicates idle, green is busy in application code and yellow is busy in
    Solaris kernel code. 
  • jstat is a text based tool that is included with the JDK and is used to display Java garbage collection statistics.  Not to be confused with jvmstat.
  • prstat is a text based tool that is included with Solaris and shows resource usage per process.

Java Options:

The following Java options where in place as the project began and where left intact:

-server -XX:PermSize=512m -XX:MaxPermSize=720m -Xss256k -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -XX:+CMSClassUnloadingEnabled -XX:+CMSPermGenSweepingEnabled -XX:+UseTLAB -XX:+DisableExplicitGC -Dsun.rmi.dgc.client.gcInterval=3600000 -Dsun.rmi.dgc.server.gcInterval=3600000

The following options where modified during the course of the project:

Initial Java Options
Modified Java Options
-Xms1024m -Xms4096m
-Xmx2048m -Xmx6144m
-XX:NewSize=128m -XX:NewSize=1536m
-XX:MaxNewSize=256m -XX:MaxNewSize=1536m
N/A -XX:SurvivorRatio=4
N/A

-verbose:gc
-XX:+PrintGCTimeStamps
-XX:+PrintGCDetails
-XX:+PrintTenuringDistribution
-Xloggc:/web_dir/log/gc.log

Stage 1: "if it ain't broke, don't fix it"

In the first stage, the system is running well under a medium load.

Here is a screen shot of VisualGC running a scenario where the end-user response times are good. The window on the left shows an instantaneous view of the Java memory statistics.  The window on the right show a strip charts of the statistics over time, where new data is published on the right side and slides to the left over time.  Eventually the historic data falls off the left side of the strip chart. The right window has the following strip charts, from top to bottom:

  • Compile TIme strip chart shows that the Java run time compiler has be kicking in, but this a fairly light load which was subsequently seen to disappear as all of the classes in use where fully compiled and optimized.
  • Class Loader strip chart shows that the class loader has recently been idle.
  • GC Time shows that Java garbage has been active, but the narrow spikes indicate short bursts of garbage collection. The narrow spikes are a good omen. If wide peaks are seen it could be indicative of a problem.
  • Eden Space strip chart shows that the data quickly fills the 128MB Eden space and is then collected. When the frequency of the garbage collections approach 1 collection per second or greater, aliasing can be a problem and you should use jstat to ascertain the actual frequency
  • Survivor space strip charts shows that the survivor space is not being used.  Combined with view of the Old Gen activity, deduce that the tenuring process is not working and that data that survives one collection is promoted directly into the Old Generation.  I don't like this configuration, but the user response times are good, so I wouldn't change it, yet.  As the load increases, this will need to be resolved.
  • Old Gen strip chart shows that the memory is being filled as data is promoted but quickly cleaned up by the CMS collector. 
  • Perm Gen strip chart shows a steady state of used space and some availble space, which is good. 

033_visualgc.jpg

I believe in the idiom "if it ain't broke, don't fix it", so although I'm not in love with the above screen shot, I wouldn't change anything unless there is a tangible performance problem.  With this load, the current settings are working.

That being said, I don't like that Eden so small and that the Survivor space is not being used.  Objects that survive an initial young generation GC are being promoted directly to the Old Generation, bypassing the tenuring process.

The following screen shot was also taken while running the medium load scenario where the end-user response times are good. On the left, prstat shows that there is one busy Java process which has been using 7.5% of this Solaris Container's CPU resource. The xcpustate window shows that all of the CPUs are lightly loaded by the Java process.

 033_prstat.jpg

The following jstat data was gathered while running the medium load scenario where the end-user response times are good. The following columns are displayed:

  • The first ten column describe memory usage:
    • The first four columns, S0C, S1C, S0U, and S1U are about survivor space. Capacity (C) and Utilization (U) for spaces 0 and 1. The capacity is 64 KB but neither space is utilized. 
    • The next two columns are about Eden, showing Capacity (EC) and Utilization (EU).  The capacity is constant at approximately 128MB (130944KB). How much data is in Eden at the instantaneous time that the trace is taken depends on how much data has filled in since the most recent young generation collection and varies wildly.
    • The next two columns are about the Old Generation, showing Capacity (OC) and Utilization (OU).  The capacity is constant at 1.2 GB. How
      much data is in Eden at the instantaneous time that the trace is taken
      depends on how much data has filled in since the most recent young
      generation collection and varies slowly.
    • Then there are two columns are about the Permanent Gerneration, showing Capacity (PC) and Utilization (PU). The capacity is constant at 720 MB (737280 KB) and the utilization is fairly flat near 440 MB (450560 KB)
  • The final 5 columns are about the number of collections and time spent in garbage collection:
    • YGC is the count of the total number of Young Generation Collections since the Java process was launched.

#jstat -gc 15515 10000
S0C S1C S0U S1U EC EU OC OU PC PU YGC YGCT FGC FGCT GCT
64.0 64.0 0.0 0.0 130944.0 128571.1 1226784.0 657397.4 737280.0 450352.0 497 15.956 158 53.966 69.922
64.0 64.0 0.0 0.0 130944.0 32841.4 1226784.0 624092.3 737280.0 450367.8 503 16.154 160 54.918 71.072
64.0 64.0 0.0 0.0 130944.0 0.0 1226784.0 631067.9 737280.0 450251.8 506 16.245 162 54.963 71.208
64.0 64.0 0.0 0.0 130944.0 76982.1 1226784.0 603735.6 737280.0 450255.0 511 16.383 163 55.725 72.107
64.0 64.0 0.0 0.0 130944.0 118467.4 1226784.0 606041.5 737280.0 450265.4 515 16.538 165 56.504 73.042
64.0 64.0 0.0 0.0 130944.0 108089.0 1226784.0 644641.9 737280.0 450273.4 520 16.795 165 56.504 73.298
64.0 64.0 0.0 0.0 130944.0 65716.7 1226784.0 627548.7 737280.0 449934.1 523 16.901 167 57.483 74.384
64.0 64.0 0.0 0.0 130944.0 93575.9 1226784.0 602544.6 737280.0 449690.5 526 17.021 168 58.691 75.712
64.0 64.0 0.0 0.0 130944.0 65022.2 1226784.0 635921.9 737280.0 449698.6 531 17.217 169 58.736 75.953
64.0 64.0 0.0 0.0 130944.0 23657.6 1226784.0 626730.9 737280.0 449708.1 537 17.471 171 59.584 77.055
\^C

The Verbose GC log file confirms the jstat data.

1398.985: [GC[YG occupancy: 32842 K (131008 K)]1398.985: [Rescan (parallel) , 0.0162640 secs]1399.002: [weak refs processing, 0.3515853 secs]1399.353: [class unloading, 0.3569879 secs]1399.711: [scrub symbol & string tables, 0.1207449 secs] [1 CMS-remark: 757680K(1226784K)] 790522K(1357792K), 0.9462555 secs]
1399.932: [CMS-concurrent-sweep-start]
1400.006: [GC 1400.006: [ParNew: 130904K->0K(131008K), 0.0295234 secs] 887457K->758257K(1357792K), 0.0304601 secs]
1400.485: [GC 1400.486: [ParNew: 130944K->0K(131008K), 0.0344479 secs] 872839K->751757K(1357792K), 0.0351573 secs]
1401.453: [GC 1401.453: [ParNew: 130944K->0K(131008K), 0.0309997 secs] 830358K->704885K(1357792K), 0.0318843 secs]
1402.550: [CMS-concurrent-sweep: 1.110/2.618 secs]
1402.550: [CMS-concurrent-reset-start]
1402.563: [CMS-concurrent-reset: 0.013/0.013 secs]
1403.901: [GC 1403.901: [ParNew: 130944K->0K(131008K), 0.0325028 secs] 835829K->711510K(1357792K), 0.0332487 secs]
1404.156: [GC [1 CMS-initial-mark: 711510K(1226784K)] 744661K(1357792K), 0.0442969 secs]
1404.201: [CMS-concurrent-mark-start]
1404.834: [GC 1404.835: [ParNew: 130941K->0K(131008K), 0.0357494 secs] 842451K->717262K(1357792K), 0.0365797 secs]
1404.953: [GC 1404.954: [ParNew: 130893K->0K(131008K), 0.0183660 secs] 848155K->717517K(1357792K), 0.0190584 secs]
1405.055: [GC 1405.055: [ParNew: 130944K->0K(131008K), 0.0173852 secs] 848461K->717814K(1357792K), 0.0180515 secs]
1405.342: [GC 1405.343: [ParNew: 130944K->0K(131008K), 0.0203568 secs] 848758K->720848K(1357792K), 0.0211302 secs]
1405.763: [GC 1405.763: [ParNew: 130944K->0K(131008K), 0.0204437 secs] 851792K->722671K(1357792K), 0.0211113 secs]
1406.192: [GC 1406.192: [ParNew: 130944K->0K(131008K), 0.0204619 secs] 853615K->724768K(1357792K), 0.0214120 secs]
1407.042: [GC 1407.042: [ParNew: 130944K->0K(131008K), 0.0326238 secs] 855712K->734062K(1357792K), 0.0334577 secs]
1409.453: [GC 1409.453: [ParNew: 130944K->0K(131008K), 0.0340238 secs] 865006K->741292K(1357792K), 0.0347661 secs]
1410.355: [CMS-concurrent-mark: 3.050/6.154 secs]
1410.355: [CMS-concurrent-preclean-start]
1412.357: [GC 1412.358: [ParNew: 130944K->0K(131008K), 0.0334870 secs] 872236K->747988K(1357792K), 0.0344020 secs]
1412.936: [CMS-concurrent-preclean: 0.366/2.581 secs]
1412.936: [CMS-concurrent-abortable-preclean-start]
1412.941: [CMS-concurrent-abortable-preclean: 0.005/0.006 secs]
1412.948: [GC[YG occupancy: 66917 K (131008 K)]1412.949: [Rescan (parallel) , 0.0494186 secs]1412.998: [weak refs processing, 0.4237737 secs]1413.422: [class unloading, 0.2246387 secs]1413.647: [scrub symbol & string tables, 0.1256452 secs] [1 CMS-remark: 747988K(1226784K)] 814906K(1357792K), 0.9224302 secs]
1413.872: [CMS-concurrent-sweep-start]
1417.184: [GC 1417.185: [ParNew: 130944K->0K(131008K), 0.0439868 secs] 878065K->755909K(1357792K), 0.0450652 secs]

Resident Set Size

When memory is allocated to a Solaris process, virtual memory pages to fulfill the request are reserved in swap space, but this is a very light weight event that does not require allocating RAM or writing data to the backing store (virtual memory disk).  The first time that any page is accessed, RAM is allocated and zeroed for the page. Finally, in a scenario where RAM is scarce, Solaris may need to pageout the data to the backing store.

It
is not uncommon for a Java users to notice a substantial gap between the swap space that has been reserved for the process compared the amount of RAM that is being used.  Observe the prstat SIZE and RSS columns.  The "RSS" column
(Resident Set Size) indicates how much RAM is in use by the process.
The "SIZE" column displays the the total virtual size of process,
including RAM that is allocated to the process, pages that have been reserved but never accessed and data that has been paged out due to memory
pressure.  In this case, there is no virtual memory pressure (seen via
vmstat, not shown) so we know that there is more than 2 GB of memory space that
has been reserved by the process but not yet accessed. Going back to
the VisualGC screen shot above, in the top left window, notice that Old
Gen has:


  • Active data, solid orange, 664MB.

  • Empty (but already allocated) space shown with a green hash.  The sum of the active data and the empty space is 1.17 GB.

  • Space that has been reserved but not yet allocated, shown with a  gray hash. The virtual memory footprint of the Old Generation is 1.87 GB, including both active, empty but allocated, and reserved space.

In the jstat Old Capacity "OC" column, (1226784 KB) matches the 1.17 GB in the VisualGC output and indicates sum of the memory that has been allocated to the Old Generation, but not pages that have been reserved and not yet allocated.

More detail about Solaris memory can be found at sources including Solaris 10 Physical Memory Management and The Solaris Memory System - Sizing, Tools and Architecture

The following graph is produced using from vmstat gathered while running the medium load scenario where the end-user response times are good. (See Graphing Solaris Performance Stats with gnuplot) You can see that there is a consistent light CPU load on the system throughout the half hour test.


 033_vmstat.jpg

Stage 2: It's broken

In the second stage, the load has been increased causing response times to spike. The processing alternates between two phases. Garbage collection runs constantly while the test is running, but the Old generation CMS collector can't clear out the data fast enough to clear the In once

What I don't like about the following VisualGC

1) GC is constantly running
2) Survivor space is not being used
3) Two phases:

(a) a phase that can be seen in the Eden Space strip chart with very frequent Eden GC's concurrent with application work, and

(b) a phase that can be seen in the Old Gen strip chart where where the application can not execute due to exclusive Old Generation GC work.

4) CMS is not able to hold Old Gen down

 100_bad_visualgc.jpg

The following jstat data was gather during a run where the load has been increased causing response times to spike. There is one sample every 10 seconds (10000 ms) and twenty samples to cover a total of 200 seconds.  Observing the YGC column, notice that there are 685 (3675 - 2990) Young Generation Collections, for an average rate of 3.4 collections/second.  A tuning goal will be to slow this rate down.

#jstat -gc $PID 10000 20
S0C S1C S0U S1U EC EU OC OU PC PU YGC YGCT FGC FGCT GCT
64.0 64.0 0.0 0.0 130944.0 130944.0 1966080.0 1553383.3 737280.0 449596.9 2990 119.803 123 283.914 403.717
64.0 64.0 0.0 0.0 130944.0 93869.1 1966080.0 1886113.2 737280.0 449599.9 3046 121.996 123 283.914 405.910
64.0 64.0 0.0 0.0 130944.0 130944.0 1966080.0 1942614.0 737280.0 449599.9 3055 122.564 124 283.914 406.478
64.0 64.0 0.0 0.0 130944.0 57856.7 1966080.0 1364918.4 737280.0 449602.5 3093 124.077 125 294.691 418.768
64.0 64.0 0.0 0.0 130944.0 0.0 1966080.0 1714499.8 737280.0 449606.8 3149 126.198 125 294.691 420.889
64.0 64.0 0.0 0.0 130944.0 130944.0 1966080.0 1943757.0 737280.0 449608.7 3186 127.848 126 294.691 422.539
64.0 64.0 0.0 0.0 130944.0 130944.0 1966080.0 1943757.0 737280.0 449605.8 3186 127.848 126 294.691 422.539
64.0 64.0 0.0 0.0 130944.0 0.0 1966080.0 1389764.8 737280.0 449616.0 3231 129.533 127 307.723 437.256
64.0 64.0 0.0 0.0 130944.0 101211.9 1966080.0 1741419.7 737280.0 449619.9 3286 131.759 127 307.723 439.483
64.0 64.0 0.0 0.0 130944.0 130944.0 1966080.0 1943870.4 737280.0 449629.2 3323 133.224 128 307.723 440.947
64.0 64.0 0.0 0.0 130944.0 91907.9 1966080.0 1116053.3 737280.0 449621.3 3330 133.492 129 319.118 452.610
64.0 64.0 0.0 0.0 130944.0 0.0 1966080.0 1541494.5 737280.0 449631.2 3392 135.838 129 319.118 454.956
64.0 64.0 0.0 0.0 130944.0 53983.1 1966080.0 1907711.7 737280.0 449651.6 3444 137.955 129 319.118 457.073
64.0 64.0 0.0 0.0 130944.0 130944.0 1966080.0 1942148.9 737280.0 449659.3 3450 138.658 130 319.118 457.776
64.0 64.0 0.0 0.0 130944.0 71632.8 1966080.0 1291300.9 737280.0 449672.5 3484 139.935 131 331.870 471.805
64.0 64.0 0.0 0.0 130944.0 0.0 1966080.0 1661931.9 737280.0 449687.8 3535 142.096 131 331.870 473.966
64.0 64.0 0.0 0.0 130944.0 55150.8 1966080.0 1930792.5 737280.0 449693.2 3581 143.826 131 331.870 475.696
64.0 64.0 0.0 0.0 130944.0 130944.0 1966080.0 1941935.7 737280.0 449693.2 3583 144.291 132 331.870 476.161
64.0 64.0 0.0 0.0 130944.0 66316.6 1966080.0 1405189.1 737280.0 449701.0 3631 146.142 133 342.789 488.931
64.0 64.0 0.0 0.0 130944.0 0.0 1966080.0 1744177.7 737280.0 449704.4 3675 148.066 133 342.789 490.855

The Verbose GC log file confirms the jstat data.

1445.781: [GC 1445.781: [ParNew: 130944K->0K(131008K), 0.0528409 secs] 2048880K->1929346K(2097088K), 0.0733931 secs]
1446.057: [GC 1446.077: [ParNew: 130944K->0K(131008K), 0.0507988 secs] 2060290K->1937743K(2097088K), 0.0519872 secs]
1446.328: [GC 1446.330: [ParNew (promotion failed): 130944K->130944K(131008K), 0.3677102 secs]1446.698: [CMS1448.338: [CMS-concurrent-mark: 8.326/20.506 secs]
(concurrent mode failure): 1943651K->1332161K(1966080K), 13.5947895 secs] 2068687K->1332161K(2097088K), 13.9636616 secs]
1460.373: [GC [1 CMS-initial-mark: 1332161K(1966080K)] 1364958K(2097088K), 0.0570809 secs]
1460.431: [CMS-concurrent-mark-start]
1460.513: [GC 1460.513: [ParNew: 130944K->0K(131008K), 0.0629766 secs] 1463105K->1340611K(2097088K), 0.0638556 secs]
1460.644: [GC 1460.645: [ParNew: 130944K->0K(131008K), 0.0287314 secs] 1471555K->1344385K(2097088K), 0.0295894 secs]
1460.752: [GC 1460.752: [ParNew: 130944K->0K(131008K), 0.0313205 secs] 1475329K->1349195K(2097088K), 0.0325400 secs]
1460.859: [GC 1460.860: [ParNew: 130944K->0K(131008K), 0.0354078 secs] 1480139K->1353073K(2097088K), 0.0363298 secs]
1460.988: [GC 1460.988: [ParNew: 130944K->0K(131008K), 0.0363377 secs] 1484017K->1356257K(2097088K), 0.0374367 secs]
1461.164: [GC 1461.164: [ParNew: 130944K->0K(131008K), 0.0479562 secs] 1487201K->1362016K(2097088K), 0.0488208 secs]
1461.439: [GC 1461.439: [ParNew: 130944K->0K(131008K), 0.0507014 secs] 1492960K->1372934K(2097088K), 0.0515876 secs]
1461.579: [GC 1461.580: [ParNew: 130944K->0K(131008K), 0.0319675 secs] 1503878K->1375415K(2097088K), 0.0332522 secs]
1461.693: [GC 1461.693: [ParNew: 130944K->0K(131008K), 0.0340291 secs] 1506359K->1378926K(2097088K), 0.0348877 secs]
1461.821: [GC 1461.821: [ParNew: 130944K->0K(131008K), 0.0314839 secs] 1509870K->1382392K(2097088K), 0.0323204 secs]
1461.950: [GC 1461.951: [ParNew: 130943K->0K(131008K), 0.0323091 secs] 1513336K->1386662K(2097088K), 0.0331815 secs]
1462.123: [GC 1462.124: [ParNew: 130944K->0K(131008K), 0.0385567 secs] 1517606K->1391853K(2097088K), 0.0398456 secs]
1462.380: [GC 1462.380: [ParNew: 130944K->0K(131008K), 0.0698981 secs] 1522797K->1405319K(2097088K), 0.0708113 secs]
1462.627: [GC 1462.628: [ParNew: 130944K->0K(131008K), 0.0482389 secs] 1536263K->1417126K(2097088K), 0.0493317 secs]
1462.835: [GC 1462.835: [ParNew: 130944K->0K(131008K), 0.0438596 secs] 1548070K->1423124K(2097088K), 0.0447105 secs]
1462.989: [GC 1462.989: [ParNew: 130944K->0K(131008K), 0.0356933 secs] 1554068K->1430934K(2097088K), 0.0365304 secs]
1463.156: [GC 1463.156: [ParNew: 130944K->0K(131008K), 0.0426650 secs] 1561878K->1437914K(2097088K), 0.0438581 secs]
1463.342: [GC 1463.342: [ParNew: 130944K->0K(131008K), 0.0369751 secs] 1568858K->1445730K(2097088K), 0.0378227 secs]
1463.508: [GC 1463.508: [ParNew: 130944K->0K(131008K), 0.0393694 secs] 1576674K->1458926K(2097088K), 0.0402032 secs]
1463.709: [GC 1463.709: [ParNew: 130944K->0K(131008K), 0.0351472 secs] 1589870K->1465137K(2097088K), 0.0364246 secs]
1463.860: [GC 1463.860: [ParNew: 130944K->0K(131008K), 0.0399990 secs] 1596081K->1468331K(2097088K), 0.0408329 secs]
1463.998: [GC 1463.998: [ParNew: 130944K->0K(131008K), 0.0338172 secs] 1599275K->1471847K(2097088K), 0.0346756 secs]
1464.123: [GC 1464.123: [ParNew: 130944K->0K(131008K), 0.0281716 secs] 1602791K->1475013K(2097088K), 0.0292473 secs]
1464.235: [GC 1464.235: [ParNew: 130944K->0K(131008K), 0.0282486 secs] 1605957K->1477824K(2097088K), 0.0293566 secs]
1464.361: [GC 1464.361: [ParNew: 130944K->0K(131008K), 0.0319564 secs] 1608768K->1480943K(2097088K), 0.0328203 secs]
1464.590: [GC 1464.590: [ParNew: 130944K->0K(131008K), 0.0430505 secs] 1611887K->1486579K(2097088K), 0.0439285 secs]
1464.875: [GC 1464.875: [ParNew: 130944K->0K(131008K), 0.0404393 secs] 1617523K->1496962K(2097088K), 0.0415705 secs]
1465.079: [GC 1465.079: [ParNew: 130944K->0K(131008K), 0.0389969 secs] 1627906K->1509873K(2097088K), 0.0398899 secs]
1465.318: [GC 1465.318: [ParNew: 130944K->0K(131008K), 0.0377232 secs] 1640817K->1518156K(2097088K), 0.0386229 secs]
1465.533: [GC 1465.534: [ParNew: 130939K->0K(131008K), 0.0433747 secs] 1649096K->1527800K(2097088K), 0.0446424 secs]
1465.652: [GC 1465.653: [ParNew: 130944K->0K(131008K), 0.0267976 secs] 1658744K->1530747K(2097088K), 0.0276501 secs]
1465.754: [GC 1465.755: [ParNew: 130944K->0K(131008K), 0.0382624 secs] 1661691K->1533772K(2097088K), 0.0390969 secs]
1465.883: [GC 1465.884: [ParNew: 130944K->0K(131008K), 0.0346790 secs] 1664716K->1538809K(2097088K), 0.0355499 secs]
1466.010: [GC 1466.010: [ParNew: 130944K->0K(131008K), 0.0360124 secs] 1669753K->1542202K(2097088K), 0.0368312 secs]
1466.133: [GC 1466.134: [ParNew: 130896K->0K(131008K), 0.0264424 secs] 1673098K->1545037K(2097088K), 0.0276615 secs]
1466.284: [GC 1466.284: [ParNew: 130944K->0K(131008K), 0.0334272 secs] 1675981K->1551352K(2097088K), 0.0342620 secs]
1466.486: [GC 1466.487: [ParNew: 130944K->0K(131008K), 0.0468365 secs] 1682296K->1559392K(2097088K), 0.0476777 secs]
1466.761: [GC 1466.762: [ParNew: 130944K->0K(131008K), 0.0668140 secs] 1690336K->1571022K(2097088K), 0.0679247 secs]
1467.043: [GC 1467.043: [ParNew: 130944K->0K(131008K), 0.0595864 secs] 1701966K->1579608K(2097088K), 0.0604566 secs]
1467.294: [GC 1467.294: [ParNew: 130944K->0K(131008K), 0.0422655 secs] 1710552K->1586731K(2097088K), 0.0433685 secs]
1467.412: [GC 1467.413: [ParNew: 130915K->0K(131008K), 0.0357374 secs] 1717646K->1590261K(2097088K), 0.0365601 secs]
1467.523: [GC 1467.523: [ParNew: 130944K->0K(131008K), 0.0368735 secs] 1721205K->1592832K(2097088K), 0.0376957 secs]
1467.637: [Full GC 1467.637: [ParNew: 130901K->0K(131008K), 0.0387518 secs] 1723877K->1595360K(2097088K), 0.0396553 secs]
1467.755: [GC 1467.755: [ParNew: 130944K->0K(131008K), 0.0328768 secs] 1726304K->1597889K(2097088K), 0.0337467 secs]
1467.913: [GC 1467.913: [ParNew: 130943K->0K(131008K), 0.0321930 secs] 1728833K->1601670K(2097088K), 0.0332656 secs]
1468.115: [GC 1468.115: [ParNew: 130944K->0K(131008K), 0.0378134 secs] 1732614K->1608985K(2097088K), 0.0386423 secs]
1468.259: [GC 1468.260: [ParNew: 130944K->0K(131008K), 0.0389691 secs] 1739929K->1619450K(2097088K), 0.0398194 secs]
1468.378: [GC 1468.379: [ParNew: 130939K->0K(131008K), 0.0346053 secs] 1750390K->1622375K(2097088K), 0.0354541 secs]
1468.509: [GC 1468.509: [ParNew: 130931K->0K(131008K), 0.0329106 secs] 1753307K->1625703K(2097088K), 0.0339618 secs]
1468.627: [GC 1468.636: [ParNew: 130874K->0K(131008K), 0.0315224 secs] 1756578K->1630866K(2097088K), 0.0323764 secs]
1468.751: [GC 1468.752: [ParNew: 130944K->0K(131008K), 0.0316954 secs] 1761810K->1633494K(2097088K), 0.0325242 secs]
1468.926: [GC 1468.933: [ParNew: 130944K->0K(131008K), 0.0361699 secs] 1764438K->1641317K(2097088K), 0.0371095 secs]
1469.145: [GC 1469.146: [ParNew: 130944K->0K(131008K), 0.0367411 secs] 1772261K->1647218K(2097088K), 0.0378685 secs]
1469.380: [GC 1469.380: [ParNew: 130944K->0K(131008K), 0.0446357 secs] 1778162K->1653998K(2097088K), 0.0454686 secs]
1469.688: [GC 1469.689: [ParNew: 130944K->0K(131008K), 0.0390912 secs] 1784942K->1661565K(2097088K), 0.0402177 secs]
1469.951: [GC 1469.951: [ParNew: 130944K->0K(131008K), 0.0486676 secs] 1792509K->1671504K(2097088K), 0.0495607 secs]
1470.248: [GC 1470.249: [ParNew: 130944K->0K(131008K), 0.0414052 secs] 1802448K->1676829K(2097088K), 0.0425514 secs]
1470.447: [GC 1470.447: [ParNew: 130944K->0K(131008K), 0.0342887 secs] 1807773K->1683630K(2097088K), 0.0351706 secs]
1470.610: [GC 1470.611: [ParNew: 130940K->0K(131008K), 0.0319829 secs] 1814571K->1688471K(2097088K), 0.0332171 secs]
1470.722: [GC 1470.722: [ParNew: 130944K->0K(131008K), 0.0311625 secs] 1819415K->1692388K(2097088K), 0.0319923 secs]
1470.852: [GC 1470.852: [ParNew: 130944K->0K(131008K), 0.0314821 secs] 1823332K->1696163K(2097088K), 0.0389738 secs]
1470.967: [GC 1470.979: [ParNew: 130944K->0K(131008K), 0.0318559 secs] 1827107K->1700627K(2097088K), 0.0326812 secs]
1471.085: [GC 1471.086: [ParNew: 130937K->0K(131008K), 0.0328143 secs] 1831564K->1704737K(2097088K), 0.0336577 secs]
1471.229: [GC 1471.229: [ParNew: 130944K->0K(131008K), 0.0335674 secs] 1835681K->1709344K(2097088K), 0.0343768 secs]
1471.385: [GC 1471.386: [ParNew: 130944K->0K(131008K), 0.0406892 secs] 1840288K->1714787K(2097088K), 0.0419120 secs]
1471.516: [GC 1471.516: [ParNew: 130944K->0K(131008K), 0.0295618 secs] 1845731K->1717584K(2097088K), 0.0303950 secs]
1471.625: [GC 1471.625: [ParNew: 130944K->0K(131008K), 0.0376045 secs] 1848528K->1720845K(2097088K), 0.0384246 secs]
1471.750: [GC 1471.750: [ParNew: 130870K->0K(131008K), 0.0260799 secs] 1851715K->1723365K(2097088K), 0.0269050 secs]
1471.839: [GC 1471.840: [ParNew: 130944K->0K(131008K), 0.0255170 secs] 1854309K->1724726K(2097088K), 0.0263177 secs]
1471.988: [GC 1471.988: [ParNew: 130944K->0K(131008K), 0.0285268 secs] 1855670K->1728103K(2097088K), 0.0295660 secs]
1472.143: [GC 1472.143: [ParNew: 130944K->0K(131008K), 0.0392845 secs] 1859047K->1738951K(2097088K), 0.0401130 secs]
1472.362: [GC 1472.362: [ParNew: 130944K->0K(131008K), 0.0379261 secs] 1869895K->1746709K(2097088K), 0.0387573 secs]
1472.554: [GC 1472.554: [ParNew: 130944K->0K(131008K), 0.0431822 secs] 1877653K->1752618K(2097088K), 0.0442519 secs]
1472.672: [GC 1472.673: [ParNew: 130944K->0K(131008K), 0.0434003 secs] 1883562K->1754549K(2097088K), 0.0442311 secs]
1472.795: [GC 1472.796: [ParNew: 130944K->0K(131008K), 0.0295359 secs] 1885493K->1757812K(2097088K), 0.0304006 secs]
1472.904: [GC 1472.904: [ParNew: 130944K->0K(131008K), 0.0386588 secs] 1888756K->1760377K(2097088K), 0.0395649 secs]
1473.035: [GC 1473.036: [ParNew: 130861K->0K(131008K), 0.0297446 secs] 1891238K->1763237K(2097088K), 0.0308653 secs]
1473.188: [GC 1473.201: [ParNew: 130944K->0K(131008K), 0.0319938 secs] 1894181K->1767389K(2097088K), 0.0333206 secs]
1473.401: [GC 1473.401: [ParNew: 130944K->0K(131008K), 0.0387580 secs] 1898333K->1773804K(2097088K), 0.0396182 secs]
1473.522: [GC 1473.538: [ParNew: 130944K->0K(131008K), 0.0312055 secs] 1904748K->1778637K(2097088K), 0.0320711 secs]
1473.650: [GC 1473.650: [ParNew: 130944K->0K(131008K), 0.0346679 secs] 1909581K->1784622K(2097088K), 0.0355099 secs]
1473.769: [GC 1473.780: [ParNew: 130944K->0K(131008K), 0.0325029 secs] 1915566K->1787148K(2097088K), 0.0336199 secs]
1473.889: [GC 1473.890: [ParNew: 130944K->0K(131008K), 0.0306312 secs] 1918092K->1791684K(2097088K), 0.0314725 secs]
1474.023: [GC 1474.025: [ParNew: 130944K->0K(131008K), 0.0346581 secs] 1922628K->1800742K(2097088K), 0.0354589 secs]
1474.219: [GC 1474.219: [ParNew: 130944K->0K(131008K), 0.0431710 secs] 1931686K->1811004K(2097088K), 0.0440093 secs]
1474.392: [GC 1474.393: [ParNew: 130944K->0K(131008K), 0.0384690 secs] 1941948K->1820039K(2097088K), 0.0396068 secs]
1474.572: [GC 1474.584: [ParNew: 130944K->0K(131008K), 0.0413289 secs] 1950983K->1827458K(2097088K), 0.0426382 secs]
1474.801: [GC 1474.813: [ParNew: 130944K->0K(131008K), 0.0385987 secs] 1958402K->1837161K(2097088K), 0.0627487 secs]
1475.046: [GC 1475.047: [ParNew: 130944K->0K(131008K), 0.0488532 secs] 1968105K->1845650K(2097088K), 0.0499528 secs]
1475.262: [GC 1475.262: [ParNew: 130944K->0K(131008K), 0.0512485 secs] 1976594K->1853035K(2097088K), 0.0521045 secs]
1475.458: [GC 1475.458: [ParNew: 130944K->0K(131008K), 0.0461453 secs] 1983979K->1862019K(2097088K), 0.0469998 secs]
1475.652: [GC 1475.653: [ParNew: 130944K->0K(131008K), 0.0447249 secs] 1992963K->1868886K(2097088K), 0.0458052 secs]
1475.763: [GC 1475.780: [ParNew: 130928K->0K(131008K), 0.0324786 secs] 1999815K->1873299K(2097088K), 0.0333689 secs]
1475.882: [GC 1475.882: [ParNew: 130944K->0K(131008K), 0.0340021 secs] 2004243K->1876399K(2097088K), 0.0348220 secs]
1475.993: [GC 1475.994: [ParNew: 130944K->0K(131008K), 0.0299435 secs] 2007343K->1883168K(2097088K), 0.0308629 secs]
1476.097: [GC 1476.097: [ParNew: 130944K->0K(131008K), 0.0343554 secs] 2014112K->1887994K(2097088K), 0.0351879 secs]
1476.218: [GC 1476.218: [ParNew: 130904K->0K(131008K), 0.0331282 secs] 2018898K->1891141K(2097088K), 0.0341797 secs]
1476.375: [GC 1476.376: [ParNew: 130944K->0K(131008K), 0.0460486 secs] 2022085K->1896937K(2097088K), 0.0472445 secs]
1476.522: [GC 1476.522: [ParNew: 130937K->0K(131008K), 0.0305741 secs] 2027875K->1900685K(2097088K), 0.0313665 secs]
1476.630: [GC 1476.630: [ParNew: 130944K->0K(131008K), 0.0354049 secs] 2031629K->1903695K(2097088K), 0.0362252 secs]
1476.749: [GC 1476.750: [ParNew: 130944K->0K(131008K), 0.0332512 secs] 2034639K->1906008K(2097088K), 0.0343826 secs]
1476.861: [GC 1476.882: [ParNew: 130944K->0K(131008K), 0.0302215 secs] 2036952K->1908090K(2097088K), 0.0310552 secs]
1477.030: [GC 1477.030: [ParNew: 130944K->0K(131008K), 0.0308026 secs] 2039034K->1910954K(2097088K), 0.0316133 secs]
1477.250: [GC 1477.251: [ParNew: 130944K->0K(131008K), 0.0385982 secs] 2041898K->1916426K(2097088K), 0.0397542 secs]
1477.451: [GC 1477.452: [ParNew: 130944K->0K(131008K), 0.0355202 secs] 2047370K->1922843K(2097088K), 0.0363764 secs]
1477.641: [GC 1477.641: [ParNew: 130944K->0K(131008K), 0.0374713 secs] 2053787K->1928347K(2097088K), 0.0383333 secs]
1477.864: [GC 1477.865: [ParNew: 130944K->0K(131008K), 0.0420856 secs] 2059291K->1935857K(2097088K), 0.0432463 secs]
1478.088: [GC 1478.089: [ParNew (promotion failed): 130944K->130944K(131008K), 0.2115119 secs]1478.300: [CMS1480.678: [CMS-concurrent-mark: 8.657/20.247 secs]
(concurrent mode failure): 1943323K->1067360K(1966080K), 12.8686984 secs] 2066801K->1067360K(2097088K), 13.0813570 secs]
1491.257: [GC [1 CMS-initial-mark: 1067360K(1966080K)] 1106126K(2097088K), 0.0660923 secs]
1491.325: [CMS-concurrent-mark-start]
1491.415: [GC 1491.416: [ParNew: 130944K->0K(131008K), 0.0628546 secs] 1198304K->1074342K(2097088K), 0.0642408 secs]
1491.569: [GC 1491.569: [ParNew: 130944K->0K(131008K), 0.0283941 secs] 1205286K->1077410K(2097088K), 0.0292610 secs]

The following screen shot shows VisualGC, prstat and xcpustate.  There is a sustained period of single threaded behavior. Of the 664 threads in the Java process, only one is able to run. 

 100_bad_visual_gc_single_threaded_gc.jpg

The following screen shot shows VisualGC, prstat and xcpustate.  After the single threaded phase, there is a phase where many of the threads that had been blocked by garbage collection become runnable.  Every virtual CPU's is busy processing runnable threads. 

 100_bad_visual_gc_back_to_work.jpg

The vmstat data shows the stages of single-threaded garage collection (1/16 CPU's = 6.25%) alternating with stages where the the CPU's are extremely busy processing application code.

 100_bad_vmstat.jpg

 The pattern in the vmstat data is more obvious when only 10 minutes of data is displayed.

 100_bad_vmstat_10_min.jpg

Stage 3: Fix it

In the third stage, the Java parameters have been modified and the system able to run well under a heavier load.  

Initial Java Options
Modified Java Options
-Xms1024m -Xms4096m
-Xmx2048m -Xmx6144m
-XX:NewSize=128m -XX:NewSize=1536m
-XX:MaxNewSize=256m -XX:MaxNewSize=1536m
N/A -XX:SurvivorRatio=4
N/A

-verbose:gc
-XX:+PrintGCTimeStamps
-XX:+PrintGCDetails
-XX:+PrintTenuringDistribution
-Xloggc:/web_dir/log/gc.log

What I like about the following VisualGC

1) GC is not constantly running
2) Survivor space is being used (5 generations) but not full
3) CMS is able to hold Old Gen down
4) Eden GCs are about one per second.

 100_big_visualgc.jpg  100_big_prstat.jpg

The following jstat data was gather during a run where the the Java parameters have been modified and the system able to run well under a heavier load.  There is one sample
every 10 seconds (10000 ms) and twenty samples to cover a total of 200
seconds.  Observing the YGC column, notice that there are 126 (943 - 817) Young Generation Collections, for an average rate of 0.63
collections/second (one every 1.6 seconds). 

Also notice that the Survivor space is now active. In the VisualGC Survivor Age Histogram, above. observe that data survives 5 generations before being promoted to the Old Generation.  This implied that Java objects that are still referenced after  8 seconds (1.6 seconds X 5 collections).  Most Java objects are not referenced for 8 seconds and therefore are never promoted into the Old Generation,

#jstat -gc $PID 10000 20
S0C S1C S0U S1U EC EU OC OU PC PU YGC YGCT FGC FGCT GCT
262144.0 262144.0 0.0 135491.4 1048576.0 740774.1 2621440.0 1518371.0 737280.0 454756.7 817 204.672 119 105.404 310.076
262144.0 262144.0 143538.3 0.0 1048576.0 429976.2 2621440.0 1667952.6 737280.0 454763.5 826 206.506 119 105.404 311.911
262144.0 262144.0 127788.8 0.0 1048576.0 648808.0 2621440.0 1772166.9 737280.0 454766.5 832 207.969 120 105.404 313.373
262144.0 262144.0 149593.1 0.0 1048576.0 598492.1 2621440.0 1515637.6 737280.0 454766.5 840 209.517 121 107.803 317.320
262144.0 262144.0 162023.5 0.0 1048576.0 1008524.0 2621440.0 1656240.4 737280.0 454768.1 846 210.922 121 107.803 318.725
262144.0 262144.0 0.0 125334.4 1048576.0 953097.1 2621440.0 1836384.7 737280.0 454769.0 853 212.890 121 107.803 320.694
262144.0 262144.0 151573.6 152908.7 1048576.0 1048576.0 2621440.0 1687193.8 737280.0 454769.8 859 214.153 122 109.747 323.900
262144.0 262144.0 0.0 127784.9 1048576.0 170487.4 2621440.0 1597313.5 737280.0 454764.8 865 216.265 123 110.197 326.462
262144.0 262144.0 125183.1 0.0 1048576.0 679868.2 2621440.0 1674807.1 737280.0 454765.9 870 217.484 123 110.197 327.682
262144.0 262144.0 0.0 138781.8 1048576.0 941507.7 2621440.0 1500453.3 737280.0 454766.5 875 218.688 124 112.160 330.848
262144.0 262144.0 154681.9 98652.8 1048576.0 1048576.0 2621440.0 1419845.8 737280.0 454770.0 881 220.185 125 112.576 332.761
262144.0 262144.0 140635.1 0.0 1048576.0 184017.0 2621440.0 1541698.1 737280.0 454774.3 888 222.260 125 112.576 334.836
262144.0 262144.0 126225.1 0.0 1048576.0 800352.6 2621440.0 1664977.6 737280.0 454777.0 894 224.013 125 112.576 336.589
262144.0 262144.0 128930.2 0.0 1048576.0 709898.1 2621440.0 1423721.6 737280.0 454779.2 900 225.276 126 114.513 339.788
262144.0 262144.0 0.0 135929.5 1048576.0 901493.1 2621440.0 1500049.7 737280.0 454783.5 907 227.254 127 114.962 342.216
262144.0 262144.0 78006.9 156657.5 1048576.0 1048576.0 2621440.0 1682226.6 737280.0 454785.8 916 229.098 127 114.962 344.060
262144.0 262144.0 112938.7 0.0 1048576.0 707242.9 2621440.0 1833853.7 737280.0 454790.2 922 230.749 128 114.962 345.712
262144.0 262144.0 0.0 119538.9 1048576.0 466447.4 2621440.0 1587738.4 737280.0 454790.8 929 232.260 129 117.110 349.370
262144.0 262144.0 141025.9 131586.9 1048576.0 1048576.0 2621440.0 1774825.3 737280.0 454791.9 937 233.762 129 117.110 350.872
262144.0 262144.0 0.0 116953.7 1048576.0 562706.0 2621440.0 1886004.9 737280.0 454792.8 943 235.572 129 117.110 352.683

The Verbose GC log file confirms the jstat data.

1063.533: [GC[YG occupancy: 740130 K (1310720 K)]1063.534: [Rescan (parallel) , 0.2932129 secs]1063.827: [weak refs processing, 0.8300947 secs]1064.658: [class unloading, 0.2556826 secs]1064.914: [scrub symbol & string tables, 0.1225031 secs] [1 CMS-remark: 1362667K(2621440K)] 2102798K(3932160K), 1.6268772 secs]
1065.161: [CMS-concurrent-sweep-start]
1065.744: [GC 1065.748: [ParNew: 1199794K->136002K(1310720K), 0.3157680 secs] 2562442K->1538117K(3932160K), 0.3173270 secs]
1067.540: [GC 1067.541: [ParNew: 1183955K->134897K(1310720K), 0.3643224 secs] 2516046K->1487739K(3932160K), 0.3657230 secs]
1069.449: [GC 1069.452: [ParNew: 1183473K->132663K(1310720K), 0.4440330 secs] 2398115K->1372567K(3932160K), 0.4454642 secs]
1071.165: [GC 1071.165: [ParNew: 1181239K->117685K(1310720K), 0.2458847 secs] 2366563K->1322400K(3932160K), 0.2472765 secs]
1073.099: [GC 1073.100: [ParNew: 1166261K->143983K(1310720K), 0.2949013 secs] 2230754K->1208476K(3932160K), 0.2964760 secs]
1074.020: [CMS-concurrent-sweep: 3.145/8.858 secs]
1074.020: [CMS-concurrent-reset-start]
1074.056: [CMS-concurrent-reset: 0.036/0.036 secs]
1074.700: [GC 1074.701: [ParNew: 1192514K->130782K(1310720K), 0.2122237 secs] 2257007K->1216167K(3932160K), 0.2136974 secs]
1075.304: [GC [1 CMS-initial-mark: 1085384K(2621440K)] 1482522K(3932160K), 0.3090166 secs]
1075.614: [CMS-concurrent-mark-start]
1076.299: [GC 1076.300: [ParNew: 1179358K->142731K(1310720K), 0.2360216 secs] 2264743K->1228116K(3932160K), 0.2373256 secs]
1078.147: [GC 1078.148: [ParNew: 1191307K->146389K(1310720K), 0.3161631 secs] 2276692K->1249628K(3932160K), 0.3176056 secs]
1080.368: [GC 1080.369: [ParNew: 1194965K->140667K(1310720K), 0.2221885 secs] 2298204K->1266323K(3932160K), 0.2235858 secs]
1082.719: [GC 1082.720: [ParNew: 1189243K->156618K(1310720K), 0.4543953 secs] 2314899K->1292754K(3932160K), 0.4557551 secs]
1083.887: [GC 1083.888: [ParNew: 1205108K->101152K(1310720K), 0.1484881 secs] 2341243K->1271768K(3932160K), 0.1498135 secs]
1084.462: [CMS-concurrent-mark: 3.962/8.848 secs]
1084.462: [CMS-concurrent-preclean-start]
1085.434: [CMS-concurrent-preclean: 0.859/0.971 secs]
1085.434: [CMS-concurrent-abortable-preclean-start]
1086.081: [GC 1086.082: [ParNew: 1149728K->135612K(1310720K), 0.3098950 secs] 2320344K->1306228K(3932160K), 0.3113219 secs]
1087.294: [CMS-concurrent-abortable-preclean: 0.459/1.860 secs]
1087.342: [GC[YG occupancy: 667129 K (1310720 K)]1087.342: [Rescan (parallel) , 0.3074071 secs]1087.650: [weak refs processing, 0.9052933 secs]1088.555: [class unloading, 0.2571331 secs]1088.813: [scrub symbol & string tables, 0.1228506 secs] [1 CMS-remark: 1170615K(2621440K)] 1837745K(3932160K), 1.7140749 secs]
1089.057: [CMS-concurrent-sweep-start]
1089.396: [GC 1089.397: [ParNew: 1184134K->139422K(1310720K), 0.1621469 secs] 2354726K->1320441K(3932160K), 0.1636681 secs]
1091.280: [GC 1091.281: [ParNew: 1187998K->154229K(1310720K), 0.2434008 secs] 2295339K->1273265K(3932160K), 0.2449105 secs]
1093.547: [GC 1093.548: [ParNew: 1202805K->150707K(1310720K), 0.3279749 secs] 2156871K->1125761K(3932160K), 0.3295486 secs]

 The application is able to run most of the time, but even with the improved Java options, there are short periods of single-threaded computation, which are likely attributable to the "initial mark" or the "remark" phases of CMS.  See this description of CMS for more details.

 100_big_cms_vmstat.jpg

When zooming in on the vmstat data, the short and acceptable phases of single threaded activity are more obvious.  Notice how much better this is than the long phases of single-threaded behavior that were visible with the initial Java options.

 100_big_cms_vmstat._10_min.jpg

Conclusion:

"One-size fits all" Java options are not necessarily a good match for your application.  Monitoring garbage collection with a combination of several tools and using the feedback to adjust the Java options for your application can result is a substantial performance improvement.  


Join the discussion

Comments ( 1 )
  • Stefan Schneider Sunday, April 4, 2010

    Very informative case study.

    Excellent visualization.

    Having the entire list of options before and after is something I'm always interested in.

    - Stefan S.


Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.