Tuesday May 27, 2014

Using R to analyze Java G1 garbage collector log files

Introduction

Working in Oracle Platform Integration gives an engineer opportunities to work on a wide array of technologies. My team’s goal is to make Oracle applications run best on the Solaris/SPARC platform. When looking for bottlenecks in a modern application, one needs to be aware of not only how the CPUs and operating system are executing, but also network, storage, and in some cases, the Java Virtual Machine.

I was recently presented with about 1.5 GB of Java Garbage First (G1) Garbage Collector log file data. If you’re not familiar with the subject, you might want to review Garbage First Garbage Collector Tuning by Monica Beckwith. The customer had been running Java HotSpot 1.6.0_31 to host a web application server. The logs are from a very early implementation of the G1 collector, prior to it becoming supported. The G1 collector is supported starting in JDK 7 update 4 and there are substantial performance improvements between the experimental version and the supported version.

I was told that the Solaris/SPARC server was running a Java process launched using a command line that included the following flags: -d64 -Xms9g -Xmx9g -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:InitiatingHeapOccupancyPercent=80 -XX:PermSize=256m -XX:MaxPermSize=256m -XX:+PrintGC -XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC -XX:+PrintGCDateStamps -XX:+PrintFlagsFinal -XX:+DisableExplicitGC -XX:+UnlockExperimentalVMOptions -XX:ParallelGCThreads=8

Several sources on the internet indicate that if I were to print out the 1.5 GB of log files, it would require enough paper to fill the bed of a pick up truck. Of course, it would be fruitless to try to scan the log files by hand. Tools will be required to summarize the contents of the log files.

Others have encountered large Java garbage collection log files. There are existing tools to analyze the log files:

  • IBM’s GC toolkit
  • The chewiebug GCViewer
  • gchisto
  • HPjmeter

Instead of using one of the other tools listed, I decide to parse the log files with standard Unix tools, and analyze the data with R. From the R project's home page, "R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS." If you would like to install R, there are many options for downloading R, including:

  • one of the CRAN mirrors listed at http://cran.r-project.org/mirrors.html
  • the Oracle R Distribution from https://oss.oracle.com/ORD/

Data Cleansing

The log files arrived in two different formats. I guess that the difference is that one set of log files was generated using a more verbose option, maybe -XX:+PrintHeapAtGC, and the other set of log files was generated without that option.

Format 1

In some of the log files, the log files with the less verbose format, a single trace, i.e. the report of a singe garbage collection event, looks like this:

{Heap before GC invocations=12280 (full 61):
 garbage-first heap   total 9437184K, used 7499918K [0xfffffffd00000000, 0xffffffff40000000, 0xffffffff40000000)
  region size 4096K, 1 young (4096K), 0 survivors (0K)
 compacting perm gen  total 262144K, used 144077K [0xffffffff40000000, 0xffffffff50000000, 0xffffffff50000000)
   the space 262144K,  54% used [0xffffffff40000000, 0xffffffff48cb3758, 0xffffffff48cb3800, 0xffffffff50000000)
No shared spaces configured.
2014-05-14T07:24:00.988-0700: 60586.353: [GC pause (young) 7324M->7320M(9216M), 0.1567265 secs]
Heap after GC invocations=12281 (full 61):
 garbage-first heap   total 9437184K, used 7496533K [0xfffffffd00000000, 0xffffffff40000000, 0xffffffff40000000)
  region size 4096K, 0 young (0K), 0 survivors (0K)
 compacting perm gen  total 262144K, used 144077K [0xffffffff40000000, 0xffffffff50000000, 0xffffffff50000000)
   the space 262144K,  54% used [0xffffffff40000000, 0xffffffff48cb3758, 0xffffffff48cb3800, 0xffffffff50000000)
No shared spaces configured.
}

A simple grep can be used to extract a summary:

$ grep "\[ GC pause (young"  g1gc.log
2014-05-13T13:24:35.091-0700: 3.109: [GC pause (young) 20M->5029K(9216M), 0.0146328 secs]
2014-05-13T13:24:35.440-0700: 3.459: [GC pause (young) 9125K->6077K(9216M), 0.0086723 secs]
2014-05-13T13:24:37.581-0700: 5.599: [GC pause (young) 25M->8470K(9216M), 0.0203820 secs]
2014-05-13T13:24:42.686-0700: 10.704: [GC pause (young) 44M->15M(9216M), 0.0288848 secs]
2014-05-13T13:24:48.941-0700: 16.958: [GC pause (young) 51M->20M(9216M), 0.0491244 secs]
2014-05-13T13:24:56.049-0700: 24.066: [GC pause (young) 92M->26M(9216M), 0.0525368 secs]
2014-05-13T13:25:34.368-0700: 62.383: [GC pause (young) 602M->68M(9216M), 0.1721173 secs]

But that format wasn't easily read into R, so I needed to be a bit more tricky. I used the following Unix command to create a summary file that was easy for R to read.

$ echo "SecondsSinceLaunch BeforeSize AfterSize TotalSize RealTime" > summary.txt 
$ grep "\[GC pause (young" g1gc.log | grep -v mark | sed -e 's/[A-SU-z\(\),]/ /g' -e 's/->/  /' -e 's/: /  /g' >> summary.txt 
"SecondsSinceLaunch BeforeSize AfterSize TotalSize RealTime
2014-05-13T13:24:35.091-0700  3.109                    20   5029  9216    0.0146328      
2014-05-13T13:24:35.440-0700  3.459                    9125   6077  9216    0.0086723      
2014-05-13T13:24:37.581-0700  5.599                    25   8470  9216    0.0203820      
2014-05-13T13:24:42.686-0700  10.704                    44   15  9216    0.0288848      
2014-05-13T13:24:48.941-0700  16.958                    51   20  9216    0.0491244      
2014-05-13T13:24:56.049-0700  24.066                    92   26  9216    0.0525368      
2014-05-13T13:25:34.368-0700  62.383                    602   68  9216    0.1721173     

Format 2

In some of the log files, the log files with the more verbose format, a single trace, i.e. the report of a singe garbage collection event, was more complicated than Format 1. Here is a text file with an example of a single G1GC trace in the second format. As you can see, it is quite complicated. It is nice that there is so much information available, but the level of detail can be overwhelming.

I wrote this awk script (download) to summarize each trace on a single line.

#!/usr/bin/env awk -f

BEGIN {
    printf("SecondsSinceLaunch IncrementalCount FullCount UserTime SysTime RealTime BeforeSize AfterSize TotalSize\n")
}

######################
# Save count data from lines that are at the start of each G1GC trace.
# Each trace starts out like this:
#   {Heap before GC invocations=14 (full 0):
#    garbage-first heap   total 9437184K, used 325496K [0xfffffffd00000000, 0xffffffff40000000, 0xffffffff40000000)
######################
/{Heap.*full/{
    gsub ( "\\)" , "" );
    nf=split($0,a,"=");
    split(a[2],b," ");
    getline;
    if ( match($0, "first") ) { 
  G1GC=1;
  IncrementalCount=b[1];
	FullCount=substr( b[3], 1, length(b[3])-1 ); 
    } else {
        G1GC=0;
    }
}

######################
# Pull out time stamps that are in lines with this format:
#   2014-05-12T14:02:06.025-0700: 94.312: [GC pause (young), 0.08870154 secs]
######################
/GC pause/ {
    DateTime=$1;
    SecondsSinceLaunch=substr($2, 1, length($2)-1);
}

######################
# Heap sizes are in lines that look like this:
#  [ 4842M->4838M(9216M)] 
######################
/\[ .*]$/ {
    gsub ( "\\[" , "" );
    gsub ( "\ \]" , "" );
    gsub ( "->" , " " );
    gsub ( "\\( " , " " );
    gsub ( "\ \)" , " " );
    split($0,a," ");

    if ( split(a[1],b,"M") > 1 ) {BeforeSize=b[1]*1024;}
    if ( split(a[1],b,"K") > 1 ) {BeforeSize=b[1];}

    if ( split(a[2],b,"M") > 1 ) {AfterSize=b[1]*1024;}
    if ( split(a[2],b,"K") > 1 ) {AfterSize=b[1];}

    if ( split(a[3],b,"M") > 1 ) {TotalSize=b[1]*1024;}
    if ( split(a[3],b,"K") > 1 ) {TotalSize=b[1];}
}

######################
# Emit an output line when you find input that looks like this:
#   [Times: user=1.41 sys=0.08, real=0.24 secs]
######################
/\[Times/ {
    if (G1GC==1) {
    gsub ( "," , "" );
    split($2,a,"=");
    UserTime=a[2];
    split($3,a,"=");
    SysTime=a[2];
    split($4,a,"=");
  RealTime=a[2];
    print DateTime,SecondsSinceLaunch,IncrementalCount,FullCount,UserTime,SysTime,RealTime,BeforeSize,AfterSize,TotalSize;
    G1GC=0;
    }
}

The resulting summary is about 25X smaller than the original file, but still difficult for a human to digest.

SecondsSinceLaunch IncrementalCount FullCount UserTime SysTime RealTime BeforeSize AfterSize TotalSize
...
2014-05-12T18:36:34.669-0700: 3985.744 561 0 0.57 0.06 0.16 1724416 1720320 9437184
2014-05-12T18:36:34.839-0700: 3985.914 562 0 0.51 0.06 0.19 1724416 1720320 9437184
2014-05-12T18:36:35.069-0700: 3986.144 563 0 0.60 0.04 0.27 1724416 1721344 9437184
2014-05-12T18:36:35.354-0700: 3986.429 564 0 0.33 0.04 0.09 1725440 1722368 9437184
2014-05-12T18:36:35.545-0700: 3986.620 565 0 0.58 0.04 0.17 1726464 1722368 9437184
2014-05-12T18:36:35.726-0700: 3986.801 566 0 0.43 0.05 0.12 1726464 1722368 9437184
2014-05-12T18:36:35.856-0700: 3986.930 567 0 0.30 0.04 0.07 1726464 1723392 9437184
2014-05-12T18:36:35.947-0700: 3987.023 568 0 0.61 0.04 0.26 1727488 1723392 9437184
2014-05-12T18:36:36.228-0700: 3987.302 569 0 0.46 0.04 0.16 1731584 1724416 9437184

Reading the Data into R

Once the GC log data had been cleansed, either by processing the first format with the shell script, or by processing the second format with the awk script, it was easy to read the data into R.

g1gc.df = read.csv("summary.txt", row.names = NULL, stringsAsFactors=FALSE,sep="")
str(g1gc.df)
## 'data.frame':    8307 obs. of  10 variables:
##  $ row.names         : chr  "2014-05-12T14:00:32.868-0700:" "2014-05-12T14:00:33.179-0700:" "2014-05-12T14:00:33.677-0700:" "2014-05-12T14:00:35.538-0700:" ...
##  $ SecondsSinceLaunch: num  1.16 1.47 1.97 3.83 6.1 ...
##  $ IncrementalCount  : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ FullCount         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ UserTime          : num  0.11 0.05 0.04 0.21 0.08 0.26 0.31 0.33 0.34 0.56 ...
##  $ SysTime           : num  0.04 0.01 0.01 0.05 0.01 0.06 0.07 0.06 0.07 0.09 ...
##  $ RealTime          : num  0.02 0.02 0.01 0.04 0.02 0.04 0.05 0.04 0.04 0.06 ...
##  $ BeforeSize        : int  8192 5496 5768 22528 24576 43008 34816 53248 55296 93184 ...
##  $ AfterSize         : int  1400 1672 2557 4907 7072 14336 16384 18432 19456 21504 ...
##  $ TotalSize         : int  9437184 9437184 9437184 9437184 9437184 9437184 9437184 9437184 9437184 9437184 ...
head(g1gc.df)
##                       row.names SecondsSinceLaunch IncrementalCount
## 1 2014-05-12T14:00:32.868-0700:              1.161                0
## 2 2014-05-12T14:00:33.179-0700:              1.472                1
## 3 2014-05-12T14:00:33.677-0700:              1.969                2
## 4 2014-05-12T14:00:35.538-0700:              3.830                3
## 5 2014-05-12T14:00:37.811-0700:              6.103                4
## 6 2014-05-12T14:00:41.428-0700:              9.720                5
##   FullCount UserTime SysTime RealTime BeforeSize AfterSize TotalSize
## 1         0     0.11    0.04     0.02       8192      1400   9437184
## 2         0     0.05    0.01     0.02       5496      1672   9437184
## 3         0     0.04    0.01     0.01       5768      2557   9437184
## 4         0     0.21    0.05     0.04      22528      4907   9437184
## 5         0     0.08    0.01     0.02      24576      7072   9437184
## 6         0     0.26    0.06     0.04      43008     14336   9437184

Basic Statistics

Once the data has been read into R, simple statistics are very easy to generate. All of the numbers from high school statistics are available via simple commands. For example, generate a summary of every column:

summary(g1gc.df)
##   row.names         SecondsSinceLaunch IncrementalCount   FullCount    
##  Length:8307        Min.   :    1      Min.   :   0     Min.   :  0.0  
##  Class :character   1st Qu.: 9977      1st Qu.:2048     1st Qu.:  0.0  
##  Mode  :character   Median :12855      Median :4136     Median : 12.0  
##                     Mean   :12527      Mean   :4156     Mean   : 31.6  
##                     3rd Qu.:15758      3rd Qu.:6262     3rd Qu.: 61.0  
##                     Max.   :55484      Max.   :8391     Max.   :113.0  
##     UserTime        SysTime          RealTime       BeforeSize     
##  Min.   :0.040   Min.   :0.0000   Min.   :  0.0   Min.   :   5476  
##  1st Qu.:0.470   1st Qu.:0.0300   1st Qu.:  0.1   1st Qu.:5137920  
##  Median :0.620   Median :0.0300   Median :  0.1   Median :6574080  
##  Mean   :0.751   Mean   :0.0355   Mean   :  0.3   Mean   :5841855  
##  3rd Qu.:0.920   3rd Qu.:0.0400   3rd Qu.:  0.2   3rd Qu.:7084032  
##  Max.   :3.370   Max.   :1.5600   Max.   :488.1   Max.   :8696832  
##    AfterSize         TotalSize      
##  Min.   :   1380   Min.   :9437184  
##  1st Qu.:5002752   1st Qu.:9437184  
##  Median :6559744   Median :9437184  
##  Mean   :5785454   Mean   :9437184  
##  3rd Qu.:7054336   3rd Qu.:9437184  
##  Max.   :8482816   Max.   :9437184

Q: What is the total amount of User CPU time spent in garbage collection?

sum(g1gc.df$UserTime)
## [1] 6236

As you can see, less than two hours of CPU time was spent in garbage collection. Is that too much? To find the percentage of time spent in garbage collection, divide the number above by total_elapsed_time*CPU_count. In this case, there are a lot of CPU’s and it turns out the the overall amount of CPU time spent in garbage collection isn’t a problem when viewed in isolation.

When calculating rates, i.e. events per unit time, you need to ask yourself if the rate is homogenous across the time period in the log file. Does the log file include spikes of high activity that should be separately analyzed? Averaging in data from nights and weekends with data from business hours may alias problems. If you have a reason to suspect that the garbage collection rates include peaks and valleys that need to be independently analyzed, see the “Time Series” section, below.

Q: How much garbage is collected on each pass?

The amount of heap space that is recovered per GC pass is surprisingly low:

  • At least one collection didn’t recover any data. (“Min.=0”)
  • 25% of the passes recovered 3MB or less. (“1st Qu.=3072”)
  • Half of the GC passes recovered 4MB or less. (“Median=4096”)
  • The average amount recovered was 56MB. (“Mean=56390”)
  • 75% of the passes recovered 36MB or less. (“3rd Qu.=36860”)
  • At least one pass recovered 2GB. (“Max.=2121000”)
g1gc.df$Delta = g1gc.df$BeforeSize - g1gc.df$AfterSize
summary(g1gc.df$Delta)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    3070    4100   56400   36900 2120000

Q: What is the maximum User CPU time for a single collection?

The worst garbage collection (“Max.”) is many standard deviations away from the mean. The data appears to be right skewed.

summary(g1gc.df$UserTime)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.040   0.470   0.620   0.751   0.920   3.370
sd(g1gc.df$UserTime)
## [1] 0.3966

Basic Graphics

Once the data is in R, it is trivial to plot the data with formats including dot plots, line charts, bar charts (simple, stacked, grouped), pie charts, boxplots, scatter plots histograms, and kernel density plots.

Histogram of User CPU Time per Collection

I don't think that this graph requires any explanation.

hist(g1gc.df$UserTime,
     main="User CPU Time per Collection",
     xlab="Seconds", ylab="Frequency")

plot of chunk unnamed-chunk-6

Box plot to identify outliers

When the initial data is viewed with a box plot, you can see the one crazy outlier in the real time per GC. Save this data point for future analysis and drop the outlier so that it’s not throwing off our statistics. Now the box plot shows many outliers, which will be examined later, using times series analysis. Notice that the scale of the x-axis changes drastically once the crazy outlier is removed.

par(mfrow=c(2,1))
boxplot(g1gc.df$UserTime,g1gc.df$SysTime,g1gc.df$RealTime,
        main="Box Plot of Time per GC\n(dominated by a crazy outlier)",
        names=c("usr","sys","elapsed"),
        xlab="Seconds per GC", ylab="Time (Seconds)",
        horizontal = TRUE,
        outcol="red")

crazy.outlier.df=g1gc.df[g1gc.df$RealTime > 400,]
g1gc.df=g1gc.df[g1gc.df$RealTime < 400,]

boxplot(g1gc.df$UserTime,g1gc.df$SysTime,g1gc.df$RealTime,
        main="Box Plot of Time per GC\n(crazy outlier excluded)",
        names=c("usr","sys","elapsed"), 
        xlab="Seconds per GC", ylab="Time (Seconds)",
        horizontal = TRUE,
        outcol="red")
box(which = "outer", lty = "solid")

plot of chunk unnamed-chunk-7

Here is the crazy outlier for future analysis:

crazy.outlier.df
##                          row.names SecondsSinceLaunch IncrementalCount
## 8233 2014-05-12T23:15:43.903-0700:              20741             8316
##      FullCount UserTime SysTime RealTime BeforeSize AfterSize TotalSize
## 8233       112     0.55    0.42    488.1    8381440   8235008   9437184
##       Delta
## 8233 146432

R Time Series Data

To analyze the garbage collection as a time series, I’ll use Z’s Ordered Observations (zoo). “zoo is the creator for an S3 class of indexed totally ordered observations which includes irregular time series.”

require(zoo)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
head(g1gc.df[,1])
## [1] "2014-05-12T14:00:32.868-0700:" "2014-05-12T14:00:33.179-0700:"
## [3] "2014-05-12T14:00:33.677-0700:" "2014-05-12T14:00:35.538-0700:"
## [5] "2014-05-12T14:00:37.811-0700:" "2014-05-12T14:00:41.428-0700:"
options("digits.secs"=3)
times=as.POSIXct( g1gc.df[,1], format="%Y-%m-%dT%H:%M:%OS%z:")
g1gc.z = zoo(g1gc.df[,-c(1)], order.by=times)
head(g1gc.z)
##                         SecondsSinceLaunch IncrementalCount FullCount
## 2014-05-12 17:00:32.868              1.161                0         0
## 2014-05-12 17:00:33.178              1.472                1         0
## 2014-05-12 17:00:33.677              1.969                2         0
## 2014-05-12 17:00:35.538              3.830                3         0
## 2014-05-12 17:00:37.811              6.103                4         0
## 2014-05-12 17:00:41.427              9.720                5         0
##                         UserTime SysTime RealTime BeforeSize AfterSize
## 2014-05-12 17:00:32.868     0.11    0.04     0.02       8192      1400
## 2014-05-12 17:00:33.178     0.05    0.01     0.02       5496      1672
## 2014-05-12 17:00:33.677     0.04    0.01     0.01       5768      2557
## 2014-05-12 17:00:35.538     0.21    0.05     0.04      22528      4907
## 2014-05-12 17:00:37.811     0.08    0.01     0.02      24576      7072
## 2014-05-12 17:00:41.427     0.26    0.06     0.04      43008     14336
##                         TotalSize Delta
## 2014-05-12 17:00:32.868   9437184  6792
## 2014-05-12 17:00:33.178   9437184  3824
## 2014-05-12 17:00:33.677   9437184  3211
## 2014-05-12 17:00:35.538   9437184 17621
## 2014-05-12 17:00:37.811   9437184 17504
## 2014-05-12 17:00:41.427   9437184 28672

Example of Two Benchmark Runs in One Log File

The data in the following graph is from a different log file, not the one of primary interest to this article. I’m including this image because it is an example of idle periods followed by busy periods. It would be uninteresting to average the rate of garbage collection over the entire log file period. More interesting would be the rate of garbage collection in the two busy periods. Are they the same or different?

Your production data may be similar, for example, bursts when employees return from lunch and idle times on weekend evenings, etc. Once the data is in an R Time Series, you can analyze isolated time windows.

Log_with_2_runs.png

Clipping the Time Series data

Flashing back to our test case… Viewing the data as a time series is interesting. You can see that the work intensive time period is between 9:00 PM and 3:00 AM. Lets clip the data to the interesting period:

par(mfrow=c(2,1))
plot(g1gc.z$UserTime, type="h",
     main="User Time per GC\nTime: Complete Log File",
     xlab="Time of Day", ylab="CPU Seconds per GC",
     col="#1b9e77")
clipped.g1gc.z=window(g1gc.z, 
                      start=as.POSIXct("2014-05-12 21:00:00"), 
                      end=as.POSIXct("2014-05-13 03:00:00"))
plot(clipped.g1gc.z$UserTime, type="h", 
     main="User Time per GC\nTime: Limited to Benchmark Execution",
     xlab="Time of Day", ylab="CPU Seconds per GC",
     col="#1b9e77")
box(which = "outer", lty = "solid")

plot of chunk unnamed-chunk-10

Cumulative Incremental and Full GC count

Here is the cumulative incremental and full GC count. When the line is very steep, it indicates that the GCs are repeating very quickly. Notice that the scale on the Y axis is different for full vs. incremental.

plot(clipped.g1gc.z[,c(2:3)],
     main="Cumulative Incremental and Full GC count",
     xlab="Time of Day",
     col="#1b9e77")

plot of chunk unnamed-chunk-11

GC Analysis of Benchmark Execution using Time Series data

In the following series of 3 graphs:

  • The “After Size” show the amount of heap space in use after each garbage collection. Many Java objects are still referenced, i.e. alive, during each garbage collection. This may indicate that the application has a memory leak, or may indicate that the application has a very large memory footprint. Typically, an application's memory footprint plateau's in the early stage of execution. One would expect this graph to have a flat top. On the far right side, the steep decline in the heap space may indicate that the application crashed after 2:00.
  • The second graph shows that the outliers in real execution time, discussed above, occur near 2:00. when the Java heap seems to be quite full.
  • The third graph shows that Full GCs are infrequent during the first few hours of execution. The rate of Full GC's, (the slope of the cumulative Full GC line), changes near midnight.
  • plot(clipped.g1gc.z[,c("AfterSize","RealTime","FullCount")],
         xlab="Time of Day",
         col=c("#1b9e77","red","#1b9e77"))
    

    plot of chunk unnamed-chunk-12

    GC Analysis of heap recovered

    Each GC trace includes the amount of heap space in use before and after the individual GC event. During garbage collection, unreferenced objects are identified, the space holding the unreferenced objects is freed, and thus, the difference in before and after usage indicates how much space has been freed.

    The following box plot and bar chart both demonstrate the same point - the amount of heap space freed per garbage collection is surprisingly low.

    par(mfrow=c(2,1))
    boxplot(as.vector(clipped.g1gc.z$Delta),
         main="Amount of Heap Recovered per GC Pass",
         xlab="Size in KB",
         horizontal = TRUE,
         col="red")
    hist(as.vector(clipped.g1gc.z$Delta),
         main="Amount of Heap Recovered per GC Pass",
         xlab="Size in KB",
         breaks=100,
         col="red")
    box(which = "outer", lty = "solid")
    

    plot of chunk unnamed-chunk-13

    This graph is the most interesting. The dark blue area shows how much heap is occupied by referenced Java objects. This represents memory that holds live data. The red fringe at the top shows how much data was recovered after each garbage collection.

    barplot(clipped.g1gc.z[,c("AfterSize","Delta")],
         col=c("#7570b3","#e7298a"),
         xlab="Time of Day",
         border=NA)
    legend("topleft",
           c("Live Objects","Heap Recovered on GC"),
           fill=c("#7570b3","#e7298a"))
    box(which = "outer", lty = "solid")
    

    plot of chunk unnamed-chunk-14

    When I discuss the data in the log files with the customer, I will ask for an explanation for the large amount of referenced data resident in the Java heap. There are two are possibilities:

    • There is a memory leak and the amount of space required to hold referenced objects will continue to grow, limited only by the maximum heap size. After the maximum heap size is reached, the JVM will throw an “Out of Memory” exception every time that the application tries to allocate a new object. If this is the case, the application needs to be debugged to identify why old objects are referenced when they are no longer needed.
    • The application has a legitimate requirement to keep a large amount of data in memory. The customer may want to further increase the maximum heap size. Another possible solution would be to partition the application across multiple cluster nodes, where each node has responsibility for managing a unique subset of the data.

    Conclusion

    In conclusion, R is a very powerful tool for the analysis of Java garbage collection log files. The primary difficulty is data cleansing so that information can be read into an R data frame. Once the data has been read into R, a rich set of tools may be used for thorough evaluation.

    About

    Jeff Taylor-Oracle

    Search

    Archives
    « March 2015
    SunMonTueWedThuFriSat
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
        
           
    Today