Tuesday Sep 30, 2014

Using JVM Diagnostics (JVMD) to help tune production JVMs

Contributing Author: Shiraz Kanga, Consulting Member of Technical Staff, Oracle

Tuning a production JVM involves more than merely adding more RAM to it via the -Xmx parameter. It depends upon an understanding of how your application truly behaves in a production environment. Most JVM tuning is done by developers with a simulated load in their Development or QA environment. This is unlikely to be truly representative of the production load running on production hardware with regards to proper JVM tuning.

One of the tools that actually contains real-world production data is JVM Diagnostics (JVMD). Hence it is a good idea to use data collected by JVMD for tuning your production JVMs. Note that JVMD is a component of Oracle Enterprise Manager, licensed as part of both Weblogic Management Pack and the Non-oracle Middleware Management Pack.

Figure 1. Heap Utilization and Garbage Collections for a specific JVM

In this document we are primarily addressing the Hotspot JVM. There are several aspects of tuning this JVM that we will look into:

Tuning Heap Size

The main parameters needed to tune heap size are:
  • -Xms<n>[g|m|k] is the initial and minimum size of the Java heap
  • -Xmx<n>[g|m|k] is the maximum possible size that the Java heap can grow upto

Figure 2. Heap after GC and Garbage Collection Overhead for a specific JVM

The Java Heap size here refers to the total size of the young and the old generation spaces. To start, take a look at the Heap usage chart (Figure 1) of your production JVM under maximum load in the JVMD Performance Diagnostics page. You should see some patterns in the minimum and the maximum heap sizes over time. You can use this data as a rough guide for your choice of -Xms and -Xmx with a reasonable amount of padding. After setting these you should start monitoring the garbage collection charts of your production JVMs (Figure 2) in the JVMD Live Heap Analysis page. It is useful to look into the JVMD metric called "Heap use after GC" which provides a good reflection of the actual amount of heap memory being used by your application. Ideally this metric should remain relatively steady over time with only few full garbage collections occuring. If there are too many full garbage collections then performance of your production application is impacted since GC is done by blocking threads that take a while to scan the entire heap. You can monitor this metric with the JVM GC Overhead% chart on the same page of JVMD. Garbage collection overhead is the percentage of total time spent in garbage collection. Increasing -Xmx can help to make these happen less frequently but actually it is time to dig deeper into your tuning options.

The key questions that you need to answer are - How frequently does garbage collection take place, How long does each collection take and what is the actual memory used (i.e. heap after GC). Also be sure that you NEVER make the heap size larger than the available free RAM on your system as disk will decrease performance as RAM will start getting swapped to disk.

The Sun HotSpot JVM relies on generational garbage collection to achieve optimum performance. The -XX:SurvivorRatio command line parameter could further help in tuning garbage collection.

The Java heap has a young generation for newly created objects and an old generation for long lived objects. The young generation is further subdivided into the Eden space where new objects are allocated and the Survivor space where new objects that are still in use can survive their first few garbage collections before being promoted to old generations. The Survivor Ratio is the ratio of Eden to Survivor space in the young object area of the heap. Increasing this setting optimizes the JVM for applications with high object creation and low object preservation. In applications that generate more medium and long lived objects, this setting should be lowered from the default and vice versa.

For example, -XX:SurvivorRatio=10 sets the ratio between each survivor space and eden space to be 1:10. If survivor spaces are too small, they will overflow directly into the old generation. If survivor spaces are too large, they will be empty. At each GC, the JVM determines the number of times an object can be copied before it is tenured, called the tenure threshold. This threshold should be set to keep the survivor space half full.

Most tuning operations represent a trade-off of some type or another. In the case of garbage collection the trade-off usually involves the memory used v/s throughput and latency.
  • The throughput of a JVM is measured in terms of the time spent doing garbage collection vs. the time spent outside of garbage collection (referred to as application time). It is the inverse of GC overhead mentioned above and represents the amount of work done by an application as a ratio of time spent in GC. Throughput can be tuned with -XX:GCTimeRatio=99 where 99 is the default which represents a 1% GC overhead.
  • Latency is the amount of time delay that is caused by garbage collection. Latency for GC pauses can be tuned by specifying rhe maximum pause time goal with the command line option -XX:MaxGCPauseMillis=<N>. This is interpreted as a hint that pause times of <N> milliseconds or less are desired. By default, there is no maximum pause time goal. If a pause time goal is specified, the heap size and other garbage collection related parameters are adjusted in an attempt to keep garbage collection pauses shorter than the specified value. Note that these adjustments may cause the garbage collector to reduce the overall throughput of the application and in some cases the desired pause time goal cannot be met.
Some lesser-known options are about permanent generation space which is used by the JVM itself to hold metadata, classes structures and so on:
  • -XX:PermSize=<n>[g|m|k] is the initial and minimum size of the permanent generation space.
  • -XX:MaxPermSize=<n>[g|m|k] is the maximum size of the permanent generation space. If you ever get the message java.lang.OutOfMemoryError: PermGen space then it means that your application is loading a very large number of classes and this should be raised.
  • -Xss=<n>[g|m|k]is the size of the thread stack. Each thread in a Java application has its own stack. The stack is used to hold return addresses, arguments to functions and method calls, and so on. The default stack size setting for a thread in Java is 1MB. In a highly multi-threaded system, like an application server at any given point in time there are multiple thread pools and threads that are in use so this may need to be reduced. Since stack size has to be allocated in contiguous blocks and if the machine is being used actively and there are many threads running in the system you may encounter an OutOfMemory error even when you have sufficient heap space. Recursive code can quickly exhaust the stack and if you use such code then you may need to increase the -Xss setting. However, if you see java.lang.OutOfMemoryError: unable to create new native thread then you may have too many threads, or each thread has a large stack; so you may need to decrease it.

Tuning Garbage Collection Algorithm

Garbage collection is expensive. Generational garbage collectors have the JVM  memory divided into several spaces.
  • Eden space: All objects are placed here when first created
  • Survivor spaces: One or more regions where objects mature
  • Tenured space: Where long lived objects are stored
  • Permanent generation: This area is only used by the JVM itself to hold metadata, such as data structures for classes, methods, interned strings
One thing that people often forget to try, is to lower the amount of garbage being created in the first place. There are a lot of ways to do this which are specific to the application/code that is being written. This often involves techniques such as using StringBuilder/StringBuffer instead of Strings, lowering the amount of logging, etc.

There are several GC algorithms which are available to be used in a Java VM. The following command line options allow to use a specific GC algorithm:
  • -XX:+UseSerialGC uses a single threaded, young generation, and old generation garbage collector (Normally this is a poor choice and should be used only for small Java heap sizes such as -Xmx256m or smaller)
  • -XX:+UseParallelGC utilizes a multithreaded (parallel) garbage collector for the young generation and a single-threaded garbage collector for the old generation space in parallel.
  • -XX:+UseParallelOldGC uses a multithread garbage collector for both the young and old generations.
  • -XX:+UseParNewGC -> enables a multithreaded, young generation garbage collector
  • -XX:+UseConcMarkSweepGC -> enables the VM’s mostly concurrent garbage collector. It also auto-enables -XX:+UseParNewGC (use if If you are not able to meet your application’s worst case latency requirements due to full garbage collection duration being too long)
  • -XX:+UseG1GC -> garbage first collector (default in java 7, can be also used in latest releases of Java 6)
In practice, the default in Java 6 is ParallelGC and in Java 7 it is the G1GC. Changing the algorithm requires detailed analysis of the application behavior. If you see a nice regular sawtooth chart in the heap usage you may not need any changes at all. If not, we recommend trying out each GC algorithm under a realistic load and then comparing it to the default algorithm's behavior under the same load. Usually you will find that the default algorithm outperforms the new setting and that there is no reason to change it.

As you can see, tuning the JVM and it's garbage collectors is largely a trade-off between space and time. If you had infinite heap space then you would never need to collect garbage. Inversely, if you could tolerate infinite time delays, then you could run a cleanup as frequently as you like and keep the heap compact. Clearly, both those situations are impossible. Finding the right middle ground that is right for you requires careful balancing act based on understanding how GC works and what the application requires.


Java SE 6 HotSpot Virtual Machine Garbage Collection Tuning http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html

Thursday Sep 11, 2014

Simplify deployment of JVMD Agents to command line Java applications

Contributing Author: Shiraz Kanga, Consulting Member of Technical Staff, Oracle

Most customers of Oracle Enterprise Manager using JVM Diagnostics use the tool to monitor their Java Applications servers like Weblogic, Websphere, Tomcat, etc. In this environment it is fairly easy to deploy the JVMD Agent. Since it is distributed as a war file, you merely deploy the agent into a running application server using the management GUI or command line tools. Then you can start monitoring with no need for a restart of the app server or for the modification of any startup commands or scripts. However, with other types of Java applications that do not allow for any code deployment at runtime such as AWT/Swing or command line java applications these steps are necessary. Modifying startup scripts is complex because each application comes with its own custom and unique launch script. Additionally, the command that actually launches the runtime needs to have the java command with its related parameters (like -Xmx) the JVMD Agent with its own parameters (like console host/port) and the application itself which may have some more custom parameters. People often get confused due to the complexity that is seen here.

I've recently had customers that needed to monitor Hadoop, HDFS, Zookeeper, Kafka, Cassandra and Solr with JVMD. In order to simplify some of the complexity discussed above, I created a simple script based framework that makes things a bit easier. Feel free to use my approach to quickly setup JVMD with these or any other command line java programs. You can also use it as the basis for your own modifications. The framework modifies the startup scripts supplied with these tools in order to add the JVMD agent. All the code/scripts are attached in a zip file. Both original and modified versions of all changed scripts are included so you can easily see the modifications I made with a simple diff.

Here's how these scripts are setup. Everything is configured using 4 environment variables as shown below:

    export JVMD_AGENT_HOME=/home/skanga/servers
    export JVMD_MANAGER_HOST=jvmdconsole.us.oracle.com
    export JVMD_MANAGER_PORT=3800
    export JVMD_UNIQUE_ID=<unique name for each server process>

where the JVMD_AGENT_HOME must contain the jamagent-env.sh (from the attached zip file) and jamagent.war (which can be downloaded from your JVMD console). The first three of these are likely to remain unchanged for all the JVMs being monitored so you can easily add them directly into jamagent-env.sh if needed.

The JVMD_UNIQUE_ID will always be unique so it must not be placed there. However it has two other modes where you can use a pointer to the unique ID instead of specifying it directly. You can point to either an environment variable or to a JVM system property that holds the actual unique ID. If you are using these cases then you could add this one to the jamagent-env.sh script too.

If JVMD_UNIQUE_ID starts with the string "sysprop-" then the actual unique ID will be read from the JVM system property named by the string following "sysprop-". For example if JVMD_UNIQUE_ID is "sysprop-server_name" and we have a system property -Dserver_name=MyTestingServer then JVMD will use MyTestingServer as the JVM unique identifier.

If JVMD_UNIQUE_ID starts with the string "envvar-" then the actual unique ID will be read from the environment variable named by the string following "envvar-". For example if JVMD_UNIQUE_ID is "envvar-server_name" and we have an environment variable called server_name=MyTestingServer then JVMD will use MyTestingServer as the JVM unique identifier.

Caution: Do not use dash (minus) character in the environment variable setup of unique id. Use underscore instead.

Generic Launch Script Modifications

After these four environment variables are set we need to modify our launch scripts. Make sure you have a backup of all files before you proceed. In the main script that you use to launch your java application look for a line that has a format that is similar to the one below: 
and replace it with

So we simply added a $JVMD_AGENT_INSERT just before the name of the Main class. If there are multiple such lines then you should modify them all in the same way. And in order to configure $JVMD_AGENT_INSERT we also need to source jamagent-env.sh (with some error checking). So we insert a snippet like this in the line just before the JAVA invocation. 

# add JVMD Agent Env settings
[[ -e "${JVMD_AGENT_HOME}/jamagent-env.sh" ]] 
&& source "${JVMD_AGENT_HOME}/jamagent-env.sh" ||
{ echo "ERROR: JVMD_AGENT_HOME undefined or does not contain jamagent-env.sh" 1>&2 ; exit 1; } 

NOTE: Everything after the comment above should in a single line of code in your launch script. This line gets mangled by the blogging software so it is best to cut & paste it from it from one of the scripts in the attached zip file.

We will now look at how I used these techniques to add JVMD monitoring to Kafka, Hadoop, Zookeeper, Cassandra and Solr. 

1) Kafka 2.8.0-

I used Kafka 2.8.0- and downloaded it directly from the Kafka site. In Kafka, ALL processes are initiated through a common launcher called kafka-run-class.sh in the bin folder. All the other shell scripts (including the built-in Zookeeper) call this one. So this single insertion point is the only place that we will need to modify in order to add JVMD monitoring to Kafka. Pretty simple. Using the modified script (inside the attached zip file) you can run the servers as shown below:

TEST - with mods to use JVMD
cd /home/skanga/servers/kafka_2.8.0-
export JVMD_AGENT_HOME=/home/skanga/servers
export JVMD_MANAGER_HOST=jvmdconsole.us.oracle.com

# start a zookeeper server
export JVMD_UNIQUE_ID=zookeeper-server
./zookeeper-server-start.sh ../config/zookeeper.properties

# start a kafka server
export JVMD_UNIQUE_ID=kafka-server
./kafka-server-start.sh ../config/server.properties

2) Hadoop 2.4.1

The scripts called hadoop, hfds, mapred and yarn in the hadoop bin directory will ALL need to be modified for JVMD monitoring. Using the modified scripts (inside the attached zip file) you can run all the servers as shown below:

TEST - with mods for hadoop command to use JVMD

cd /home/skanga/servers/hadoop-2.4.1
export JVMD_AGENT_HOME=/home/skanga/servers
export JVMD_MANAGER_HOST=jvmdconsole.us.oracle.com

# Launch the hdfs nfs gateway
export JVMD_UNIQUE_ID=hdfs-nfs3-gateway
./bin/hdfs nfs3

# Run a mapreduce history server
export JVMD_UNIQUE_ID=mapred-historyserver
./bin/mapred historyserver

# Run a yarn resource manager
export JVMD_UNIQUE_ID=yarn-resourcemanager
./bin/yarn resourcemanager

# Run a hadoop map-reduce job to find the value of PI (QuasiMonteCarlo method)
export JVMD_UNIQUE_ID=hadoop-test-pi-montecarlo
./bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.4.1.jar pi 1024 100

3) Zookeeper 3.4.6

The standalone version of zookeeper has a common environment setup script called zkEnv.sh where most JVMD setup can be done. After that a minor modification is needed in the java launch command in zkServer.sh after which all JVMD monitoring works fine. The scripts called zkCleanup.sh and zkCli.sh probably do not need monitoring but can be easily added if really needed.

TEST - with mods for zkServer.sh command to use JVMD

cd /home/skanga/servers/zookeeper-3.4.6/bin
export JVMD_AGENT_HOME=/home/skanga/servers
export JVMD_MANAGER_HOST=jvmdconsole.us.oracle.com
export JVMD_UNIQUE_ID=zk-server

# start the zookeeper server
./zkServer.sh start
./zkServer.sh status
./zkServer.sh stop

4) Cassandra 2.0.9

The Apache Cassandra data store has a common environment setup script called conf/cassandra-env.sh where we can add the command to source our include script. Then a minor modification is needed to the java launch command in bin/cassandra after which all JVMD monitoring works fine. The other scripts probably do not need monitoring but can be easily added if really needed. 

TEST - with mods for cassandra command to use JVMD

cd /home/skanga/servers/apache-cassandra-2.0.9/bin
export JVMD_AGENT_HOME=/home/skanga/servers
export JVMD_MANAGER_HOST=jvmdconsole.us.oracle.com
export JVMD_UNIQUE_ID=cassandra-server

# start cassandra
./cassandra -f

5) Solr 4.9.0

The Solr search server is an interesting case. In production scenarios, users will probably use the Solr war file in their own application server. In this scenario the standard JVMD warfile can be deployed to the same application server and monitored easily. However, the Solr distribution also include an embedded mode which may be used by simply running java -jar start.jar and for this scenario we have converted this java command into a simple script called start.sh and added it to the same folder as start.jar in order to run it. Using this script (inside the attached zip file) you can run a test as shown below:

TEST - with addition of start.sh command to use JVMD with Solr

cd /home/skanga/servers/solr-4.9.0/example
export JVMD_AGENT_HOME=/home/skanga/servers
export JVMD_MANAGER_HOST=jvmdconsole.us.oracle.com
export JVMD_UNIQUE_ID=solr-server

# start solr

After everything is setup properly for your servers you should see all the relevant JVMs in the default pool with the proper ID as shown in the image below.

JVMs in Default Pool (with hostnames & ip addresses blanked out)
Click image to expand it in a new tab

Remember to be a bit patient and wait a few seconds until the connections are established and the servers appear in the console.

Monday May 12, 2014

Quickly Diagnose the Root Cause of Stuck Threads using Oracle Enterprise Manager 12c JVM Diagnostics

Note: Clicking on any image will open the same image in full size in a new window

One of the hidden gems in Oracle Enterprise Manager 12c is JVM Diagnostics. If you purchased the Weblogic Management Pack license then you already own it. JVMD allows administrators to diagnose performance problems in production Java applications. By eliminating the need to reproduce these “production only” problems in QA, it reduces the time required to resolve them. It does not require complex instrumentation or restarting of the application to get in-depth application details. Application administrators will be able to identify Java problems or database issues that are causing application downtime without any detailed knowledge of the application internals. It is also very well suited to diagnosing issues with “Stuck Threads” which will be the focus of this blog.

What is a [STUCK] Thread

In a Weblogic server, all incoming requests are handled by a thread pool which is controlled by a work manager. Worker threads that are taken out of the pool and not returned after a specified time period are marked as [STUCK] by the work manager. This time period is 10 minutes by default but it is configurable on a per work manager basis using the "StuckThreadMaxTime" parameter (default is set to 600 seconds).

Note that it is possible that some of your threads are doing legitimate work for over 10 min with no issues. If you have such threads then you should consider placing them in a another work manager with proper setting for the "StuckThreadMaxTime" parameter

Why JVMD is Well Suited to Diagnosing [STUCK] Threads

Traditionally, developers will use a stack trace generated by jstack or kill -3 and try to determine the cause of a stuck thread. However, in my experience a majority of the time this stack is not even the culprit. The problem often lies in another tier of the application or even in another thread of the same application. JVMD has the ability to provide additional context such as the name of the request and which tier it called out to Eg: RDBMS servers, LDAP servers, Web servers, RMI servers, etc. Using fine grained thread states (i.e. DB, Network, IO, CPU, RMI, Lock, etc) and the ability to see additional details about the thread, JVMD users can quickly pinpoint the root cause of the problem. Since JVMD is always on, it can also debug these issues that happened in the past and can proactively notify you about stuck threads Eg: Get an email at 1am when you had stuck threads. And lastly, sometimes developers have no access to the target host due to lack of credentials needed to run command line applications.

On several occasions, the thread may be stuck but is doing legitimate work. In such scenarios JVMD allows you to scan back and forth through a large number of samples to see what work is being processed by the thread. In addition, you can take a look at other threads that were serviced the same request to see if they behaved similarly or not. This will allow you to quickly determine whether there is really a problem or not.

Real-Time [STUCK] Thread Analysis

With JVMD there are two real use cases for stuck thread analysis. If you get notified about a stuck thread in real-time (via email, etc) then you can perform a real-time stuck thread analysis. Alternatively, if you are investigating a thread that was stuck in the past but is not present any more, then you can perform a historical stuck thread analysis. In either case the first thing to do is to navigate to the JVM (or JVM pool) where the thread is stuck. We do this by clicking on Targets -> Middleware as shown

From here we can filter the list of targets by target type or by target name. Your most recent filter request will be remembered the next time you visit the page. Select the Target Type of JVM to see all of the JVM targets.

Pick the JVM for the Weblogic server which is having the stuck thread issues and click on it. This will take you to the target home page. Click the button at the top that says “Live Thread Analysis”. Type the word "stuck" into the thread name search box and click on the arrow to filter the table. Now you should see all the stuck threads. In this case we can see a thread that is stuck in the “Network Wait” state. It is stuck on line 358 in function writeBuffer() of OutputRecord.java which is in package com.sun.ssl.internal.ssl which makes it clear that this stuck thread has made an SSL call and the remote server has not responded in a reasonable amount of time so the client thread is stuck.

Here is another example of stuck threads, this time in the “DB Wait” state. Notice how the tool tip over the SQL ID field shows the SQL being executed. Click on it to view longer SQL statements. Also try clicking on the DB Wait link which takes you directly to this specific database session in the Oracle Database Diagnostics section of EM for further analysis. The columns displayed are controlled by “View” drop-down menu. Here we added the “User” column to show the logged in user who executed the request.

Historical [STUCK] Thread Analysis

In order to start historical stuck thread analysis you need to navigate to the JVM target home page in the same way as discussed in the real-time section. From the target home page clicking on the “JVM Performance Diagnostics” button at the top of the page. On the performance diagnostics page you can filter the data to make it more relevant to your task. The first filter to apply is of time. If you know the exact time you can use the “Edit Date and Time” button to specify it as shown. Otherwise use the handy shortcut links for Day, 1 Hours, 1 Hour or 15 Minutes as needed.

The next thing to filter is the Thread Name. Expand the filter options region if necessary and add the Thread Name filter to be “[STUCK]*” so you only see threads whose name starts with [STUCK]

Below the filter region the “General” tab will show you the Thread States, Top Requests, Top Methods, Top SQLs, Top DB Wait Events and Top Databases – only for the filtered data i.e. for only threads that are stuck. Try clicking on method names to see the call stack for the method. The charts are all interactive and fetch additional data about the item clicked.

If you want to find a specific thread move from the “General” tab over to the “Threads” tab. This is fine grained data with each sample and state transition visible. You can click on any sample to view it in the sample analyzer which should look familiar to you if you saw the threads in real-time. Details about SQLs, Wait states, etc are all available here also along with the complete call stack which can also be exported to a CSV file.

In conclusion, we can see the JVMD provides a rich set of additional details which are only a mouse click away that help you to diagnose the root causes of your stuck threads.

NOTE: Many of the screen shots taken here are using testing & debug code, which deliberately tries to create stuck threads. This does not and should not reflect on the nature of any Oracle products being shipped to customers.


Latest information and perspectives on Oracle Enterprise Manager and Oracle Management Cloud.

Related Blogs


« December 2015