Tuesday Sep 30, 2014

Using JVM Diagnostics (JVMD) to help tune production JVMs

Contributing Author: Shiraz Kanga, Consulting Member of Technical Staff, Oracle


Tuning a production JVM involves more than merely adding more RAM to it via the -Xmx parameter. It depends upon an understanding of how your application truly behaves in a production environment. Most JVM tuning is done by developers with a simulated load in their Development or QA environment. This is unlikely to be truly representative of the production load running on production hardware with regards to proper JVM tuning.

One of the tools that actually contains real-world production data is JVM Diagnostics (JVMD). Hence it is a good idea to use data collected by JVMD for tuning your production JVMs. Note that JVMD is a component of Oracle Enterprise Manager, licensed as part of both Weblogic Management Pack and the Non-oracle Middleware Management Pack.


Figure 1. Heap Utilization and Garbage Collections for a specific JVM


In this document we are primarily addressing the Hotspot JVM. There are several aspects of tuning this JVM that we will look into:

Tuning Heap Size

The main parameters needed to tune heap size are:
  • -Xms<n>[g|m|k] is the initial and minimum size of the Java heap
  • -Xmx<n>[g|m|k] is the maximum possible size that the Java heap can grow upto

Figure 2. Heap after GC and Garbage Collection Overhead for a specific JVM


The Java Heap size here refers to the total size of the young and the old generation spaces. To start, take a look at the Heap usage chart (Figure 1) of your production JVM under maximum load in the JVMD Performance Diagnostics page. You should see some patterns in the minimum and the maximum heap sizes over time. You can use this data as a rough guide for your choice of -Xms and -Xmx with a reasonable amount of padding. After setting these you should start monitoring the garbage collection charts of your production JVMs (Figure 2) in the JVMD Live Heap Analysis page. It is useful to look into the JVMD metric called "Heap use after GC" which provides a good reflection of the actual amount of heap memory being used by your application. Ideally this metric should remain relatively steady over time with only few full garbage collections occuring. If there are too many full garbage collections then performance of your production application is impacted since GC is done by blocking threads that take a while to scan the entire heap. You can monitor this metric with the JVM GC Overhead% chart on the same page of JVMD. Garbage collection overhead is the percentage of total time spent in garbage collection. Increasing -Xmx can help to make these happen less frequently but actually it is time to dig deeper into your tuning options.

The key questions that you need to answer are - How frequently does garbage collection take place, How long does each collection take and what is the actual memory used (i.e. heap after GC). Also be sure that you NEVER make the heap size larger than the available free RAM on your system as disk will decrease performance as RAM will start getting swapped to disk.

The Sun HotSpot JVM relies on generational garbage collection to achieve optimum performance. The -XX:SurvivorRatio command line parameter could further help in tuning garbage collection.

The Java heap has a young generation for newly created objects and an old generation for long lived objects. The young generation is further subdivided into the Eden space where new objects are allocated and the Survivor space where new objects that are still in use can survive their first few garbage collections before being promoted to old generations. The Survivor Ratio is the ratio of Eden to Survivor space in the young object area of the heap. Increasing this setting optimizes the JVM for applications with high object creation and low object preservation. In applications that generate more medium and long lived objects, this setting should be lowered from the default and vice versa.

For example, -XX:SurvivorRatio=10 sets the ratio between each survivor space and eden space to be 1:10. If survivor spaces are too small, they will overflow directly into the old generation. If survivor spaces are too large, they will be empty. At each GC, the JVM determines the number of times an object can be copied before it is tenured, called the tenure threshold. This threshold should be set to keep the survivor space half full.

Most tuning operations represent a trade-off of some type or another. In the case of garbage collection the trade-off usually involves the memory used v/s throughput and latency.
  • The throughput of a JVM is measured in terms of the time spent doing garbage collection vs. the time spent outside of garbage collection (referred to as application time). It is the inverse of GC overhead mentioned above and represents the amount of work done by an application as a ratio of time spent in GC. Throughput can be tuned with -XX:GCTimeRatio=99 where 99 is the default which represents a 1% GC overhead.
  • Latency is the amount of time delay that is caused by garbage collection. Latency for GC pauses can be tuned by specifying rhe maximum pause time goal with the command line option -XX:MaxGCPauseMillis=<N>. This is interpreted as a hint that pause times of <N> milliseconds or less are desired. By default, there is no maximum pause time goal. If a pause time goal is specified, the heap size and other garbage collection related parameters are adjusted in an attempt to keep garbage collection pauses shorter than the specified value. Note that these adjustments may cause the garbage collector to reduce the overall throughput of the application and in some cases the desired pause time goal cannot be met.
Some lesser-known options are about permanent generation space which is used by the JVM itself to hold metadata, classes structures and so on:
  • -XX:PermSize=<n>[g|m|k] is the initial and minimum size of the permanent generation space.
  • -XX:MaxPermSize=<n>[g|m|k] is the maximum size of the permanent generation space. If you ever get the message java.lang.OutOfMemoryError: PermGen space then it means that your application is loading a very large number of classes and this should be raised.
  • -Xss=<n>[g|m|k]is the size of the thread stack. Each thread in a Java application has its own stack. The stack is used to hold return addresses, arguments to functions and method calls, and so on. The default stack size setting for a thread in Java is 1MB. In a highly multi-threaded system, like an application server at any given point in time there are multiple thread pools and threads that are in use so this may need to be reduced. Since stack size has to be allocated in contiguous blocks and if the machine is being used actively and there are many threads running in the system you may encounter an OutOfMemory error even when you have sufficient heap space. Recursive code can quickly exhaust the stack and if you use such code then you may need to increase the -Xss setting. However, if you see java.lang.OutOfMemoryError: unable to create new native thread then you may have too many threads, or each thread has a large stack; so you may need to decrease it.

Tuning Garbage Collection Algorithm

Garbage collection is expensive. Generational garbage collectors have the JVM  memory divided into several spaces.
  • Eden space: All objects are placed here when first created
  • Survivor spaces: One or more regions where objects mature
  • Tenured space: Where long lived objects are stored
  • Permanent generation: This area is only used by the JVM itself to hold metadata, such as data structures for classes, methods, interned strings
One thing that people often forget to try, is to lower the amount of garbage being created in the first place. There are a lot of ways to do this which are specific to the application/code that is being written. This often involves techniques such as using StringBuilder/StringBuffer instead of Strings, lowering the amount of logging, etc.

There are several GC algorithms which are available to be used in a Java VM. The following command line options allow to use a specific GC algorithm:
  • -XX:+UseSerialGC uses a single threaded, young generation, and old generation garbage collector (Normally this is a poor choice and should be used only for small Java heap sizes such as -Xmx256m or smaller)
  • -XX:+UseParallelGC utilizes a multithreaded (parallel) garbage collector for the young generation and a single-threaded garbage collector for the old generation space in parallel.
  • -XX:+UseParallelOldGC uses a multithread garbage collector for both the young and old generations.
  • -XX:+UseParNewGC -> enables a multithreaded, young generation garbage collector
  • -XX:+UseConcMarkSweepGC -> enables the VM’s mostly concurrent garbage collector. It also auto-enables -XX:+UseParNewGC (use if If you are not able to meet your application’s worst case latency requirements due to full garbage collection duration being too long)
  • -XX:+UseG1GC -> garbage first collector (default in java 7, can be also used in latest releases of Java 6)
In practice, the default in Java 6 is ParallelGC and in Java 7 it is the G1GC. Changing the algorithm requires detailed analysis of the application behavior. If you see a nice regular sawtooth chart in the heap usage you may not need any changes at all. If not, we recommend trying out each GC algorithm under a realistic load and then comparing it to the default algorithm's behavior under the same load. Usually you will find that the default algorithm outperforms the new setting and that there is no reason to change it.

As you can see, tuning the JVM and it's garbage collectors is largely a trade-off between space and time. If you had infinite heap space then you would never need to collect garbage. Inversely, if you could tolerate infinite time delays, then you could run a cleanup as frequently as you like and keep the heap compact. Clearly, both those situations are impossible. Finding the right middle ground that is right for you requires careful balancing act based on understanding how GC works and what the application requires.

References:

Java SE 6 HotSpot Virtual Machine Garbage Collection Tuning http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html

Monday May 12, 2014

Quickly Diagnose the Root Cause of Stuck Threads using Oracle Enterprise Manager 12c JVM Diagnostics

Note: Clicking on any image will open the same image in full size in a new window

One of the hidden gems in Oracle Enterprise Manager 12c is JVM Diagnostics. If you purchased the Weblogic Management Pack license then you already own it. JVMD allows administrators to diagnose performance problems in production Java applications. By eliminating the need to reproduce these “production only” problems in QA, it reduces the time required to resolve them. It does not require complex instrumentation or restarting of the application to get in-depth application details. Application administrators will be able to identify Java problems or database issues that are causing application downtime without any detailed knowledge of the application internals. It is also very well suited to diagnosing issues with “Stuck Threads” which will be the focus of this blog.

What is a [STUCK] Thread

In a Weblogic server, all incoming requests are handled by a thread pool which is controlled by a work manager. Worker threads that are taken out of the pool and not returned after a specified time period are marked as [STUCK] by the work manager. This time period is 10 minutes by default but it is configurable on a per work manager basis using the "StuckThreadMaxTime" parameter (default is set to 600 seconds).

Note that it is possible that some of your threads are doing legitimate work for over 10 min with no issues. If you have such threads then you should consider placing them in a another work manager with proper setting for the "StuckThreadMaxTime" parameter

Why JVMD is Well Suited to Diagnosing [STUCK] Threads

Traditionally, developers will use a stack trace generated by jstack or kill -3 and try to determine the cause of a stuck thread. However, in my experience a majority of the time this stack is not even the culprit. The problem often lies in another tier of the application or even in another thread of the same application. JVMD has the ability to provide additional context such as the name of the request and which tier it called out to Eg: RDBMS servers, LDAP servers, Web servers, RMI servers, etc. Using fine grained thread states (i.e. DB, Network, IO, CPU, RMI, Lock, etc) and the ability to see additional details about the thread, JVMD users can quickly pinpoint the root cause of the problem. Since JVMD is always on, it can also debug these issues that happened in the past and can proactively notify you about stuck threads Eg: Get an email at 1am when you had stuck threads. And lastly, sometimes developers have no access to the target host due to lack of credentials needed to run command line applications.

On several occasions, the thread may be stuck but is doing legitimate work. In such scenarios JVMD allows you to scan back and forth through a large number of samples to see what work is being processed by the thread. In addition, you can take a look at other threads that were serviced the same request to see if they behaved similarly or not. This will allow you to quickly determine whether there is really a problem or not.

Real-Time [STUCK] Thread Analysis

With JVMD there are two real use cases for stuck thread analysis. If you get notified about a stuck thread in real-time (via email, etc) then you can perform a real-time stuck thread analysis. Alternatively, if you are investigating a thread that was stuck in the past but is not present any more, then you can perform a historical stuck thread analysis. In either case the first thing to do is to navigate to the JVM (or JVM pool) where the thread is stuck. We do this by clicking on Targets -> Middleware as shown

From here we can filter the list of targets by target type or by target name. Your most recent filter request will be remembered the next time you visit the page. Select the Target Type of JVM to see all of the JVM targets.

Pick the JVM for the Weblogic server which is having the stuck thread issues and click on it. This will take you to the target home page. Click the button at the top that says “Live Thread Analysis”. Type the word "stuck" into the thread name search box and click on the arrow to filter the table. Now you should see all the stuck threads. In this case we can see a thread that is stuck in the “Network Wait” state. It is stuck on line 358 in function writeBuffer() of OutputRecord.java which is in package com.sun.ssl.internal.ssl which makes it clear that this stuck thread has made an SSL call and the remote server has not responded in a reasonable amount of time so the client thread is stuck.

Here is another example of stuck threads, this time in the “DB Wait” state. Notice how the tool tip over the SQL ID field shows the SQL being executed. Click on it to view longer SQL statements. Also try clicking on the DB Wait link which takes you directly to this specific database session in the Oracle Database Diagnostics section of EM for further analysis. The columns displayed are controlled by “View” drop-down menu. Here we added the “User” column to show the logged in user who executed the request.

Historical [STUCK] Thread Analysis

In order to start historical stuck thread analysis you need to navigate to the JVM target home page in the same way as discussed in the real-time section. From the target home page clicking on the “JVM Performance Diagnostics” button at the top of the page. On the performance diagnostics page you can filter the data to make it more relevant to your task. The first filter to apply is of time. If you know the exact time you can use the “Edit Date and Time” button to specify it as shown. Otherwise use the handy shortcut links for Day, 1 Hours, 1 Hour or 15 Minutes as needed.

The next thing to filter is the Thread Name. Expand the filter options region if necessary and add the Thread Name filter to be “[STUCK]*” so you only see threads whose name starts with [STUCK]

Below the filter region the “General” tab will show you the Thread States, Top Requests, Top Methods, Top SQLs, Top DB Wait Events and Top Databases – only for the filtered data i.e. for only threads that are stuck. Try clicking on method names to see the call stack for the method. The charts are all interactive and fetch additional data about the item clicked.

If you want to find a specific thread move from the “General” tab over to the “Threads” tab. This is fine grained data with each sample and state transition visible. You can click on any sample to view it in the sample analyzer which should look familiar to you if you saw the threads in real-time. Details about SQLs, Wait states, etc are all available here also along with the complete call stack which can also be exported to a CSV file.

In conclusion, we can see the JVMD provides a rich set of additional details which are only a mouse click away that help you to diagnose the root causes of your stuck threads.

NOTE: Many of the screen shots taken here are using testing & debug code, which deliberately tries to create stuck threads. This does not and should not reflect on the nature of any Oracle products being shipped to customers.


About

Latest information and perspectives on Oracle Enterprise Manager.

Related Blogs




Search

Archives
« March 2015
SunMonTueWedThuFriSat
1
2
3
4
5
7
8
9
10
11
12
13
14
15
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
    
       
Today