Debugging a low performance incident within a DPS6 instance

The fully-java DPS6.x instances are launched by default under a pretty small heap-sized JVM (241MB):


[03/Sep/2008:02:15:56] - STARTUP - INFO - Global log level INFO (from config)
[03/Sep/2008:02:15:56] - STARTUP - INFO - Logging Service configured
[03/Sep/2008:02:15:56] - STARTUP - INFO - Java Version: 1.5.0_09 (Java Home:
/opt/app/foo/apps/dsee63/jre)
[03/Sep/2008:02:15:56] - STARTUP - INFO - Java(TM) 2 Runtime Environment,
Standard Edition (build 1.5.0_09-b03)
[03/Sep/2008:02:15:56] - STARTUP - INFO - Java HotSpot(TM) 64-Bit Server VM
(build 1.5.0_09-b03, mixed mode)
[03/Sep/2008:02:15:56] - STARTUP - INFO - Java Heap Space:
Total Memory (-Xms) = 241MB, Max Memory (-Xmx) = 241MB
[03/Sep/2008:02:15:56] - STARTUP - INFO - Operating System: SunOS/sparcv9 5.9


 While this should be fine under normal circumstances, it could be the root cause for potential performance degradations over time if we are running in a situation where we are hitting a program bug by which transient objects are left as referenced objects by mistake (aka, the closest concept to a Java memory "leak";). When this happens, the heap starts filling up and Java Garbage Collector mechanisms start to enter in scene in a way that may highly impact the product performances.


 An easy workaround when such a situation occurs is to increase the size of the JVM Heap during DPS start time. It's actually recommended to use at least 1 GB for the JVM heap size, and this is achieved like follows:



http://docs.sun.com/app/docs/doc/820-0377/hw-size-dps?l=en&a=view
$ dpadm set-flags instance-path jvm-args="-Xmx1000M -Xms1000M -XX:NewRatio=1"


Followed by a dps instance restart to take the new settings into account.


Of course, such an action wll only win us some time before we will hit this issue again, would the issue's root cause be memory consumption related. But it will keep up safe for quite some time while the search for the root cause is taking place.


Following are some useful tips to apply whenever we will be interested in determining / confirming whether the root cause for an over-time performance degradation is due to a memory abnormal consumption event. These tips are useful for debugging any Java application memory consumption pattern, DPS being only a particular case of it.


TIPS FOR DEBUGGING MEMORY CONSUMPTION ON JAVA APPS
==================================================
There are several j\* commands that will allow us to get some interesting information regarding garbage collection patterns happening inside a process running under a specific JVM. Such commands include jps, jstat, jmap. We will see below how we will use them to get the memory consumption debugging information that will be required by Sun Support to dig in further into a potential memory related issue.

Unfortunately, depending on the version of the JVM being run by the java application we want to debug and the version of the j\* commands being launched, there may be unexpected results when running such j\* commands.

While it could seem better at first sight to try to run the j\* commands delivered through the latest jdk1.6 available, I have experienced issues when trying to launch a jmap or a jstack 1.6 command to try to analyze a process running inside a 1.5 jvm. Hence, following is an strategy example regarding the execution on my lab of the same commands we need the customer to execute. Here are the results of such executions, when applied to an analysis of a DPS6 java process:

The DPS on my lab is running (extracted from the DPS logs):

[06/Sep/2008:12:46:21 +0200] - STARTUP - INFO - Java Version: 1.5.0_09 (Java Home: /u2/install/jre)
[06/Sep/2008:12:46:21 +0200] - STARTUP - INFO - Java Heap Space: Total Memory (-Xms) = 241MB, Max Memory (-Xmx) = 241MB
[06/Sep/2008:12:46:21 +0200] - STARTUP - INFO - Operating System: SunOS/sparcv9 5.10

In other words, the DPS is running under a 1.5.0_09 JVM

Hence, I go directly to the jdk1.5.09 bin directory to execute all the j\* commands:

mares $ pwd
/usr/lang/JAVA/jdk1.5.0_09/solaris-sparc/bin

jps will work:

mares $ ./jps | grep DistributionServerMain
1374 DistributionServerMain

jstat -gcutil will also work:

mares $ ./jstat -gcutil 1374 1000
S0 S1 E O P YGC YGCT FGC FGCT GCT
0.00 25.21 35.88 0.00 43.65 1 0.032 0 0.000 0.032
0.00 25.21 35.88 0.00 43.65 1 0.032 0 0.000 0.032
0.00 25.21 35.88 0.00 43.65 1 0.032 0 0.000 0.032
0.00 25.21 35.88 0.00 43.65 1 0.032 0 0.000 0.032
\^C
mares $

Unfortunately, jmap will not work, it will provide the following output:

mares $ ./jmap -heap 1374
Attaching to process ID 1374, please wait...
Error attaching to process: Attach failed : debuggee is 64 bit, use java -d64 for debugger
mares $

This error is due to the fact that the DPS is running in 64 bit. If we try directly to pass the -d64 flag to jmap, it will still fail:

mares $ ./jmap -d64 -heap 1374
execve(): No such file or directory
Error trying to exec /usr/lang/JAVA/jdk1.5.0_09/solaris-sparc/bin/sparcv9/jmap.
Check if file exists and permissions are set correctly.
Failed to start a 64-bit JVM process from a 32-bit JVM.
Verify all necessary J2SE components have been installed.
(Solaris SPARC 64-bit components must be installed after 32-bit components.)
mares $

This second error is because we can not use 32bit debugging tools to analyze 64bit processes. The solution is to identify the proper 64bit jmap tool, which typically resides in a parallel directory (solaris-sparcv9 instead of solaris-sparc for my lab):


mares $ pwd
/usr/lang/JAVA/jdk1.5.0_09/solaris-sparc/bin
mares $ cd ../../solaris-sparcv9/bin
mares $ pwd
/usr/lang/JAVA/jdk1.5.0_09/solaris-sparcv9/bin
mares $

If we now run jmap from this directory, it will work as expected:

mares $ ./jmap -d64 -heap 1374
Attaching to process ID 1374, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 1.5.0_09-b03

using thread-local object allocation.
Parallel GC with 2 thread(s)

Heap Configuration:
MinHeapFreeRatio = 40
MaxHeapFreeRatio = 70
MaxHeapSize = 264241152 (252.0MB)
NewSize = 2883584 (2.75MB)
MaxNewSize = -65536 (-0.0625MB)
OldSize = 1835008 (1.75MB)
NewRatio = 2
SurvivorRatio = 32
PermSize = 21757952 (20.75MB)
MaxPermSize = 83886080 (80.0MB)

Heap Usage:
PS Young Generation
Eden Space:
capacity = 66060288 (63.0MB)
used = 16965328 (16.179397583007812MB)
free = 49094960 (46.82060241699219MB)
25.681583465091766% used

[...]


The data obtained using the tips above will allow Sun Support to move forward into  digging for a memory related issue. Getting several consecutive jstacks will also help (especially if the low performance issue is a high cpu event which happens to be not "java memory leak" related).


Here is the analysis of an example collection I got from a customer a while ago after they followed the tips shown above:


\*\* ANALYSIS EXAMPLE \*\*
======================
The jmap cllected data confirms a continuously growing number of BINDRequest elements inside the process:


mares $ grep proxy jmap1.time.20080908 | head -10
2089296 11871 com.sun.directory.proxy.ldap.BindRequest
1168680 29217 com.sun.directory.proxy.asn1.ASN1OctetString
701240 17531 com.sun.directory.proxy.asn1.ASN1Element
664320 2076 com.sun.directory.proxy.ldap.SearchRequest
483608 991 com.sun.directory.proxy.extensions.LDAPClientConnection
475544 7081 com.sun.directory.proxy.asn1.ASN1Element[]
464464 2639 com.sun.directory.proxy.server.MultiplexedOpConnectionV2
365904 2541 com.sun.directory.proxy.ldap.LDAPMessage
345752 3929 com.sun.directory.proxy.schema.LDAPSyntaxSchemaElement
339360 7070 com.sun.directory.proxy.asn1.ASN1Sequence


mares $ grep proxy jmap2.time.20080908 | head -10
2280432 12957 com.sun.directory.proxy.ldap.BindRequest
1302880 32572 com.sun.directory.proxy.asn1.ASN1OctetString
755600 18890 com.sun.directory.proxy.asn1.ASN1Element
712960 2228 com.sun.directory.proxy.ldap.SearchRequest
529224 7881 com.sun.directory.proxy.asn1.ASN1Element[]
525576 1077 com.sun.directory.proxy.extensions.LDAPClientConnection
500016 2841 com.sun.directory.proxy.server.MultiplexedOpConnectionV2
394992 2743 com.sun.directory.proxy.ldap.LDAPMessage
378288 7881 com.sun.directory.proxy.asn1.ASN1Sequence
354624 7388 com.sun.directory.proxy.asn1.ASN1Integer


Which can only mean either a continuously growing reception of requests not being consumed or a Java memory leak (unreference objects) hit after an abrupt error handling condition. Indeed, it could be that the BINDRequest elements and associated ASN1 chunks may be leaking (left-referenced-by-mistake objects in Java) due to such error conditions, for as we can see how jstat -gcutil shows a steadily growing rate for both Young Generation Garbage Collection (YGC column) and Old Generation occupation rate (O column):


 S0    S1    E     O     P     YGC   YGCT   FGC  FGCT   GCT
18.42  0.00 60.27 32.10 48.60 37926 217.403 116 72.007 289.410
 0.00 50.00 49.93 32.14 48.60 37927 217.408 116 72.007 289.415
38.89  0.00 31.56 32.17 48.60 37928 217.413 116 72.007 289.420
 0.00 41.18 24.77 32.24 48.60 37929 217.419 116 72.007 289.425
50.00  0.00 10.59 32.33 48.60 37930 217.424 116 72.007 289.431
 0.00 25.00  6.34 32.36 48.60 37931 217.429 116 72.007 289.436
23.33  0.00  0.00 32.41 48.60 37932 217.433 116 72.007 289.440


Ideally, this kind of leaking behavior, if confirmed to be caused by a programmatic "leak" type of bug, should be fixed to avoid uncontrolled GC intrusion.

Comments:

Post a Comment:
Comments are closed for this entry.
About

marcos

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today