Tuesday Oct 27, 2015

JVM hang with CMS collector

This blog post was long due but better late than never. Sometime back I worked on an interesting GC problem which initially looked like an issue with the JVM runtime system. I am talking about 'JDK-8085965: VM hangs in C2Compiler'. This bug has been fixed in JDK 9 and 8u45+. In this post, I want to share interesting symptoms of this bug, and the troubleshooting steps taken that helped in getting to the root of the problem.

This report came in as a JVM hang when running with the CMS collector. Looking at the core file revealed that the JVM was in the '_synchronizing' state at that point.

(gdb) print SafepointSynchronize::_state
$1 = SafepointSynchronize::_synchronizing
(gdb) print SafepointSynchronize::_waiting_to_block
$2 = 1

It was waiting for one thread to come to safe-point state and block.

And the only thread running at that time was the Compiler thread. Stack trace of the Compiler thread:
(gdb) where
#0  Dependencies::find_finalizable_subclass (k=0x101047428)
#1  0x00007fd6804a38ee in ciInstanceKlass::has_finalizable_subclass (this=0x7fd67913e950)
#2  0x00007fd6809c28fb in Parse::call_register_finalizer (this=0x7fd6395405a0)
#3  0x00007fd6809c2fb9 in Parse::return_current (this=0x101047428, value=0x0)
#4  0x00007fd6809cf526 in Parse::do_one_bytecode (this=0x101047428)    
#5  0x00007fd6809c17d8 in Parse::do_one_block (this=0x7fd6395405a0)
#6  0x00007fd6809c1a57 in Parse::do_all_blocks (this=0x7fd6395405a0)
#7  0x00007fd6809c5fe9 in Parse::Parse (this=0x7fd6395405a0, caller=<optimized out>, parse_method=<optimized out>, expected_uses=5377)
#8  0x00007fd6804784e8 in ParseGenerator::generate (this=0x7fd63004dd60, jvms=0x7fd61c402f40)
#9  0x00007fd6805eea36 in Parse::do_call (this=0x7fd6395409c0)
#10 0x00007fd6809cf5aa in Parse::do_one_bytecode (this=0x101047428)
#11 0x00007fd6809c17d8 in Parse::do_one_block (this=0x7fd6395409c0)
#12 0x00007fd6809c1a57 in Parse::do_all_blocks (this=0x7fd6395409c0)
#13 0x00007fd6809c5fe9 in Parse::Parse (this=0x7fd6395409c0, caller=<optimized out>, parse_method=<optimized out>, expected_uses=5377)

The klass* passed to find_finalizable_subclass() was: 0x101047428

(gdb) info frame
Stack level 0, frame at 0x7fd639540240:
 rip = 0x7fd6805955a0 in Dependencies::find_finalizable_subclass

(gdb) print k
$1 = (Klass *) 0x101047428

(gdb) print *$1
$2 = {<Metadata> = {<MetaspaceObj> = {<No data fields>}, _vptr.Metadata =
0x7fd680fc9670 <vtable for InstanceKlass+16>}, _layout_helper = 16,
  _super_check_offset = 64, _name = 0x7fd618009850, _secondary_super_cache =
 0x100009608, _secondary_supers = 0x7fd668848a88, _primary_supers = {
    0x100000f28, 0x100009418, 0x1000097f0, 0x101047428, 0x0, 0x0, 0x0, 0x0},
 _java_mirror = 0xa77b4b40, _super = 0x1000097f0, _subklass = 0x0,
   _next_sibling = 0x101047428, _next_link = 0x0, _class_loader_data =
 0x7fd618003ab0, _modifier_flags = 1, _access_flags = {_flags = 536870913},
   _last_biased_lock_bulk_revocation_time = 0, _prototype_header = 0x5,
 _biased_lock_revocation_count = 0, _trace_id = 1685520384,
   _modified_oops = 0 '\000', _accumulated_modified_oops = 0 '\000',
 _shared_class_path_index = -1}
(gdb) x/s 0x7fd618009850+8
 0x7fd618009858: "sun/reflect/GeneratedMethodAccessor1389"

 _next_sibling = 0x101047428

The next_sibling in this case was the same as the klass itself. So, the compiler thread was stuck in the following loop:

 1187 Klass* Dependencies::find_finalizable_subclass(Klass* k) {
 1188   if (k->is_interface())  return NULL;
 1189   if (k->has_finalizer()) return k;
 1190   k = k->subklass();
 1191   while (k != NULL) {
 1192     Klass* result = find_finalizable_subclass(k);
 1193     if (result != NULL) return result;
 1194     k = k->next_sibling(); 
 1195   }
 1196   return NULL;
 1197 }

Then we did another run with the failing program using TraceClassLoading and TraceClassUnloading JVM options. From the core file of this run, found that VM was stuck while going over a circular siblings list. Note that the subklasses of a Klass are maintained using the _next_sibling field of its first _subklass.

Here's the klass and subklasses chain built from the core file:

super class sun/reflect/SerializationConstructorAccessorImpl - 0x1001acec0
   |--subklasses chain
0x1010eec30-sun/reflect/GeneratedSerializationConstructorAccessor232->
0x1010ef030-sun/reflect/GeneratedSerializationConstructorAccessor231->
0x1010ef430-sun/reflect/GeneratedSerializationConstructorAccessor230->
0x1010ef830-sun/reflect/GeneratedSerializationConstructorAccessor229->
0x1010efc30-sun/reflect/GeneratedSerializationConstructorAccessor228->
0x1010f0030-sun/reflect/GeneratedSerializationConstructorAccessor227->
0x1010f0430-sun/reflect/GeneratedSerializationConstructorAccessor226->
0x1010f0830-sun/reflect/GeneratedSerializationConstructorAccessor225->
0x1010f0c30- sun/reflect/GeneratedSerializationConstructorAccessor224->
0x1010f1030-sun/reflect/GeneratedSerializationConstructorAccessor223->
0x1010f1430-sun/reflect/GeneratedSerializationConstructorAccessor222->
0x1010f1830-sun/reflect/GeneratedSerializationConstructorAccessor221->
0x1010f1c30-sun/reflect/GeneratedSerializationConstructorAccessor220->
0x1010f2030-sun/reflect/GeneratedSerializationConstructorAccessor219->
0x1010f2430- sun/reflect/GeneratedSerializationConstructorAccessor218->
0x1010f2830-sun/reflect/GeneratedSerializationConstructorAccessor217->
0x10110b830-sun/reflect/GeneratedMethodAccessor94->
0x1010e8430-sun/reflect/GeneratedSerializationConstructorAccessor256->
0x1010e8830-sun/reflect/GeneratedSerializationConstructorAccessor255->
0x1010e8c30-sun/reflect/GeneratedSerializationConstructorAccessor254->
0x1010e9830-sun/reflect/GeneratedSerializationConstructorAccessor253->
0x1010e9c30-sun/reflect/GeneratedSerializationConstructorAccessor252->
0x1010ea030-sun/reflect/GeneratedSerializationConstructorAccessor251->
0x1010ea430-sun/reflect/GeneratedSerializationConstructorAccessor250->
0x1010ea830-sun/reflect/GeneratedSerializationConstructorAccessor249->
0x1010eac30-sun/reflect/GeneratedSerializationConstructorAccessor248->
0x1010eb030-sun/reflect/GeneratedSerializationConstructorAccessor247->
0x1010eb430-sun/reflect/GeneratedSerializationConstructorAccessor246->
0x1010eb830-sun/reflect/GeneratedSerializationConstructorAccessor245->
0x1010ebc30-sun/reflect/GeneratedSerializationConstructorAccessor244->
0x1010ec030-sun/reflect/GeneratedSerializationConstructorAccessor243->
0x1010ec430-sun/reflect/GeneratedSerializationConstructorAccessor242->
0x1010ec830-sun/reflect/GeneratedSerializationConstructorAccessor241->
0x1010ecc30-sun/reflect/GeneratedSerializationConstructorAccessor240->
0x1010ed030-sun/reflect/GeneratedSerializationConstructorAccessor239->
0x1010ed430-sun/reflect/GeneratedSerializationConstructorAccessor238->
0x1010ed830-sun/reflect/GeneratedSerializationConstructorAccessor237->
0x1010edc30-sun/reflect/GeneratedSerializationConstructorAccessor236->
0x1010ee030-sun/reflect/GeneratedSerializationConstructorAccessor235->
0x1010ee430-sun/reflect/GeneratedSerializationConstructorAccessor234->
0x1010ee830-sun/reflect/GeneratedSerializationConstructorAccessor233->
0x1010eec30-sun/reflect/GeneratedSerializationConstructorAccessor232

The GC logs generated with TraceClassLoading and TraceClassUnloading options had:

[Unloading class sun.reflect.GeneratedSerializationConstructorAccessor216 0x000000010110b830]


So, GeneratedSerializationConstructorAccessor216 was unloaded from address 0x10110b830, and sun/reflect/GeneratedMethodAccessor94 got loaded at the same address. But the link to GeneratedSerializationConstructorAccessor216 in the siblings list was not updated after this class was unloaded, and that corrupted the subKlasses chain.

That made me look at the other JVM options used for this failing program. hs_err file showed that -Xnoclassgc was being used to turn off the class unloading but the logs indicated that classes were still getting unloaded. From that point, this mystery was not very hard to solve.

In JDK8, CMSClassUnloadingEnabled option which is used to control the class-unloading in CMS collector, was enabled by default. And this problem was occurring due to the conflicting values getting set for CMSClassUnloadingEnabled and ClassUnloading options. One phase of the class-unloading which is common to the other collectors too that updates subkalsses/siblings links is performed only when ClassUnloading option is enabled. So when -Xnoclassgc or -XX:-ClassUnloading are specified on the command line, due to this bug CMSClassUnloadingEnabled stays enabled and classes get unloaded but that common step gets skipped which corrupts the class hierarchy links leading to hangs or crashes.

The fix was simple - disable CMS class unloading options when -Xnoclassgc or -XX:-ClassUnloading are specified.

There is a similar bug open for G1 collector which is being worked and is targeted for JDK9:
    JDK-8114823: G1 doesn't honor request to disable class unloading


Wednesday Dec 17, 2014

Long Class-Unloading Pauses with JDK8

Recently, I came across a report where a user was facing long GC pauses with JDK8. They didn't see this issue while running with JDK7. The problem with jdk8 was that sometimes the class unloading step of CMS-remark phase was taking a very long time to do its job. They were running with large pages enabled with both jdk7 and jdk8.

2014-11-30T01:51:35.632+0000: [GC (CMS Final Remark) [YG occupancy: 292006 K (1179648 K)][Rescan (parallel) , 0.0397261 secs][weak refs processing, 0.0954205 secs][class unloading, 29.2450253 secs][scrub symbol table, 4.9592268 secs][scrub string table, 0.0034572 secs][1 CMS-remark: 1101173K(1835008K)] 1393180K(3014656K), 34.3796360 secs] [Times: user=0.92 sys=0.38, real=34.37 secs]

2014-12-01T02:01:44.452+0000: [GC (CMS Final Remark) [YG occupancy: 550275 K (1179648 K)][Rescan (parallel) , 0.0785401 secs][weak refs processing, 0.0931797 secs][class unloading, 28.0163134 secs][scrub symbol table, 5.2806525 secs][scrub string table, 0.0034023 secs][1 CMS-remark: 1101982K(1835008K)] 1652257K(3014656K), 33.5087749 secs] [Times: user=1.48 sys=0.36, real=33.51 secs]


Now, here if we look closely, there is a huge difference in the times reported for 'user' 'sys' and 'real'. As we all know, 'user' is the amount of CPU time spent in user-mode within the process and 'sys' is the amount of CPU time spent in the kernel within the process. Other processes and the time the process spends blocked does not count towards these two figures. The 'real' time is the total elapsed time including time slices used by other processes and the time the process spends blocked (for example in waiting for I/O to complete, or system is performing memory activities such as paging/swapping).

In this case too, the problem seemed not be with the JVM or more specifically with the class unloading step but more with the GC threads getting blocked out. It turned out that the class unloading phase was causing lot of paging activity to occur at the system level which took away the CPU time from the GC threads.

What changed in the JVM from JDK7 to JDK8 is that in JDK8 we don't have PermGen. Instead we have MetaSpace which is allocated out of native memory space. As I mentioned earlier, the user was running with Large Pages enabled. With JDK7 by using option -XX:+UseLargePages, we get the large pages enabled for the Java Heap as well as for the PermGen. But with jdk8, with the class meta data stored in MetaSpace(native space), we don't get large pages enabled for the MetaSpace by default when we use +UseLargePages.

In this particular case, the system was configured to have higher number of large pages and a limited memory resource was left for the regular small pages. And a limited number of regular pages available for the MetaSpace caused paging activity during the class unloading phase leading to long GC pauses. The solution for this problem was to use -XX:+UseLargePagesInMetaspace that enables the large pages support for the MetaSpace. This avoided the paging activity and the class-unloading times returned to normal.

Please note that UseLargePagesInMetaspace is not enabled by default when UseLargePages is ON. This is intentional and is done to avoid any memory usage regressions. A classloader (and its classes metadata) is stored in a chunk of memory called Metachunk in the MetaSpace. To be able to hand back pages to the system, we have to clean out an entire page. It is possible that there may be many Metachunks of different classloaders present on a page, and that page can be returned to the system only when all the classloaders are dead and their Metachunks are not needed any more. This leads to delayed deallocation of pages and it is a bit more likely with large pages.

Another reason for not turning UseLargePagesInMetaspace option on by default is that we can not use large pages for the CompressedClass space. The problem there is that we can not dynamically grow the number of large pages; we have to pre-commit all the large pages when memory is reserved. CompressedClass space by default is 1GB of reserved memory and committing that 1GB up-front would cause memory usage regressions.

So, if one has the system configured in such a way that a larger part of memory is set aside for large pages and a relatively smaller part for the regular pages, it is likely to observe the above described long GC pauses during the clean up of MetaSpace. In such situation, using UseLargePagesInMetaspace may help resolve the problem.

About

I share my work experiences as a JVM Sustaining Engineer through this blog

Search

Categories
Archives
« January 2017
SunMonTueWedThuFriSat
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
    
       
Today