By David Buck-Oracle on Oct 16, 2015
The following is a guest post by Mattis Castegren, our fearless leader and manager of Oracle’s JVM Sustaining Engineering team:
I recently got a great example of a bug solved using a JFR recording, so I wanted to share it on this blog.
The problem came from another team, and they were seeing that their application froze completely for 5-6 seconds, roughly once every five minutes
I got a bunch of JFR recordings from their runs. So, first thing to look at is the GC, right?
They were running with a Parallel GC, and twice in a 30 minute recording they saw 4-second stop-the-world pause. However, their application froze more often than that. I wondered if the problem was other machines in the cluster seeing GC pauses, and if the machines somehow waited for those, but after running all the machines with G1, the GC pauses went down but they still saw the long pauses.
Right, so what more can we say? This team was running with WLDF, or the WebLogic Diagnostic Framework. This framework generates events for WLS actions like servlet requests, database queries, etc, and is extremely powerful. Looking at the event graph, I only enabled servlet requests, and if you squint a bit, you can kind of see the pauses as bands of longer requests:
To make it clearer, I went to the Log tab, sorted the servlet invocation events on duration, selected the ones longer than 4 seconds, right clicked and selected "Operative Set"->"Set Selection" from the drop-down menu. Now, only the longest servlet invocations are part of our operative set. Going back to the graph and selecting "Show Only Operative Set" gives a pretty clear image:
So yes, it seems the work is frozen at somewhat regular intervals, and it is not caused by the GC (at least not all the pauses)
Is the JVM frozen?
Next question was: Is the entire JVM frozen, or only the worker threads. Could it be some locking issue or something? To answer that, I zoomed in on one of the pauses, enabled all events, and unchecked "Show Only Operative Set". If you are not used to the Graph view, it may look a bit cluttered now. (Note the zoom in the time slider at top):
Basically, each thread has several lines representing different event producers. We can zoom in on one of the threads:
The tan event at the top is the long servlet invocation. The red event under that is a long database request, the long blue event is an I/O event. Holding the mouse over them gives you information about each event.
Here, it seems like this thread is waiting on a long I/O caused by a long database request, but most of the other threads did not have this pattern. Was it a problem that long I/O caused one thread to pause, and that thread held some kind of resource? We did not see any synchronization events, but maybe there were something else? If so, we should be able to see some other events during that time, from threads that did not require that resource. Scrolling further down, I found some threads that gave me the answer:
Zooming out a little bit, I saw several threads that looked like this:
The yellow events are "Wait" events. They all have a timeout of 1s, but suddenly there are two Wait event that are 6 seconds long, even though the timeout is still 1 second. This is a pretty strong indication that the entire JVM is indeed frozen.
Why is it frozen?
So, why is the JVM frozen? Here, I again zoomed in on one of the pauses and went to the Log tab. I wanted to know the last event that was generated before the pause. So, I sorted all events on start time, and scrolling through the list of all events, I looked at the last event before the gap:
So, at 08:46:08, there is a "VM Operation" event that takes 4.8 seconds. After that, no other events gets created until 08:46:13 (roughly 4.8 seconds later). Looking at the event, we see that it is a ThreadDump event, called from the "oracle.webcenter.DefaultTimer" thread. Thread dumps stops the world, so this makes sense!
Putting the pieces together
Let's grow our operative set. I started by zooming out, and selected all the VM Operation events over 4 seconds that were caused by thread dumps (there were also two events with the ParallelGCFailedAllocation operation caused by the two GCs). I added those events to the operative set.
I also went to the events tab, selected the oracle.webcenter.DefaultTimer thread and added all events in that thread to the operative set too
Going back to the graph, enabling all event types and selecting "Show Only Operative Set", I now got this nice image:
The DefaultTimer thread periodically waits for 4m 45s. When it wakes up, and triggers two VMOperations collecting thread dumps in the VMThread. These operations correlate perfectly with the pauses seen in the application (except two that were caused by the GC).
So, what was the problem? Now, I handed the investigation back to the application team who found the stack trace in the DefaultTimer thread that triggered the thread dumps, and in the source code we found:
ThreadInfo threadInfos = threadMXBean.getThreadInfo(threadIds,true, true);
Looking at the Javadoc, we can read:
“This method obtains a snapshot of the thread information for each thread including:
- the entire stack trace,
- the object monitors currently locked by the thread if lockedMonitors is true, and
- the ownable synchronizers currently locked by the thread if lockedSynchronizers is true.
This method is designed for troubleshooting use, but not for synchronisation control. It might be an expensive operation.”
And there we have it. By calling this method with "true, true", they triggered thread dumps, including monitors, that froze the system for several seconds (as the javadoc warns about). By not gathering the monitor information, the problem went away (apart from the slightly long parallel GC pauses, but they seemed fine with that)
In summary, this debugging was made possible by the JFR events. Without them, it would have taken several iterations back and forth to find the right logging to capture that the long pauses was caused by thread dumps, and still more iterations to find what thread triggered the thread dumps. Now, all we needed was the data from one of the runs, taken the first time it failed. A great example showing the power of JFR!