By arnaud on Sep 29, 2009
I was recently faced with the challenge to track down and eliminate outliers from a customer's traffic and I had to come up with some some of tool to help in diagnosing where these long response time transactions originated from. Not really rocket science -hardly anything IS rocket science, even rocket science isn't all that complicated, but I digress- yet nothing that I had in the tool box would quite serve the purpose. So I sat down and wrote a tool that would allow me to visually correlate events in real time. At least that was the idea.
Bird's Eye View
This little tool is only meant for investigations and we are working on delivering something better and more polished (code name Gualicho, shhhhhhh) for production monitoring. The tool I am describing in this article simply correlates the server throughput, peak etime, I/O, CPU, Network and Garbage Collection activity (for OpenDS). It is all presented in a sliding line metric, stacked on top of each other, making visual identification and correlation easy. Later on I will adapt the tool to work on DPS, since it is the other product I like to fine tune for my customers.
When pointed to the access log and the GC log, here is the text output you get. There is one line per second that is displayed with the aggregated information collected from the access log and garbage collection as well as kstats for network, I/O, CPU.
If you looked at it closely, I represented the garbage collection in % which is somewhat unsual but after debating on how to make this metric available, I decided that all I was interested was a relative measure of the time spent in stop-the-world GC operations over the time the application itself is running. As I will show in the snapshot below, this is quite effective to spot correlations with high etimes in most cases. To generate this output in the GC log, all you have to do is add the following to your set of JAVA_ARGS for start-ds.java-args in /path/to/OpenDS/config/java.properties:
-Xloggc:/data/OpenDS/logs/gc.log -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCApplicationStoppedTime
And then my GUI will show something like:
Don't hesitate to zoom in on this snapshot. The image is barely legible due to blog formatting constraints.
Excuse me if I have not waited 7+ days to take the snapshot for this article but I think this simple snap serves the purpose. You can see that most of the time we spend 2% of the time blocked in GC but sometimes we have spikes up to 8% and when this happens, even though it has little impact on the overall throughput over one second, the peak etime suddenly jumps to 50ms. I will describe in another article what we can do to mitigate this issue, I simply wanted to share this simple tool here since I think it can serve some of our expert community.