Performance tools and realtime Java
By Roland Westrelin on Apr 21, 2008
Performance is usually expressed in terms of mean execution time (often designated as throughput). By contrast, in the realtime world, it is defined by the worst case execution. If an application must respond with a latency of 500 microseconds when a stop button is pressed, then the worst case time it takes for the application to detect that the button is pressed and to take some action is what defines performance. This time must be less that 500 microseconds.
When a developer studies realtime performance, he often has to focus on a few data points and understand why a delay is observed for those data points: if from time to time the button is pressed but the application fails to respond in less than 500 microseconds, then the developer needs to study the chain of events for those few outliers. Performance tools usually work by taking samples (oprofile on Linux for instance). Such statical tools are useful to study throughput issues but don't help much in the realtime world: a statistical tool is unlikely to catch a rare event.
One tool that is very efficient for realtime performance investigation on Solaris, in particular with java RTS, is Dtrace. It allows the instrumentation of full chain of events and to detect when something unexpected happens. Dtrace allows the developer to observe:
- system events. If the application is delayed after a press of the stop button, it could be that the application thread is descheduled. The sched provider will let you detect when that happens.
- user-level activity. Once you know your application thread lost the cpu, if it was not preempted, you are likely to want to know what it was doing at that time that made the thread block. User process tracing is here to help.
- any Java programs. If your application is a java application, then you'll want a java stack at the point where it lost the cpu. jstack() does that for you.
- JVM events. The same way Solaris has static probes to observe scheduling activities, Java RTS has static probes to observe Java RTS specific activities.
- instrumented Java applications. Java RTS allows the application programmer to insert in its Java RTS application Dtrace probes.
What about Linux? Linux has systemtap. It is supposed to be for Linux what Dtrace is for Solaris. In the list of 5 points above where Dtrace helps realtime performance investigation, systemtap only covers 1- as of today. User-level support, while planned, does not appear available.
I had the opportunity of using systemtap to study a few realtime performance issues on Linux and I had some success with it. However, the lack of user-level support makes things challenging. On a Linux kernel with RT patches, I found that it can very easy to make the system unstable with simple systemtap scripts. Systemtap does not have static probes as Solaris does. It uses tapsets instead, a way to alias a specific event (the tasks gets the cpu: probe scheduler.cpu_on) with some specific code level locations (kernel function finish_task_switch). I found that the tapsets provided with systemtap are often out of sync with the kernel you are running. I guess we can expect those issues to improve as systemtap matures.
Another issue I hit is that on a RT kernel, systemtap scripts can induce a large latency on the application (commonly several tens of milliseconds). I found that this is caused by how systemtap uses locks to protect every global variables in the systemtap script. Whenever a global variable is read in a probe definition, it is protected read-only by a lock. Whenever the global variable is written to in a probe definition, it is protected by a read/write lock. Systemtap works by converting the systemtap script to C, then compiling the C code to a binary kernel module and then loading the kernel module. By inspecting the C code generated by systemtap for one of the scripts, it is clear that locks acquisition is one of the first things done when the probe code is run. This should not be much of a problem as long as a variable is not commonly written to. On a RT kernel, because of PIP read-only lock operations and read-write lock operations are all the same lock operations. So even if writes to global variables are uncommon in the systemstap script, contentions on the locks happen and greatly disrupt the application being traced, to the point where systemtap is unuseable because it causes a higher latency that the problem being debugged.
I've tried several work-arounds for the global variable locks problems including first converting the systemtap script to C code using the stap command, then editing the C code to either remove or move lock operations that I knew were useless and then building and loading the kernel module. That's a lot of work to do by hands for every change to the script. So I ended up switching to systemtap guru mode and embedding C code in the systemtap script. That proved very efficient to remove the lock contention problem but defeats a lot of the advantages of using a framework such as systemtap.
Beyond systemtap, kernel markers appear promising as they should bring real static probe points to the linux kernel. The framework has just been integrated in kernel 2.6.24 and probes still need to be added.