Thursday Dec 14, 2006

Java HotSpot: load the VM from a non-primordial thread and effects on stack and heap limits

Part I: Launching the HotSpot VM from a non-primordial thread.

A primordial thread is the first thread created by the Operating System kernel when a process is created.

In the prior versions to Java SE 6, a user application has little or no control of the primordial thread attributes, once the thread is created the attributes cannot be modified. Launching HotSpot in the primordial thread poses several issues, see BugId: 6316197.

HotSpot requires the threads be correctly setup for stack size which may be user specified on the command line, the stackaddr the address at which the stack starts and the stack guard size. On Unix the stack size can be set using the shell (limit or ulimit), on Windows the PE header of the executable needs to be modified using special tools.

To circumvent these issues, the java launcher and the javaw launcher on Windows in JavaSE 6 will launch the HotSpot VM from a non-primordial thread. The following stack trace on Solaris shows the thread, notice that the primordial thread is waiting to rejoin, this happens when the process exits, the HotSpot VM is created and started in the ContinueInNewThread method.

6459:   /usr/java/bin/java Sleepy
----------------- lwp# 1 / thread# 1 --------------------
ff2c0fbc lwp_wait (2, ffbfe8cc)
ff2bc9bc _thrp_join (2, 0, ffbfe990, 1, ffbfe8cc, ff2ecbc0) + 34
ff2bcb28 thr_join (2, 0, ffbfe990, ffbfea20, 0, 0) + 10
00018a04 ContinueInNewThread (124c0, 0, 80000, ffbfea20, fffe7ccc, 0) + 30
00012480 main (18000, 2ac40, 10000, 2b1c8, 44c, 10001) + eb0
000111a0 _start (0, 0, 0, 0, 0, 0) + 108







JNI Applications which utilize custom launchers may use the same strategy illustrated by this simple example:

#include <stdio.h>
#include <jni.h>
#include <sys/types.h>
#include <unistd.h>

#include <pthread.h>

JavaVM\* jvm;

JNIEnv\* create_vm() {
JNIEnv\* env;
JavaVMInitArgs args;
JavaVMOption options[1];

args.version = JNI_VERSION_1_4;
args.nOptions = 1;
options[0].optionString = "-Djava.class.path=.";
args.options = options;
args.ignoreUnrecognized = JNI_FALSE;

JNI_CreateJavaVM(&jvm, (void \*\*)&env, &args);
return env;

void invoke_class(JNIEnv\* env) {
jclass helloWorldClass;
jmethodID mainMethod;
jobjectArray applicationArgs;
jstring applicationArg0;
char buf[128];

sprintf(buf, "%d", getpid());

helloWorldClass = (\*env)->FindClass(env, "HelloWorld");

mainMethod = (\*env)->GetStaticMethodID(env, helloWorldClass, "main", "([Ljava/lang/String;)V");

applicationArgs = (\*env)->NewObjectArray(env, 1, (\*env)->FindClass(env, "java/lang/String"), NULL);
applicationArg0 = (\*env)->NewStringUTF(env, buf);

(\*env)->SetObjectArrayElement(env, applicationArgs, 0, applicationArg0);
(\*env)->CallStaticVoidMethod(env, helloWorldClass, mainMethod, applicationArgs);

// VM Worker Thread
void\* dowork(void\* args) {
JNIEnv\* env = create_vm();
invoke_class( env );
// Unload the VM
if (jvm == NULL) exit(-1);

int retval = (\*jvm)->DetachCurrentThread(jvm);
if (retval != 0) exit(2);

retval = (\*jvm)->DestroyJavaVM(jvm);
if (retval != 0) exit(3);

int main(int argc, char \*\*argv) {
pthread_t tid;
void\* status;

// Create a new thread and launch the vm in that thread
pthread_create(&tid, NULL, dowork, NULL);

// Make the primordial wait until the VM worker thread exits
pthread_join(tid, &status);

Ta Dah, easier done than said!.

 Part II. The Java HotSpot Stack and Heap sizes

Several customers have asked me what sizes can be set for the heap and stack on the command line ?

This got me intrigued!,  upon some digging I found that there are lot of factors which limit these settings. The Operating System, Memory, Ergonomics and various other factors. So I wrote a little Java Program which basically interpolates the minimum and maximum values (gave me an excuse to use "KoolBeans" aka Netbeans, to quote a friend RK,  and a  committed Netbeans enthusiast).

The Java launcher supports Xms, Xmx and Xss, briefly, Xms sets the minimum heap size, Xmx sets the maximum heap size, and Xss sets the stack size, please see the man pages for further details. Typically the values for these options are chosen at run time by what is called as Ergonomics. Ergonomics is a feature built into the Java launcher, since Java SE 5. Based on the available physical memory and/or virtual memory, the number of processors, it sets the above values and also chooses  subsystems, such as the Garbage Collector and the JIT compilers.

In most cases the values for Xms, Xms, Xss have been carefully chosen for typical applications, for the optimum out of the box experience. There may be occasions to override these values for specific application needs. In such cases the acceptable values need to be known, on a given platform.  For those applications which could be redeployed, care must be taken to ensure the values chosen for one systems also works on all the desired target systems. Here is a Table empirical values of the limits based on my experiments, in some cases the mileage may vary. Why ?

Well, it must  be noted that certain values such as Xss maximum is severely restricted by the Operating System, the limits set by the shell and the actual available physical and virtual memory on the system. For Xmx, despite the availability  physical/virtual memory, the VM may not be able to fully utilize all of your free memory, due to fragmentation arising from loaded shared objects or dlls in the process address space. For instance on Windows, -Xmx1.6M perhaps will not work with Java Plugin, because of the overhead relating to the Web Browser and the dll's it loads  in the process address space.











Limited by Virtual Memory








Limited by Virtual Memory








Limited by Virtual Memory








Limited by Virtual Memory


So if you need a very large heap one should seriously consider using 64 bit platforms.

Tuesday Sep 27, 2005

We take Java performance very seriously, paying attention to details.

At Sun's Java Hotspot Development group we continually look into various ways of improving performance of the Hotspot VM and the JDK. It was noticed with SpecJBB2000, an industry standard benchmark, simulates order processing. This is typically used to measure and evaluate Java Servers. It was noted that the SpecJBB2000 creates billions of Date objects presumably to timestamps transactions. Therefore the idea came about to improve System.currentTimeMillis method and thereby improve the benchmark score on all platforms.

Using a faster javaTimeMillis implementation in the VM.

gettimeofday(3C) vs. time(2)
The method System.currentTimeMillis calls into the VMs javaTimeMillis which in turn calls the OS's gettimeofday(3C) on Solaris and Linux. A micro benchmark was performed to characterize the performance of gettimeofday(3C) and time(2) using identical systems Intel P4 HT, 800MHz, 256 cache, 512MB, on Solaris 10 x86 and Linux - SMP RH AS4 x86.

 Function Linux - operation  time in milliseconds Solaris - operation
time milliseconds
 gettimeofday(3c)   0.8216   0.5347
 time(2)   0.7418  0.8572

It can be inferred from the above table, that the gettimeofday(3C) performs the best, and time(2) is marginally better on Linux, therefore swapping these calls would not yield any better performance.

Using rdtsc

rdtsc (Read Time Stamp Counter) operation on Intel processors, this appears to be very fast, however, there are several risk factors associated using rdtsc. The Intel processors keeps track of every machine tick since the start of the machine. Using the cumulative ticks, the time can be computed, by time = machine ticks / processor frequency. This sounds great, however a large SMP system may have several processors and there may be a skew in the rdtsc time values,
making the task of calculating the time, very daunting. Additionally, many x86 based processors could be switched into a power conserving (low frequency) mode, which can make the task of time calculation extremely challenging.

Since rdtsc is Pentium specific and the noted risk factors involved, this approach is not feasible.

Caching the date

A safer approach is to cache the date value, in the Date() constructor (typically the Date object requires a coarse date value), and the value returned by System.currentTimeMillis would still be as accurate as ever. In order to confirm the performance improvement, a constant date value was assigned to the date, field and it was noted that a 3% improvement may be achievable. However, it was required that the date values returned by System.currentTimeMillis and that held by the Date object were monotonic. To clarify this, suppose we run the following code in multiple threads simultaneously,

long t0 = System.currentTimeMillis();
long t1 = new Date()).getTime();
long t2 = System.currentTimeMillis();

Then, t2 >= t1 >= t0 must always be true. Thus two caches are required one to hold the date value returned by System.currentTimeMillis called "clockTM" and the other "clockCache". Using this several approaches were experimented:

1. Using the Watcher Thread: The Hotspot VM has a native watcher thread (simulating a timer interrupt ) waking up every 50ms. In this scheme, the watcher thread stores the date value into the clockCache and the System.currentTimeMillis method updates the clockTM. The clockCache and the clockTM are defined in the java.util.Date class as static and volatile, and is used to create a Date.object. The performance did not improve by a big factor it was less than 1% at the most, hence it was discarded.

2. Using the Unsafe mechanism: The clockCTM and clockCache were allocated natively, passed in through the JNI interfaces into the VM, then sun.misc.Unsafe.getLongVolatile() was used to retrieve the values, this too had dismal results with respect to SpecJBB2000 performance.

3. Using Java Threading: The last approach is to throttle the Date object construction, if the creation of the Date objects exceeded a threshold value, then a Thread would be started to update the cache asynchronously, though this yielded good improvements of 3-4%, the clockTM updater degrades the overall performance by 0.5, due to cache-line bouncing, ie. each native thread storing to the clockTM leads to invalidation of the cache, leading to frequent cache restores.

This is a an example of our continual efforts, to improve performance, however not all of these efforts prove to be useful. We do gain many insights to improve associated features for future improvements.




« April 2014