We take Java performance very seriously, paying attention to details.
By ksrini on Sep 27, 2005
At Sun's Java Hotspot Development group we continually look into various ways of improving performance of the Hotspot VM and the JDK. It was noticed with SpecJBB2000, an industry standard benchmark, simulates order processing. This is typically used to measure and evaluate Java Servers. It was noted that the SpecJBB2000 creates billions of Date objects presumably to timestamps transactions. Therefore the idea came about to improve System.currentTimeMillis method and thereby improve the benchmark score on all platforms.
Using a faster javaTimeMillis implementation in the VM.
gettimeofday(3C) vs. time(2)
The method System.currentTimeMillis calls into the VMs javaTimeMillis which in turn calls the OS's gettimeofday(3C) on Solaris and Linux. A micro benchmark was performed to characterize the performance of gettimeofday(3C) and time(2) using identical systems Intel P4 HT, 800MHz, 256 cache, 512MB, on Solaris 10 x86 and Linux - SMP RH AS4 x86.
|Function||Linux - operation time in milliseconds||Solaris - operation |
It can be inferred from the above table, that the gettimeofday(3C) performs the best, and time(2) is marginally better on Linux, therefore swapping these calls would not yield any better performance.
rdtsc (Read Time Stamp Counter) operation on Intel processors, this appears to be very fast, however, there are several risk factors associated using rdtsc. The Intel processors keeps track of every machine tick since the start of the machine. Using the cumulative ticks, the time can be computed, by time = machine ticks / processor frequency. This sounds great, however a large SMP system may have several processors and there may be a skew in the rdtsc time values,
making the task of calculating the time, very daunting. Additionally, many x86 based processors could be switched into a power conserving (low frequency) mode, which can make the task of time calculation extremely challenging.
Since rdtsc is Pentium specific and the noted risk factors involved, this approach is not feasible.
Caching the date
A safer approach is to cache the date value, in the Date() constructor (typically the Date object requires a coarse date value), and the value returned by System.currentTimeMillis would still be as accurate as ever. In order to confirm the performance improvement, a constant date value was assigned to the date, field and it was noted that a 3% improvement may be achievable. However, it was required that the date values returned by System.currentTimeMillis and that held by the Date object were monotonic. To clarify this, suppose we run the following code in multiple threads simultaneously,
long t0 = System.currentTimeMillis();
long t1 = new Date()).getTime();
long t2 = System.currentTimeMillis();
Then, t2 >= t1 >= t0 must always be true. Thus two caches are required one to hold the date value returned by System.currentTimeMillis called "clockTM" and the other "clockCache". Using this several approaches were experimented:
1. Using the Watcher Thread: The Hotspot VM has a native watcher thread (simulating a timer interrupt ) waking up every 50ms. In this scheme, the watcher thread stores the date value into the clockCache and the System.currentTimeMillis method updates the clockTM. The clockCache and the clockTM are defined in the java.util.Date class as static and volatile, and is used to create a Date.object. The performance did not improve by a big factor it was less than 1% at the most, hence it was discarded.
2. Using the Unsafe mechanism: The clockCTM and clockCache were allocated natively, passed in through the JNI interfaces into the VM, then sun.misc.Unsafe.getLongVolatile() was used to retrieve the values, this too had dismal results with respect to SpecJBB2000 performance.
3. Using Java Threading: The last approach is to throttle the Date object construction, if the creation of the Date objects exceeded a threshold value, then a Thread would be started to update the cache asynchronously, though this yielded good improvements of 3-4%, the clockTM updater degrades the overall performance by 0.5, due to cache-line bouncing, ie. each native thread storing to the clockTM leads to invalidation of the cache, leading to frequent cache restores.
This is a an example of our continual efforts, to improve performance, however not all of these efforts prove to be useful. We do gain many insights to improve associated features for future improvements.