Java vs C: A Brief Performance Comparison
By user12612225 on Nov 26, 2008
There is a widespread perception that Java is still lagging behind C in performance regarding numerically intensive applications. However, in the last two major releases of the Hotspot implementation of the Java Platform Standard Edition, Java SE 5 and Java SE 6, a great number of performance improvement features have been incorporated to Hotspot that has ultimately closed and sometimes given the edge to Hotspot in some benchmarks intended to measure numerical performance.
Below we compare the performance, with numerically intensive applications, of Hotspot with respect to gcc using the Scimark 2.0 benchmark. We use this benchmark because it happens to provide both a Java and a C based version. SciMark\* 2.0 is a benchmark from the National Institute of Standards for scientific and numerical computing widely-used to measure CPU performance. It measures several computational kernels and reports a composite score in approximate Mflops (Millions of floating point operations per second). In the scores higher is better. It consists of five computational kernels:
FFT – performs a complex 1D fast Fourier transform
SOR – solves the Laplace equation in 2D by successive
MC – computes ! by Monte Carlo integration
MV – performs sparse matrix-vector multiplication
LU – computes the LU factorization of a dense N x N matrix
These kernels represent the types of calculations that commonly occur in numerically intensive scientific applications. Each kernel except MC has small and large problem sizes. The small problems are designed to test raw CPU performance and the effectiveness of the cache hierarchy. The large problems stress the memory subsystem because they do not fit in cache. The MC kernel only uses scalars so there is no distinction between the small and large problems.
The numbers shown below were collected on an Intel Harpertown ( 2 socket, 4 core/socket) machine running Solaris 5.11:
CPU Intel® Xeon® Processor (3.1 GHz Xeon 8 MB L2 cache)
Memory 16 GB DDR2
Operating system: Solaris 5.11 (64-bit)
GNU\* C compiler gcc version: 3.4.3 (csl-sol210-3_420050802)
gcc Compilation flags: -O3 -march=nocona -ffast-math -mfpmath=sse -m64
JDK version: Hotspot version 6u6-p (Performance JDK) 64-Bit
Figure 1 below shows a SpiderWeb Chart with the scores for the Small dataset, while Table 1 shows the same scores in a tabular way. Figure 2 shows a SpiderWeb Chart with the scores for the Large dataset, while Table 2 shows the same scores in a tabular way. On the SpiderWeb Charts green represents the score with GCC, red represents the score for out-of-the-box JDK 6u6p (BASE or no tuning) and blue the corresponding PEAK (tuned) JDK 6u6p score. Similarly, In each table the baseline is the score with gcc, Result 1 represents the score for out-of-the-box JDK 6u6p (BASE or no tuning) and Result 2 corresponds to the PEAK (tuned) JDK 6u6p score.
As we can see in those charts and tables, in both, the Large and Small problem, the JDK score beats the gcc overall or composite score, even without any tuning. With the Small dataset, gcc has better scores than perfJDK on both the Sparse and FFT workloads, but with the Large dataset the gap in those workloads is much smaller. For the Monte Carlo workload, gcc is better than the performance JDK without any tuning, but the JDK beats the gcc score when tuned.
One could argue that the baseline C scores used here could be greatly improved with further compilation tuning or by using optimized and sometimes parallelized libraries as they do in this document Optimizing SciMark\* 2.0 Using Intel® Software Products (By the way, since I used an Intel based machine to obtain the numbers shown below, I used the exact same baseline they used in that document as well). But that requires changing the benchmark itself. Furthermore, most of the times, those optimized libraries are tightly coupled to a specific hardware and/or system. By contrast, most of the performance optimizations in the Java side are incorporated to the virtual machine itself and do not require changes to the arbiter (benchmark), which makes life easier for the programmer. In addition to those VM improvements, the Java platform has also made available a concurrency library that is optimized for every platform where Hotspot is implemented. One could use this library to change the source code of the benchmark as well, but the significant difference here is that the resulting code can be executed on any compatible Java virtual machine without any changing and even without recompiling, which is the old and sometimes forgotten advantage that Java provides over any other language.
So, the obvious question you now have is: what are these improvements that have made Java catchup with C ? First and foremost are the changes related to ergonomics and secondly the runtime and just in time compiler optimizations that have been and are still being added to Hotspot. Among the most important features added recently are:
Biased locking: this is a class of optimization that improves uncontended synchronization performance by eliminating atomic operations associated with the Java language’s synchronization primitives. These optimizations rely on the property that not only are most monitors uncontended, they are locked by at most one thread during their lifetime. An object is "biased" toward the thread which first acquires its monitor via a monitor enter bytecode or synchronized method invocation; subsequent monitor-related operations can be performed by that thread without using atomic operations resulting in much better performance, particularly on multiprocessor machines. Locking attempts by threads other that the one toward which the object is "biased" will cause a relatively expensive operation whereby the bias is revoked. The benefit of the elimination of atomic operations must exceed the penalty of revocation for this optimization to be profitable. Applications with substantial amounts of uncontended synchronization may attain significant speedups while others with certain patterns of locking may see slowdowns. Biased Locking is enabled by default in Java SE 6 and later.
Lock coarsening: There are some patterns of locking where a lock is released and then reacquired within a piece of code where no observable operations occur in between. The lock coarsening optimization technique implemented in hotspot eliminates the unlock and relock operations in those situations (when a lock is released and then reacquired with no meaningful work done in between those operations). It basically reduces the amount of synchronization work by enlarging an existing synchronized region. Doing this around a loop could cause a lock to be held for long periods of times, so the technique is only used on non-looping control flow.Lock coarsening is enabled by default in Java SE 6 and later
Adaptive spinning: Adaptive spinning is an optimization technique where a two-phase spin-then-block strategy is used by threads attempting a contended synchronized enter operation. This technique enables threads to avoid undesirable effects that impact performance such as context switching and repopulation of Translation Lookaside Buffers (TLBs). It is “adaptive" because the duration of the spin is determined by policy decisions based on factors such as the rate of success and/or failure of recent spin attempts on the same monitor and the state of the current lock owner.
Array Copy Performance Improvements
Background Compilation in HotSpot™ (both the server and client VM): in early versions of HotSpot, the compiler did not compile Java methods in the background by default. As a consequence, Hyper-threaded or Multi-processing systems couldn't take advantage of spare CPU cycles to optimize Java code execution speed. Currently, both the server and client Hotspot Vms perform compilation in the background (concurrently with the application)
Garbage collection: there have been numerous Garbage collection optimizations in the last two major releases of Hotspot. One that deserves to be highlighted is the introduction of the parallel compaction collector. Parallel compaction is a feature that enables the parallel collector to perform major collections in parallel resulting in lower garbage collection overhead and better application performance particularly for applications with large heaps. It is best suited to platforms with two or more processors or hardware threads. Previous to Java SE 6, while the young generation was collected in parallel, major collections were performed using a single thread. For applications with frequent major collections, this adversely affected scalability. Previous to Java SE 6, while the young generation was collected in parallel, major collections were performed using a single thread. For applications with frequent major collections, this adversely affected scalability.
Ergonomic: In Java SE 5, platform-dependent default selections for the garbage collector, heap size, and runtime compiler were introduced to better match the needs of different types of applications while requiring less command-line tuning. New tuning flags were also introduced to allow users to specify a desired behavior which in turn enabled the garbage collector to dynamically tune the size of the heap to meet the specified behavior. In Java SE 6, the default selections have been further enhanced to improve application runtime performance and garbage collector efficiency.
These improvements along with other non trivial code generation optimizations to the Hotspot compiler have improved the Scimark numbers we now obtain with Java. In follow up blogs, I will try to pick specific optimizations and pin point how that optimization helped improved the scores with the Scimark benchmark. In the mean time, for further details on these and other improvements added to Hotspot in the latest releases I encourage you to check these documents and the pointers they provide: J2SE 5.0 Performance White Paper and Java SE 6 Performance White Paper.
It is important to highlight, that as we can see in the scores below, the difference between the base or out-of-the-box scores with the peak (or tuned) Java scores are not that big and that is mostly due to the ergonomics features in Hotspot that automatically select the best default values for some garbage collection tunable parameters. This is an ongoing source of performance improvements for Hotspot that greatly reduces the burden on the programmer.
The scores below compare Java with gcc based exclusively on the Scimark2 standalone benchmark. It is also appropriate to mention that perfJDK is also showing excellent results with Java benchmarks like SPECjvm2008 that includes a series of sub-benchmarks that cover a wider range of applications that would make for a better comparison. SPECjvm2008 includes cryptographic applications, XML processing applications, a database core application and many others.