Parallel Java with Fork/Join on SPARC CMT
By Amit Hurvitz on Apr 22, 2014
Java 7 Fork and Join allows an easy way to perform dividable work by executing parallel tasks on a single computing machine. This article introduced a fork/join example of counting occurrences of a word in all files/directories under a root directory. I thought to check how these forked threads scale on a T5-4 server. Oracle T5-4 server has 4 processors, each has 16 cores. CMT technology allows 8 threads contexts per core (each core includes two out-of-order integer pipelines, one floating-point unit, level 1 and 2 caches, full specs here).
|It took 1131.29 seconds for a single thread to process a root directory with 1024 files, 1.25MB each. Doing the same work with Java fork/join (available from Java 7), increasing "parallelism level" —using the Java fork/join pool terminology— up to 2048, took 7.74 seconds! Clearly it is worth setting the ForkJoinPool parallelism level manually to higher than the number of virtual CPUs. This can be easily done using the non-default constructor. In this example, processing time was reduced by additional 15% when I set the parallelism level to 4x the default value. The default is the total number of virtual CPUs —512 for a T5-4. The side table reports the times for the different parallelism levels on 1, 2 or 4 CPUs enabled.||
Let's now look at scalability. I have plotted in the side graph
execution speedup vs number of physical CPU cores for different
parallelism levels (abbreviated as number of threads per core). The red line shows the inherent scalability of Java fork/join : linear scalability until 16 CPU cores then continued scalability to 60 and beyond, not bad for an automatic parallelism technology.
If we now increase the parallelism level to run with multiple threads per core to leverage the hardware Chip Multi Threading (CMT) capability of Sparc T-Series, we get dramatic speed-ups and super-linearity effect. More interestingly, past the number of threads physically supported in hardware, we see that we are not taking the overloaded systems to its knee, on the opposite, it handles the load very well and is even able to extract a little more performance out of the overall system.
This CMT behavior is something we consistently see, and this Java Fork/Join benchmark was no exception: a great contribution
to throughput performance and robustness in the face of overload. It is because CMT threads very efficiently share the core's
execution pipeline to significantly decrease the number of wasted CPU cycles (stalls) and dramatically increase CPU utilization —measured by instructions per
second— as a result. When reading large memory areas e.g., much beyond what any existing
caches can satisfy, the application's threads often stall on cache
misses and the CMT technology enables the CPU to immediately switch to a runnable thread with no penalty.