Parallel Java with Fork/Join on SPARC CMT

Java 7 Fork and Join allows an easy way to perform dividable work by executing parallel tasks on a single computing machine. This article introduced a fork/join example of counting occurrences of a word in all files/directories under a root directory. I thought to check how these forked threads scale on a T5-4 server. Oracle T5-4 server has 4 processors, each has 16 cores. CMT technology allows 8 threads contexts per core (each core includes two out-of-order integer pipelines, one floating-point unit, level 1 and 2 caches, full specs here).

It took 1131.29 seconds for a single thread to process a root directory with 1024 files, 1.25MB each.  Doing the same work with Java fork/join (available from Java 7), increasing "parallelism level" —using the Java fork/join pool terminology— up to 2048, took 7.74 seconds! Clearly it is worth setting the ForkJoinPool parallelism level manually to higher than the number of virtual CPUs. This can be easily done using the non-default constructor. In this example, processing time was reduced by additional 15% when I set the parallelism level to 4x the default value. The default is the total number of virtual CPUs —512 for a T5-4. The side table reports the times for the different parallelism levels on 1, 2 or 4 CPUs enabled.
Parallelism Level Seconds to complete
1 CPU 2 CPUs 4 CPUs
1 1136.26 1126.1 1131.29
16 91.58 92.27 96.51
32 52.99 47.88 44.61
64 33.54 25.67 22.62
128 29.24 17.08 13.59
256 28.76 15.42 9.3
512 28.72 14.73 9.07
1024 29.04 15.02 8.5
2048 28.65 14.68 7.74

Let's now look at scalability. I have plotted in the side graph execution speedup vs number of physical CPU cores for different parallelism levels (abbreviated as number of threads per core). The red line shows the inherent scalability of Java fork/join : linear scalability until 16 CPU cores then continued scalability to 60 and beyond, not bad for an automatic parallelism technology.

If we now increase the parallelism level to run with multiple threads per core to leverage the hardware Chip Multi Threading (CMT) capability of Sparc T-Series, we get dramatic speed-ups and super-linearity effect. More interestingly, past the number of threads physically supported in hardware, we see that we are not taking the overloaded systems to its knee, on the opposite, it handles the load very well and is even able to extract a little more performance out of the overall system.

This CMT behavior is something we consistently see, and this Java Fork/Join benchmark was no exception: a great contribution to throughput performance and robustness in the face of overload. It is because CMT threads very efficiently share the core's execution pipeline to significantly decrease the number of wasted CPU cycles (stalls) and dramatically increase CPU utilization —measured by instructions per second— as a result. When reading large memory areas e.g., much beyond what any existing memory caches can satisfy, the application's threads often stall on cache misses and the CMT technology enables the CPU to immediately switch to a runnable thread with no penalty.

Back to the Java world. Java 8 introduces, among other features, streams, parallel streams and lambda expressions. What happens when these new capabilities meet the scalability power of SPARC T5? This will be my next post.
Comments:

Post a Comment:
  • HTML Syntax: NOT allowed
About

How open innovation and technology adoption translates to business value, with stories from our developer support work at Oracle's ISV Engineering.

Subscribe

Search

Categories
Archives
« April 2015
SunMonTueWedThuFriSat
   
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
23
24
25
26
27
28
29
30
  
       
Today
Feeds