MPI meets Multicore

Last month I had the great thrill of being asked to give a talk at the SPIE Astronomy Conference (thanks to Hilton Lewis for inviting me). I gave a broad talk on the state of scientific computing. The issue I got the most feedback on is an odd stress I've noticed between MPI and the proliferation of multicore chips. It goes like this:

MPI is a Message Passing Interface that is the predominant API used in building applications for scientific clusters. There are Java versions; one of my favorites is an MPI-like system called MPJ Express. MPI has been around for years and there's a large body of existing MPI applications in C and Fortran.

On the one hand, MPI is a wonderful thing because it enables scientific applications to take advantage of large clusters. On the other hand it's horrible because it forces applications to ship large quantities of data around, and they often quickly find that communication overheads far overshadow the actual computation. An immense amount of cleverness goes into both making the communication efficient, and minimizing the amount of communication. For example, computational fluid dynamics applications will sometimes do adaptive non-uniform partitioning of the simulation volume so that regions with highly detailed interactions (like vortices) don't get split across computation nodes. Scientific applications generally get much better CPU utilization if they are run in one big address space with all of the CPUs having direct access to it.

But there's a huge catch: these applications are mostly single threaded, having been designed for clusters where each node is a single-CPU machine. But now the shift to multicore is happening and folks are building clusters of SMPs. There is a tendancy to run multiple copies of the application on each machine, one per core. Using MPI within one machine to communicate between the copies of the application. This is crazy - especially on machines with lots of cores: which will be the norm in not too many years. Even though the communication is often highly optimized within the one machine, there's still overhead.

To get the best CPU utilization the apps have to be written to be multithreaded on a cluster node, and use MPI between nodes. This is the worst of both worlds because you have to architect for threading and clustering at the same time. This is pretty straightforward in the Java world because we have great threading facilities, but folks with bags of Fortran code have trouble (auto-parallelizing matrix code doesn't help nearly enough).

This is (almost) a non-issue for folks writing enterprise applications using the JavaEE frameworks because the containers deal with both threading and clustering.

Comments:

Java already takes care of mapping of Java thread to Native thread. Is that possible to build all into Java Virtual Machine? So JVM will take care of sharing load between cluster node and commnunication between them. JVM is a VIRTUAL MACHINE anyway.

Posted by Tom on June 24, 2006 at 03:45 AM PDT #

James, I'm interested in the usage of Java to either interface old and optimized Fortan code or approach old problems in a novel way. HPC always forces interesting programming problems that transactional processes never have to deal with. Even for a trivial operation like matrix multiply, a lot of work goes into libraries for the client side to optimize the caching flow to the chip. It is a very different type of computing than JEE, which boils down to lots of simultaneous people get data from a database, but I'm sure that Java could be used in various clever ways to come up with better solutions. I'd love to see more on this topic here! Best regards, Tim

Posted by Tim Burns on June 24, 2006 at 10:57 PM PDT #

There're several different parallel approaches. Maybe java works well for task-parallel because we can treat task as thread. But java is too complicated for data-parallel, for example, matrix multiplication. We can write matrix multiplication in thread, but we can imagine the complexity. Just as Tim Burns says, it is a very different type of computing than JEE. If we want to develop data-parallel application, mabey we should choose another road.

Posted by dreamhead on June 25, 2006 at 02:09 AM PDT #

I agree that this is a hole in Java. One of the main reasons why all multimedia applications and codecs are written using a C++ compiler is that you can embed data-parallel vector instructions inside loop code as assembly or "performance primitives". Why are there no "performance primitives" in Java? Only very few compilers make any way in turning loops into data-parallel code, and none of those are java compilers. Such primitives should be reasonably simple to construct, using the Simd instructions of the Intel platform as the Least Common Denominator. At least I don't expect to see any serious games or video systems in java until that is in place.

Posted by eirikma on June 26, 2006 at 06:08 AM PDT #

Post a Comment:
Comments are closed for this entry.
About

jag

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today