An Oracle blog about Transactional locks

Parallelism Manifesto

Dave Dice
Senior Research Scientist

Java provides threads to express parallelism and locks to restrict or prune that parallelism. This subtractive model seem somewhat awkward: threads give parallelism and locks take it away. (I'm intentionally ignoring lock-free programming and optimistic transactions for the purposes of this rant. Note too, that multithreaded programs and thus locks are still required on uniprocessors, but I'd argue that threads are used on uniprocessor systems only to achieve IO-compute overlap. Better asynchronous IO facilities provide a better solution than threads.) Think of taking one drug [parallelism] as a remedy for a disease [limited single-CPU performance] and then needing a 2nd drug [locks] to counter act the side-effects [races] of that 1st drug [parallelism]. Remember, true parallelism is not a feature -- it's just a remedy, and a problematic one at that. Just as Java removed explicit memory management from C and C++, I expect that the next generation of programming languages may remove threads and locking as we know it today. Explicit threading will still be available, of course, but ideally I'd like to see the world switch to a model where, by default, threads are managed by the runtime environment. Ultimately, we'd disappear threads (sadly, disappear has entered regular usage as a transitive verb). Programmers would describe potential parallelism with new language constructs but avoid explicitly prescribing threads. Such a descriptive model is better than our current prescriptive model as we trade two mechanisms (threads and locks) for one construct used to express potential parallelism. By default, execution would be serial unless the programmer explicitly marked it as potentially parallel. That's inverted from our current model. The managed runtime would automatically provision the process with real threads as needed.

Of course you can approximate a nearly thread-free programming model today with deft use of thread pools in the java.util.concurrent libraries. But instead of a library implementation I'm proposing first-class language integration. Relatedly, one might correctly argue that the descriptive model described above could be emulated today with very light-weight threads. In such a model we'd use threads and implicit cyclic barriers to express the parallel regions. The "body" of the thread's run() method would typically be short and not contain any synchronization operations. While workable, that approach isn't a complete solution. Not all concurrency problems map cleanly or easily onto the fork-join model. See also Cilk and Fork-Join parallelism.

The key economy we're grappling with is really human scalability (developers), not machine scalability. (Programmer scalability is just as important as program scalability). Parallelism has arrived on our desktops and the free lunch is over. Traditional concurrent programming has been error-prone and difficult, making it expensive and best left to a small priesthood of specialists. That situation can't stand. It's probably true that humans are naturally better at sequential reasoning and programming (c.f., dichotic listening). You can even draw a sketchy analogy between human multitasking (NY Times, 2007-3-26) and parallel programming. To the extent possible, we need to provide languages and constructs that allow programmers to "think sequentially" but "execute concurrently". There's no reason that the programmer who has domain expertise in international tax law should also be required in the future to become a concurrency expert.

Sun's Fortress language is an interesting step in the right direction. Fortress provides loop-level parallelism by default. (If you're interested in how concurrency is expressed in modern programming languages then IBM's X10 is also worth a look).

Note that we're not describing automatic parallelization, where the compiler or runtime analyzes the dependencies in a serial instruction stream, converting implicit instruction-level parallelism (ILP) to thread-level parallelism (TLP). That's the holly grail, but it remains an open research topic. It's also likely that for most codes, the benefit is bounded. This is the same reason we're seeing diminishing returns from out-of-order and speculative execution in modern processors.

Since locks are problematic, a useful interim step is to provide transactions in lieu of locks. Think synchronized but without the need to specify an object. Some researchers prefer atomic { ...code...} instead, but it's all syntactic sugar over the same concept. Briefly, transactions provide correctness as a first-order property. The programmer doesn't need to be concerned about which variables are protected by which locks or locking protocol. Deadlock is impossible as well - atomic sections are composable. As a secondary benefit, depending on the application and the implementation of the transactional subsystem, the program might enjoy more potential parallelism and thus improved throughput. There's often no need for the programmer to design complex and error-prone fine-grained locking protocols to extract additional parallelism. (Of course hand code algorithms will always have a performance edge, but we hope that STMs will often be able to archive competitive performance). This particular area is where STM (Software Transactional Memory) research intersects language design. As you'd expect, much of this remains an open -- and very active -- area of research. If you're curious, you might enjoy reading about the atomjava project as well as watching this video interview with Simon Peyton-Jones and Tim Harris of Microsoft research or the following Google TechTalk video on
transactional memory.

As an aside, some have suggested MPI or OpenMPI as a workable parallel programming model. Message passing has a certain aesthetic appeal, but I don't see it as the way forward. See Gosling's blog entry on the topic. Relatedly, a segment of the concurrency community believes that Erlang (or a programming model like Erlang) is the right path for the future of concurrency. Erlang doesn't allow traditional threads and shared memory, but instead provides efficient asynchronous message passing between processes. A process can neither address, read, or mutate the memory of another process. Many of the bugs we face today in a classic threads + shared-memory programming model aren't even expressible. Minimizing shared mutable data is generally a good practice; Erlang simply takes that the philosophy to the extreme and eliminates shared mutable data, providing communication through explicit message passing instead. (Of course you could approximate the Erlang model in Java by partitioning data as follows: Mutable unshared data would be thread-local -- one thread couldn't even reach another thread's unshared data. Shared immutable data is benign. Finally, you could ban shared mutable data and communicate through java.util.concurrent queues with serialized objects.) The Erlang programming model is seductive, but we don't know if it scales out or maps to large problems where state can end up dispersed. Collectively, we have more experience with the traditional threads memory model. Interestingly, David Patterson mentioned in his Feb 2007 Stanford EE380 talk that recent programmer productivity studies suggested message passing might be more difficult to master. (See also The Landscape of Parallel Computing Research: A View from Berkeley.) Distributed programming brings its own benefits as well as problems. I suspect in the future we'll use hybrid environments with message passing between nodes and threads with shared memory for intra-node cooperation.

It's also worth noting that concurrency isn't a panacea and must be used with care. It's entirely possible to encounter negative scalability. In the classic case if plot throughput on the Y-axis and concurrency on the X-axis you'll see either a concave graph or a monotonically descending line. If the communication costs (shared memory coherency traffic, synchronization, etc) don't overcome the benefits derived from compute parallelism then you may encounter this problem. In addition, we'd like to use fine-grain parallelism to maximize system utilization, but fine-grained parallelism typically incurs higher communications costs.

Finally, it's worth commenting on language closures. Normally you wouldn't think there's much connection between closures and parallelism, but in practice closures are an excellent way of expressing certain actions. Think of a closure being passed into a ParallelApply() method that operates on a collection and you'll quickly see the benefits of closures. BTW, the scala programming language provides a rather elegant closure model. I've only played with it briefly, but scala looks promising. It's a nice shiny new language but the compiler generates Java bytecode so you can take advantage of your highly optimized JVM and mature Java libraries. Closures are also a hot topic for Java, of course. For a balanced look at the design choices I recommend John Rose's blog entry.

Join the discussion

Comments ( 2 )
  • Toby Monday, September 25, 2006
    "Programmers would describe potential parallelism with new language constructs but avoid explicitly prescribing threads."
    -- Perl6 already has some interesting constructs that work this way.
  • Emilio Thursday, December 14, 2006
    This blog is timely. I see the same issue with explicit threading and the lack of implicit vertical, horizontal and pipeline parallelism available in Java.
    Turns out that's hard, until you narrow the problem space to a single large class of programs such as batch processing GB's or TB's of data. Then you can provide the Java community with a light-weight, but powerful framework to do this one class of problems in parallel. If a developer can simply write POJO-like components and string them together in a dataflow graph using an XML syntax, the framework could do all the complex threading, memory mgmt, deadlock detection, etc....
    We at Pervasive Software felt so strongly about this that we created a hyper-parallel Java framework called DataRush. IT managers can now just throw CoolThreads and SPARC IV+ cores at the problem without opening the code and/or tuning every time.
    I would love feedback from your readers. Get a tech preview version:
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.