OOo my Threading...

This starts where Stephan's OOoCon talk left off - by taking his problem statement and maybe a gist of the proposed concepts into OOo's core (which happens to be C++ instead of Haskell. Which some folks might bemoan, but still, ~90% of our code consists of that language).

To recap, the free lunch is over for us application software developers, i.e. either OOo gets seriously multi-threaded soon, or we stop adding features and only optimize for speed, or OOo will get significantly slower within the next few years. But as you, the gurus and yours faithfully have noticed by now, becoming concurrent with osl::MutexGuard, osl::Thread & friends is tedious, error-prone, and neither race nor dead-lock free for a project of OOo's size. With the expressive power of a general programming language, it is even theoretically impossible to verify against these properties (for a sufficiently large project. OOo probably satisfies this constraint).

So, Stephan already showed what's possible in other languages. To boot, the trick is to raise the level of abstraction above digging-in-the-dirt, hand-made, explicit, and ugly locking or thread fiddling. Whereever possible, parallelism should be detected either implicitely, or at least denoted declaratively (instead of manually telling the machine what to do). John Ousterhout has a nice writeup about that. It's basically another instance of the lack of transparency and ease of use that is so ingrained with day-to-day tasks in C++. For stuff like object lifetime management, serialization and so on, boost provides transparent solutions. For large-scale concurrency management, the UNO threading framework does. For local, statement-level parallelization, there ain't much - or is there?

In fact, there actually are several attempts to extend C++ in that direction. To name a few:

    OpenMP, a C/C++ language extension, which provides things like parallel loops, work sharing or master and synchronization constructs. Drawbacks: controlled via pragmas, expressiveness limited, C origins (e.g. there's no proper way to propagate exceptions out of OpenMP threads).

    Boost.Act, a template framework that, together with Boost.Threads, provides functionality on the same conceptual level as OpenMP (it actually builds upon OpenMP internally). Furthermore, Boost.Act has so-called active objects, which provide their results lazily - one can construct a whole tree of calculations using active objects, which internally enqueue those, run them asynchronously, and force a wait on the result only if a non-active object actually requests it.

    Concur, featuring active objects, messages, parallel STL algorithms and futures (with futures being values that are possibly unevaluated at the point of instantiation, and get their actual result only eventually - forcing futures evaluation might block the requester until the value becomes available). Probably served as an inspiration for Boost.Act.

    TBB, or Intel ThreadBuildingBlocks, another (closed-source) C++ template framework with similar scope & features as Concur and Boost.Act.

    Two proposals for extending the C++ standard: Multithreading API for C++0X (and related: Transporting Values and Exceptions between Threads)

    KDE's ThreadWeaver, a threaded job queue that offers some nice concepts, such as explicit job dependencies.

From a coarse inspection, Boost.Act seems appealing (TBB really isn't an option for OOo, no?) - boost is high-quality anyway, and I think that the Action concept is a richer one than Futures.

Basically, the advanced approaches listed above employ declarative semantics, i.e. one no longer needs to create threads by hand, assign a thread function, and manually serialize access to shared state. Instead, one simply states that a loop body shall be executed in parallel:

parallel_transform( src.begin(), src.end(), dest.begin(), bind(_1 \* 42) );

This multiplies each element in src with 42, and stores the result in dest. How many threads are created, and where they are taken from is completely up to the implementation - a typical threading runtime on a machine with N cores will hold a static pool of N threads, each of them receiving a chunk of the parallel_transform workload from above. In the old way of doing multi-threading, this behaviour would potentially be spread over the whole code base (making it a cross-cutting concern, just like explicit method guarding is today - with thousands of osl::MutexGuard(m_aMutex) invocations all over the place).

One of the crucial points of doing stuff like the above in a safe way is to avoid hidden access to shared state. Access to the container can be serialized by parallel_transform, but if the called functor modifies shared objects concurrently, all bets are off. Thus, the listed frameworks usually have some safe-guards in place, most of them using a way of explicitely annotating free-standing functions and classes to be thread-safe (note that this tagging of thread-safeness can easily be achieved by means of the UNO threading framework as well - Kay already has different uno::Reference types on his list, to distinguish between thread-safe and thread-unsafe components during compile-time). Another important concept noteworthy here are pure functions, i.e. functions that, given the same input parameters, always return the same result, no matter where or when called (related, but not exactly the same are referentially transparent functions, which could also be exploited by an optimizing parallel execution runtime). It's always safe to call pure functions on disjunct parameters concurrently with no extra synchronization effort. A safe (because syntactically limited) way of defining pure functions is via boost's bind or lambda library. There, expressions that modify shareable state could simply be banned from the available syntax (they aren't - but it's possible to do that).

Need an example? Good, take the standard one, that so naturally asks for parallelization, that it's usually been sped up via SIMD in the past (you might have noticed that those parallel_something algorithms principally apply the same instructions (the functor to run) in parallel to all container elements - that's SIMD), namely operating on image pixel:

// color-convert an image from RGB to CMYK color space

This is the archetype of inherent parallelism, having a huge amount of data (image pixel), that all need to have the same operation applied to. Disregarding memory bandwidth for the moment, this snippet scales linearly with the number of available cores.

Note the conversion_parms parameter to the bound function, which can pose problems if rgb2cmyk silently const_casts it to something mutable. This is the dark corner of C/C++'s non-enforcable constness, and therefore, a minimal amount of developer discipline is necessary: if objects or functions employ magic behind their public interfaces (on-demand creation of data, buffering, caching, copy-on-write), this still needs to be made thread-safe explicitely. Fortunately, for each of the aforementioned items, there's a thread-safe helper available in OOo (use rtl::Static for late-initialization of static data, boost::shared_ptr for lazy initialization of class members, caching and buffering, and cow_wrapper for copy-on-write).

To be continued.


Post a Comment:
Comments are closed for this entry.



« July 2016