Using OpenMP to parallelize for CMT
By Ruud on Dec 06, 2005
Sun just announced the UltraSPARC T1 processor based T1000 and T2000 servers with massive amounts of parallelism on the chip. Of course, UltraSPARC IV and IV+ also have parallelism at the chip level. This is generally referred to as "Chip Multi-Threading", or CMT for short. Another example of this kind of on-chip hardware parallelism is the AMD dual-core Opteron processor.
And we are only seeing the beginning of this. In the not too distant future, the majority of processors will have some implementation of CMT. The naming may be different, but what matters to the developer is that a single chip is like a mini parallel system.
This brings up the question "How can I exploit this hardware parallelism in my application?"
There are various ways to do this. One can use Posix Threads ("Pthreads" for short) for example, or in Java the native threading model. These are fairy low level interfaces though and the learning curve is fairly steep.
An alternative is the Message Passing Interface (MPI). The latter is really designed to run on a cluster of systems, but an application parallelized with MPI can run equally well, if not better, on a single parallel system. But, it is not always easy to use MPI when one is new to the concepts of parallelization. Plus that MPI may be fairly heavy to run on a single CMT processor, but to be honest, I have not done any performance measurements to back this up with data. I could be wrong here!
My favourite programming model is OpenMP. The language specifications and much more information can be found at the OpenMP website.
Today I want to give a quick overview of what OpenMP is about and in howfar it fits an underlying CMT architecture.
Just like MPI, OpenMP is a de-facto standard. The OpenMP Architecture Review Board (ARB) defines the specifications and an implementation is supposed to stick to that. So far, this has worked very well and there are many compilers available that support OpenMP.
Sun Studio has supported OpenMP in C, C++ and Fortran for a long time now. We continue to improve the performance of the implementation, plus that we have added some features to assist the developer. I strongly advise to use Sun Studio 11 when you plan on giving it a try with OpenMP.
Not in the least because our compilers also support automatic parallelization in C, C++ and Fortran. Through the -xautopar option this feature is activated and it is certainly worth trying on an application. We support mixing and matching this with OpenMP, so you can always augment what the compiler did. If you use this feature, please also use the -xloopinfo option. This causes the compiler to print diagnostic messages on the screen, informing you what was parallelized and what wasn't. In case of the latter, the compiler tells you why the parallelization did not succeed.
OpenMP uses a directive based approach to parallelize an application. The one limitation of OpenMP is that an application can only run within a single address space. In other words, you can not run an OpenMP application on a cluster. This is a difference with MPI.
On the other hand, OpenMP is more lightweight and may therefore be more suitable for a CMT processor.
One question I often get is in howfar OpenMP compares to Posix Threads and whether developers should rewrite their Pthreads application and use OpenMP instead. To start with the latter, my answer is always "why ?". If it works fine, don't modify it.
The first question takes a little more time to answer. Pthreads are rather low level, but also efficient. OpenMP is built on top of a native threading model and therefore adds overhead, but the additional cost is fairly low. Unless you use OpenMP in the "wrong" way. One golden rule is to create large portions of parallel work to amortize the cost of the so-called parallel region in OpenMP.
The clear advantage OpenMP has over Pthreads is ease of use. OpenMP provides a fairly extensive set of features, relieving the developer from having to roll their own functionality. You can still do this with OpenMP, but many basic needs are covered already.
Time to look at an example. The code fragment below uses OpenMP directives to parallelize this simple loop:
#pragma omp parallel for shared(n,a,b,c) private(i)
for (i=0; i < n; i++)
a[i] = b[i] + c[i];
An OpenMP compliant compiler translates this pragma into the appropriate parallel infrastructure. As a side note, this loop will be automatically parallelized by the Sun Studio C compiler.
At run-time, the application will go parallel at this point. The threads are created and the work is distributed over the threads. In this case "work" means the various loop iterations. Each thread will get assigned a chunk out of the total number of iterations that need to be executed.
At the end of the loop, the threads synchronize and one thread (the so-called "master thread") resumes execution.
One of the things one has to do is to specify the so-called data-sharing attributes of the variables. OpenMP has some defaults for that, but I prefer and recommend to not use these and think about it yourself. Initially this may take some more time, but the reward is substantial.
At run-time, the OpenMP system distributes the iterations over the threads. For example if n=10 and I use 2 threads, each thread gets 5 iterations to work on. This is why the loop iteration variable "i" needs to be made "private". This implies that each thread gets a local copy of "i" and can modify it independently of the other threads.
All threads need to be able to access (a portion of) "a", "b" and "c". This is why we need to declare those to be "shared'. The same is true for "n".
OpenMP supports any data type. Although many OpenMP applications use floating-point, OpenMP works equally well for other data types. Like integer codes. Take the loop above. The vectors can be of any type, including "int", or "char" for that matter.
Although the example above is correct, my explanation is a little simplified, but I hope you get the gist of how this works in OpenMP. For example, one thing I can control explicitly is how the various loop iterations are distributed over the threads. If I do not specify this, I get a compiler dependent default.
Another feature of OpenMP is that I can, and should, create bigger chunks of work than just one parallelized for-loop. Otherwise I may get killed in the overhead. OpenMP provides several features to address this.
Activating the OpenMP directives is very easy. For example, on the Sun Studio compilers one needs to add the -xopenmp option to the rest of the options and voila, the compiler will recognize the OpenMP constructs and generate the infrastructure for you. If you do so, please also use the -xloopinfo option. Just as with -xautopar, this will display diagnostic messages on your screen.
OpenMP comes with a fairly compact, but yet powerful, set of directives to implement and control the parallelization. On top of that, a run-time library is supplied to query and control the parallel environment (e.g. adjusting the number of threads). A set of environment variables is also provided. For example, OMP_SET_NUM_THREADS to specify the number of threads to be used by the application.
So, what are the main advantages of OpenMP? This is what I think makes it attractive:
- Portable - Every major hardware vendor and several ISVs provide an OpenMP compiler
- Modest programming effort - Implementing OpenMP is usually fairly easy compared to other programming models
- Incremental parallelization - One can implement the parallelization step by step
- Local impact - Often, the OpenMP specific part in an application is relatively small
- Easier to test - By not compiling for OpenMP you "de-activate" it
- Natural mapping onto CMT architectures - OpenMP maps elegantly onto a CMT processor
- Assistance from tools - You get a compiler to help you
- Preserves sequential program - If done correctly, you can have your sequential program still built in
In particular the latter is a powerful feature I think. You can always go back to your original application if you want.
Of course there also drawbacks of using OpenMP. One is restricted to a single address space and in that respect OpenMP is not "Grid Ready". With all chip level parallelism this may be less of an issue in the future though.
Plus that one can combine MPI and OpenMP. MPI can be used to distribute the work at the system level. Within one SMP system one can then use OpenMP for the finer grained parallelization. This kind of "hybrid" parallelization is gaining groud.
The developer also needs to think about data-sharing attributes. Without prior experience in parallelization this is a hurdle to be taken, but it is usually easier than it might seem at first sight.
You also need a compiler, but hey, we have Sun Studio for that!
Interested? One place to start could be to take a look at the OpenMP tutorial that I presented at the previous IWOMP workshop (International Workshop on OpenMP, June 2005). You can find it through the OpenMP User Group website (under "Trainings"). This is the direct link to the pdf file. As always, with such things, I would do it slightly differently today, but hopefully this will get you going.
By the way, the next IWOMP will take place in Reims, France (June 12-15, 2006). The first two days a conference will be held. The next two days are reserved for the OMPlab and a tutorial. People can bring their own application to the OMPlab. OpenMP experts will be available to assist you with the parallelization. The tutorial will most likely be held the third day, in parallel with the OMPlab. After that, attendees can join the OMPlab part and work on whatever they want. Preparations have just started. Through this web site you can sign up for the conference newsletter.
Okay, back to CMT. The beauty of OpenMP is that it provides an abstract model. You can develop your OpenMP program on any piece of hardware with an OpenMP compliant compiler and then run it on any parallel system. Possibly you need to recompile if you change architecture, but that is about it.
For example, I travel a lot and routinely work on OpenMP programs on my single processor Toshiba laptop, using the Sun Studio 11 compilers under Solaris. I can run multi-threaded, even using the nested parallelism feature that OpenMP supports. Of course I will not get any performance gain doing so, but it makes for a great development environment!
I hope this article got you excited about OpenMP. If so, I recommend to sign up for the OpenMP community mail alias. This is a very good forum to ask questions, discuss issues, etc.
I hope to see you soon on the OpenMP alias!