Monday Sep 02, 2013

READ_ME_FIRST: What Do I Do With All Of Those SPARC Threads?

This is a technical white paper with a fairly challenging title, but it actually describes the contents quite well.

We wrote this paper because the new SPARC T5 and M5 Oracle servers provide so much main memory plus so many cores and threads that one may wonder how to manage and deploy such a kind of system. They're pretty unique in the market.

This is why we joined forces and set ourselves the goal of providing a holistic view on this topic.

The paper is written in a modular way and readers can select the individual topics they're interested in, but of course we hope you'll read it front to back and find it useful. Perhaps the Glossary at the end comes in handy too. 

The first part covers the processor and system architectures, but only to the extent we felt is needed for the remainder of the paper. There are several other white papers that go into an awful lot more detail on this.

The next part targets those (thinking about) developing parallel applications and looking for tips and tricks what choices need to be made and how to make real world codes scale. That is no mean feat, but rewarding and long lasting. Think about it. The trend is upward and the size of systems can be expected to continue to scale up. Any investment done today in improving scalability is going to help. There is a learning curve as well and the sooner one begins, the better.

We feel these chapters are however also of use to those not directly involved writing parallel code. It helps to understand what happens under the hood and may explain things one has observed experimentally. For example why there may be a diminishing return on adding more and more threads.

The third part covers the virtualization features available and how these can be used to configure the system to the needs. Perhaps to run legacy applications that require an older software environment and do this side by side with applications running in a more modern environment. On top of that, each of such applications can be multi-threaded, providing the optimal configuration per application.

The paper concludes with a brief coverage of key Solaris features. Not everybody realizes the importance of a scalable OS and how much on-going engineering investment is needed to continue to improve the scalability of Solaris.

We hope you find this technical white paper useful and of course feedback is encouraged! 

Sunday Dec 05, 2010

Video Lectures on Parallel Programming

A series of 7 short video introductions into parallel programming are now available on line. They can be viewed here. Each video is about 10-15 minutes long and covers a specific topic, or topics. Although it is recommended to view them in sequence, this is not required.

The slides are available as well and can be downloaded from the same web page.

Readers interested in these topics may also want to take a look at the "Parallel Programming with Oracle Developer Tools" white paper. There is little overlap between the two. The videos cover the big picture, whereas the paper goes into much more detail on the programming aspects.

The first video sets the stage. It covers the more general topic of application performance tuning, summarizing the different ways how to optimize an application, with parallelization being one of them.

The second video is about multicore architectures. The main purpose of this talk is to underline that multicore is here today and supported by all major microprocessor manufacturers. Admittedly, the processors covered here have been replaced by their respective successors, but that is in the nature of the beast. This topic is after all a moving target, since new microprocessors appear on the horizon all the time. For the purpose of the talks given here, the processor details are not relevant however. Of course there is an impact on the performance, but when parallelizing an application, the details of the processor architecture need not be taken into account.

The most important thing to realize is that there are three features that are going to stay for a long time to come. First of all, multicore is not going to go away; developers can count on any new general purpose microprocessor to support parallel execution in hardware. Secondly, not only does the number of threads continue to increase, the cost of a thread continues to come down too.

The next video covers parallel architectures. There are even more different architectures than processors. This is why the systems described here are generic and covered at the block diagram level. The goal is to help understand at a high level what the differences between an SMP system, a single/multi core cluster and a cc-NUMA system are.

Video number 4 is another conceptual talk, but with an exclusive focus on parallel programming. This is where you'll find more about topics like threads, parallel overheads and Amdah'ls law.

By then, all necessary information to get started writing a parallel program has been covered and it is high time to dig deeper into parallel programming.

The fifth video covers the Message Passing Interface (MPI) in some detail. This is a widely used distributed memory model, targeting a cluster of systems.  MPI has been around for quite some time, but is still alive and kicking. Many recent other distributed memory programming models (e.g. CUDA) rely on the same principles. This is why the information presented here is of more general use than to those only interested in using MPI.

The next video is about shared memory parallel programming, starting with automatic parallelization. Often overlooked, but absolutely worth giving a try. The mechanism is simply activated by using the appropriate option on the compiler. Success or failure depends on many factors, but compiler technology continues to improve and can handle increasingly complex code structures.

OpenMP is  the second shared memory model covered in this video. It is a mature and easy to use directive based model to explicitly parallelize applications for multicore based systems. Due to the multicore revolution, interest in OpenMP has never been as strong as it is today.

Clusters of multicore based systems are more and more common. The question is how to program them. This is where the Hybrid Parallel Programming model comes into the picture. It is the topic of the short 7-th video. With a Hybrid model, a distributed memory model (like MPI) is used to parallelize the application across the nodes of the cluster. Within the node one can either use MPI (or a similar model) again, but since the overhead of such models tends to be relatively high, a more lightweight model like OpenMP (or a native threading model like POSIX threads) is often more efficient.

This last video only touches upon this important and interesting topic. Those interested to learn much more about it may want to read the appropriate chapters in my white paper on parallel programming.

The title of this 7-th talk includes "What's Next?". It is very hard to predict what's down the road two or more years from now, but it is very safe to assume that parallel computing is here to stay and will only get more interesting. Anybody developing software is strongly advised to look into parallel programming as a way to enhance the performance of his or her application.

I would like to acknowledge Richard Friedman and Steve Kimmey at Oracle, as well as Deirdré Straughan (now at Joyent) for their support creating these videos and to make them available to a wide audience.

Sunday May 09, 2010

Parallel Programming with Oracle Developer Tools

Multicore? Threads? Parallel Programming Models? OpenMP? MPI? Parallel Architectures? Amdahl's Law? Efficiency? Overhead?

If you're interested in what these mean, plus other topics fundamental to parallel programming, read on!

With great pleasure I announce the availability of a comprehensive technical white paper, titled "Parallel Programming with Oracle Developer Tools". It targets the developer new to parallel programming. No background in this topic is assumed. The paper is available through the Oracle Solaris Studio web page and can also be downloaded directly here

Quite often I get asked how to get started with parallel programming. There is a lot of training material in the form of books, online tutorials and blogs available, but most of these focus on a few specific and specialized topics only. Where and how to begin can therefore be an overwhelming problem to the developer who is interested to apply parallelization as a way to further enhance the performance of his or her application.

For a number of years I've given talks that cover the various aspects of parallel programming, targeting the developer who wants to learn more about this topic. What was missing was a write up of these talks. To address this gap, I started working on a comprehensive technical white paper on the basics of parallel programming and am very glad it is out now. The paper will help you to get started with parallel programming and along the way you'll learn how to use the Oracle Solaris Studio Compilers and Tools to get the job done.

I would like to encourage you to download and read the paper, but for those that like to get more detail on the contents first, a fairly extensive summary of the contents can be found below. 

Enjoy the paper and I welcome your feedback!

Summary of "Parallel Programming with Oracle Developer Tools"

The paper starts with a brief overview of multicore technology. This is after all what drives the increased and more widespread interest in parallel computing.

In the next chapter, some important terminology is explained. Since it plays such a crucial role in parallel programming, the concept of a "thread" is covered first. The goal of parallelization is to reduce the execution time of an application. It is the next topic and may seem trivial, but I found that not everybody is aware of the fact that a performance improvement is not a given. Here, the stage is set for a more extensive discussion on parallel performance in a later chapter. The chapter concludes with a definition of parallelization.

The chapter following is about parallel architectures. One can develop and run a parallel program on any computer, even on one with a single core only, but clearly multiple cores are needed if a performance gain is to be expected. Here an overview is given of the types of basic parallel architectures available today.

The choice of a specific platform not only affects the performance, but to some extent is also determined by the parallel programming model chosen to implement the parallelism. That is the topic of the next chapter.

There are many ways to implement parallelism in an application. In order to do so, one has to select a parallel programming model. This choice is driven by several factors, including the programming language used, portability, the type of application and parallelism, the target architecture(s) and personal preferences.

An important distinction is whether the parallel application is to be run on a single parallel computer system ("shared memory'), or across a cluster of systems ("distributed memory"). This choice has a profound impact on the choice of a programming model, since only a few models support a cluster architecture.

In the chapter, several programming models for both types of architectures are presented and discussed, but by no means is this an extensive overview of the entire field. That is a big topic in itself and beyond the scope of the paper. 

The more in-depth part of the chapter starts with Automatic Parallelization by the compiler. Through a compiler switch, the user requests the compiler to identify those parts in the program that can be parallelized. If such an opportunity is found, the compiler generates the parallel code for the user and no extra effort is needed. The Oracle Solaris Studio compilers support this feature.

We then zoom in on OpenMP for shared memory and MPI for distributed memory programming. These are explicit programming models to parallelize an application and have been selected because they are the dominant parallel programming models in technical computing. They are however strong candidates to parallelize other types of applications too. 

The chapter concludes with the Hybrid programming model, combining two parallel programming models. For example, MPI is used to parallelize the application at a fairly high level. The more fine grained parts are then further parallelized with OpenMP, or another shared memory model. In certain cases this is a natural way to parallelize an application. The Hybrid model also provides a natural fit for today's systems, since many consist of a cluster with multicore nodes.

The next chapter is very extensive and covers an example in great detail. The computation of the average of a set of numbers was chosen, since this is a real world type of operation and parallelizing it is not entirely straightforward.

In the first section, Automatic Parallelization by the Oracle Solaris Studio compilers is introduced and demonstrated on this example. The compiler is able to identify the parallelism in this computation and generates a parallel binary without the need for the user to do anything, other than using some compiler switches (-xautopar and -xreduction to be more precise).

Next, a general strategy how to explicitly parallelize this computation is given. This provides the general framework for the various parallel implementations.

Using this framework, the parallel version of the computation is then implemented using OpenMP, MPI and the MPI+OpenMP Hybrid model. Full source code for all 3 implementations is shown and discussed in great detail. Throughout, it is demonstrated how the Oracle Solaris Studio compilers and the Oracle Message Passing Toolkit can be used to compile and run these parallel versions.

Now that the parallel program has been designed and implemented, it is time to consider other, more advanced, aspects of parallel computing. These topics are covered in the next chapter. They are not needed to get started, but are important enough to read up on. For example, how parallelization may affect the round off behavior in case floating-point numbers are used.

The majority of the chapter is however dedicated to performance, since that is after all the goal of parallelization. In addition to parallel overheads, parallel speed up and efficiency, Amdahl's Law is derived and discussed in quite some detail. This formula plays a very important role to understand the measured performance of a parallel application. As shown, it can also be used to assert how well the application has been parallelized and what performance to expect when increasing the number of threads.

The last chapter presents performance results obtained with the 3 different parallel versions discussed earlier. It is demonstrated how the performance of these various implementations depends on the number of threads.

Tuesday Oct 09, 2007

How to evaluate the performance of the heavily threaded UltraSPARC T2 multicore processor?

Some time ago I had the pleasure to have access to an early engineering system with the UltraSPARC T2 processor. I used this opportunity to run my private PEAS (Performance Evaluation Application Suite) test suite.

It turned out that my findings on throughput benchmarks conducted with this suite revealed some interesting aspects of the architecture. In this blog I'll write about that, but a word of warning is also in place. This is rather early work. I'd like to see it as a start and plan to gather and analyze more results in the near future.

PEAS currently consists of 20 technical-scientific applications written in Fortran and C. These are all single threaded user programs, or computational kernels derived from the real application. In this sense it very well reflects what our users are typically running.

PEAS has been used in the past to evaluate the performance of upcoming processors. Access to an early system provided a good opportunity to take a closer first look at the performance of the UltraSPARC T2 processor.

PEAS is about throughput. I typically run all 20 jobs simultaneously. I call this a "STREAM", which by the way has no relationship with the Streams benchmark. It is just a name I picked quite some time ago.

I can of course run multiple STREAMs simultaneously, cranking up the load. Due to the limitations in the system I had access to, I could only run one and two STREAMS, but this still means I was running 20 and 40 jobs simultaneously. The maximum memory usage on the system was around 6 GB per STREAM.

The question is how to interpret the performance results. To this end, I use the single core performances as a reference. Prior to running my throughput tests, I run all jobs sequentially. For each job this gives me a reference timing. With these timings I can then estimate what the elapsed time of a throughput experiment will be.

Let me give a simple example to illustrate this. Let's say I have two programs, A and B. The single core elapsed times for A and B are 100 and 200 seconds respectively. If I run A and B simultaneously on one core, the respective elapsed times are then 200 and 300 seconds. This is because for 200 seconds, both A and B execute on a single core simultaneously and therefore get 50% of the core on average. After these 200 seconds, B still has 100 more seconds to go. Because A has finished, B has 100% of the core to itself and therefore needs an additional 100 seconds to finish.

This idea can easily be extended to multiple cores.

In the past I've used this approach to evaluate the performance of more conventional processor designs. Given the estimates assume ideal hardware and software behavior, the measured values were typically higher than what I estimated. This is actually what one would expect.

When I applied the same methodology to my results on the UltraSPARC T2 processor however, a big difference showed up. The measured times were actually better than what I estimated! This is exactly the opposite of what one might expect.

The explanation is that the threading within one core increases the capacity of that core.

The question is how to attach a number to that. In other words, how much computational capacity does an UltraSPARC T2 processor represent?

To answer this question I introduce what I call a "Core Equivalent", or "CE" for short. A CE is a core that has the capacity of the core used, but without the additional threading.

On multicore designs without additional hardware threading, a CE then corresponds to the core; there is a one to one mapping between the two.

On threaded designs, like UltraSPARC T2, a CE might be more than one core. The question is how much more.

This leads me to introduce the following metric. Define the Average(CE) metric as follows:

Average(CE) := sum{j = 1 to 20}(measured(j)-estimated(j,CE))/measured(j)

It is easy to see that this function increases monotonically as a function of the CE. I then define the "best" CE as the CE for which Average(CE) has the smallest positive value.

The motivation for this metric is that I compute the average over the relative differences of the measured versus the estimated elapsed times. Of course there could be cancellations, but as long as this average is negative, I underestimate the total throughput capacity of the system.

I applied this metric to my PEAS suite and found one STREAM (20 jobs executed simultaneously) to deliver 15 CEs, with Average(15) = 0.56%. For two STREAMs (40 simultaneous jobs), the throughput corresponds to 25 CEs with Average(25) = 1.21%.

In a subsequent blog I'll give more details so you can see for yourself, but these percentages indicate there is a really good match between the measured and estimated timings.

These numbers clearly reflect the observation that UltraSPARC T2 delivers more performance than 8 non-threaded cores. It is interesting to note that the number goes up when increasing the load. This suggests a higher workload might even give a higher CE value.

As mentioned in the introduction, this is early work. In a future blog I'll go into much more detail regarding the above. I also plan to gather more results. In particular I would like to see how the number of CEs changes if I increase the load.

So, stay tuned!


Picture of Ruud

Ruud van der Pas is a Senior Principal Software Engineer in the SPARC Microelectronics organization at Oracle. His focus is on application performance, both for single threaded, as well as for multi-threaded programs. He is also co-author on the book Using OpenMP

Cover of the Using OpenMP book


« July 2016