Wednesday Jan 16, 2008

Throughput computing series: System Concurrency and Parallelism

Most environments have some open source SW that is used as part of the application stack. Depending on the packages, this can take a fair amount of time to configure and compile. To speed the install process, parallelism can easily be used to take advantage of the throughput of CMT servers.

Let us consider the following five open source packages:
  • httpd-2.2.6
  • mysql-5.1.22-rc
  • perl-5.10.0
  • postgresql-8.2.4
  • ruby-1.8.6

The following experiments will time the installation of these packages in both a serial, parallel, and concurrent fashion.

Parallel builds

After the "configure" phase is complete, these packages are all compiled using gmake. This is where parallelism within each job can be used to speed the install process. By using the "gmake -j" option, the level of parallelism can specified for each of the packages. This can dramatically improve the overall compile time as seen below.

compile time \*without\* concurrency
    • Jobs were ran in a serial fashion but with parallelism within the job itself.
    • 79% reduction in compile time at 32 threads/job.

Concurrency and Parallelism

The build process for the various packages are not each able to be parallelized perfectly. In fact, the best overall gain of any of the packages is 6x out of 32. This is where concurrency comes into play. If we start all the compiles at the same time and use parallelism as well, this further reduces the overall build time.

compile times with concurrency and parallelism
    • All 5 jobs were run concurrently with 1 and 32 (threads/job).
    • 88% overall reduction in compile time from serial to parallel with concurrency.
    • 42% reduction in compile time over parallel jobs ran serially.

Load it up!

Hopefully, this helps to better describe how to achieve better system throughput through parallelism and concurrency. Sun's CMT servers are multi-threaded machines which are capable of a high level of throughput. Whether you are building packages from source or installing pre-build packages, you have to load up the machine to see throughput.

Monday Jan 14, 2008

Throughput computing series: Parallel commands (pbzip2)

In this installment of the throughput computing series, I will explore how to get parallelism from the system point of view. The system administrator who first begins to configure the system will start forming impressions from the moment the shrink wrap comes off the server. First impressions and potential parallel options will be explored in this entry.

Off with the shrink-wrap... on with the install

Unfortunately, most installation processes involve a fair number of single-threaded procedures. As mentioned before, the CMT processor is designed to be a processor that optimizes the overall throughput of a server - often to the detriment of single threaded processes. There are several schools of thought on this one. First is, why bother - the install process happens but once and it really doesn't matter. That is true for most typical environments. But the current trend toward grid computing and virtualization makes "time to provision" often a critical factor. To help speed provisioning, there are some things that can be done by using parallelized commands and concurrency.

pbzip2 to the rescue

A very common time-consuming part of provisioning is the packing/unpacking of SW packages. Commonly, gzip or bzip is used to unpack data and packages, but this is not a parallel program. Fortunately, there is a parallel version of bzip that has been made available. "pbzip2" allows you to specify the level of parallelism in order to speed the compression/decompression process.

I spent a little time experimenting with the pbzip program after repeated interactions that always seemed to come back to "gzip" performance. I decided to do some quick benchmarks with pbzip2 using both the T2000(8core@1.4GHz) and v20z(AMD 2cores@2.2GHz).

pbzip2 benchmark

The setup used a 135M text file. This file was the trade_history.txt created using the egen program distributed by the tpc council for the TPC-E benchmark. This file was compressed using the following simple test script:
    for i in 1 2 4 8 16 32
      print "pbzip2 compress: ${i} threads\\n" 
      timex pbzip2 -p${i} small.txt
      print "pbzip2 decompress: ${i} threads\\n" 
      timex pbzip2 -d -p${i} small.txt.bz2
T2000 pbzip2 throughput T2000 pbzip2 throughput

At lower thread counts, the v20z with two AMD cores does better. This is expected since the AMD x64 processor is optimized single-threaded performance. But you can see as you crank up the thread count, the T2000 starts to really shine. This demonstrates my main point that to push massive throughput within a single application, you need lots of threads and parallelism.

    ...The next entry will explore how concurrency and parallelism can help improve build times.

This blog discusses performance topics as running on Sun servers. The main focus is in database performance and architecture but other topics can and will creep in.


« July 2016

No bookmarks in folder


No bookmarks in folder