Friday Jun 19, 2009

SSDs in HPC: Reducing the I/O Bottleneck BluePrint Best Practices

High-Performance Computing (HPC) applications can be dramatically increased by simply using SSDs instead of traditional hard drives. To read about these findings see the Sun BluePrint by Larry McIntosh and Michael Burke, called "Solid State Drives in HPC: Reducing the I/O Bottleneck".

There was a BestPerf blog posting on the NASTRAN/SSD results at:
http://blogs.sun.com/BestPerf/entry/sun_fire_x2270_msc_nastran

Our BestPerf authors will blog about more of their recent benchmarks in the coming weeks.

Tuesday Jun 16, 2009

Sun Fire X2270 MSC/Nastran Vendor_2008 Benchmarks

Significance of Results

The I/O intensive MSC/Nastran Vendor_2008 benchmark test suite was used to compare the performance on a Sun Fire X2270 server when using SSDs internally instead of HDDs.

The effect on performance from increasing memory to augment I/O caching was also examined. The Sun Fire X2270 server was equipped with Intel QC Xeon X5570 processors (Nehalem). The positive effect of adding memory to increase I/O caching is offset to some degree by the reduction in memory frequency with additional DIMMs in the bays of each memory channel on each cpu socket for these Nehalem processors.

  • SSDs can significantly improve NASTRAN performance especially on runs with larger core counts.
  • Additional memory in the server can also increase performance, however in some systems additional memory can decrease memory GHz so this may offset the benefits of increased capacity.
  • If SSDs are not used striped disks will often improve performance of IO-bound MCAE applications.
  • To obtain the highest performance it is recommended that SSDs be used and servers be configured with the largest memory possible without decreasing memory GHz. One should always look at the workload characteristics and compare against this benchmark to correctly set expectations.

SSD vs. HDD Performance

The performance of two striped 30GB SSDs was compared to two striped 7200 rpm 500GB SATA drives on a Sun Fire X2270 server.

  • At the 8-core level (maximum cores for a single node) SSDs were 2.2x faster for the larger xxocmd2 and the smaller xlotdf1 cases.
  • For 1-core results SSDs are up to 3% faster.
  • On the smaller mdomdf1 test case there was no increase in performance on the 1-, 2-, and 4-cores configurations.

Performance Enhancement with I/O Memory Caching

Performance for Nastran can often be increased by additional memory to provide additional in-core space to cache I/O and thereby reduce the IO demands.

The main memory was doubled from 24GB to 48GB. At the 24GB level one 4GB DIMM was placed in the first bay of each of the 3 CPU memory channels on each of the two CPU sockets on the Sun Fire X2270 platform. This configuration allows a memory frequency of 1333MHz.

At the 48GB level a second 4GB DIMM was placed in the second bay of each of the 3 CPU memory channels on each socket. This reduces the memory frequency to 1066MHz.

Adding Memory With HDDs (SATA)

  • The additional server memory increased the performance when running with the slower SATA drives at the higher core levels (e.g. 4- & 8-cores on a single node)
  • The larger xxocmd2 case was 42% faster and the smaller xlotdf1 case was 32% faster at the maximum 8-core level on a single system.
  • The special I/O intensive getrag case was 8% faster at the 1-core level.

Adding Memory With SDDs

  • At the maximum 8-core level (for a single node) the larger xxocmd2 case was 47% faster in overall run time.
  • The effects were much smaller at lower core counts and in the tests at the 1-core level most test cases ran from 5% to 14% slower with the slower CPU memory frequency dominating over the added in-core space available for I/O caching vs. direct transfer to SSD.
  • Only the special I/O intensive getrag case was an exception running 6% faster at the 1-core level.

Increasing performance with Two Striped (SATA) Drives

The performance of multiple striped drives was also compared to single drive. The study compared two striped internal 7200 rpm 500GB SATA drives to a singe single internal SATA drive.

  • On a single node with 8 cores, the largest test xx0cmd2 was 40% faster, a smaller test case xl0tdf1 was 33% faster and even the smallest test case mdomdf1 case was 12% faster.

  • On 1-core the added boost in performance with striped disks was from 4% to 13% on the various test cases.

  • One 1-core the special I/O-intensive test case getrag was 29% faster.

Performance Landscape

Times in table are elapsed time (sec).


MSC/Nastran Vendor_2008 Benchmark Test Suite

Test Cores Sun Fire X2270
2 x X5570 QC 2.93 GHz
2 x 7200 RPM SATA HDDs
Sun Fire X2270
2 x X5570 QC 2.93 GHz
2 x SSDs
48 GB
1067MHz
24 GB
2 SATA
1333MHz
24 GB
1 SATA
1333MHz
Ratio (2xSATA):
48GB/
24GB
Ratio:
2xSATA/
1xSATA
48 GB
1067MHz
24 GB
1333MHz
Ratio:
48GB/
24GB
Ratio (24GB):
2xSATA/
2xSSD

vlosst1 1 133 127 134 1.05 0.95 133 126 1.05 1.01

xxocmd2 1
2
4
8
946
622
466
1049
895
614
631
1554
978
703
991
2590
1.06
1.01
0.74
0.68
0.87
0.87
0.64
0.60
947
600
426
381
884
583
404
711
1.07
1.03
1.05
0.53
1.01
1.05
1.56
2.18

xlotdf1 1
2
4
8
2226
1307
858
912
2000
1240
833
1562
2081
1308
1030
2336
1.11
1.05
1.03
0.58
0.96
0.95
0.81
0.67
2214
1315
744
674
1939
1189
751
712
1.14
1.10
0.99
0.95
1.03
1.04
1.11
2.19

xloimf1 1 1216 1151 1236 1.06 0.93 1228 1290 0.95 0.89

mdomdf1 1
2
4
987
524
270
913
485
237
983
520
269
1.08
1.08
1.14
0.93
0.93
0.88
987
524
270
911
484
250
1.08
1.08
1.08
1.00
1.00
0.95

Sol400_1
(xl1fn40_1)
1 2555 2479 2674 1.03 0.93 2549 2402 1.06 1.03

Sol400_S
(xl1fn40_S)
1 2450 2302 2481 1.06 0.93 2449 2262 1.08 1.02

getrag
(xx0xst0)
1 778 843 1178 0.92 0.71 771 817 0.94 1.03

Results and Configuration Summary

Hardware Configuration:
    Sun Fire X2270
      1 2-socket rack mounted server
      2 x 2.93 GHz QC Intel Xeon X5570 processors
      2 x internal striped SSDs
      2 x internal striped 7200 rpm 500GB SATA drives

Software Configuration:

    O/S: Linux 64-bit SUSE SLES 10 SP 2
    Application: MSC/NASTRAN MD 2008
    Benchmark: MSC/NASTRAN Vendor_2008 Benchmark Test Suite
    HP MPI: 02.03.00.00 [7585] Linux x86-64
    Voltaire OFED-5.1.3.1_5 GridStack for SLES 10

Benchmark Description

The benchmark tests are representative of typical MSC/Nastran applications including both SMP and DMP runs involving linear statics, nonlinear statics, and natural frequency extraction.

The MD (Multi Discipline) Nastran 2008 application performs both structural (stress) analysis and thermal analysis. These analyses may be either static or transient dynamic and can be linear or nonlinear as far as material behavior and/or deformations are concerned. The new release includes the MARC module for general purpose nonlinear analyses and the Dytran module that employs an explicit solver to analyze crash and high velocity impact conditions.

  • As of the Summer '08 there is now an official Solaris X64 version of the MD Nastran 2008 system that is certified and maintained.
  • The memory requirements for the test cases in the new MSC/Nastran Vendor 2008 benchmark test suite range from a few hundred megabytes to no more than 5 GB.

Please go here for a more complete description of the tests.

Key Points and Best Practices

For more on Best Practices of SSD on HPC applications also see the Sun Blueprint:
http://wikis.sun.com/display/BluePrints/Solid+State+Drives+in+HPC+-+Reducing+the+IO+Bottleneck

Additional information on the MSC/Nastran Vendor 2008 benchmark test suite.

  • Based on the maximum physical memory on a platform the user can stipulate the maximum portion of this memory that can be allocated to the Nastran job. This is done on the command line with the mem= option. On Linux based systems where the platform has a large amount of memory and where the model does not have large scratch I/O requirements the memory can be allocated to a tmpfs scratch space file system. On Solaris X64 systems advantage can be taken of ZFS for higher I/O performance.

  • The MSC/Nastran Vendor 2008 test cases don't scale very well, a few not at all and the rest on up to 8 cores at best.

  • The test cases for the MSC/Nastran module all have a substantial I/O component where 15% to 25% of the total run times are associated with I/O activity (primarily scratch files). The required scratch file size ranges from less than 1 GB on up to about 140 GB. Performance will be enhanced by using the fastest available drives and striping together more than one of them or using a high performance disk storage system, further enhanced as indicated here by implementing the Lustre based I/O system. High performance interconnects such as Infiniband for inter node cluster message passing as well as I/O transfer from the storage system can also enhance performance substantially.

See Also

Disclosure Statement

MSC.Software is a registered trademark of MSC. All information on the MSC.Software website is copyrighted. MSC/Nastran Vendor 2008 results from http://www.mscsoftware.com and this report as of June 9, 2009.

Tuesday Jun 09, 2009

Free Compiler Wins Nehalem Race by 2x

Not the Free Compiler That You Thought, No, This Other One.

Nehalem performance measured with several software configurations

Contributed by: John Henning and Karsten Guthridge

Introduction

race

The GNU C Compiler, GCC, is popular, widely available, and an exemplary collaborative effort.

But how does it do for performance -- for example, on Intel's latest hot "Nehalem" processor family? How does it compare to the freely available Sun Studio compiler?

Using the SPEC CPU benchmarks, we take a look at this question. These benchmarks depend primarily on performance of the chip, the memory hierarchy, and the compiler. By holding the first two of these constant, it is possible to focus in on compiler contributions to performance.

Current Record Holder

The current SPEC CPU2006 floating point speed record holder is the Sun Blade X6270 server module. Using 2x Intel Xeon X5570 processor chips and 24 GB of DDR3-1333 memory, it delivers a result of 45.0 SPECfp_base2006 and 50.4 SPECfp2006. [1]

We used this same blade system to compare GCC vs. Studio. On separate, but same-model disks, this software was installed:

  • SuSE Linux Enterprise Server 11.0 (x86_64) and GCC V4.4.0 built with gmp-4.3.1 and mpfr-2.4.1
  • OpenSolaris2008.11 and Sun Studio 12 Update 1

Tie One Hand Behind Studio's Back

In order to make the comparison more fair to GCC, we took several steps.

  1. We simplified the tuning for the OpenSolaris/Sun Studio configuration. This was done in order to counter the criticism that one sometimes hears that SPEC benchmarks have overly aggressive tuning. Benchmarks were optimized with a reasonably short tuning string:

    For all:  -fast -xipo=2 -m64 -xvector=simd -xautopar
    For C++, add:  -library=stlport4
  2. Recall that SPEC CPU2006 allows two kinds of tuning: "base", and "peak". The base metrics require that all benchmarks of a given language use the same tuning. The peak metrics allow individual benchmarks to have differing tuning, and more aggressive optimizations, such as compiler feedback. The simplified Studio configuration used only the less aggressive base tuning.

Both of the above changes limited the performance of Sun Studio.  Several measures were used to increase the performance of GCC:

  1. We tested the latest released version of GCC, 4.4.0, which was announced on 21 April 2009. In our testing, GCC 4.4.0 provides about 10% better overall floating point performance than V4.3.2. Note that GCC 4.4.0 is more recent than the compiler that is included with recent Linux distributions such as SuSE 11, which includes 4.3.2; or Ubuntu 8.10, which updates to 4.3.2 when one does "apt-get install gcc". It was installed with the math libraries mpfr 2.4.1 and gmp 4.3.1, which are labeled as the latest releases as of 1 June 2009.

  2. A tuning effort was undertaken with GCC, including testing of -O2 -O3 -fprefetch-loop-arrays -funroll-all-loops -ffast-math -fno-strict-aliasing -ftree-loop-distribution -fwhole-program -combine and -fipa-struct-reorg

  3. Eventually, we settled on this tuning string for GCC base:

    For all:  -O3 -m64 -mtune=core2 -msse4.2 -march=core2
    -fprefetch-loop-arrays -funroll-all-loops
    -Wl,-z common-page-size=2M
    For C++, add:  -ffast-math

    The reason that only the C++ benchmarks used the fast math library was that 435.gromacs, which uses C and Fortran, fails validation with this flag. (Note: we verified that the benchmarks successfully obtained 2MB pages.)

Studio wins by 2x, even with one hand tied behind its back

At this point, a fair base-to-base comparison can be made, and Sun Studio/OpenSolaris finishes the race while GCC/Linux is still looking for its glasses: 44.8 vs. 21.1 (see Table 1). Notice that Sun Studio provides more than 2x the performance of GCC.

Table 1: Initial Comparisons, SPECfp2006,
Sun Studio/Solaris vs. GCC/Linux
  Base Peak
Industry FP Record
    Sun Studio 12 Update 1
    OpenSolaris 2008.11
45.0 50.4
Studio/OpenSolaris: simplify above (less tuned) 44.8  
GCC V4.4 / SuSE Linux 11 21.1  
Notes: All results reported are from rule-compliant, "reportable" runs of the SPEC CPU2006 floating point suite, CFP2006. "Base" indicates the metric SPECfp_base2006. "Peak" indicates SPECfp2006. Peak uses the same benchmarks and workloads as base, but allows more aggressive tuning. A base result, may, optionally, be quoted as peak, but the converse is not allowed. For details, see SPEC's Readme1st.

Fair? Did you say "Fair"?

Wait, wait, the reader may protest - this is all very unfair to GCC, because the Studio result used all 8 cores on this 2-chip system, whereas GCC used only one core! You're using trickery!

To this plaintive whine, we respond that:

  • Compiler auto-parallelization technology is not a trick. Rather, it is an essential technology in order to get the best performance from today's multi-core systems. Nearly all contemporary CPU chips provide support for multiple cores. Compilers should do everything possible to make it easy to take advantage of these resources.

  • We tried to use more than one core for GCC, via the -ftree-parallelize-loops=n flag. GCC's autoparallelization appears to be in a much earlier development stage than Studio's, since we did not observe any improvements for all values of "n" that we tested. From the GCC wiki, it appears that a new autoparallelization effort is under development, which may improve its results at a later time.

  • But, all right, if you insist, we will make things even harder for Studio, and see how it does.

Tie Another Hand Behind Studio's Back

The earlier section mentioned various ways in which the performance comparison had been made easier for GCC. Continuing the paragraph numbering from above, we took these additional measures:

  1. Removed the autoparallelization from Studio, substituting instead a request for 2MB pagesizes (which the GCC tuning already had).

  2. Added "peak" tuning to GCC: for benchmarks that benefit, add -ffast-math, and compiler profile-driven feedback

At this point, Studio base beats GCC base by 38%, and Studio base beats GCC peak by more than 25% (see table 2).

Table 2: Additional Comparisons, SPECfp2006,
Sun Studio/Solaris vs. GCC/Linux
  Base Peak
Sun Studio/OpenSolaris: base only, noautopar 29.1  
GCC V4.4 / SuSE Linux 11 21.1 23.1
The notes from Table 1 apply here as well.

Bottom line

The freely available Sun Studio 12 Update 1 compiler on OpenSolaris provides more than double the performance of GCC V4.4 on SuSE Linux, as measured by SPECfp_base2006.

If compilation is restricted to avoid using autoparallelization, Sun Studio still wins by 38% (base to base), or by more than 25% (Studio base vs. GCC peak).

YMMV

Your mileage may vary. It is certain that both GCC and Studio could be improved with additional tuning efforts. Both provide dozens of compiler flags, which can keep the tester delightfully engaged for an unbounded number of days. We feel that the tuning presented here is reasonable, and that additional tuning effort, if applied to both compilers, would not radically alter the conclusions.

Additional Information

The results disclosed in this article are from "reportable" runs of the SPECfp2006 benchmarks, which have been submitted to SPEC.

[1] SPEC and SPECfp are registered trademarks of the Standard Performance Evaluation Corporation. Competitive comparisons are based on data published at www.spec.org as of 1 June 2009. The X6270 result can be found at http://www.spec.org/cpu2006/results/res2009q2/cpu2006-20090413-07019.html.

Wednesday Jun 03, 2009

Welcome to BestPerf group blog!

Welcome to BestPerf group blog!  This blog will contain many different performance results and the best practices learned from doing a wide variety of performance work on the broad range of Sun's products.

Over the coming days, you will see many engineers in the Strategic Applications Engineering group posting a wide variety topics and providing useful information to the users of Sun's technologies. Some of the areas explored will be:

world-record, performance, $/Perf, watts, watt/perf, scalability, bandwidth, RAS, virtualization, security, cluster, latency, HPC, Web, Application, Database

About

BestPerf is the source of Oracle performance expertise. In this blog, Oracle's Strategic Applications Engineering group explores Oracle's performance results and shares best practices learned from working on Enterprise-wide Applications.

Index Pages
Search

Archives
« April 2015
SunMonTueWedThuFriSat
   
1
2
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
  
       
Today