Thursday Aug 27, 2009

Sun SPARC Enterprise T5240 with 1.6GHz UltraSPARC T2 Plus Beats 4-Chip IBM Power 570 POWER6 System on SPECjbb2005

Significance of Results

A Sun SPARC Enterprise T5240 server equipped with two UltraSPARC T2 Plus processors at 1.6GHz delivered a result of 422782 SPECjbb2005 bops, 26424 SPECjbb2005 bops/JVM. The Sun SPARC Enterprise T5240 consumed an average of 875 Watts of power during the execution of the benchmark.

  • The Sun SPARC Enterprise T5240 server running 2x 1.6 GHz UltraSPARC T2 Plus processor delivered 5% better performance than an IBM Power 570 with 4x 4.7 GHz POWER6 processors as measured by the SPECjbb2005 benchmark.

  • The Sun SPARC Enterprise T5240 server equipped with two UltraSPARC T2 Plus processors at 1.6GHz demonstrated 10% better performance than the Sun SPARC Enterprise T5240 server equipped with two UltraSPARC T2 Plus processors at 1.4GHz.
  • One Sun SPARC Enterprise T5240 (two 1.6GHz UltraSPARC T2 Plus chips, 2RU) has 2.3 times the power/performance than the IBM Power 570 (8RU) that used four 4.7GHz POWER6 chips.
  • The Sun SPARC Enterprise T5240 used OpenSolaris 2009.06 and the Sun JDK 1.6.0_14 Performance Release to obtain this result.

Performance Landscape

SPECjbb2005 Performance Chart (ordered by performance), select results presented.

bops : SPECjbb2005 Business Operations per Second (bigger is better)

System Processors Performance
Chips Cores Threads GHz Type bops bops/JVM
Sun SPARC Enterprise T5240 2 16 128 1.6 UltraSPARC T2 Plus 422782 26424
IBM Power 570 4 8 16 4.7 POWER6 402923 100731
Sun SPARC Enterprise T5240 2 16 128 1.4 UltraSPARC T2 Plus 384934 24058

Complete benchmark results may be found at the SPEC benchmark website http://www.spec.org.

Results and Configuration Summary

Hardware Configuration:

    Sun SPARC Enterprise T5240
      2 x 1.6 GHz UltraSPARC T2 Plus processors
      64 GB

Software Configuration:

    OpenSolaris 2009.06
    Java HotSpot(TM) 32-Bit Server, Version 1.6.0_14 Performance Release

Benchmark Description

SPECjbb2005 (Java Business Benchmark) measures the performance of a Java implemented application tier (server-side Java). The benchmark is based on the order processing in a wholesale supplier application. The performance of the user tier and the database tier are not measured in this test. The metrics given are number of SPECjbb2005 bops (Business Operations per Second) and SPECjbb2005 bops/JVM (bops per JVM instance).

Key Points and Best Practices

  • Each JVM executed in the FX scheduling class to improve performance by reducing the frequency of context switches.
  • Each JVM was bound to a separate processor containing 1 core to reduce memory access latency using the physical memory closest to the processor.

See Also

Disclosure Statement

SPEC, SPECjbb reg tm of Standard Performance Evaluation Corporation. Results as of 8/25/2009 on http://www.spec.org.
Sun SPARC T5240 (2 chips, 16 cores) 422782 SPECjbb2005 bops, 26424 SPECjbb2005 bops/JVM;Sun SPARC T5240 (2 chips, 16 cores) 384934 SPECjbb2005 bops, 24058 SPECjbb2005 bops/JVM; IBM Power 570 (4 chips, 8 cores) 402923 SPECjbb2005 bops, 100731 SPECjbb2005 bops/JVM.

Sun watts were measured on the system during the test.

IBM p 570 4P (2 building blocks) power specifications calculated as 80% of maximum input power reported 7/8/09 in 'Facts and Features Report': ftp://ftp.software.ibm.com/common/ssi/pm/br/n/psb01628usen/PSB01628USEN.PDF

Wednesday Aug 26, 2009

Sun SPARC Enterprise T5220 with 1.6GHz UltraSPARC T2 Sets Single Chip World Record on SPECjbb2005

Significance of Results

A Sun SPARC Enterprise T5220 server equipped with one UltraSPARC T2 processor at 1.6GHz delivered a World Record single-chip result of 231464 SPECjbb2005 bops, 28933 SPECjbb2005 bops/JVM. The Sun SPARC Enterprise T5220 consumed an average of 520 Watts of power during the execution of this benchmark.

  • The Sun SPARC Enterprise T5220 server (one 1.6 GHz UltraSPARC T2 chip) demonstrated 3% better performance over the Fujitsu TX100 result of 223691 SPECjbb2005 bops which used one 3.16 GHz Xeon X3380 processor.
  • The Sun SPARC Enterprise T5220 (one 1.6 GHz UltraSPARC T2 chip) demonstrated 8% better performance over the IBM x3200 result of 214578 SPECjbb2005 bops which used one 3.16 GHz Xeon X3380 processor.
  • The Sun SPARC Enterprise T5220 server (one 1.6 GHz UltraSPARC T2 chip) demonstrated 10% better performance over the Fujitsu RX100 result of 211144 SPECjbb2005 bops which used one 3.16 GHz Xeon X3380 processor.
  • The Sun SPARC Enterprise T5220 server (one 1.6 GHz UltraSPARC T2 chip) demonstrated 19% better performance over the IBM X3350 result of 194256 SPECjbb2005 bops which used one 3 GHz Xeon X3370 processor.
  • The Sun SPARC Enterprise T5220 server (one 1.6 GHz UltraSPARC T2 chip) demonstrated 2.6X the performance over the IBM p570 result of 88089 SPECjbb2005 bops which used one 4.7 GHz POWER6 processor.
  • One Sun SPARC Enterprise T5220 (one 1.6GHz UltraSPARC T2 Plus chip, 2RU) has 2.1 the power/performance than the IBM Power 570 (4RU) that used two 4.7GHz POWER6 chips.
  • The Sun SPARC Enterprise T5220 used OpenSolaris 2009.06 and the Sun JDK 1.6.0_14 Performance Release to obtain this result.

Performance Landscape

SPECjbb2005 Performance Chart (ordered by performance)

bops : SPECjbb2005 Business Operations per Second (bigger is better)

System Processors Performance
Chips Cores Threads GHz Type bops bops/JVM
Sun SPARC Enterprise T5220 1 8 64 1.6 UltraSPARC T2 231464 28933
Sun Blade T6320 1 8 64 1.6 UltraSPARC T2 229576 28697
Fujitsu TX100 1 4 4 3.16 Intel Xeon 223691 111846
IBM x3200 M2 1 4 4 3.16 Intel Xeon 214578 107289
Fujitsu RX100 1 4 4 3.16 Intel Xeon 211144 105572
IBM Power 570 2 4 8 4.7 POWER6 205917 102959
IBM x3350 1 4 4 3.0 Intel Xeon 194256 97128
Sun SPARC Enterprise T5220 1 8 64 1.4 UltraSPARC T2 192055 24007
IBM Power 570 1 2 4 4.7 POWER6 88089 88089

Complete benchmark results may be found at the SPEC benchmark website http://www.spec.org.

Results and Configuration Summary

Hardware Configuration:

    Sun SPARC Enterprise T5220
      1x 1.6 GHz UltraSPARC T2 processor
      64 GB

Software Configuration:

    OpenSolaris 2009.06
    Java HotSpot(TM) 32-Bit Server, Version 1.6.0_14 Performance Release

Benchmark Description

SPECjbb2005 (Java Business Benchmark) measures the performance of a Java implemented application tier (server-side Java). The benchmark is based on the order processing in a wholesale supplier application. The performance of the user tier and the database tier are not measured in this test. The metrics given are number of SPECjbb2005 bops (Business Operations per Second) and SPECjbb2005 bops/JVM (bops per JVM instance).

Key Points and Best Practices

  • Each JVM executed in the FX scheduling class to improve performance by reducing the frequency of context switches.
  • Each JVM was bound to a separate processor containing 1 core to reduce memory access latency using the physical memory closest to the processor.

See Also

Disclosure Statement

SPEC, SPECjbb reg tm of Standard Performance Evaluation Corporation. Results as of 8/25/2009 on http://www.spec.org.
Sun SPARC T5220 231464 SPECjbb2005 bops, 28933 SPECjbb2005 bops/JVM Submitted to SPEC for review; IBM p 570 88089 SPECjbb2005 bops, 88089 SPECjbb2005 bops/JVM; Fujitsu TX100 223691 SPECjbb2005 bops, 111846 SPECjbb2005 bops/JVM; IBM x3350 194256 SPECjbb2005 bops, 97128 SPECjbb2005 bops/JVM; Sun SPARC Enterprise T5120 192055 SPECjbb2005 bops, 24007 SPECjbb2005 bops/JVM.

Sun watts were measured on the system during the test.

IBM p 570 2P (1 building blocks) power specifications calculated as 80% of maximum input power reported 7/8/09 in "Facts and Features Report": ftp://ftp.software.ibm.com/common/ssi/pm/br/n/psb01628usen/PSB01628USEN.PDF

Tuesday Jul 21, 2009

Sun Blade T6320 World Record SPECjbb2005 performance

Significance of Results

The Sun Blade T6320 server module equipped with one UltraSPARC T2 processor running at 1.6 GHz delivered a World Record single-chip result while running the SPECjbb2005 benchmark.

  • The Sun Blade T6320 server module powered by one 1.6 GHz UltraSPARC T2 processor delivered a result of 229576 SPECjbb2005 bops, 28697 SPECjbb2005 bops/JVM when running the SPECjbb2005 benchmark.
  • The Sun Blade T6320 server module (with one 1.6 GHz UltraSPARC T2 processor) demonstrated 2.6X better performance than the IBM System p 570 with one 4.7 GHz POWER6 processor.
  • The Sun Blade T6320 server module (with one 1.6 GHz UltraSPARC T2 processor) demonstrated 3% better performance than the Fujitsu TX100 result which used one 3.16 GHz Intel Xeon X3380 processor.
  • The Sun Blade T6320 server module (with one 1.6 GHz UltraSPARC T2 processor) demonstrated 7% better performance than the IBM x3200 result which used one 3.16 GHz Xeon X3380 processor.
  • The Sun Blade T6320 server module running the 1.6 GHz UltraSPARC T2 processor delivered 20% better performance than a Sun SPARC Enterprise T5120 with the 1.4 GHz UltraSPARC T2 processor.
  • The Sun Blade T6320 used the OpenSolaris 2009.06 operation system and the Java HotSpot(TM) 32-Bit Server, Version 1.6.0_14 Performance Release JVM to obtain this leading result.

Performance Landscape

SPECjbb2005 Performance Chart (ordered by performance)

bops: SPECjbb2005 Business Operations per Second (bigger is better)

System Processors Performance
Chips Cores Threads GHz Type SPECjbb2005
bops
SPECjbb2005
bops/JVM
Sun Blade T6320 1 8 64 1.6 UltraSPARC T2 229576 28697
Fujitsu TX100 1 4 4 3.16 Intel Xeon 223691 111846
IBM x3200 M2 1 4 4 3.16 Intel Xeon 214578 107289
Fujitsu RX100 1 4 4 3.16 Intel Xeon 211144 105572
IBM x3350 1 4 4 3.0 Intel Xeon 194256 97128
Sun SE T5120 1 8 64 1.4 UltraSPARC T2 192055 24007
IBM p 570 1 2 4 4.7 POWER6 88089 88089

Complete benchmark results may be found at the SPEC benchmark website http://www.spec.org.

Results and Configuration Summary

Hardware Configuration:

    Sun Blade T6320
      1 x 1.6 GHz UltraSPARC T2 processor
      64 GB

Software Configuration:

    OpenSolaris 2009.06
    Java HotSpot(TM) 32-Bit Server, Version 1.6.0_14 Performance Release

Benchmark Description

SPECjbb2005 (Java Business Benchmark) measures the performance of a Java implemented application tier (server-side Java). The benchmark is based on the order processing in a wholesale supplier application. The performance of the user tier and the database tier are not measured in this test. The metrics given are number of SPECjbb2005 bops (Business Operations per Second) and SPECjbb2005 bops/JVM (bops per JVM instance).

Key Points and Best Practices

  • Enhancements to the JVM had a major impact on performance.
  • Each JVM executed in the FX scheduling class to improve performance by reducing the frequency of context switches.
  • Each JVM bound to a separate processor containing 1 core to reduce memory access latency using the physical memory closest to the processor.

See Also

Disclosure Statement

SPEC, SPECjbb reg tm of Standard Performance Evaluation Corporation. Results as of 7/17/2009 on http://www.spec.org. SPECjbb2005, Sun Blade T6320 229576 SPECjbb2005 bops, 28697 SPECjbb2005 bops/JVM; IBM p 570 88089 SPECjbb2005 bops, 88089 SPECjbb2005 bops/JVM; Fujitsu TX100 223691 SPECjbb2005 bops, 111846 SPECjbb2005 bops/JVM; IBM x3350 194256 SPECjbb2005 bops, 97128 SPECjbb2005 bops/JVM; Sun SPARC Enterprise T5120 192055 SPECjbb2005 bops, 24007 SPECjbb2005 bops/JVM.

Monday Jul 20, 2009

Sun T5440 SPECjbb2005 Beats IBM POWER6 Chip-to-Chip

A Sun SPARC Enterprise T5440 server equipped with four UltraSPARC T2 Plus processors running at 1.6GHz, delivered a result of 841380 SPECjbb2005 bops, 26293 SPECjbb2005 bops/JVM when running the SPECjbb2005 benchmark.

  • One Sun SPARC Enterprise T5440 (four 1.6GHz UltraSPARC T2 Plus chips, 4RU) demonstrated 5% better performance than the IBM Power 570 (16RU) result of 798752 SPECjbb2005 bops using eight 4.7 GHz POWER6 chips. The IBM system requires twice the number of processor chips and the system requires 4 times more space than the T5440.
  • One Sun SPARC Enterprise T5440 (four 1.6GHz UltraSPARC T2 Plus chips, 4RU) has 2.3 times better power/performance than the IBM Power 570 (16RU) that used eight 4.7 GHz POWER6 chips.
  • Sun's 1.6GHz UltraSPARC T2 Plus Processor is over 2.3x performance of 4.7 GHz IBM POWER6 processor.
  • One Sun SPARC Enterprise T5440 (four 1.6GHz UltraSPARC T2 Plus chips) demonstrated 21% better performance when compared to the Sun SPARC Enterprise T5440 result of 692736 SPECjbb2005 bops using four 1.4GHz UltraSPARC T2 Plus chips.
  • The Sun SPARC Enterprise T5440 used OpenSolaris 2009.06 and the Sun JDK 1.6.0_14 Performance Release to obtain this result.

Performance Landscape

SPECjbb2005 Performance Chart (ordered by performance)

bops : SPECjbb2005 Business Operations per Second (bigger is better)

System Processors Performance
Chips Cores Threads GHz Type bops bops/JVM
HP DL585 G6 4 24 24 2.8 AMD Opteron 937207 234302
Sun SPARC Enterprise T5440 4 32 256 1.6 US-T2 Plus 841380 26293
IBM Power 570 8 16 32 4.7 POWER6 798752 99844
Sun SPARC Enterprise T5440 4 32 256 1.4 US-T2 Plus 692736 21648

Complete benchmark results may be found at the SPEC benchmark website http://www.spec.org.

Results and Configuration Summary

Hardware Configuration:

    Sun SPARC Enterprise T5440
      4 x 1.6 GHz UltraSPARC T2 Plus processors
      256 GB

Software Configuration:

    OpenSolaris 2009.06
    Java HotSpot(TM) 32-Bit Server, Version 1.6.0_14 Performance Release

Benchmark Description

SPECjbb2005 (Java Business Benchmark) measures the performance of a Java implemented application tier (server-side Java). The benchmark is based on the order processing in a wholesale supplier application. The performance of the user tier and the database tier are not measured in this test. The metrics given are number of SPECjbb2005 bops (Business Operations per Second) and SPECjbb2005 bops/JVM (bops per JVM instance).

Key Points and Best Practices

  • Enhancements to the JVM had a major impact on performance.
  • Each JVM executed in the FX scheduling class to improve performance by reducing the frequency of context switches.
  • Each JVM bound to a separate processor containing 1 core to reduce memory access latency using the physical memory closest to the processor.

See Also

Disclosure Statement:

SPECjbb2005 Sun SPARC Enterprise T5440 (4 chips, 32 cores) 841380 SPECjbb2005 bops, 26293 SPECjbb2005 bops/JVM. Results submitted to SPEC. HP DL585 G5 (4 chips, 24 cores) 937207 SPECjbb2005 bops, 234302 SPECjbb2005 bops/JVM. IBM Power 570 (8 chips, 16 cores) 798752 SPECjbb2005 bops, 99844 SPECjbb2005 bops/JVM. Sun SPARC Enterprise T5440 (4 chips, 32 cores) 692736 SPECjbb2005 bops, 21648 SPECjbb2005 bops/JVM. SPEC, SPECjbb reg tm of Standard Performance Evaluation Corporation. Results from www.spec.org as of 7/20/09

Sun watts were measured on the system during the test.

IBM p 570 8P (4 building blocks) power specifications calculated as 80% of maximum input power reported 7/8/09 in “Facts and Features Report”: ftp://ftp.software.ibm.com/common/ssi/pm/br/n/psb01628usen/PSB01628USEN.PDF

Wednesday Jun 24, 2009

I/O analysis using DTrace

Introduction:

In almost any commercial application, the role of storage is critical, in terms of capacity, performance and data availability. The future roadmap of the storage industry shows disk storage getting cheaper, capacity getting larger, and SAS technology acquiring wider adoption.

Disk performance is becoming more important as CPU power has increased, and is more rarely the bottleneck. Newly available analysis tools have provided a simpler way to understand the internal workings of the I/O subsystem as part of the Operating System, increasing effectiveness of solving I/O performance issues.

Solaris's powerful and open sourced DTrace set of tools allows any user interested in solving I/O performance issues to observe how I/O are carried out by the Operating System and related drivers. The ever increasing set of DTrace features and adoption by various vendors is a strong proof of its usefulness.

Many performance problems can be captured and solved using basic performance tools, such as vmstat, mpstat, prstat and iostat. However, some problems require deeper analysis to be resolved, or require live analysis to solve performance problems and analyze events out of the ordinary.

Environment:

DTrace allows analysis of real or simulated workloads, of arbitrary complexity. There are several applications that can be used to generate the workload. My choice was to use Oracle database software, in order to generate a more interesting and realistic I/O workloads.

An Oracle table was created similar to a fact table in a data warehousing environment, then loaded with data from a flat-file directly into the fact table. This is a write-intensive workload that generates almost all writes to the underlying disk storage.

Just a few important points about the whole setup:

  • Solaris10 or OpenSolaris is required to run DTrace commands.
  • No filesystem was used for this experiment. Since we have few filesystem options, and the architecture of these filesystems differ, I might address filesystem performance analysis in a future blog.
  • Only one disk was used, to keep it very simple. However, I don't see any problem using the same techniques to diagnose I/O issues on a huge disk farm environment.
  • Although I used a workload similar to real, the main goal is to monitor the I/O subsystem using DTrace. So, let's not dive deep into the tricks used to generate I/O load. They are simple enough to generate the target workload.
  • The scripts to setup the environment are in the zip-file.
  • An account with admin privilege can run DTrace commands. In fact, an account with just bare minimum DTrace privilege can run DTrace commands as well. For more details, check out the document here.

Live Analysis:

Note that while the workload was being generated, iostat was started, and then the DTrace commands to get more details. Since iostat without any arguments wouldn't produce the desired details, extended statistics were requested. The options used are:

    iostat -cnxz 2

The quick summary of those arguments:

   -c : display CPU statistics
   -n : display device names in descriptive format
   -x : display extended device statistics
   -z : do not display lines with all zeros
    2 : display io statisitcs for every 2 seconds

For more details, check out the man page.

About 6 SQL\*Loader sessions were started to load the data, which in-turn generated huge writes to the disk where the table was located. Oracle SQL\*Loader is a utility that is used to load flat-file data into a table. The disk was saturated with about 4-6 SQL\*Loader sessions. The following iostat output was captured during the run:

 
     cpu
 us sy wt id
 58  8  0 34
                    extended device statistics              
    r/s    w/s   kr/s    kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0    3.0    0.0    23.9  0.0  0.0    0.0    4.2   0   1 c0t1d0
    0.0  520.4    0.0 32535.2  0.0  4.9    0.0    9.4   2  98 c2t8d0
    0.0    1.5    0.0   885.1  0.0  0.0    0.0   13.5   0   1 c2t13d0
     cpu
 us sy wt id
 57  7  0 36
                    extended device statistics              
    r/s    w/s   kr/s    kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0  546.7    0.0 31004.3  0.0  9.3    0.0   16.9   1  98 c2t8d0
    0.0    1.0    0.0   832.1  0.0  0.0    0.0   11.7   0   1 c2t13d0

The tablespace, where the target table was created, is on the disk device identified by c2t8d0. We can see that we are writing about 30MB/sec. Let's now use DTrace to explore further...

Let's start with very basic info. The following DTrace command prints the number of reads, writes and asynchronous I/Os initiated on the system. [In this example, and in all other examples in this article, white space has been added for readability. In real life, you would either put the whole dtrace command on a single line, or would put it in a file.]

# dtrace -qn 
   'BEGIN {
      rio=0; 
      wio=0; 
      aio=0
   } 
   io:::start {
      args[0]->b_flags&B_READ?rio++:0; 
      args[0]->b_flags&B_WRITE?wio++:0; 
      args[0]->b_flags&B_ASYNC?aio++:0
   } 
   tick-1s {
      printf("%Y: reads :%8d, writes :%8d, async-ios :%8d\\n", walltimestamp, rio, wio, aio); 
      rio=0; 
      wio=0; 
      aio=0
   }'
2009 Jun 15 13:11:39: reads :       0, writes :     578, async-ios :     465
2009 Jun 15 13:11:40: reads :       0, writes :     526, async-ios :     520
2009 Jun 15 13:11:41: reads :       0, writes :     529, async-ios :     528
2009 Jun 15 13:11:42: reads :       0, writes :     533, async-ios :     519
2009 Jun 15 13:11:43: reads :       0, writes :     779, async-ios :     622
2009 Jun 15 13:11:44: reads :       0, writes :     533, async-ios :     472
2009 Jun 15 13:11:45: reads :       0, writes :     511, async-ios :     513
2009 Jun 15 13:11:46: reads :       0, writes :     538, async-ios :     531
2009 Jun 15 13:11:47: reads :       0, writes :     620, async-ios :     503
2009 Jun 15 13:11:48: reads :       0, writes :     512, async-ios :     504
2009 Jun 15 13:11:49: reads :       0, writes :     533, async-ios :     526
2009 Jun 15 13:11:50: reads :       0, writes :     565, async-ios :     523
2009 Jun 15 13:11:51: reads :       0, writes :     662, async-ios :     539
2009 Jun 15 13:11:52: reads :       0, writes :     527, async-ios :     524
2009 Jun 15 13:11:53: reads :       0, writes :     501, async-ios :     503
2009 Jun 15 13:11:54: reads :       0, writes :     587, async-ios :     531
2009 Jun 15 13:11:55: reads :       0, writes :     585, async-ios :     532

The DTrace command used above has a few sections. BEGIN gets executed only once when the script starts to execute. The io:::start gets executed everytime when an I/O is initiated. tick-1s gets executed for every second.

The above output clearly shows that the system has been issuing a lot of asynchronous I/O to write data. It is easy to verify this behavior by running a truss command against one of those background Oracle processes. Since it's confirmed that only 'writes' are issued, not 'reads', the following commands will focus only on 'writes'.

We will now look at the histogram of write I/O sizes using DTrace aggregation feature. The aggregation function, quantize(), is used to generate histogram data. The quick summary of DTrace aggregation functions can be found here.

 
# dtrace -qn 
   'io:::start /execname=="oracle" && args[0]->b_flags & B_WRITE/ { 
       @[args[1]->dev_pathname] = quantize(args[0]->b_bcount); 
    } 
    tick-10s {
       exit(0)
    }'

  /devices/pci@3,700000/pci@0/scsi@8,1/sd@8,0:a     
           value  ------------- Distribution ------------- count    
            4096 |                                         0        
            8192 |@@@@@@@@@@                               1324     
           16384 |                                         0        
           32768 |@@@@@@@@                                 1032     
           65536 |@@@@@@@@@@@@@@@@@                        2182     
          131072 |@@@@                                     518      
          262144 |                                         6        
          524288 |                                         0        


Before interpreting the output, let's examine the command used. The 2 new filters within forward slashes are used to a) enable DTrace only for executable name "oracle" and b) enable DTrace only if it's a write request. The quantize aggregation function creates multiple buckets for a range of I/O size and tracks it.

The DTrace output has a physical device name. It is easy to map it to the device that is used at the application layer.

# ls -l /dev/rdsk/c2t8d0s0
lrwxrwxrwx   1 root     root          54 Feb  8  2008 
   /dev/rdsk/c2t8d0s0 -> ../../devices/pci@3,700000/pci@0/scsi@8,1/sd@8,0:a,raw
# 

That's it. The device that is used in database is in fact a symbolic link to the device that showed up above in the DTrace output. The above output suggests that the size of more than half-of I/O requests are between 65536 (64KB) and 131071 (~128KB). Since the database software and the OS are capable of writing about upto 1MB per second, there is an opportunity to look closely the database configuration and OS kernel parameters to optimize it. This is one of the useful DTrace commands that can be used to observe how much throughupt a device can deliver. More importantly, it helps discover the I/O (write) pattern of an application.

We will now look at the latency histogram.

# dtrace -qn 
   'io:::start /execname=="oracle" && args[0]->b_flags & B_WRITE/ { 
       io_start[args[0]->b_edev, args[0]->b_blkno] = timestamp; 
    } 
    io:::done / (io_start[args[0]->b_edev, args[0]->b_blkno]) && (args[0]->b_flags & B_WRITE) / { 
       @[args[1]->dev_pathname, args[2]->fi_name] = 
           quantize((timestamp - io_start[args[0]->b_edev, args[0]->b_blkno]) / 1000000); 
       io_start[args[0]->b_edev, args[0]->b_blkno] = 0; 
    } 
    tick-10s {
       exit(0)
    }'

  /devices/pci@0,600000/pci@0/pci@8/pci@0/scsi@1/sd@1,0:h  orasystem                                         
           value  ------------- Distribution ------------- count    
               2 |                                         0        
               4 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1        
               8 |                                         0        

  /devices/pci@0,600000/pci@0/pci@8/pci@0/scsi@1/sd@1,0:h  control_003                                       
           value  ------------- Distribution ------------- count    0 |                                         0        
               1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@      7        
               2 |@@@@@                                    1        
               4 |                                         0        

  /devices/pci@0,600000/pci@0/pci@8/pci@0/scsi@1/sd@1,0:h  control_001                                       
           value  ------------- Distribution ------------- count    
               1 |                                         0        
               2 |@@@@@@@@@@                               2        
               4 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@           6        
               8 |                                         0        

  /devices/pci@0,600000/pci@0/pci@8/pci@0/scsi@1/sd@1,0:h  control_002                                       
           value  ------------- Distribution ------------- count    
               2 |                                         0        
               4 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 8        
               8 |                                         0        

  /devices/pci@0,600000/pci@0/pci@8/pci@0/scsi@1/sd@1,0:h  oraundo                                           
           value  ------------- Distribution ------------- count    
               1 |                                         0        
               2 |                                         1        
               4 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  271      
               8 |@                                        5        
              16 |                                         2        
              32 |                                         0        

The preceding DTrace command now has more options, io:::start followed by io:::done. The io:::start gets executed every time when an I/O is initiated and io:::done gets executed when an I/O is completed. Since these 2 routines get executed every time when an I/O event occurs, we want to make sure an I/O that is being tracked when it's initiated is the same I/O being tracked when it's completed. This is why the above command tracks each I/O by its device name and its block number. This, for most cases, gives us the result we want.

However, the elapsed time histogram output shows data for all files except the device that we are interested in. Why? I believe it's due to the effect of asynchronous I/O. (Expect a future blog about monitoring asynchronous I/O). The above output would have been much different if synchronous I/Os are issued to transfer data. The output is still interesting to see that there has been some 'writes' to other database files. For example, it shows the response-time of most 'writes' to various database files are well under 8 milliseconds.

Finally, let's look at disk blocks (lba) that are being accessed more frequently.

# dtrace -qn 
   'io:::start /execname=="oracle" && args[0]->b_flags & B_WRITE/{ 
       @["block access count", args[1]->dev_pathname] = quantize(args[0]->b_blkno); 
    } 
    tick-10s {
       exit(0)
    }'

  block access count                                  /devices/pci@3,700000/pci@0/scsi@8,1/sd@8,0:a     
           value  ------------- Distribution ------------- count    
         4194304 |                                         0        
         8388608 |@@@                                      324      
        16777216 |                                         8        
        33554432 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@    4237     
        67108864 |                                         0        

# 

We see that the range of blocks between 33554432 and 67108863 are being accessed much more frequently than any other block ranges listed above. The use of this command is rather occassional. It is more useful if an application provides any mapping info between the application block and the physical block, to figure out a hot block at the disk level and where it's located.

Conclusion:

We can observe how easy it is take a close look at the OS/physical layer when an application is busy with its disk-based workload. The DTrace collection can be widned or narrowed by using various filter options. For example, the read-intensive workloads can be observed by flipping a filter flag, from B_WRITE to B_READ.

Addtional references:

If you are interested in any specific topics similar to above experiment, feel free to drop a note in the comments section.

Tuesday Jun 23, 2009

New CPU2006 Records: 3x better integer throughput, 9x better fp throughput

Significance of Results

A Sun Constellation system, composed of 48 Sun Blade X6440 server modules in a Sun Blade 6048 chassis, running OpenSolaris 2008.11 and using the Sun Studio 12 Update 1 compiler delivered World Record SPEC CPU2006 rate results.

On the SPECint_rate_base2006 benchmark, Sun delivered 4.7 times more performance than the IBM power 595 (5GHz POWER6); this IBM system requires a slightly larger cabinet than the Sun Blade 6048 chassis (details below). 

On the SPECfp_rate_base2006 benchmark Sun delivered 3.9 times more performance than the largest IBM power 595 (5GHz POWER6); this IBM system requires a slightly larger cabinet than the Sun Blade 6048 chassis (details below).

  • The Sun Constellation System equipped with AMD Opteron QC 8384 2.7 GHz processors, running OpenSolaris 2008.11 and using the Sun Studio 12 update 1 compiler, delivered the World Record SPECint_rate_base2006 score of 8840.
  • This SPECint_rate_base2006 score beat the previous record holding score by over three times.
  • The Sun Constellation System equipped with AMD Opteron QC 8384 2.7 GHz processors, running OpenSolaris 2008.11 and using the Sun Studio 12 update 1 compiler, delivered the fastest x86 SPECfp_rate_base2006 score of 6500.
  • This SPECfp_rate_base2006 score beat the previous x86 record holding score by nine times.

Performance Landscape

SPEC CPU2006 Performance Charts - bigger is better, selected results, please see www.spec.org for complete results.

SPECint_rate2006

System Processors Performance Results Notes (1)
Type GHz Chips Cores Peak Base
Sun Blade 6048 Opteron 8384 2.7 192 768
8840 New Record
SGI Altix 4700 Density System Itanium 9150M 1.66 128 256 3354 2893 Previous Best
SGI Altix 4700 Bandwidth System Itanium2 9040 1.6 128 256 2971 2715
Fujitsu/Sun SPARC Enterprise M9000 SPARC64 VII 2.52 64 256 2290 2090
IBM Power 595 POWER6 5.0 32 64 2160 1870 Best POWER6

(1) Results as of 23 June 2009 from www.spec.org.

SPECfp_rate2006

System Processors Performance Results Notes (2)
Type GHz Chips Cores Peak Base
SGI Altix 4700 Density System Itanium 9140M 1.66 512 1024
10580
Sun Blade 6048 Opteron 8384 2.7 192 768
6500 New x86 Record
SGI Altix 4700 Bandwidth System Itanium2 9040 1.6 128 256 3507 3419
IBM Power 595 POWER 6 5.0 64 32 2184 1681 Best POWER6
Fujitsu/Sun SPARC Enterprise M9000 SPARC64 VII 2.52 64 256 2005 1861
SGI Altix 4700 Bandwidth System Itanium 9150M 1.66 128 256 1947 1832
SGI Altix ICE 8200EX Intel X5570 2.93 8 32 742 723

(2) Results as of 23 June 2009 from www.spec.org.

(2) Results as of 23 June 2009 from www.spec.org.

Results and Configuration Summary

Hardware Configuration:
    1 x Sun Blade 6048
      48 x Sun Blade X6440, each with
        4 x 2.7 GHz QC AMD Opteron 8384 processors
        32 GB, (8 x 4GB)

Software Configuration:

    O/S: OpenSolaris 2008.11
    Compiler: Sun Studio 12 Update 1
    Other SW: MicroQuill SmartHeap Library 9.01 x64
    Benchmark: SPEC CPU2006 V1.1

Key Points and Best Practices

The Sun Blade 6048 chassis is able to contain a variety of server modules. In this case, the Sun Blade X6440 was used to provide this capacity solution. This single rack delivered results which have not been seen in this form factor.

To run this many jobs, the benchmark requires a reasonably good file server where the benchmark is run. The Sun Fire X4540 server was used to provide the disk space required being accessed by NFS by the blades.

Sun has shown 4.7x greater SPECint_rate_base2006 and 3.9x greater SPECfp_rate_base2006 in a slightly smaller cabinet. IBM specifications are at: http://www-03.ibm.com/systems/power/hardware/595/specs.html. One frame (slimline doors): 79.3"H x 30.5"W x 58.5"D weight: 3,376 lb. One frame (acoustic doors): 79.3"H x 30.5"W x 71.1"D weight: 3,422 lb. The Sun Blade 6048 specifications are at: http://www.sun.com/servers/blades/6048chassis/specs.xml One Sun Blade 6048: 81.6"H x 23.9"W x 40.3"D weight: 2,300 lb (fully configured). 

Disclosure Statement:

SPEC, SPECint, SPECfp reg tm of Standard Performance Evaluation Corporation. Results from www.spec.org as of 6/22/2009 and this report. Sun Blade 6048 chassis with Sun Blade X6440 server modules (48 nodes with 4 chips, 16 cores, 16 threads each, OpenSolaris 2008.11, Studio 12 update 1) - 8840 SPECint_rate_base2006, 6500 SPECfp_rate_base2006; IBM p595, 1870 SPECint_rate_base2006, 1681 SPECfp_rate_base2006.

See Also

Sun Blade X6275 results capture Top Places in CPU2006 SPEED Metrics

Significance of Multiple World Records

The Sun Blade X6275 server module, equipped with two Intel QC Xeon X5570 2.93 GHz processors and running the OpenSolaris 2009.06 operating system delivered the best SPECfp2006 and SPECint2006 results to date.
  • The Sun Blade X6275 server module using the Sun Studio 12 update 1 compiler and the OpenSolaris 2009.06 operating system delivered a World Record SPECfp2006 result of 50.8.

  • This SPECfp2006 result beats the best result by the competition, using the same processor type, by 20%.
  • The Sun Blade X6275 server module using the Sun Studio 12 update 1 compiler and the OpenSolaris 2009.06 operating system delivered a World Record SPECint2006 result of 37.4.

  • This SPECint2006 result just tops the best result by the competition even though that result used the 9% faster clock W-series chip of the Nehalem family.
Sun(TM) Studio 12 Update 1 contains new features and enhancements to boost performance and simplify the creation of high-performance parallel applications for the latest multicore x86 and SPARC-based systems running on leading Linux platforms, the Solaris(TM) Operating System (OS) or OpenSolaris(TM). The Sun Studio 12 Update 1 software has set almost a dozen industry benchmark records to date, and in conjunction with the freely available community-based OpenSolaris 2009.06 OS, was instrumental in landing four new ground-breaking SPEC CPU2006 results.

Sun Studio 12 Update 1 includes improvements in the compiler's ability to automatically parallelise codes - afterall the easiest way to develop parallel applications is if the compiler can do it for you; improvements to the support of parallelisation specifications like OpenMP, this release includes support for the latest OpenMP 3.0 specification; and improvements in the tools and their ability to provide the developer meaningful feedback about parallel code, for example the ability of the Performance Analyzer to profile MPI code.

Performance Landscape

SPEC CPU2006 Performance Charts - bigger is better, selected results, please see www.spec.org for complete results.

SPECint2006

System Processors Performance Results Notes (1)
Type GHz Chips Cores Peak Base
Sun Blade X6275 Xeon X5570 2.93 2 8 37.4 31.0 New Record
ASUS TS700-E6 (Z8PE-D12X) Xeon W5580 3.2 2 8 37.3 33.2 Previous Best
Fujitsu R670 Xeon W5580 3.2 2 8 37.2 33.2
Sun Blade X6270 Xeon X5570 2.93 2 8 36.9 32.0
Fujitsu Celsius R570 Xeon X5570 2.93 2 8 36.3 32.2
YOYOtech MLK1610 Intel Core i7-965 3.73 1 4 36.0 32.5
HP ProLiant DL585 G5 Opteron 8393 3.1 1 1 23.4 19.7 Best Opteron
IBM System p570 POWER6 4.70 1 1 21.7 17.8 Best POWER6

(1) Results as of 22 June 2009 from www.spec.org and this report.

SPECfp2006

System Processors Performance Results Notes (2)
Type GHz Chips Cores Peak Base
Sun Blade X6275 Xeon X5570 2.93 2 8 50.8 44.2 New Record
Sun Blade X6270 Xeon X5570 2.93 2 8 50.4 45.0 Previous Best
Sun Blade X4170 Xeon X5570 2.93 2 8 48.9 43.9
Fujitsu R670 Xeon W5580 3.2 2 8 42.2 39.5
HP ProLiant DL585 G5 Opteron 8393 3.1 2 8 25.9 23.6 Best Opteron
IBM Power 595 POWER6 5.00 1 1 24.9 20.1 Best POWER6

(2) Results as of 22 June 2009 from www.spec.org and this report.

Results and Configuration Summary

Hardware Configuration:
    Sun Blade X6275
      2 x 2.93 GHz QC Intel Xeon X5570 processors, turbo enabled
      24 GB, (6 x 4GB DDR3-1333 DIMM)
      1 x 146 GB, 10000 RPM SAS disk

Software Configuration:

    O/S: OpenSolaris 2009.06
    Compiler: Sun Studio 12 Update 1
    Other SW: MicroQuill SmartHeap Library 9.01 x64
    Benchmark: SPEC CPU2006 V1.1

Key Points and Best Practices

These results show that choosing the right compiler for the job can maximize one's investment in hardware. The Sun Studio compiler teamed with the OpenSolaris operating system allows one to tackle hard problems to get quick solution turnaround.
  • Autoparallelism was used to deliver the maximum time to solution. These results show that autoparallel compilation is a very viable compilation option that should be considered when one needs the quickest turnaround of results. Note that not all codes benefit from this optimization, just like they can't always take advantage of other compiler optimization techniques.
  • OpenSolaris 2009.06 was able to fully take advantage of the turbo mode of the Nehalem family of processors.

Disclosure Statement

SPEC, SPECint, SPECfp reg tm of Standard Performance Evaluation Corporation. Results from www.spec.org as of 6/22/2009. Sun Blade X6275 (Intel X5570, 2 chips, 8 cores) 50.8 SPECfp2006, 37.4 SPECint2006; ASUS TS700-E6 (Intel W5570, 2 chips, 8 cores) 37.3 SPECint2006; Fujitsu R670 (Intel X5570, 2 chips, 8 cores) 42.2 SPECfp2006.

Friday Jun 12, 2009

OpenSolaris Beats Linux on memcached Sun Fire X2270

OpenSolaris provided 25% better performance on memcached than Linux on the Sun Fire X2270 server. memcached 1.3.2 using OpenSolaris gave a maximum throughput of 352K ops/sec compared to the same server running RHEL5 (with kernel 2.6.29) which produced a result of 281K ops/sec.

memcached is the de-facto distributed caching server used to scale many web2.0 sites today. With the requirement to support a very large number of users as sites grow, memcached aids scalability by effectively cutting down on MySQL traffic and improving response times.

  • memcached is a very light-weight server but is known not to scale beyond 4-6 threads. Some scalability improvements have gone into the 1.3 release (still in beta).
  • As customers move to the newer, more powerful Intel Nehalem based systems, it is important that they have the ability to use these systems efficiently using appropriate software and hardware components.

Performance Landscape

memcached performance results: ops/sec (bigger is better)

System C/C/T Processors Memory Operating System Performance
Ops/Sec
GHz Type
Sun Fire X2270 2/8/16 2.93 Intel X5570 QC 48GB OpenSolaris 2009.06 352K
Sun Fire X2270 2/8/16 2.93 Intel X5570 QC 48GB RedHat Enterprise Linux 5 (kernel 2.6.29) 281K

C/C/T: Chips, Cores, Threads

Results and Configuration Summary

Sun's results used the following hardware and software components.

Hardware:

    Sun Fire X2270
    2 x Intel X5570 QC 2.93 GHz
    48GB of memory
    10GbE Intel Oplin card

Software:

    OpenSolaris 2009.06
    Linux RedHat 5 (on kernel 2.6.29)

Benchmark Description

memcached is a high-performance, distributed memory object caching system, generic in nature, but intended for use in speeding up dynamic web applications by alleviating database load. The memcached benchmark was based on Apache Olio - a web2.0 workload.

The benchmark initially populates the server cache with objects of different sizes to simulate the types of data that real sites typically store in memcached :

  • small objects (4-100 bytes) to represent locks and query results
  • medium objects (1-2 KBytes) to represent thumbnails, database rows, resultsets
  • large objects (5-20 KBytes) to represent whole or partially generated pages

The benchmark then runs a mixture of operations (90% gets, 10% sets) and measures the throughput and response times when the system reaches steady-state. The workload is implemented using Faban, an open-source benchmark development framework. It not only speeds benchmark development, but the Faban harness is a great way to queue, monitor and archive runs for analysis.

Key Points and Best Practices

OpenSolaris Tuning

The following /etc/system settings were used to set the number of MSIX:

  • set ddi_msix_alloc_limit=4
  • set pcplusmp:apic_intr_policy=1

For the ixgbe interface, 4 transmit and 4 receive rings gave the best performance :

  • tx_queue_number=4, rx_queue_number=4

The crossbow threads were bound:

dladm set-linkprop -p cpus=12,13,14,15 ixgbe0

Linux Tuning

Linux was more complicated to tune, the following Linux tunables were changed to try and get the best performance:

  • net.ipv4.tcp_timestamps = 0
  • net.core.wmem_default = 67108864
  • net.core.wmem_max = 67108864
  • net.core.optmem_max = 67108864
  • net.ipv4.tcp_dsack = 0
  • net.ipv4.tcp_sack = 0
  • net.ipv4.tcp_window_scaling = 0
  • net.core.netdev_max_backlog = 300000
  • net.ipv4.tcp_max_syn_backlog = 200000

Here are the ixgbe specific settings that were used (2 transmit, 2 receive rings):

  • RSS=2,2 InterruptThrottleRate=1600,1600

Linux Issues

The 10GbE Intel Oplin card on Linux resulted in the following driver and kernel re-builds.

  • With the default ixgbe driver from the RedHat distribution (version 1.3.30-k2 on kernel 2.6.18)), the interface simply hung during the benchmark test.
  • This led to downloading the driver from the Intel site (1.3.56.11-2-NAPI) and re-compiling it. This version does work and we got a maximum throughput of 232K operations/sec on the same linux kernel (2.6.18). However, this version of the kernel does not have support for multiple TX rings.
  • The kernel version 2.6.29 includes support for multiple TX rings but still doesn't have the ixgbe driver which is 1.3.56.11-2-NAPI. So we downloaded, built and installed these versions of the kernel and driver. This worked well giving a maximum throughput of 281K with some tuning.

See Also

Disclosure Statement

Sun Fire X2270 server with OpenSolaris 352K ops/sec. Sun Fire X2270 server with RedHat Linux 281K ops/sec. For memcached information, visit http://www.danga.com/memcached. Results as of June 8, 2009.

Tuesday Jun 09, 2009

Free Compiler Wins Nehalem Race by 2x

Not the Free Compiler That You Thought, No, This Other One.

Nehalem performance measured with several software configurations

Contributed by: John Henning and Karsten Guthridge

Introduction

race

The GNU C Compiler, GCC, is popular, widely available, and an exemplary collaborative effort.

But how does it do for performance -- for example, on Intel's latest hot "Nehalem" processor family? How does it compare to the freely available Sun Studio compiler?

Using the SPEC CPU benchmarks, we take a look at this question. These benchmarks depend primarily on performance of the chip, the memory hierarchy, and the compiler. By holding the first two of these constant, it is possible to focus in on compiler contributions to performance.

Current Record Holder

The current SPEC CPU2006 floating point speed record holder is the Sun Blade X6270 server module. Using 2x Intel Xeon X5570 processor chips and 24 GB of DDR3-1333 memory, it delivers a result of 45.0 SPECfp_base2006 and 50.4 SPECfp2006. [1]

We used this same blade system to compare GCC vs. Studio. On separate, but same-model disks, this software was installed:

  • SuSE Linux Enterprise Server 11.0 (x86_64) and GCC V4.4.0 built with gmp-4.3.1 and mpfr-2.4.1
  • OpenSolaris2008.11 and Sun Studio 12 Update 1

Tie One Hand Behind Studio's Back

In order to make the comparison more fair to GCC, we took several steps.

  1. We simplified the tuning for the OpenSolaris/Sun Studio configuration. This was done in order to counter the criticism that one sometimes hears that SPEC benchmarks have overly aggressive tuning. Benchmarks were optimized with a reasonably short tuning string:

    For all:  -fast -xipo=2 -m64 -xvector=simd -xautopar
    For C++, add:  -library=stlport4
  2. Recall that SPEC CPU2006 allows two kinds of tuning: "base", and "peak". The base metrics require that all benchmarks of a given language use the same tuning. The peak metrics allow individual benchmarks to have differing tuning, and more aggressive optimizations, such as compiler feedback. The simplified Studio configuration used only the less aggressive base tuning.

Both of the above changes limited the performance of Sun Studio.  Several measures were used to increase the performance of GCC:

  1. We tested the latest released version of GCC, 4.4.0, which was announced on 21 April 2009. In our testing, GCC 4.4.0 provides about 10% better overall floating point performance than V4.3.2. Note that GCC 4.4.0 is more recent than the compiler that is included with recent Linux distributions such as SuSE 11, which includes 4.3.2; or Ubuntu 8.10, which updates to 4.3.2 when one does "apt-get install gcc". It was installed with the math libraries mpfr 2.4.1 and gmp 4.3.1, which are labeled as the latest releases as of 1 June 2009.

  2. A tuning effort was undertaken with GCC, including testing of -O2 -O3 -fprefetch-loop-arrays -funroll-all-loops -ffast-math -fno-strict-aliasing -ftree-loop-distribution -fwhole-program -combine and -fipa-struct-reorg

  3. Eventually, we settled on this tuning string for GCC base:

    For all:  -O3 -m64 -mtune=core2 -msse4.2 -march=core2
    -fprefetch-loop-arrays -funroll-all-loops
    -Wl,-z common-page-size=2M
    For C++, add:  -ffast-math

    The reason that only the C++ benchmarks used the fast math library was that 435.gromacs, which uses C and Fortran, fails validation with this flag. (Note: we verified that the benchmarks successfully obtained 2MB pages.)

Studio wins by 2x, even with one hand tied behind its back

At this point, a fair base-to-base comparison can be made, and Sun Studio/OpenSolaris finishes the race while GCC/Linux is still looking for its glasses: 44.8 vs. 21.1 (see Table 1). Notice that Sun Studio provides more than 2x the performance of GCC.

Table 1: Initial Comparisons, SPECfp2006,
Sun Studio/Solaris vs. GCC/Linux
  Base Peak
Industry FP Record
    Sun Studio 12 Update 1
    OpenSolaris 2008.11
45.0 50.4
Studio/OpenSolaris: simplify above (less tuned) 44.8  
GCC V4.4 / SuSE Linux 11 21.1  
Notes: All results reported are from rule-compliant, "reportable" runs of the SPEC CPU2006 floating point suite, CFP2006. "Base" indicates the metric SPECfp_base2006. "Peak" indicates SPECfp2006. Peak uses the same benchmarks and workloads as base, but allows more aggressive tuning. A base result, may, optionally, be quoted as peak, but the converse is not allowed. For details, see SPEC's Readme1st.

Fair? Did you say "Fair"?

Wait, wait, the reader may protest - this is all very unfair to GCC, because the Studio result used all 8 cores on this 2-chip system, whereas GCC used only one core! You're using trickery!

To this plaintive whine, we respond that:

  • Compiler auto-parallelization technology is not a trick. Rather, it is an essential technology in order to get the best performance from today's multi-core systems. Nearly all contemporary CPU chips provide support for multiple cores. Compilers should do everything possible to make it easy to take advantage of these resources.

  • We tried to use more than one core for GCC, via the -ftree-parallelize-loops=n flag. GCC's autoparallelization appears to be in a much earlier development stage than Studio's, since we did not observe any improvements for all values of "n" that we tested. From the GCC wiki, it appears that a new autoparallelization effort is under development, which may improve its results at a later time.

  • But, all right, if you insist, we will make things even harder for Studio, and see how it does.

Tie Another Hand Behind Studio's Back

The earlier section mentioned various ways in which the performance comparison had been made easier for GCC. Continuing the paragraph numbering from above, we took these additional measures:

  1. Removed the autoparallelization from Studio, substituting instead a request for 2MB pagesizes (which the GCC tuning already had).

  2. Added "peak" tuning to GCC: for benchmarks that benefit, add -ffast-math, and compiler profile-driven feedback

At this point, Studio base beats GCC base by 38%, and Studio base beats GCC peak by more than 25% (see table 2).

Table 2: Additional Comparisons, SPECfp2006,
Sun Studio/Solaris vs. GCC/Linux
  Base Peak
Sun Studio/OpenSolaris: base only, noautopar 29.1  
GCC V4.4 / SuSE Linux 11 21.1 23.1
The notes from Table 1 apply here as well.

Bottom line

The freely available Sun Studio 12 Update 1 compiler on OpenSolaris provides more than double the performance of GCC V4.4 on SuSE Linux, as measured by SPECfp_base2006.

If compilation is restricted to avoid using autoparallelization, Sun Studio still wins by 38% (base to base), or by more than 25% (Studio base vs. GCC peak).

YMMV

Your mileage may vary. It is certain that both GCC and Studio could be improved with additional tuning efforts. Both provide dozens of compiler flags, which can keep the tester delightfully engaged for an unbounded number of days. We feel that the tuning presented here is reasonable, and that additional tuning effort, if applied to both compilers, would not radically alter the conclusions.

Additional Information

The results disclosed in this article are from "reportable" runs of the SPECfp2006 benchmarks, which have been submitted to SPEC.

[1] SPEC and SPECfp are registered trademarks of the Standard Performance Evaluation Corporation. Competitive comparisons are based on data published at www.spec.org as of 1 June 2009. The X6270 result can be found at http://www.spec.org/cpu2006/results/res2009q2/cpu2006-20090413-07019.html.

Wednesday Jun 03, 2009

Wide Variety of Topics to be discussed on BestPerf

A sample of the various Sun and partner technologies to be discussed:
OpenSolaris, Solaris, Linux, Windows, vmware, gcc, Java, Glassfish, MySQL, Java, Sun-Studio, ZFS, dtrace, perflib, Oracle, DB2, Sybase, OpenStorage, CMT, SPARC64, X64, X86, Intel, AMD

About

BestPerf is the source of Oracle performance expertise. In this blog, Oracle's Strategic Applications Engineering group explores Oracle's performance results and shares best practices learned from working on Enterprise-wide Applications.

Index Pages
Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today