Tuesday Mar 26, 2013

SPARC T5-2 Achieves ZFS File System Encryption Benchmark World Record

Oracle continues to lead in enterprise security. Oracle's SPARC T5 processors combined with the Oracle Solaris ZFS file system demonstrate faster file system encryption than equivalent x86 systems using the Intel Xeon Processor E5-2600 Sequence chips which have AES-NI security instructions.

Encryption is the process where data is encoded for privacy and a key is needed by the data owner to access the encoded data.

  • The SPARC T5-2 server is 3.4x faster than a 2 processor Intel Xeon E5-2690 server running Oracle Solaris 11.1 that uses the AES-NI GCM security instructions for creating encrypted files.

  • The SPARC T5-2 server is 2.2x faster than a 2 processor Intel Xeon E5-2690 server running Oracle Solaris 11.1 that uses the AES-NI CCM security instructions for creating encrypted files.

  • The SPARC T5-2 server consumes a significantly less percentage of system resources as compared to a 2 processor Intel Xeon E5-2690 server.

Performance Landscape

Below are results running two different ciphers for ZFS encryption. Results are presented for runs without any cipher, labeled clear, and a variety of different key lengths. The results represent the maximum delivered values measured for 3 concurrent sequential write operations using 1M blocks. Performance is measured in MB/sec (bigger is better). System utilization is reported as %CPU as measured by iostat (smaller is better).

The results for the x86 server were obtained using Oracle Solaris 11.1 with performance bug fixes.

Encryption Using AES-GCM Ciphers

System GCM Encryption: 3 Concurrent Sequential Writes
Clear AES-256-GCM AES-192-GCM AES-128-GCM
MB/sec %CPU MB/sec %CPU MB/sec %CPU MB/sec %CPU
SPARC T5-2 server 3,918 7 3,653 14 3,676 15 3,628 14
SPARC T4-2 server 2,912 11 2,662 31 2,663 30 2,779 31
2-Socket Intel Xeon E5-2690 3,969 42 1,062 58 1,067 58 1,076 57
SPARC T5-2 vs x86 server 1.0x 3.4x 3.4x 3.4x

Encryption Using AES-CCM Ciphers

System CCM Encryption: 3 Concurrent Sequential Writes
Clear AES-256-CCM AES-192-CCM AES-128-CCM
MB/sec %CPU MB/sec %CPU MB/sec %CPU MB/sec %CPU
SPARC T5-2 server 3,862 7 3,665 15 3,622 14 3,707 12
SPARC T4-2 server 2,945 11 2,471 26 2,801 26 2,442 25
2-Socket Intel Xeon E5-2690 3,868 42 1,566 64 1,632 63 1,689 66
SPARC T5-2 vs x86 server 1.0x 2.3x 2.2x 2.2x

Configuration Summary

Storage Configuration:

Sun Storage 6780 array
4 CSM2 trays, each with 16 83GB 15K RPM drives
8x 8 GB/sec Fiber Channel ports per host
R0 Write cache enabled, controller mirroring off for peak write bandwidth
8 Drive R0 512K stripe pools mirrored via ZFS to storage

Sun Storage 6580 array
9 CSM2 trays, each with 16 136GB 15K RPM drives
8x 4 GB/sec Fiber Channel ports per host
R0 Write cache enabled, controller mirroring off for peak write bandwidth
4 Drive R0 512K stripe pools mirrored via ZFS to storage

Server Configuration:

SPARC T5-2 server
2 x SPARC T5 3.6 GHz processors
512 GB memory
Oracle Solaris 11.1

SPARC T4-2 server
2 x SPARC T4 2.85 GHz processors
256 GB memory
Oracle Solaris 11.1

Sun Server X3-2L server
2 x Intel Xeon E5-2690, 2.90 GHz processors
128 GB memory
Oracle Solaris 11.1

Switch Configuration:

Brocade 5300 FC switch

Benchmark Description

This benchmark evaluates secure file system performance by measuring the rate at which encrypted data can be written. The Vdbench tool was used to generate the IO load. The test performed 3 concurrent sequential write operations using 1M blocks to 3 separate files.

Key Points and Best Practices

  • ZFS encryption is integrated with the ZFS command set. Like other ZFS operations, encryption operations such as key changes and re-key are performed online.

  • Data is encrypted using AES (Advanced Encryption Standard) with key lengths of 256, 192, and 128 in the CCM and GCM operation modes.

  • The flexibility of encrypting specific file systems is a key feature.

  • ZFS encryption is inheritable to descendent file systems. Key management can be delegated through ZFS delegated administration.

  • ZFS encryption uses the Oracle Solaris Cryptographic Framework which gives it access to SPARC T5 and Intel Xeon E5-2690 processor hardware acceleration or to optimized software implementations of the encryption algorithms automatically.

  • On modern computers with multiple threads per core, simple statistics like %utilization measured in tools like iostat and vmstat are not "hard" indications of the resources that might be available for other processing. For example, 90% idle may not mean that 10 times the work can be done. So drawing numerical conclusions must be done carefully.

See Also

Disclosure Statement

Copyright 2013, Oracle and/or its affiliates. All rights reserved. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Results as of March 26, 2013.

Tuesday Oct 02, 2012

Performance of Oracle Business Intelligence Benchmark on SPARC T4-4

Oracle's SPARC T4-4 server configured with four SPARC T4 3.0 GHz processors delivered 25,000 concurrent users on Oracle Business Intelligence Enterprise Edition (BI EE) 11g benchmark using Oracle Database 11g Release 2 running on Oracle Solaris 10.

  • A SPARC T4-4 server running Oracle Business Intelligence Enterprise Edition 11g achieved 25,000 concurrent users with an average response time of 0.36 seconds with Oracle BI server cache set to ON.

  • The benchmark data clearly shows that the underlying hardware, SPARC T4 server, and the Oracle BI EE 11g (11.1.1.6.0 64-bit) platform scales within a single system supporting 25,000 concurrent users while executing 415 transactions/sec.

  • The benchmark demonstrated the scalability of Oracle Business Intelligence Enterprise Edition 11g 11.1.1.6.0, which was deployed in a vertical scale-out fashion on a single SPARC T4-4 server.

  • Oracle Internet Directory configured on SPARC T4 server provided authentication for the 25,000 Oracle BI EE users with sub-second response time.

  • A SPARC T4-4 with internal Solid State Drive (SSD) using the ZFS file system showed significant I/O performance improvement over traditional disk for the Web Catalog activity. In addition, ZFS helped get past the UFS limitation of 32767 sub-directories in a Web Catalog directory.

  • The multi-threaded 64-bit Oracle Business Intelligence Enterprise Edition 11g and SPARC T4-4 server proved to be a successful combination by providing sub-second response times for the end user transactions, consuming only half of the available CPU resources at 25,000 concurrent users, leaving plenty of head room for increased load.

  • The Oracle Business Intelligence on SPARC T4-4 server benchmark results demonstrate that comprehensive BI functionality built on a unified infrastructure with a unified business model yields best-in-class scalability, reliability and performance.

  • Oracle BI EE 11g is a newer version of Business Intelligence Suite with richer and superior functionality. Results produced with Oracle BI EE 11g benchmark are not comparable to results with Oracle BI EE 10g benchmark. Oracle BI EE 11g is a more difficult benchmark to run, exercising more features of Oracle BI.

Performance Landscape

Results for the Oracle BI EE 11g version of the benchmark. Results are not comparable to the Oracle BI EE 10g version of the benchmark.

Oracle BI EE 11g Benchmark
System Number of Users Response Time (sec)
1 x SPARC T4-4 (4 x SPARC T4 3.0 GHz) 25,000 0.36

Results for the Oracle BI EE 10g version of the benchmark. Results are not comparable to the Oracle BI EE 11g version of the benchmark.

Oracle BI EE 10g Benchmark
System Number of Users
2 x SPARC T5440 (4 x SPARC T2+ 1.6 GHz) 50,000
1 x SPARC T5440 (4 x SPARC T2+ 1.6 GHz) 28,000

Configuration Summary

Hardware Configuration:

SPARC T4-4 server
4 x SPARC T4-4 processors, 3.0 GHz
128 GB memory
4 x 300 GB internal SSD

Storage Configuration:

Sun ZFS Storage 7120
16 x 146 GB disks

Software Configuration:

Oracle Solaris 10 8/11
Oracle Solaris Studio 12.1
Oracle Business Intelligence Enterprise Edition 11g (11.1.1.6.0)
Oracle WebLogic Server 10.3.5
Oracle Internet Directory 11.1.1.6.0
Oracle Database 11g Release 2

Benchmark Description

Oracle Business Intelligence Enterprise Edition (Oracle BI EE) delivers a robust set of reporting, ad-hoc query and analysis, OLAP, dashboard, and scorecard functionality with a rich end-user experience that includes visualization, collaboration, and more.

The Oracle BI EE benchmark test used five different business user roles - Marketing Executive, Sales Representative, Sales Manager, Sales Vice-President, and Service Manager. These roles included a maximum of 5 different pre-built dashboards. Each dashboard page had an average of 5 reports in the form of a mix of charts, tables and pivot tables, returning anywhere from 50 rows to approximately 500 rows of aggregated data. The test scenario also included drill-down into multiple levels from a table or chart within a dashboard.

The benchmark test scenario uses a typical business user sequence of dashboard navigation, report viewing, and drill down. For example, a Service Manager logs into the system and navigates to his own set of dashboards using Service Manager. The BI user selects the Service Effectiveness dashboard, which shows him four distinct reports, Service Request Trend, First Time Fix Rate, Activity Problem Areas, and Cost Per Completed Service Call spanning 2002 to 2005. The user then proceeds to view the Customer Satisfaction dashboard, which also contains a set of 4 related reports, drills down on some of the reports to see the detail data. The BI user continues to view more dashboards – Customer Satisfaction and Service Request Overview, for example. After navigating through those dashboards, the user logs out of the application. The benchmark test is executed against a full production version of the Oracle Business Intelligence 11g Applications with a fully populated underlying database schema. The business processes in the test scenario closely represent a real world customer scenario.

See Also

Disclosure Statement

Copyright 2012, Oracle and/or its affiliates. All rights reserved. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Results as of 30 September 2012.

Friday Sep 30, 2011

SPARC T4-2 Server Beats Intel (Westmere AES-NI) on ZFS Encryption Tests

Oracle continues to lead in enterprise security. Oracle's SPARC T4 processors combined with Oracle's Solaris ZFS file system demonstrate faster file system encryption than equivalent systems based on the Intel Xeon Processor 5600 Sequence chips which use AES-NI security instructions.

Encryption is the process where data is encoded for privacy and a key is needed by the data owner to access the encoded data. The benefits of using ZFS encryption are:

  • The SPARC T4 processor is 3.5x to 5.2x faster than the Intel Xeon Processor X5670 that has the AES-NI security instructions in creating encrypted files.

  • ZFS encryption is integrated with the ZFS command set. Like other ZFS operations, encryption operations such as key changes and re-key are performed online.

  • Data is encrypted using AES (Advanced Encryption Standard) with key lengths of 256, 192, and 128 in the CCM and GCM operation modes.

  • The flexibility of encrypting specific file systems is a key feature.

  • ZFS encryption is inheritable to descendent file systems. Key management can be delegated through ZFS delegated administration.

  • ZFS encryption uses the Oracle Solaris Cryptographic Framework which gives it access to SPARC T4 processor and Intel Xeon X5670 processor (Intel AES-NI) hardware acceleration or to optimized software implementations of the encryption algorithms automatically.

Performance Landscape

Below are results running two different ciphers for ZFS encryption. Results are presented for runs without any cipher, labeled clear, and a variety of different key lengths.

Encryption Using AES-CCM Ciphers

MB/sec – 5 File Create* Encryption
Clear AES-256-CCM AES-192-CCM AES-128-CCM
SPARC T4-2 server 3,803 3,167 3,335 3,225
SPARC T3-2 server 2,286 1,554 1,561 1,594
2-Socket 2.93 GHz Xeon X5670 3,325 750 764 773

Speedup T4-2 vs X5670 1.1x 4.2x 4.4x 4.2x
Speedup T4-2 vs T3-2 1.7x 2.0x 2.1x 2.0x

Encryption Using AES-GCM Ciphers

MB/sec – 5 File Create* Encryption
Clear AES-256-GCM AES-192-GCM AES-128-GCM
SPARC T4-2 server 3,618 3,929 3,164 2,613
SPARC T3-2 server 2,278 1,451 1,455 1,449
2-Socket 2.93 GHz Xeon X5670 3,299 749 748 753

Speedup T4-2 vs X5670 1.1x 5.2x 4.2x 3.5x
Speedup T4-2 vs T3-2 1.6x 2.7x 2.2x 1.8x

(*) Maximum Delivered values measured over 5 concurrent mkfile operations.

Configuration Summary

Storage Configuration:

Sun Storage 6780 array
16 x 15K RPM drives
Raid 0 pool
Write back cache enable
Controller cache mirroring disabled for maximum bandwidth for test
Eight 8 Gb/sec ports per host

Server Configuration:

SPARC T4-2 server
2 x SPARC T4 2.85 GHz processors
256 GB memory
Oracle Solaris 11

SPARC T3-2 server
2 x SPARC T3 1.6 GHz processors
Oracle Solaris 11 Express 2010.11

Sun Fire X4270 M2 server
2 x Intel Xeon X5670, 2.93 GHz processors
Oracle Solaris 11

Benchmark Description

The benchmark ran the UNIX command mkfile (1M). Mkfile is a simple single threaded program to create a file of a specified size. The script ran 5 mkfile operations in the background and observed the peak bandwidth observed during the test.

See Also

Disclosure Statement

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Results as of December 16, 2011.

Monday Sep 19, 2011

Halliburton ProMAX® Seismic Processing on Sun Blade X6270 M2 with Sun ZFS Storage 7320

Halliburton/Landmark's ProMAX® 3D Pre-Stack Kirchhoff Time Migration's (PSTM) single workflow scalability and multiple workflow throughput using various scheduling methods are evaluated on a cluster of Oracle's Sun Blade X6270 M2 server modules attached to Oracle's Sun ZFS Storage 7320 appliance.

Two resource scheduling methods, compact and distributed, are compared while increasing the system load with additional concurrent ProMAX® workflows.

  • Multiple concurrent 24-process ProMAX® PSTM workflow throughput is constant; 10 workflows on 10 nodes finish as fast as 1 workflow on one compute node. Additionally, processing twice the data volume yields similar traces/second throughput performance.

  • A single ProMAX® PSTM workflow has good scaling from 1 to 10 nodes of a Sun Blade X6270 M2 cluster scaling 4.5X. ProMAX® scales to 4.7X on 10 nodes with one input data set and 6.3X with two consecutive input data sets (i.e. twice the data).

  • A single ProMAX® PSTM workflow has near linear scaling of 11x on a Sun Blade X6270 M2 server module when running from 1 to 12 processes.

  • The 12-thread ProMAX® workflow throughput using the distributed scheduling method is equivalent or slightly faster than the compact scheme for 1 to 6 concurrent workflows.

Performance Landscape

Multiple 24-Process Workflow Throughput Scaling

This test measures the system throughput scalability as concurrent 24-process workflows are added, one workflow per node. The per workflow throughput and the system scalability are reported.

Aggregate system throughput scales linearly. Ten concurrent workflows finish in the same time as does one workflow on a single compute node.

Halliburton ProMAX® Pre-Stack Time Migration - Multiple Workflow Scaling


Single Workflow Scaling

This test measures single workflow scalability across a 10-node cluster. Utilizing a single data set, performance exhibits near linear scaling of 11x at 12 processes, and per-node scaling of 4x at 6 nodes; performance flattens quickly reaching a peak of 60x at 240 processors and per-node scaling of 4.7x with 10 nodes.

Running with two consecutive input data sets in the workflow, scaling is considerably improved with peak scaling ~35% higher than obtained using a single data set. Doubling the data set size minimizes time spent in workflow initialization, data input and output.

Halliburton ProMAX® Pre-Stack Time Migration - Single Workflow Scaling

This next test measures single workflow scalability across a 10-node cluster (as above) but limiting scheduling to a maximum of 12-process per node; effectively restricting a maximum of one process per physical core. The speedup relative to a single process, and single node are reported.

Utilizing a single data set, performance exhibits near linear scaling of 37x at 48 processes, and per-node scaling of 4.3x at 6 nodes. Performance of 55x at 120 processors and per-node scaling of 5x with 10 nodes is reached and scalability is trending higher more strongly compared to the the case of two processes running per physical core above. For equivalent total process counts, multi-node runs using only a single process per physical core appear to run between 28-64% more efficiently (96 and 24 processes respectively). With a full compliment of 10 nodes (120 processes) the peak performance is only 9.5% lower than with 2 processes per vcpu (240 processes).

Running with two consecutive input data sets in the workflow, scaling is considerably improved with peak scaling ~35% higher than obtained using a single data set.

Halliburton ProMAX® Pre-Stack Time Migration - Single Workflow Scaling

Multiple 12-Process Workflow Throughput Scaling, Compact vs. Distributed Scheduling

The fourth test compares compact and distributed scheduling of 1, 2, 4, and 6 concurrent 12-processor workflows.

All things being equal, the system bi-section bandwidth should improve with distributed scheduling of a fixed-size workflow; as more nodes are used for a workflow, more memory and system cache is employed and any node memory bandwidth bottlenecks can be offset by distributing communication across the network (provided the network and inter-node communication stack do not become a bottleneck). When physical cores are not over-subscribed, compact and distributed scheduling performance is within 3% suggesting that there may be little memory contention for this workflow on the benchmarked system configuration.

With compact scheduling of two concurrent 12-processor workflows, the physical cores become over-subscribed and performance degrades 36% per workflow. With four concurrent workflows, physical cores are oversubscribed 4x and performance is seen to degrade 66% per workflow. With six concurrent workflows over-subscribed compact scheduling performance degrades 77% per workflow. As multiple 12-processor workflows become more and more distributed, the performance approaches the non over-subscribed case.

Halliburton ProMAX® Pre-Stack Time Migration - Multiple Workflow Scaling

141616 traces x 624 samples


Test Notes

All tests were performed with one input data set (70808 traces x 624 samples) and two consecutive input data sets (2 * (70808 traces x 624 samples)) in the workflow. All results reported are the average of at least 3 runs and performance is based on reported total wall-clock time by the application.

All tests were run with NFS attached Sun ZFS Storage 7320 appliance and then with NFS attached legacy Sun Fire X4500 server. The StorageTek Workload Analysis Tool (SWAT) was invoked to measure the I/O characteristics of the NFS attached storage used on separate runs of all workflows.

Configuration Summary

Hardware Configuration:

10 x Sun Blade X6270 M2 server modules, each with
2 x 3.33 GHz Intel Xeon X5680 processors
48 GB DDR3-1333 memory
4 x 146 GB, Internal 10000 RPM SAS-2 HDD
10 GbE
Hyper-Threading enabled

Sun ZFS Storage 7320 Appliance
1 x Storage Controller
2 x 2.4 GHz Intel Xeon 5620 processors
48 GB memory (12 x 4 GB DDR3-1333)
2 TB Read Cache (4 x 512 GB Read Flash Accelerator)
10 GbE
1 x Disk Shelf
20.0 TB RAID-Z (20 x 1 TB SAS-2, 7200 RPM HDD)
4 x Write Flash Accelerators

Sun Fire X4500
2 x 2.8 GHz AMD 290 processors
16 GB DDR1-400 memory
34.5 TB RAID-Z (46 x 750 GB SATA-II, 7200 RPM HDD)
10 GbE

Software Configuration:

Oracle Linux 5.5
Parallel Virtual Machine 3.3.11 (bundled with ProMAX)
Intel 11.1.038 Compilers
Libraries: pthreads 2.4, Java 1.6.0_01, BLAS, Stanford Exploration Project Libraries

Benchmark Description

The ProMAX® family of seismic data processing tools is the most widely used Oil and Gas Industry seismic processing application. ProMAX® is used for multiple applications, from field processing and quality control, to interpretive project-oriented reprocessing at oil companies and production processing at service companies. ProMAX® is integrated with Halliburton's OpenWorks® Geoscience Oracle Database to index prestack seismic data and populate the database with processed seismic.

This benchmark evaluates single workflow scalability and multiple workflow throughput of the ProMAX® 3D Prestack Kirchhoff Time Migration (PSTM) while processing the Halliburton benchmark data set containing 70,808 traces with 8 msec sample interval and trace length of 4992 msec. Benchmarks were performed with both one and two consecutive input data sets.

Each workflow consisted of:

  • reading the previously constructed MPEG encoded processing parameter file
  • reading the compressed seismic data traces from disk
  • performing the PSTM imaging
  • writing the result to disk

Workflows using two input data sets were constructed by simply adding a second identical seismic data read task immediately after the first in the processing parameter file. This effectively doubled the data volume read, processed, and written.

This version of ProMAX® currently only uses Parallel Virtual Machine (PVM) as the parallel processing paradigm. The PVM software only used TCP networking and has no internal facility for assigning memory affinity and processor binding. Every compute node is running a PVM daemon.

The ProMAX® processing parameters used for this benchmark:

Minimum output inline = 65
Maximum output inline = 85
Inline output sampling interval = 1
Minimum output xline = 1
Maximum output xline = 200 (fold)
Xline output sampling interval = 1
Antialias inline spacing = 15
Antialias xline spacing = 15
Stretch Mute Aperature Limit with Maximum Stretch = 15
Image Gather Type = Full Offset Image Traces
No Block Moveout
Number of Alias Bands = 10
3D Amplitude Phase Correction
No compression
Maximum Number of Cache Blocks = 500000

Primary PSTM business metrics are typically time-to-solution and accuracy of the subsurface imaging solution.

Key Points and Best Practices

  • Multiple job system throughput scales perfectly; ten concurrent workflows on 10 nodes each completes in the same time and has the same throughput as a single workflow running on one node.
  • Best single workflow scaling is 6.6x using 10 nodes.

    When tasked with processing several similar workflows, while individual time-to-solution will be longer, the most efficient way to run is to fully distribute them one workflow per node (or even across two nodes) and run these concurrently, rather than to use all nodes for each workflow and running consecutively. For example, while the best-case configuration used here will run 6.6 times faster using all ten nodes compared to a single node, ten such 10-node jobs running consecutively will overall take over 50% longer to complete than ten jobs one per node running concurrently.

  • Throughput was seen to scale better with larger workflows. While throughput with both large and small workflows are similar with only one node, the larger dataset exhibits 11% and 35% more throughput with four and 10 nodes respectively.

  • 200 processes appears to be a scalability asymptote with these workflows on the systems used.
  • Hyperthreading marginally helps throughput. For the largest model run on 10 nodes, 240 processes delivers 11% more performance than with 120 processes.

  • The workflows do not exhibit significant I/O bandwidth demands. Even with 10 concurrent 24-process jobs, the measured aggregate system I/O did not exceed 100 MB/s.

  • 10 GbE was the only network used and, though shared for all interprocess communication and network attached storage, it appears to have sufficient bandwidth for all test cases run.

See Also

Disclosure Statement

The following are trademarks or registered trademarks of Halliburton/Landmark Graphics: ProMAX®, GeoProbe®, OpenWorks®. Results as of 9/1/2011.

Tuesday Oct 26, 2010

3D VTI Reverse Time Migration Scalability On Sun Fire X2270-M2 Cluster with Sun Storage 7210

This Oil & Gas benchmark shows the Sun Storage 7210 system delivers almost 2 GB/sec bandwidth and realizes near-linear scaling performance on a cluster of 16 Sun Fire X2270 M2 servers.

Oracle's Sun Storage 7210 system attached via QDR InfiniBand to a cluster of sixteen of Oracle's Sun Fire X2270 M2 servers was used to demonstrate the performance of a Reverse Time Migration application, an important application in the Oil & Gas industry. The total application throughput and computational kernel scaling are presented for two production sized grids of 800 samples.

  • Both the Reverse Time Migration I/O and combined computation shows near-linear scaling from 8 to 16 nodes on the Sun Storage 7210 system connected via QDR InfiniBand to a Sun Fire X2270 M2 server cluster:

      1243 x 1151 x 1231: 2.0x improvement
      2486 x 1151 x 1231: 1.7x improvement
  • The computational kernel of the Reverse Time Migration has linear to super-linear scaling from 8 to 16 nodes in Oracle's Sun Fire X2270 M2 server cluster:

      1243 x 1151 x 1231 : 2.2x improvement
      2486 x 1151 x 1231 : 2.0x improvement
  • Intel Hyper-Threading provides additional performance benefits to both the Reverse Time Migration I/O and computation when going from 12 to 24 OpenMP threads on the Sun Fire X2270 M2 server cluster:

      1243 x 1151 x 1231: 8% - computational kernel; 2% - total application throughput
      2486 x 1151 x 1231: 12% - computational kernel; 6% - total application throughput
  • The Sun Storage 7210 system delivers the Velocity, Epsilon, and Delta data to the Reverse Time Migration at a steady rate even when timing includes memory initialization and data object creation:

      1243 x 1151 x 1231: 1.4 to 1.6 GBytes/sec
      2486 x 1151 x 1231: 1.2 to 1.3 GBytes/sec

    One can see that when doubling the size of the problem, the additional complexity of overlapping I/O and multiple node file contention only produces a small reduction in read performance.

Performance Landscape

Application Scaling

Performance and scaling results of the total application, including I/O, for the reverse time migration demonstration application are presented. Results were obtained using a Sun Fire X2270 M2 server cluster with a Sun Storage 7210 system for the file server. The servers were running with hyperthreading enabled, allowing for 24 OpenMP threads per server.

Application Scaling Across Multiple Nodes
Number Nodes Grid Size - 1243 x 1151 x 1231 Grid Size - 2486 x 1151 x1231
Total Time (sec) Kernel Time (sec) Total Speedup Kernel Speedup Total Time (sec) Kernel Time (sec) Total Speedup Kernel Speedup
16 504 259 2.0 2.2\* 1024 551 1.7 2.0
14 565 279 1.8 2.0 1191 677 1.5 1.6
12 662 343 1.6 1.6 1426 817 1.2 1.4
10 784 394 1.3 1.4 1501 856 1.2 1.3
8 1024 560 1.0 1.0 1745 1108 1.0 1.0

\* Super-linear scaling due to the compute kernel fitting better into available cache

Application Scaling – Hyper-Threading Study

The affects of hyperthreading are presented when running the reverse time migration demonstration application. Results were obtained using a Sun Fire X2270 M2 server cluster with a Sun Storage 7210 system for the file server.

Hyper-Threading Comparison – 12 versus 24 OpenMP Threads
Number Nodes Thread per Node Grid Size - 1243 x 1151 x 1231 Grid Size - 2486 x 1151 x1231
Total Time (sec) Kernel Time (sec) Total HT Speedup Kernel HT Speedup Total Time (sec) Kernel Time (sec) Total HT Speedup Kernel HT Speedup
16 24 504 259 1.02 1.08 1024 551 1.06 1.12
16 12 515 279 1.00 1.00 1088 616 1.00 1.00

Read Performance

Read performance is presented for the velocity, epsilon and delta files running the reverse time migration demonstration application. Results were obtained using a Sun Fire X2270 M2 server cluster with a Sun Storage 7210 system for the file server. The servers were running with hyperthreading enabled, allowing for 24 OpenMP threads per server.

Velocity, Epsilon, and Delta File Read and Memory Initialization Performance
Number Nodes Overlap MBytes Read Grid Size - 1243 x 1151 x 1231 Grid Size - 2486 x 1151 x1231
Time (sec) Time Relative 8-node Total GBytes Read Read Rate GB/s Time (sec) Time Relative 8-node Total GBytes Read Read Rate GB/s
16 2040 16.7 1.1 23.2 1.4 36.8 1.1 44.3 1.2
8 951
14.8 1.0 22.1 1.6 33.0 1.0 43.2 1.3

Configuration Summary

Hardware Configuration:

16 x Sun Fire X2270 M2 servers, each with
2 x 2.93 GHz Intel Xeon X5670 processors
48 GB memory (12 x 4 GB at 1333 MHz)

Sun Storage 7210 system connected via QDR InfiniBand
2 x 18 GB SATA SSD (logzilla)
40 x 1 TB 7200 RM SATA disk

Software Configuration:

SUSE Linux Enterprise Server SLES 10 SP 2
Oracle Message Passing Toolkit 8.2.1 (for MPI)
Sun Studio 12 Update 1 C++, Fortran, OpenMP

Benchmark Description

This Reverse Time Migration (RTM) demonstration application measures the total time it takes to image 800 samples of various production size grids and write the final image to disk. In this version, each node reads in only the trace, velocity, and conditioning data to be processed by that node plus a four element inline 3-D array pad (spatial order of eight) shared with its neighbors to the left and right during the initialization phase. It represents a full RTM application including the data input, computation, communication, and final output image to be used by the next work flow step involving 3D volumetric seismic interpretation.

Key Points and Best Practices

This demonstration application represents a full Reverse Time Migration solution. Many references to the RTM application tend to focus on the compute kernel and ignore the complexity that the input, communication, and output bring to the task.

I/O Characterization without Optimal Checkpointing

Velocity, Epsilon, and Delta Files - Grid Reading

The additional amount of overlapping reads to share velocity, epsilon, and delta edge data with neighbors can be calculated using the following equation:

    (number_nodes - 1) x (order_in_space) x (y_dimension) x (z_dimension) x (4 bytes) x (3 files)

For this particular benchmark study, the additional 3-D pad overlap for the 16 and 8 node cases is:

    16 nodes: 15 x 8 x 1151 x 1231 x 4 x 3 = 2.04 GB extra
    8 nodes: 7 x 8 x 1151 x 1231 x 4 x 3 = 0.95 GB extra

For the first of the two test cases, the total size of the three files used for the 1243 x 1151 x 1231 case is

    1243 x 1151 x 1231 x 4 bytes = 7.05 GB per file x 3 files = 21.13 GB

With the additional 3-D pad, the total amount of data read is:

    16 nodes: 2.04 GB + 21.13 GB = 23.2 GB
    8 nodes: 0.95 GB + 21.13 GB = 22.1 GB

For the second of the two test cases, the total size of the three files used for the 2486 x 1151 x 1231 case is

    2486 x 1151 x 1231 x 4 bytes = 14.09 GB per file x 3 files = 42.27 GB

With the additional pad based on the number of nodes, the total amount of data read is:

    16 nodes: 2.04 GB + 42.27 GB = 44.3 GB
    8 nodes: 0.95 GB + 42.27 GB = 43.2 GB

Note that the amount of overlapping data read increases, not only by the number of nodes, but as the y dimension and/or the z dimension increases.

Trace Reading

The additional amount of overlapping reads to share trace edge data with neighbors for can be calculated using the following equation:

    (number_nodes - 1) x (order_in_space) x (y_dimension) x (4 bytes) x (number_of_time_slices)

For this particular benchmark study, the additional overlap for the 16 and 8 node cases is:

    16 nodes: 15 x 8 x 1151 x 4 x 800 = 442MB extra
    8 nodes: 7 x 8 x 1151 x 4 x 800 = 206MB extra

For the first case the size of the trace data file used for the 1243 x 1151 x 1231 case is

    1243 x 1151 x 4 bytes x 800 = 4.578 GB

With the additional pad based on the number of nodes, the total amount of data read is:

    16 nodes: .442 GB + 4.578 GB = 5.0 GB
    8 nodes: .206 GB + 4.578 GB = 4.8 GB

For the second case the size of the trace data file used for the 2486 x 1151 x 1231 case is

    2486 x 1151 x 4 bytes x 800 = 9.156 GB

With the additional pad based on the number of nodes, the total amount of data read is:

    16 nodes: .442 GB + 9.156 GB = 9.6 GB
    8 nodes: .206 GB + 9.156 GB = 9.4 GB

As the number of nodes is increased, the overlap causes more disk lock contention.

Writing Final Output Image

1243x1151x1231 - 7.1 GB per file:

    16 nodes: 78 x 1151 x 1231 x 4 = 442MB/node (7.1 GB total)
    8 nodes: 156 x 1151 x 1231 x 4 = 884MB/node (7.1 GB total)

2486x1151x1231 - 14.1 GB per file:

    16 nodes: 156 x 1151 x 1231 x 4 = 930 MB/node (14.1 GB total)
    8 nodes: 311 x 1151 x 1231 x 4 = 1808 MB/node (14.1 GB total)

Resource Allocation

It is best to allocate one node as the Oracle Grid Engine resource scheduler and MPI master host. This is especially true when running with 24 OpenMP threads in hyperthreading mode to avoid oversubscribing a node that is cooperating in delivering the solution.

See Also

Disclosure Statement

Copyright 2010, Oracle and/or its affiliates. All rights reserved. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Results as of 10/20/2010.

Wednesday Sep 22, 2010

Oracle Solaris 10 9/10 ZFS OLTP Performance Improvements

Oracle Solaris ZFS has seen significant performance improvements in the Oracle Solaris 10 9/10 release compared to the previous release, Oracle Solaris 10 10/09.
  • A 28% reduction in response time comparing holding the load constant in an OLTP workload test comparing Solaris 10 9/10 release to Oracle Solaris 10 10/09.
  • A 19% increase in IOPS throughput holding the response time of 28 msec constant in an OLTP workload test comparing Solaris 10 9/10 release to Oracle Solaris 10 10/09.
  • OLTP workload throughput rates of at least 800 IOPS using Oracle's Sun SPARC Enterprise T5420 server and Oracle's StorageTek 2540 array were used in calculating the above improvement percentages.

Performance Landscape

8K Block Random Read/Write OLTP-Style Test
IOPS Response Time (msec)
Oracle Solaris 10 9/10 Oracle Solaris 10 10/09
100 5.1 8.3
500 11.7 24.6
800 20.1 28.1
900 23.9 32.0
950 28.8 34.4

Results and Configuration Summary

Storage Configuration:

1 x StorageTek 2540 Array
12 x 73 GB 15K RPM HDDs
2 RAID5 5+1 volumes
1 RAID0 host stripe across the volumes

Server Configuration:

1 x Sun SPARC Enterprise T5240 server with
8 GB memory
2 x 1.6 GHz UltraSPARC T2 Plus processors

Software Configuration:

Oracle Solaris 10 10/09
Oracle Solaris 10 9/10
ZFS
SVM

Benchmark Description

IOPs test consisting of a mixture of random 8K block reads and writes accessing a significant portion of the available storage. As such the workload is not very "cache friendly" and, hence, illustrates the capability of the system to more fully utilize the processing capability of the back end storage.

See Also

Disclosure Statement

Copyright 2010, Oracle and/or its affiliates. All rights reserved. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Results as of 9/20/2010.

Tuesday Sep 21, 2010

ProMAX Performance and Throughput on Sun Fire X2270 and Sun Storage 7410

Halliburton/Landmark's ProMAX 3D Prestack Kirchhoff Time Migration's single job scalability and multiple job throughput using various scheduling methods are evaluated on a cluster of Oracle's Sun Fire X2270 servers attached via QDR InfiniBand to Oracle's Sun Storage 7410 system.

Two resource scheduling methods, compact and distributed, are compared while increasing the system load with additional concurrent ProMAX jobs.

  • A single ProMAX job has near linear scaling of 5.5x on 6 nodes of a Sun Fire X2270 cluster.

  • A single ProMAX job has near linear scaling of 7.5x on a Sun Fire X2270 server when running from 1 to 8 threads.

  • ProMAX can take advantage of Oracle's Sun Storage 7410 system features compared to dedicated local disks. There was no significant difference in run time observed when running up to 8 concurrent 16 thread jobs.

  • The 8-thread ProMAX job throughput using the distributed scheduling method is equivalent or slightly faster than the compact scheme for 1 to 4 concurrent jobs.

  • The 16-thread ProMAX job throughput using the distributed scheduling method is up to 8% faster when compared to the compact scheme on an 8-node Sun Fire X2270 cluster.

The multiple job throughput characterization revealed in this benchmark study are key in pre-configuring Oracle Grid Engine resource scheduling for ProMAX on a Sun Fire X2270 cluster and provide valuable insight for server consolidation.

Performance Landscape

Single Job Scaling

Single job performance on a single node is near linear up the number of cores in the node, i.e. 2 Intel Xeon X5570s with 4 cores each. With hyperthreading (2 active threads per core) enabled, more ProMAX threads are used increasing the load on the CPU's memory architecture causing the reduced speedups.
ProMAX single job performance on the 6-node cluster shows near linear speedup node to node.
Single Job 6-Node Scalability
Hyperthreading Enabled - 16 Threads/Node Maximum
Number of Nodes Threads Per Node Speedup to 1 Thread Speedup to 1 Node
6 16 54.2 5.5
4 16 36.2 3.6
3 16 26.1 2.6
2 16 17.6 1.8
1 16 10.0 1.0
1 14 9.2
1 12 8.6
1 10 7.2\*
1 8 7.5
1 6 5.9
1 4 3.9
1 3 3.0
1 2 2.0
1 1 1.0

\* 2 threads contend with two master node daemons

Multiple Job Throughput Scaling, Compact Scheduling

With the Sun Storage 7410 system, performance of 8 concurrent jobs on the cluster using compact scheduling is equivalent to a single job.

Multiple Job Throughput Scalability
Hyperthreading Enabled - 16 Threads/Node Maximum
Number of Nodes Number of Nodes per Job Threads Per Node per Job Performance Relative to 1 Job Total Nodes Percent Cluster Used
1 1 16 1.00 1 13
2 1 16 1.00 2 25
4 1 16 1.00 4 50
8 1 16 1.00 8 100

Multiple 8-Thread Job Throughput Scaling, Compact vs. Distributed Scheduling

These results report the difference of different distributed method resource scheduling levels to 1, 2, and 4 concurrent job compact method baselines.

Multiple 8-Thread Job Scheduling
HyperThreading Enabled - Use 8 Threads/Node Maximum
Number of Jobs Number of Nodes per Job Threads Per Node per Job Performance Relative to 1 Job Total Nodes Total Threads per Node Used Percent of PVM Master 8 Threads Used
1 1 8 1.00 1 8 100
1 4 2 1.01 4 2 25
1 8 1 1.01 8 1 13

2 1 8 1.00 2 8 100
2 4 2 1.01 4 4 50
2 8 1 1.01 8 2 25

4 1 8 1.00 4 8 100
4 4 2 1.00 4 8 100
4 8 1 1.01 8 4 100

Multiple 16-Thread Job Throughput Scaling, Compact vs. Distributed Scheduling

The results are reported relative to the performance of 1, 2, 4, and 8 concurrent 2-node, 8-thread jobs.

Multiple 16-Thread Job Scheduling
HyperThreading Enabled - 16 Threads/Node Available
Number of Jobs Number of Nodes per Job Threads Per Node per Job Performance Relative to 1 Job Total Nodes Total Threads per Node Used Percent of PVM Master 16 Threads Used
1 1 16 0.66 1 16 100\*
1 2 8 1.00 2 8 50
1 4 4 1.03 4 4 25
1 8 2 1.06 8 2 13

2 1 16 0.70 2 16 100\*
2 2 8 1.00 4 8 50
2 4 4 1.07 8 4 25
2 8 2 1.08 8 4 25

4 1 16 0.74 4 16 100\*
4 4 4 0.74 4 16 100\*
4 2 8 1.00 8 8 50
4 4 4 1.05 8 8 50
4 8 2 1.04 8 8 50

8 1 16 1.00 8 16 100\*
8 4 4 1.00 8 16 100\*
8 8 2 1.00 8 16 100\*

\* master PVM host; running 20 to 21 total threads (over-subscribed)

Results and Configuration Summary

Hardware Configuration:

8 x Sun Fire X2270 servers, each with
2 x 2.93 GHz Intel Xeon X5570 processors
48 GB memory at 1333 MHz
1 x 500 GB SATA
Sun Storage 7410 system
4 x 2.3 GHz AMD Opteron 8356 processors
128 GB memory
2 Internal 233GB SAS drives = 466 GB
2 Internal 93 GB read optimized SSD = 186 GB
1 External Sun Storage J4400 array with 22 1TB SATA drives and 2 18GB write optimized SSD
11 TB mirrored data and mirrored write optimized SSD

Software Configuration:

SUSE Linux Enterprise Server 10 SP 2
Parallel Virtual Machine 3.3.11
Oracle Grid Engine
Intel 11.1 Compilers
OpenWorks Database requires Oracle 10g Enterprise Edition
Libraries: pthreads 2.4, Java 1.6.0_01, BLAS, Stanford Exploration Project Libraries

Benchmark Description

The ProMAX family of seismic data processing tools is the most widely used Oil and Gas Industry seismic processing application. ProMAX is used for multiple applications, from field processing and quality control, to interpretive project-oriented reprocessing at oil companies and production processing at service companies. ProMAX is integrated with Halliburton's OpenWorks Geoscience Oracle Database to index prestack seismic data and populate the database with processed seismic.

This benchmark evaluates single job scalability and multiple job throughput of the ProMAX 3D Prestack Kirchhoff Time Migration while processing the Halliburton benchmark data set containing 70,808 traces with 8 msec sample interval and trace length of 4992 msec. Alternative thread scheduling methods are compared for optimizing single and multiple job throughput. The compact scheme schedules the threads of a single job in as few nodes as possible, whereas, the distributed scheme schedules the threads across a many nodes as possible. The effects of load on the Sun Storage 7410 system are measured. This information provides valuable insight into determining the Oracle Grid Engine resource management policies.

Hyperthreading is enabled for all of the tests. It should be noted that every node is running a PVM daemon and ProMAX license server daemon. On the master PVM daemon node, there are three additional ProMAX daemons running.

The first test measures single job scalability across a 6-node cluster with an additional node serving as the master PVM host. The speedup relative to a single node, single thread are reported.

The second test measures multiple job scalability running 1 to 8 concurrent 16-thread jobs using the Sun Storage 7410 system. The performance is reported relative to a single job.

The third test compares 8-thread multiple job throughput using different job scheduling methods on a cluster. The compact method involves putting all 8 threads for a job on the same node. The distributed method involves spreading the 8 threads of job across multiple nodes. The results report the difference of different distributed method resource scheduling levels to 1, 2, and 4 concurrent job compact method baselines.

The fourth test is similar to the second test except running 16-thread ProMAX jobs. The results are reported relative to the performance of 1, 2, 4, and 8 concurrent 2-node, 8-thread jobs.

The ProMAX processing parameters used for this benchmark:

Minimum output inline = 65
Maximum output inline = 85
Inline output sampling interval = 1
Minimum output xline = 1
Maximum output xline = 200 (fold)
Xline output sampling interval = 1
Antialias inline spacing = 15
Antialias xline spacing = 15
Stretch Mute Aperature Limit with Maximum Stretch = 15
Image Gather Type = Full Offset Image Traces
No Block Moveout
Number of Alias Bands = 10
3D Amplitude Phase Correction
No compression
Maximum Number of Cache Blocks = 500000

Key Points and Best Practices

  • The application was rebuilt with the Intel 11.1 Fortran and C++ compilers with these flags.

    -xSSE4.2 -O3 -ipo -no-prec-div -static -m64 -ftz -fast-transcendentals -fp-speculation=fast
  • There are additional execution threads associated with a ProMAX node. There are two threads that run on each node: the license server and PVM daemon. There are at least three additional daemon threads that run on the PVM master server: the ProMAX interface GUI, the ProMAX job execution - SuperExec, and the PVM console and control. It is best to allocate one node as the master PVM server to handle the additional 5+ threads. Otherwise, hyperthreading can be enabled and the master PVM host can support up to 8 ProMAX job threads.

  • When hyperthreading is enabled in on one of the non-master PVM hosts, there is a 7% penalty going from 8 to 10 threads. However, 12 threads are 11 percent faster than 8. This can be contributed to the two additional support threads when hyperthreading initiates.

  • Single job performance on a single node is near linear up the number of cores in the node, i.e. 2 Intel Xeon X5570s with 4 cores each. With hyperthreading (2 active threads per core) enabled, more ProMAX threads are used increasing the load on the CPU's memory architecture causing the reduced speedups.

    Users need to be aware of these performance differences and how it effects their production environment.

See Also

Disclosure Statement

The following are trademarks or registered trademarks of Halliburton/Landmark Graphics: ProMAX. Results as of 9/20/2010.

Monday Sep 20, 2010

Schlumberger's ECLIPSE 300 Performance Throughput On Sun Fire X2270 Cluster with Sun Storage 7410

Oracle's Sun Storage 7410 system, attached via QDR InfiniBand to a cluster of eight of Oracle's Sun Fire X2270 servers, was used to evaluate multiple job throughput of Schlumberger's Linux-64 ECLIPSE 300 compositional reservoir simulator processing their standard 2 Million Cell benchmark model with 8 rank parallelism (MM8 job).

  • The Sun Storage 7410 system showed little difference in performance (2%) compared to running the MM8 job with dedicated local disk.

  • When running 8 concurrent jobs on 8 different nodes all to the Sun Storage 7140 system, the performance saw little degradation (5%) compared to a single MM8 job running on dedicated local disk.

Experiments were run changing how the cluster was utilized in scheduling jobs. Rather than running with the default compact mode, tests were run distributing the single job among the various nodes. Performance improvements were measured when changing from the default compact scheduling scheme (1 job to 1 node) to a distributed scheduling scheme (1 job to multiple nodes).

  • When running at 75% of the cluster capacity, distributed scheduling outperformed the compact scheduling by up to 34%. Even when running at 100% of the cluster capacity, the distributed scheduling is still slightly faster than compact scheduling.

  • When combining workloads, using the distributed scheduling allowed two MM8 jobs to finish 19% faster than the reference time and a concurrent PSTM workload to find 2% faster.

The Oracle Solaris Studio Performance Analyzer and Sun Storage 7410 system analytics were used to identify a 3D Prestack Kirchhoff Time Migration (PSTM) as a potential candidate for consolidating with ECLIPSE. Both scheduling schemes are compared while running various job mixes of these two applications using the Sun Storage 7410 system for I/O.

These experiments showed a potential opportunity for consolidating applications using Oracle Grid Engine resource scheduling and Oracle Virtual Machine templates.

Performance Landscape

Results are presented below on a variety of experiments run using the 2009.2 ECLIPSE 300 2 Million Cell Performance Benchmark (MM8). The compute nodes are a cluster of Sun Fire X2270 servers connected with QDR InfiniBand. First, some definitions used in the tables below:

Local HDD: Each job runs on a single node to its dedicated direct attached storage.
NFSoIB: One node hosts its local disk for NFS mounting to other nodes over InfiniBand.
IB 7410: Sun Storage 7410 system over QDR InfiniBand.
Compact Scheduling: All 8 MM8 MPI processes run on a single node.
Distributed Scheduling: Allocate the 8 MM8 MPI processes across all available nodes.

First Test

The first test compares the performance of a single MM8 test on a single node using local storage to running a number of jobs across the cluster and showing the effect of different storage solutions.

Compact Scheduling
Multiple Job Throughput Results Relative to Single Job
2009.2 ECLIPSE 300 MM8 2 Million Cell Performance Benchmark

Cluster Load Number of MM8 Jobs Local HDD Relative Throughput NFSoIB Relative Throughput IB 7410 Relative Throughput
13% 1 1.00 1.00\* 0.98
25% 2 0.98 0.97 0.98
50% 4 0.98 0.96 0.97
75% 6 0.98 0.95 0.95
100% 8 0.98 0.95 0.95

\* Performance measured on node hosting its local disk to other nodes in the cluster.

Second Test

This next test uses the Sun Storage 7410 system and compares the performance of running the MM8 job on 1 node using the compact scheduling to running multiple jobs with compact scheduling and to running multiple jobs with the distributed schedule. The tests are run on a 8 node cluster, so each distributed job has only 1 MPI process per node.

Comparing Compact and Distributed Scheduling
Multiple Job Throughput Results Relative to Single Job
2009.2 ECLIPSE 300 MM8 2 Million Cell Performance Benchmark

Cluster Load Number of MM8 Jobs Compact Scheduling
Relative Throughput
Distributed Scheduling\*
Relative Throughput
13% 1 1.00 1.34
25% 2 1.00 1.32
50% 4 0.99 1.25
75% 6 0.97 1.10
100% 8 0.97 0.98

\* Each distributed job has 1 MPI process per node.

Third Test

This next test uses the Sun Storage 7410 system and compares the performance of running the MM8 job on 1 node using the compact scheduling to running multiple jobs with compact scheduling and to running multiple jobs with the distributed schedule. This test only uses 4 nodes, so each distributed job has two MPI processes per node.

Comparing Compact and Distributed Scheduling on 4 Nodes
Multiple Job Throughput Results Relative to Single Job
2009.2 ECLIPSE 300 MM8 2 Million Cell Performance Benchmark

Cluster Load Number of MM8 Jobs Compact Scheduling
Relative Throughput
Distributed Scheduling\*
Relative Throughput
25% 1 1.00 1.39
50% 2 1.00 1.28
100% 4 1.00 1.00

\* Each distributed job it has two MPI processes per node.

Fourth Test

The last test involves running two different applications on the 4 node cluster. It compares the performance of running the cluster fully loaded and changing how the applications are run, either compact or distributed. The comparisons are made against the individual application running the compact strategy (as few nodes as possible). It shows that appropriately mixing jobs can give better job performance than running just one kind of application on a single cluster.

Multiple Job, Multiple Application Throughput Results
Comparing Scheduling Strategies
2009.2 ECLIPSE 300 MM8 2 Million Cell and 3D Kirchoff Time Migration (PSTM)

Number of PSTM Jobs Number of MM8 Jobs Compact Scheduling
(1 node x 8 processes
per job)
ECLIPSE
Distributed Scheduling
(4 nodes x 2 processes
per job)
ECLIPSE
Distributed Scheduling
(4 nodes x 4 processes
per job)
PSTM
Compact Scheduling
(2 nodes x 8 processes per job)
PSTM
Cluster Load
0 1 1.00 1.40

25%
0 2 1.00 1.27

50%
0 4 0.99 0.98

100%
1 2
1.19 1.02
100%
2 0

1.07 0.96 100%
1 0

1.08 1.00 50%

Results and Configuration Summary

Hardware Configuration:

8 x Sun Fire X2270 servers, each with
2 x 2.93 GHz Intel Xeon X5570 processors
24 GB memory (6 x 4 GB memory at 1333 MHz)
1 x 500 GB SATA
Sun Storage 7410 system, 24 TB total, QDR InfiniBand
4 x 2.3 GHz AMD Opteron 8356 processors
128 GB memory
2 Internal 233GB SAS drives (466 GB total)
2 Internal 93 GB read optimized SSD (186 GB total)
1 Sun Storage J4400 with 22 1 TB SATA drives and 2 18 GB write optimized SSD
20 TB RAID-Z2 (double parity) data and 2-way striped write optimized SSD or
11 TB mirrored data and mirrored write optimized SSD
QDR InfiniBand Switch

Software Configuration:

SUSE Linux Enterprise Server 10 SP 2
Scali MPI Connect 5.6.6
GNU C 4.1.2 compiler
2009.2 ECLIPSE 300
ECLIPSE license daemon flexlm v11.3.0.0
3D Kirchoff Time Migration

Benchmark Description

The benchmark is a home-grown study in resource usage options when running the Schlumberger ECLIPSE 300 Compositional reservoir simulator with 8 rank parallelism (MM8) to process Schlumberger's standard 2 Million Cell benchmark model. Schlumberger pre-built executables were used to process a 260x327x73 (2 Million Cell) sub-grid with 6,206,460 total grid cells and model 7 different compositional components within a reservoir. No source code modifications or executable rebuilds were conducted.

The ECLIPSE 300 MM8 job uses 8 MPI processes. It can run within a single node (compact) or across multiple nodes of a cluster (distributed). By using the MM8 job, it is possible to compare the performance between running each job on a separate node using local disk to using a shared network attached storage solution. The benchmark tests study the affect of increasing the number of MM8 jobs in a throughput model.

The first test compares the performance of running 1, 2, 4, 6 and 8 jobs on a cluster of 8 nodes using local disk, NFSoIB disk, and the Sun Storage 7410 system connected via InfiniBand. Results are compared against the time it takes to run 1 job with local disk. This test shows what performance impact there is when loading down a cluster.

The second test compares different methods of scheduling jobs on a cluster. The compact method involves putting all 8 MPI processes for a job on the same node. The distributed method involves using 1 MPI processes per node. The results compare the performance against 1 job on one node.

The third test is similar to the second test, but uses only 4 nodes in the cluster, so when running distributed, there are 2 MPI processes per node.

The fourth test compares the compact and distributed scheduling methods on 4 nodes while running a 2 MM8 jobs and one 16-way parallel 3D Prestack Kirchhoff Time Migration (PSTM).

Key Points and Best Practices

  • ECLIPSE is very sensitive to memory bandwidth and needs to be run on 1333 MHz or greater memory speeds. In order to maintain 1333 MHz memory, the maximum memory configuration for the processors used in this benchmark is 24 GB. Bios upgrades now allow 1333 MHz memory for up to 48 GB of memory. Additional nodes can be used to handle data sets that require more memory than available per node. Allocating at least 20% of memory per node for I/O caching helps application performance.

  • If allocating an 8-way parallel job (MM8) to a single node, it is best to use an ECLIPSE license for that particular node to avoid the any additional network overhead of sharing a global license with all the nodes in a cluster.

  • Understanding the ECLIPSE MM8 I/O access patterns is essential to optimizing a shared storage solution. The analytics available on the Oracle Unified Storage 7410 provide valuable I/O characterization information even without source code access. A single MM8 job run shows an initial read and write load related to reading the input grid, parsing Petrel ascii input parameter files and creating an initial solution grid and runtime specifications. This is followed by a very long running simulation that writes data, restart files, and generates reports to the 7410. Due to the nature of the small block I/O, the mirrored configuration for the 7410 outperformed the RAID-Z2 configuration.

    A single MM8 job reads, processes, and writes approximately 240 MB of grid and property data in the first 36 seconds of execution. The actual read and write of the grid data, that is intermixed with this first stage of processing, is done at a rate of 240 MB/sec to the 7410 for each of the two operations.

    Then, it calculates and reports the well connections at an average 260 KB writes/second with 32 operations/second = 32 x 8 KB writes/second. However, the actual size of each I/O operation varies between 2 to 100 KB and there are peaks every 20 seconds. The write cache is on average operating at 8 accesses/second at approximately 61 KB/second (8 x 8 KB writes/sec). As the number of concurrent jobs increases, the interconnect traffic and random I/O operations per second to the 7410 increases.

  • MM8 multiple job startup time is reduced on shared file systems, if each job uses separate input files.

See Also

Disclosure Statement

Copyright 2010, Oracle and/or its affiliates. All rights reserved. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Results as of 9/20/2010.

Tuesday Apr 13, 2010

Oracle Sun Flash Accelerator F20 PCIe Card Accelerates Web Caching Performance

Using Oracle's Sun FlashFire technology, the Sun Accelerator F20 PCIe Card is shown to be a high performance and cost effective caching device for web servers. Many current web and application servers are designed with an active cache that is used for holding things like session objects, files and web pages. The Sun F20 card is shown to be an excellent candidate to improve performance over using HDD solutions.

  • The Sun Flash Accelerator F20 PCIe Card provides 2x better Quality of Service (QoS) at the same load as compared to 15K RPM high performance disk drives.

  • The Sun Flash Accelerator F20 PCIe Card enables scaling to 3x more users than 15K RPM high performance disk drives.

  • The Sun Flash Accelerator F20 PCIe Card provides 25% higher Quality of Service (QoS) than 15K RPM high performance disk drives at maximum rate.

  • The Sun Flash Accelerator F20 PCIe Card allows for easy expansion of the webcache. Each card provides an additional 96 GB of storage.

  • The Sun Flash Accelerator F20 PCIe Card used as a caching device offers Bitrate and Quality of Service (QoS) comparable to that provided by memory. While memory also provides excellent caching performance in comparison to disk, memory capacity is limited in servers.

Performance Landscape

Experiment results using three Sun Flash Accelerator F20 PCIe Cards.

Load Factor No Cache F20 Webcache Memcache
Max Load @Disk Load Max Load @F20 Load
Max Connections 7,000 7,000 27,000 27,000
Average Bitrate 445 Kbps 870 Kbps 602 Kbps 678 Kbps
Cache Hit Rate 0% 98% 99% 56%

QoS Bitrates %Connect %Connect %Connect %Connect
900 Kbps - 1 Mbps 0% 97% 0% 0%
800 Kbps 0% 3% 0% 6%
700 Kbps 0% 0% 64% 70%
600 Kbps 18% 0% 24% 15%
420 Kbps - 500 Kbps 88% 0% 12% 9%

Experiment results using two Sun Flash Accelerator F20 PCIe Cards.

Load Factor No Cache F20 Webcache Memcache
Max Load @Disk Load Max Load @F20 Load
Max Connections 7,000 7,000 22,000 27,000
Average Bitrate 445 Kbps 870 Kbps 622 Kbps 678 Kbps
Cache Hit Rate 0% 98% 80% 56%

QoS Bitrates %Connect %Connect %Connect %Connect
900 Kbps - 1 Mbps 0% 97% 0% 0%
800 Kbps 0% 3% 1% 6%
700 Kbps 0% 0% 68% 70%
600 Kbps 18% 0% 26% 15%
420 Kbps - 500 Kbps 88% 0% 5% 9%

Results and Configuration Summary

Hardware Configuration:

Sun Fire X4270, 72 GB memory
3 X Sun Flash Accelerator F20 PCIe Card
Sun Storage J4400 (12 15K RPM disks)

Software Configuration:

Sun Java System Web Server 7
OpenSolaris
Flickr Photo Download Workload
Oracle Solaris Zettabyte File System (ZFS)

Three configurations are compared:

  1. No cache, 12 x high-speed 15K RPM Disks
  2. 3 x Sun Flash Accelerator F20 PCIe Cards as cache device
  3. 64 GB server memory as cache device

Benchmark Description

This benchmark is based upon the description of the flickr website presented at http://highscalability.com/flickr-architecture. It measures performance of an HTTP-based file photo Slide Show workload. The workload randomly selects and downloads from 80 photos stored in 4 bins:

  • 20 large photos, 1800x1800p, 1 MB, 1% probability
  • 20 medium photos, 1000x1000p, 500 KB, 4% probability
  • 20 small photos, 540x540p, 100K, 35% probability
  • 20 thumbnail photos, 100x100p, 5k, 60% probability

Benchmark metrics are:

  • Scalability – Number of persistent connections achieved
  • Quality of Service (QoS) – bitrate achieved by each user
    • max speed: 1 Mbps, min speed SLA: 420 Kbps
    • divides bitrates between max and min in 5 bands, corresponding to dial-in, T1, etc.
    • example: 900 Kbps, 800 Kbps, 700 Kbps, 600 Kbps, 500 Kbps
    • reports %users in each bitrate band

Three cases were tested:

  • Disk as OverFlow Cache – Contents are served from 12 high-performance 15K RPM disks configured in a ZFS zpool.
  • Sun Flash Accelerator F20 PCIe Card as Cache Device – Contents are served from 2 F20 Cards, with 8 component DOMs configured in a ZFS spool
  • Memory as Cache – Contents are served from tmpfs

Key Points and Best Practices

See Also

Disclosure Statement

Results as of 4/1/2010.

Thursday Nov 19, 2009

SPECmail2009: New World record on T5240 1.6GHz Sun 7310 and ZFS

The Sun SPARC Enterprise T5240 server running the Sun Java Messaging server 7.2 achieved a World Record SPECmail2009 result using Sun Storage 7310 Unified Storage System and ZFS file system.  Sun's OpenStorage platforms enable another world record.

  • World record SPECmail2009 benchmark using the Sun SPARC Enterprise T5240 server (two 1.6GHz UltraSPARC T2 Plus), Sun Communications Suite 7, Solaris 10, and the Sun Storage 7310 Unified Storage System achieved 14,500 SPECmail_Ent2009 users at 69,857 Sessions/Hour.

  • This SPECmail2009 benchmark result clearly demonstrates that the Sun Messaging Server 7.2, Solaris 10 and ZFS solution can support a large, enterprise level IMAP mail server environment as a low cost 'Sun on Sun' solution, delivering the best performance and maximizing data integrity and availability of Sun Open Storage and ZFS.

  • The Sun SPARC Enterprise T5240 server supported 2.4 times more users with 2.4 times better sessions/hour rate than AppleXserv3 solution on the SPECmail2009 benchmark.

  • There are no IBM Power6 results on this benchmark.

  • The configuration using Sun OpenStorage outperformed all previous results with traditional direct attached storage and significantly higher number of disk devices.

SPECmail2009 Performance Landscape (ordered by performance)

System Performance Disks OS Messaging
Server
Users Sessions/
hour
Sun SPARC Enterprise T5240
2 x 1.6GHz UltraSPARC T2 Plus
14,500 69,857 58
NAS
Solaris 10 CommSuite 7.2
Sun JMS 7.2
Sun SPARC Enterprise T5240
2 x 1.6GHz UltraSPARC T2 Plus
12,000 57,758 80
DAS
Solaris 10 CommSuite 5
Sun JMS 6.3
Sun Fire X4275
2 x 2.93GHz Xeon X5570
8,000 38,348 44
NAS
Solaris 10 Sun JMS 6.2
Apple Xserv3,1
2 x 2.93GHz Xeon X5570
6,000 28,887 82
DAS
MacOS 10.6 Dovecot 1.1.14
apple 0.5
Sun SPARC Enterprise T5220
1 x 1.4GHz UltraSPARC T2
3,600 17,316 52
DAS
Solaris 10 Sun JMS 6.2

Complete benchmark results may be found at the SPEC benchmark website http://www.spec.org

Users - SPECmail_Ent2009 Users
Sessions/hour - SPECmail2009 Sessions/hour
NAS - Network Attached Storage
DAS - Direct Attached Storage

Results and Configuration Summary

Hardware Configuration:

    Sun SPARC Enterprise T5240
      2 x 1.6 GHz UltraSPARC T2 Plus processors
      128 GB memory
      2 x 146GB, 10K RPM SAS disks, 4 x 32GB SSDs

External Storage:

    2 x Sun Storage 7310 Unified Storage System, each with
      32 GB of memory
      24 x 1 TB 7200 RPM SATA Drives

Software Configuration:

    Solaris 10
    ZFS
    Sun Java Communications Suite 7 Update 2
      Sun Java System Messaging Server 7.2
      Directory Server 6.3

Benchmark Description

The SPECmail2009 benchmark measures the ability of corporate e-mail systems to meet today's demanding e-mail users over fast corporate local area networks (LAN). The SPECmail2009 benchmark simulates corporate mail server workloads that range from 250 to 10,000 or more users, using industry standard SMTP and IMAP4 protocols. This e-mail server benchmark creates client workloads based on a 40,000 user corporation, and uses folder and message MIME structures that include both traditional office documents and a variety of rich media content. The benchmark also adds support for encrypted network connections using industry standard SSL v3.0 and TLS 1.0 technology. SPECmail2009 replaces all versions of SPECmail2008, first released in August 2008. The results from the two benchmarks are not comparable.

Software on one or more client machines generates a benchmark load for a System Under Test (SUT) and measures the SUT response times. A SUT can be a mail server running on a single system or a cluster of systems.

A SPECmail2009 'run' simulates a 100% load level associated with the specific number of users, as defined in the configuration file. The mail server must maintain a specific Quality of Service (QoS) at the 100% load level to produce a valid benchmark result. If the mail server does maintain the specified QoS at the 100% load level, the performance of the mail server is reported as SPECmail_Ent2009 SMTP and IMAP Users at SPECmail2009 Sessions per hour. The SPECmail_Ent2009 users at SPECmail2009 Sessions per Hour metric reflects the unique workload combination for a SPEC IMAP4 user.

Key Points and Best Practices

  • Each Sun Storage 7310 Unified Storage System was configured with one J4400 JBOD array with 22x1TB SATA drives to a mirrored device and 4 shared volumes are built under the mirrored device. Total 8 mirrored volumes from 2 x Sun Storage 7310 are mounted on the system under test (SUT) messaging mail indexes and mail messages file system using NFSV4 protocol. Four SSDs were used as the SUT internal disks. Each SSD is configured as a ZFS file system. Four such ZFS directories are used for the messaging server queue, store metadata, LDAP and queue. SSDs substantially reduced the store metadata and queue latencies.

  • Each Sun Storage 7310 Unified Storage System was connected to the SUT via a dual 10-Gigabit Ethernet Fiber XFP card.

  • The Sun Storage 7310 Unified Storage System software version is 2009.08.11,1-0.

  • The clients used these Java options: java -d64 -Xms4096m -Xmx4096m -XX:+AggressiveHeap

  • Substantial performance improvement and scalability was observed with Sun Communications Suite7 update2, Java Messaging Server 7.2 and Directory Server 6.2

  • See the SPEC Report for all OS, network and messaging server tunings.

See Also

Disclosure Statement

SPEC, SPECmail reg tm of Standard Performance Evaluation Corporation. Results as of 10/22/09 on www.spec.org. SPECmail2009: Sun SPARC Enterprise T5240, SPECmail_Ent2009 14,500 users at 69,857 SPECmail2009 Sessions/hour. Apple Xserv3,1, SPECmail_Ent2009 6,000 users at 28,887 SPECmail2009 Sessions/hour.

Tuesday Oct 13, 2009

Sun T5440 Oracle BI EE Sun SPARC Enterprise T5440 World Record

The Oracle BI EE, a component of Oracle Fusion Middleware,  workload was run on two Sun SPARC Enterprise T5440 servers and achieved world record performance.
  • Two Sun SPARC Enterprise T5440 servers with four 1.6 GHz UltraSPARC T2 Plus processors delivered the best performance of 50K concurrent users on the Oracle BI EE 10.1.3.4 benchmark with Oracle 11g database running on free and open Solaris 10.

  • The two node Sun SPARC Enterprise T5440 servers with Oracle BI EE running on Solaris 10 using 8 Solaris Containers shows 1.8x scaling over Sun's previous one node SPARC Enterprise T5440 server result with 4 Solaris Containers.

  • The two node SPARC Enterprise T5440 servers demonstrated the performance and scalability of the UltraSPARC T2 Plus processor demonstrating 50K users can be serviced with 0.2776 sec response time.

  • The Sun SPARC Enterprise T5220 server was used as an NFS server with 4 internal SSDs and the ZFS file system which showed significant I/O performance improvement over traditional disk for Business Intelligence Web Catalog activity.

  • Oracle Fusion Middleware provides a family of complete, integrated, hot pluggable and best-of-breed products known for enabling enterprise customers to create and run agile and intelligent business applications. Oracle BI EE performance demonstrates why so many customers rely on Oracle Fusion Middleware as their foundation for innovation.

  • IBM has not published any POWER6 processor based results on this important benchmark.

Performance Landscape

System Processors Users
Chips GHz Type
2 x Sun SPARC Enterprise T5440 8 1.6 UltraSPARC T2 Plus 50,000
1 x Sun SPARC Enterprise T5440 4 1.6 UltraSPARC T2 Plus 28,000
5 x Sun Fire T2000 1 1.2 UltraSPARC T1 10,000

Results and Configuration Summary

Hardware Configuration:

    2 x Sun SPARC Enterprise T5440 (1.6GHz/128GB)
    1 x Sun SPARC Enterprise T5220 (1.2GHz/64GB) and 4 SSDs (used as NFS server)

Software Configuration:

    Solaris10 05/09
    Oracle BI EE 10.1.3.4
    Oracle Fusion Middleware
    Oracle 11gR1

Benchmark Description

The objective of this benchmark is to highlight how Oracle BI EE can support pervasive deployments in large enterprises, using minimal hardware, by simulating an organization that needs to support more than 25,000 active concurrent users, each operating in mixed mode: ad-hoc reporting, application development, and report viewing.

The user population was divided into a mix of administrative users and business users. A maximum of 28,000 concurrent users were actively interacting and working in the system during the steady-state period. The tests executed 580 transactions per second, with think times of 60 seconds per user, between requests. In the test scenario 95% of the workload consisted of business users viewing reports and navigating within dashboards. The remaining 5% of the concurrent users, categorized as administrative users, were doing application development.

The benchmark scenario used a typical business user sequence of dashboard navigation, report viewing, and drill down. For example, a Service Manager logs into the system and navigates to his own set of dashboards viz. .Service Manager.. The user then selects the .Service Effectiveness. dashboard, which shows him four distinct reports, .Service Request Trend., .First Time Fix Rate., .Activity Problem Areas., and .Cost Per completed Service Call . 2002 till 2005. . The user then proceeds to view the .Customer Satisfaction. dashboard, which also contains a set of 4 related reports. He then proceeds to drill-down on some of the reports to see the detail data. Then the user proceeds to more dashboards, for example .Customer Satisfaction. and .Service Request Overview.. After navigating through these dashboards, he logs out of the application

This benchmark did not use a synthetic database schema. The benchmark tests were run on a full production version of the Oracle Business Intelligence Applications with a fully populated underlying database schema. The business processes in the test scenario closely represents a true customer scenario.

See Also

Disclosure Statement

Oracle BI EE benchmark results 10/13/2009, see

SPECcpu2006 Results On MSeries Servers With Updated SPARC64 VII Processors

The SPEC CPU2006 benchmarks were run on the new 2.88 GHz and 2.53 GHz SPARC64 VII processors for the Sun SPARC Enterprise Mseries servers. The new processors were tested in the Sun SPARC Enterprise M4000, M5000, M8000, M9000 servers.


  • The Sun SPARC Enterprise M9000 server running the new 2.88 GHz SPARC64 VII processors beats the IBM Power 595 server running 5.0 GHz POWER6 processors by 20% on the SPECint_rate2006 benchmark.

  • The Sun SPARC Enterprise M9000 server running the new 2.88 GHz SPARC64 VII processors beats the IBM Power 595 server running 5.0 GHz POWER6 processors by 29% on the SPECint_rate_base2006 benchmark.

  • The Sun SPARC Enterprise M9000 server with 64 SPARC64 VII 2.88GHz processors delivered results of 2590 SPECint_rate2006 and 2100 SPECfp_rate2006.

  • The Sun SPARC Enterprise M9000 server with 64 SPARC64 VII processors at 2.88GHz improves performance vs. 2.52 GHz by 13% for SPECint_rate2006 and 5% for SPECfp_rate2006.

  • The Sun SPARC Enterprise M9000 server with 32 SPARC64 VII 2.88GHz processors delivered results of 1450 SPECint_rate2006 and 1250 SPECfp_rate2006.

  • The Sun SPARC Enterprise M9000 server with 32 SPARC64 VII processors at 2.88GHz improves performance vs. 2.52 GHz by 17% for SPECint_rate2006 and 13% for SPECfp_rate2006.

  • The Sun SPARC Enterprise M8000 server with 16 SPARC64 VII 2.88GHz processors delivered results of 753 SPECint_rate2006 and 666 SPECfp_rate2006.

  • The Sun SPARC Enterprise M8000 server with 16 SPARC64 VII processors at 2.88GHz improves performance vs. 2.52 GHz by 18% for SPECint_rate2006 and 14% for SPECfp_rate2006.

  • The Sun SPARC Enterprise M5000 server with 8 SPARC64 VII 2.53GHz processors delivered results of 296 SPECint_rate2006 and 234 SPECfp_rate2006.

  • The Sun SPARC Enterprise M5000 server with 8 SPARC64 VII processors at 2.53GHz improves performance vs. 2.40 GHz by 12% for SPECint_rate2006 and 5% for SPECfp_rate2006.

  • The Sun SPARC Enterprise M4000 server with 4 SPARC64 VII 2.53GHz processors delivered results of 152 SPECint_rate2006 and 116 SPECfp_rate2006.

  • The Sun SPARC Enterprise M4000 server with 4 SPARC64 VII processors at 2.53GHz improves performance vs. 2.40 GHz by 13% for SPECint_rate2006 and 4% for SPECfp_rate2006.

Performance Landscape

SPEC CPU2006 Performance Charts - bigger is better, selected results, please see www.spec.org for complete results. All results as of 10/07/09.

In the tables below
"Base" = SPECint_rate_base2006 or SPECfp_rate_base2006
"Peak" = SPECint_rate2006 or SPECfp_rate2006

SPECint_rate2006 results - large systems

System Processors Base
Copies
Performance Results Comments
Cores/
Chips
Type GHz Base Peak
SGI Altix 4700 Bandwidth 1024/512 Itanium 2 1.6 1020 9031 na
Sun Blade X6440 Cluster 768/192 Opteron 8384 2.7 705 8845 na
SGI Altix 4700 Density 256/128 Itanium 2 1.66 256 2893 3354
vSMP Foundation 128/32 Xeon X5570 2.93 255 3147 na
SGI Altix 4700 Bandwidth 256/128 Itanium 2 1.6 256 2715 2971
SPARC Enterprise M9000 256/64 SPARC64 VII 2.88 511 2400 2590 New
SPARC Enterprise M9000 256/64 SPARC64 VII 2.52 511 2088 2288
IBM Power 595 64/32 POWER6 5.0 128 1866 2155
HP Superdome 128/64 Itanium 2 1.6 128 1534 1648
SPARC Enterprise M9000 128/32 SPARC64 VII 2.88 255 1370 1450 New
SPARC Enterprise M9000 128/64 SPARC64 VI 2.4 255 1111 1294
SPARC Enterprise M9000 128/32 SPARC64 VI 2.52 255 1141 1240
Unisys ES7000 96/16 Xeon X7460 2.66 96 999 1049
SGI Altix ICE 8200EX 32/8 Xeon X5570 2.93 64 931 999
IBM Power 575 32/16 POWER6 4.7 64 812 934
IBM Power 570 32/16 POWER6+ 4.2 64 661 832
SPARC Enterprise M8000 64/16 SPARC64 VII 2.88 127 706 753 New
SPARC Enterprise M9000 64/32 SPARC64 VI 2.4 127 553 650
SPARC Enterprise M8000 64/16 SPARC64 VII 2.52 127 565 637

SPECint_rate2006 results - small systems

System Processors Base
Copies
Performance Results Comments
Cores/
Chips
Type GHz Base Peak
Sun Fire X4440 24/4 Opteron 8435 SE 2.6 24 296 377
SPARC Enterprise M5000 32/8 SPARC64 VII 2.53 64 267 296 New
Sun Blade X6440 16/4 Opteron 8389 2.9 16 226 292
HP ProLiant BL680c G5 24/4 Xeon E7458 2.4 24 247 268
SPARC Enterprise M5000 32/8 SPARC64 VII 2.4 63 232 264
IBM Power 550 8/4 POWER6+ 5.0 16 215 263
Sun Fire X2270 8/2 Xeon X5570 2.93 16 223 260
SPARC Enterprise T5240 16/2 UltraSPARC T2 Plus 1.6 127 171 183
SPARC Enterprise M4000 16/4 SPARC64 VII 2.53 32 136 152 New
SPARC Enterprise M4000 16/4 SPARC64 VII 2.4 32 118 135

SPECfp_rate2006 results - large systems

System Processors Base
Copies
Performance Results Comments
Cores/
Chips
Type GHz Base Peak
SGI Altix 4700 Bandwidth 1024/512 Itanium 2 1.6 1020 10583 na
SGI Altix 4700 Density 1024/512 Itanium 2 1.66 1020 10580 na
Sun Blade X6440 Cluster 768/192 Opteron 8384 2.7 705 6502 na
SGI Altix 4700 Bandwidth 256/128 Itanium 2 1.6 256 3419 3507
ScaleMP vSMP Foundation 128/32 Xeon X5570 2.93 255 2553 na
IBM Power 595 64/32 POWER6 5.0 128 1681 2184
IBM Power 595 64/32 POWER6 5.0 128 1822 2108
SPARC Enterprise M9000 256/64 SPARC64 VII 2.88 511 1930 2100 New
SPARC Enterprise M9000 256/64 SPARC64 VII 2.52 511 1861 2005
SGI Altix 4700 Bandwidth 128/64 Itanium 2 1.66 128 1832 1947
HP Superdome 128/64 Itanium 2 1.6 128 1422 1479
SPARC Enterprise M9000 128/32 SPARC64 VII 2.88 255 1190 1250 New
SPARC Enterprise M9000 128/64 SPARC64 VI 2.4 255 1160 1225
SPARC Enterprise M9000 128/32 SPARC64 VII 2.52 255 1059 1110
IBM Power 575 32/16 POWER6 4.7 64 730 839
SPARC Enterprise M8000 64/16 SPARC64 VII 2.88 127 616 666 New
SPARC Enterprise M9000 64/32 SPARC64 VI 2.52 127 588 636
IBM Power 570 32/16 POWER6+ 4.2 64 517 602
SPARC Enterprise M8000 64/32 SPARC64 VI 2.4 127 538 582

SPECfp_rate2006 results - small systems

System Processors Base
Copies
Performance Results Comments
Cores/
Chips
Type GHz Base Peak
Supermicro H8QM8-2 24/4 Opteron 8435 SE 2.8 24 261 287
SPARC Enterprise T5440 32/4 UltraSPARC T2 Plus 1.6 255 254 270
IBM Power 560 16/8 POWER6+ 3.6 32 226 263
SPARC Enterprise M5000 32/8 SPARC64 VII 2.53 64 218 234 New
SPARC Enterprise M5000 32/8 SPARC64 VII 2.4 63 208 223
IBM Power 550 8/4 POWER6+ 5.0 16 188 222
ASUS Z8PE-D18 8/2 Xeon X5570 2.93 16 197 203
SPARC Enterprise T5240 16/2 UltraSPARC T2 Plus 1.6 127 124 133
SPARC Enterprise M4000 16/4 SPARC64 VII 2.53 32 111 116 New
SPARC Enterprise M4000 16/4 SPARC64 VII 2.4 32 107 112

Results and Configuration Summary

Test Configurations:

Sun SPARC Enterprise M9000
64 x 2.88 GHz SPARC64 VII
1152 GB (448 x 2GB + 64 x 4GB)
Solaris 10 5/09
Sun Studio 12 Update 1

Sun SPARC Enterprise M9000
32 x 2.88 GHz SPARC64 VII
704 GB (160 x 2GB + 96 x 4GB)
Solaris 10 5/09
Sun Studio 12 Update 1

Sun SPARC Enterprise M8000
16 x 2.88 GHz SPARC64 VII
512 GB (128 x 4GB)
Solaris 10 10/09
Sun Studio 12 Update 1

Sun SPARC Enterprise M5000
8 x 2.53 GHz SPARC64 VII
128 GB (64 x 2GB)
Solaris 10 10/09
Sun Studio 12 Update 1

Sun SPARC Enterprise M4000
4 x 2.53 GHz SPARC64 VII
32 GB (32 x 1GB)
Solaris 10 10/09
Sun Studio 12 Update 1

Results Summary:

M9000 M9000 M8000 M5000 M4000
SPECint_rate_base2006 2400 1370 706 267 136
SPECint_rate2006 2590 1450 753 296 152
SPECfp_rate_base2006 1930 1190 616 218 111
SPECfp_rate2006 2100 1250 666 234 116
SPECint_base2006 - - 12.4 - 12.1
SPECint2006 - - 13.6 - 12.9
SPECfp_base2006 - - 15.6 - 13.3
SPECfp2006 - - 16.5 - 13.9
SPECfp2006 - autopar - - 28.2 - -
SPECfp2006 - autopar - - 33.9 - -

Benchmark Description

SPEC CPU2006 is SPEC's most popular benchmark, with over 8000 results published in the three years since it was introduced. It measures:

  • "Speed" - single copy performance of chip, memory, compiler
  • "Rate" - multiple copy (throughput)

The rate metrics are used for the throughput-oriented systems described on this page. These metrics include:

  • SPECint_rate2006: throughput for 12 integer benchmarks derived from real applications such as perl, gcc, XML processing, and pathfinding
  • SPECfp_rate2006: throughput for 17 floating point benchmarks derived from real applications, including chemistry, physics, genetics, and weather.

There are "base" variants of both the above metrics that require more conservative compilation, such as using the same flags for all benchmarks.

Key Points and Best Practices

Result on this page for the Sun SPARC Enterprise M9000 server were measured on a Fujitsu SPARC Enterprise M9000. The Sun SPARC Enterprise M9000 and Fujitsu SPARC Enterprise M9000 are electronically equivalent. Results for the Sun SPARC Enterprise M8000, M4000 and M5000 were measured on those systems. The similarly named Fujitsu sytems are electronically equivalent.

Use the latest compiler. The Sun Studio group is always working to improve the compiler. Sun Studio 12 Update 1, which are used in these submissions, provides updated code generation for a wide variety of SPARC and x86 implementations.

I/O still counts. Even in a CPU-intensive workload, some I/O remains. This point is explored in some detail at http://blogs.sun.com/jhenning/entry/losing_my_fear_of_zfs.

See Also

Disclosure Statement

SPEC, SPECint, SPECfp reg tm of Standard Performance Evaluation Corporation. Competitive results from www.spec.org as of 7 October 2009. Sun's new results quoted on this page have been submitted to SPEC. Sun SPARC Enterprise M9000 2400 SPECint_rate_base2006, 2590 SPECint_rate2006, 1930 SPECfp_rate_base2006, 2100 SPECfp_rate2006; Sun SPARC Enterprise M9000 (32 chips) 1370 SPECint_rate_base2006, 1450 SPECint_rate2006, 1190 SPECfp_rate_base2006, 1250 SPECfp_rate2006; Sun SPARC Enterprise M8000 706 SPECint_rate_base2006, 753 SPECint_rate2006, 616 SPECfp_rate_base2006, 666 SPECfp_rate2006; Sun SPARC Enterprise M5000 267 SPECint_rate_base2006, 296 SPECint_rate2006, 218 SPECfp_rate_base2006, 234 SPECfp_rate2006; Sun SPARC Enterprise M4000 136 SPECint_rate_base2006, 152 SPECint_rate2006, 111 SPECfp_rate_base2006, 116 SPECfp_rate2006; Sun SPARC Enterprise M9000 (2.52GHz) 2088 SPECint_rate_base2006, 2288 SPECint_rate2006, 1860 SPECfp_rate_base2006, 2010 SPECfp_rate2006; Sun SPARC Enterprise M9000 (32 chips 2.52GHz) 1140 SPECint_rate_base2006, 1240 SPECint_rate2006, 1060 SPECfp_rate_base2006, 1110 SPECfp_rate2006; Sun SPARC Enterprise M8000 (2.52GHz) 565 SPECint_rate_base2006, 637 SPECint_rate2006, 538 SPECfp_rate_base2006, 582 SPECfp_rate2006; Sun SPARC Enterprise M5000 (2.4GHz) 232 SPECint_rate_base2006, 264 SPECint_rate2006, 208 SPECfp_rate_base2006, 223 SPECfp_rate2006; Sun SPARC Enterprise M4000 (2.4GHz) 118 SPECint_rate_base2006, 135 SPECint_rate2006, 107 SPECfp_rate_base2006, 112 SPECfp_rate2006; IBM Power 595 1866 SPECint_rate_base2006, 2155 SPECint_rate2006,

Wednesday Aug 12, 2009

SPECmail2009 on Sun SPARC Enterprise T5240 and Sun Java System Messaging Server 6.3

Significance of Results

The Sun SPARC Enterprise T5240 server running the Sun Java Messaging server 6.3 achieved World Record SPECmail2009 results using ZFS.

  • A Sun SPARC Enterprise T5240 server powered by two 1.6 GHz UltraSPARC T2 Plus processors running the Sun Java Communications Suite 5 software along with the Solaris 10 Operating System and using six Sun StorageTek 2540 arrays achieved a new World Record 12000 SPECmail_Ent2009 IMAP4 users at 57,758 Sessions/hour for SPECmail2009.
  • The Sun SPARC Enterprise T5240 server achieve twice the number of users and sessions/hour rate than the Apple Xserv3,1 solution equipped with Intel Nehalem processors.
  • The Sun result was obtained using ~10% fewer disk spindles with the Sun StorageTek 2540 RAID controller direct attach storage solution versus Apple's direct attached storage.
  • This benchmark result demonstrates that the Sun SPARC Enterprise T5240 server together with Sun Java Communication Suite 5 component Sun Java System Messaging Server 6.3, Solaris 10 and ZFS on Sun StorageTek 2540 arrays supports a large, enterprise level IMAP mail server environment. This solution is reliable, low cost, and low power, delivering the best performance and maximizing the data integrity with Sun's ZFS file systems.

Performance Landscape

SPECmail2009 (ordered by performance)

System Processors Performance
Type GHz Ch, Co, Th SPECmail_Ent2009
Users
SPECmail2009
Sessions/hour
Sun SPARC Enterprise T5240 UltraSPARC T2 Plus 1.6 2, 16, 128 12,000 57,758
Sun Fire X4275 Xeon X5570 2.93 2, 8, 16 8,000 38,348
Apple Xserv3,1 Xeon X5570 2.93 2, 8, 16 6,000 28,887
Sun SPARC Enterprise T5220 UltraSPARC T2 1.4 1, 8, 64 3,600 17,316

Notes:

    Number of SPECmail_Ent2009 users (bigger is better)
    SPECmail2009 Sessions/hour (bigger is better)
    Ch, Co, Th: Chips, Cores, Threads

Complete benchmark results may be found at the SPEC benchmark website http://www.spec.org

Results and Configuration Summary

Hardware Configuration:

    Sun SPARC Enterprise T5240

      2 x 1.6 GHz UltraSPARC T2 Plus processors
      128 GB
      8 x 146GB, 10K RPM SAS disks

    6 x Sun StorageTek 2540 Arrays,

      4 arrays with 12 x 146GB 15K RPM SAS disks
      2 arrays with 12 x 73GB 15K RPM SAS disks

    2 x Sun Fire X4600 benchmark manager, load generator and mail sink

      8 x AMD Opteron 8356 2.7 GHz QC processors
      64 GB
      2 x 73GB 10K RPM SAS disks

    Sun Fire X4240 load generator

      2 x AMD Opteron 2384 2.7 GHz DC processors
      16 GB
      2 x 73GB 10K RPM SAS disks

Software Configuration:

    Solaris 10
    ZFS
    Sun Java Communication Suite 5
    Sun Java System Messaging Server 6.3

Benchmark Description

The SPECmail2009 benchmark measures the ability of corporate e-mail systems to meet today's demanding e-mail users over fast corporate local area networks (LAN). The SPECmail2009 benchmark simulates corporate mail server workloads that range from 250 to 10,000 or more users, using industry standard SMTP and IMAP4 protocols. This e-mail server benchmark creates client workloads based on a 40,000 user corporation, and uses folder and message MIME structures that include both traditional office documents and a variety of rich media content. The benchmark also adds support for encrypted network connections using industry standard SSL v3.0 and TLS 1.0 technology. SPECmail2009 replaces all versions of SPECmail2008, first released in August 2008. The results from the two benchmarks are not comparable.

Software on one or more client machines generates a benchmark load for a System Under Test (SUT) and measures the SUT response times. A SUT can be a mail server running on a single system or a cluster of systems.

A SPECmail2009 'run' simulates a 100% load level associated with the specific number of users, as defined in the configuration file. The mail server must maintain a specific Quality of Service (QoS) at the 100% load level to produce a valid benchmark result. If the mail server does maintain the specified QoS at the 100% load level, the performance of the mail server is reported as SPECmail_Ent2009 SMTP and IMAP Users at SPECmail2009 Sessions per hour. The SPECmail_Ent2009 users at SPECmail2009 Sessions per Hour metric reflects the unique workload combination for a SPEC IMAP4 user.

Key Points and Best Practices

  • Each Sun StorageTek 2540 array was configured with 6 hardware RAID1 volumes. A total of 36 RAID1 volumes were configured with 24 of size 146GB and 12 of size 73GB. Four ZPOOLs of (6x146GB RAID1 volumes) were mounted as the four primary message stores and ZFS file systems. Four ZPOOLs of (8x73GB RAID1 volumes) were mounted as the four primary message indexes. The hardware RAID1 volumes were created with 64K stripe size without read ahead turned on. The 7x146GB internal drives were used to create four ZPOOLs and ZFS file systems for the LDAP, store metadata, queue and the mailserver log.

  • The clients used these Java options: java -d64 -Xms4096m -Xmx4096m -XX:+AggressiveHeap

  • See the SPEC Report for all OS, network and messaging server tunings.

See Also

Disclosure Statement

SPEC, SPECmail reg tm of Standard Performance Evaluation Corporation. Results as of 08/07/2009 on www.spec.org. SPECmail2009: Sun SPARC Enterprise T5240 (16 cores, 2 chips) SPECmail_Ent2009 12000 users at 57,758 SPECmail2009 Sessions/hour. Apple Xserv3,1 (8 cores, 2 chips) SPECmail_Ent2009 6000 users at 28,887 SPECmail2009 Sessions/hour.

Thursday Jul 23, 2009

World Record Performance of Sun CMT Servers

This week, Sun continues to highlight the record-breaking performance of its latest update to the chip multi-threaded (CMT) Sun SPARC Enterprise server family running Solaris.  Some of these benchmarks leverage the use of a variety of Sun's unique technologies including ZFS, SSD, various Storage Products and many more. These benchmarks were blogged about by various members or our team and the URLs are shown below.

Messages

  • Sun's CMT is the most powerful CPU regardless of architectural/implementation details (#transistors, #cores, threads, MHz, etc.)!
  • Performance tests show that Sun can outperform IBM Power6 by more than 2x on a variety of benchmarks.
  • Performance tests show Sun's new 1.6GHz CMT systems can be 20% faster than Sun's previous generation 1.4GHz processors, given Sun's continual advancements in both hardware and software.

Benchmark Results Recently Blogged

Sun T5440 Oracle BI EE World Record Performance
http://blogs.sun.com/BestPerf/entry/sun_t5440_oracle_bi_ee

Sun T5440 World Record SAP-SD 4-Processor Two-tier SAP ERP 6.0 EP 4 (Unicode), Beats IBM POWER6 (note1)
http://blogs.sun.com/BestPerf/entry/sun_t5440_world_record_sap

Zeus ZXTM Traffic Manager World Record on Sun T5240
http://blogs.sun.com/BestPerf/entry/top_performance_on_sun_sparc

Sun T5440 SPECjbb2005, Sun 1.6GHz T2 Plus chip is 2.3x IBM 4.7GHz POWER6 chip
http://blogs.sun.com/BestPerf/entry/sun_t5440_specjbb2005_beats_ibm

New SPECjAppServer2004 Performance on the Sun SPARC Enterprise T5440
http://blogs.sun.com/BestPerf/entry/new_specjappserver2004_performance_on_sun

1.6 GHz SPEC CPU2006: World Record 4-chip system, Rate Benchmarks, Beats IBM POWER6
http://blogs.sun.com/BestPerf/entry/1_6_ghz_spec_cpu2006

Sun Blade T6320 World Record 1-chip SPECjbb2005 performance, Sun 1.6GHz T2 Plus chip is 2.6x IBM 4.7GHz POWER6 chip
http://blogs.sun.com/BestPerf/entry/new_specjbb2005_performance_on_the

Comparison Table

Benchmark Sun CMT Tier Software Key Messages
Oracle BI EE Sun T5440 Appl,
Database
Oracle 11g,
Oracle BIEE,
ZFS,
Solaris
  • World Record: T5440
  • Achieved 28,000 users
  • Reference
SAP-SD 2-Tier Sun T5440 Appl,
Database
SAP ECC 6.0,EP4
Oracle 10g,
Solaris
  • World Record 4-socket: T5440
  • T5440 Beats 4-socket IBM 550 5GHz Power6 by 26% (note1)
  • T5440 Beats HP DL585 G6 4-socket Opteron (note1)
  • Unicode version
SPECjAppServer
2004
Sun T5440 Appl, Database Oracle WebLogic,
Oracle 11g,
JDK 1.6.0_14,
Solaris
  • World Record Single System (Appl Tier): T5440
  • T5440 is 6.4x faster of IBM Power 570 4.7GHz Power6
  • T5440 is 73% faster than HP DL 580 G5 Xeon 6C
  • Oracle Fusion Middleware
Sun T5440
SPECjbb2005
Sun T5440 Appl Java HotSpot,
OpenSolaris
  • 1.6GHz US T2 Plus CPU is 2.3x faster of IBM 4.7GHz Power6 CPU
  • 1.6GHz US T2 Plus CPU is 21% faster than previous generation 1.4GHz US T2 Plus CPU
  • Sun T5440 has 2.3x better power/perf than the IBM 570 (8 4.7GHz Power6)
Sun Blade T6320 SPECjbb2005 Sun T6320 Appl Java HotSpot,
OpenSolaris
  • World Record 1-socket: T6320
  • 1.6GHz US T2 Plus CPU is 2.6x faster than IBM 4.7GHz Power6 CPU
  • T6320 is 3% faster than Fujitsu 3.16GHz Xeon QC
SPEC CPU2006 Sun T5440,
Sun T5240,
Sun T5220,
Sun T5120,
Sun T6320
all tiers Sun Studio12,
Solaris,
ZFS
  • World Record 4-socket: T5440
  • 1.6GHz US T2 Plus CPU is 2.6x faster than IBM 4.7GHz Power6 CPU
  • T6320 is 3% faster than Fujitsu 3.16GHz Xeon QC
Zeus ZXTM
Traffic Manager
Sun T5240 Web Zeus ZXTM v5.1r1,
Solaris
  • World Record: T5240
  • T5240 Beats f5 BIG-IP VIPRON by 34%; 2.6x better $/perf
  • T5240 Beats f5 BIG-IP 8800 by 91%; 2.7x better $/perf⁞
  • T5240 Beats Citrix 12000 by 2.2x; 3.3x better $/perf
  • No IBM result

Virtualization

Sun's announcement also included updated virtualization software (LDOMs 1.1). Downloads are available to existing SPARC Enterprise server customers at: http://www.sun.com/servers/coolthreads/ldoms/index.jsp.  Also look the the blog posting "LDoms for Dummies" at http://blogs.sun.com/PierreReynes/entry/ldoms_for_dummies

Try & Buy Program

Sun is also offering free 60-day trials on Sun CMT servers with with a very popular Try and Buy program: http://www.sun.com/tryandbuy.

Benchmark Performance Disclosure Statements (the URLs listed above go into more detail on each of these benchmarks)

Note1: 4-processor world record on the 2-tier SAP SD Standard Application Benchmark with 4720 SD User, as of July 23, 2009, IBM System 550 (4 processors, 8 cores, 16 threads) 3,752 SAP SD Users, 4x 5 GHz Power6, 64 GB memory, DB2 9.5, AIX 6.1, Cert# 2009023. T5440 beats HP new 4-socket Opteron Servers (HPDL585 G6 with 4665 SD User and HP BL685c G6 with 4422 SD User)

Two-tier SAP Sales and Distribution (SD) standard SAP ERP 6.0 2005/EP4 (Unicode) application benchmarks as of 07/21/09: Sun SPARC Enterprise T5440 Server (4 processors, 32 cores, 256 threads) 4,720 SAP SD Users, 4x 1.6 GHz UltraSPARC T2 Plus, 256 GB memory, Oracle10g, Solaris10, Cert# 2009026. HP ProLiant DL585 G6 (4 processors, 24 cores, 24 threads) 4,665 SAP SD Users, 4x 2.8 GHz AMD Opteron Processor 8439 SE, 64 GB memory, SQL Server 2008, Windows Server 2008 Enterprise Edition, Cert# 2009025. HP ProLiant BL685c G6 (4 processors, 24 cores, 24 threads) 4,422 SAP SD Users, 4x 2.6 GHz AMD Opteron Processor 8435, 64 GB memory, SQL Server 2008, Windows Server 2008 Enterprise Edition, Cert# 2009021. IBM System 550 (4 processors, 8 cores, 16 threads) 3,752 SAP SD Users, 4x 5 GHz Power6, 64 GB memory, DB2 9.5, AIX 6.1, Cert# 2009023. HP ProLiant DL585 G5 (4 processors, 16 cores, 16 threads) 3,430 SAP SD Users, 4x 3.1 GHz AMD Opteron Processor 8393 SE, 64 GB memory, SQL Server 2008, Windows Server 2008 Enterprise Edition, Cert# 2009008. HP ProLiant BL685 G6 (4 processors, 16 cores, 16 threads) 3,118 SAP SD Users, 4x 2.9 GHz AMD Opteron Processor 8389, 64 GB memory, SQL Server 2008, Windows Server 2008 Enterprise Edition, Cert# 2009007. NEC Express5800 (4 processors, 24 cores, 24 threads) 2,957 SAP SD Users, 4x 2.66 GHz Intel Xeon Processor X7460, 64 GB memory, SQL Server 2008, Windows Server 2008 Enterprise Edition, Cert# 2009018. Dell PowerEdge M905 (4 processors, 16 cores, 16 threads) 2,129 SAP SD Users, 4x 2.7 GHz AMD Opteron Processor 8384, 96 GB memory, SQL Server 2005, Windows Server 2003 Enterprise Edition, Cert# 2009017. Sun Fire X4600M2 (8 processors, 32 cores, 32 threads) 7,825 SAP SD Users, 8x 2.7 GHz AMD Opteron 8384, 128 GB memory, MaxDB 7.6, Solaris 10, Cert# 2008070. IBM System x3650 M2 (2 Processors, 8 Cores, 16 Threads) 5,100 SAP SD users,2x 2.93 Ghz Intel Xeon X5570, DB2 9.5, Windows Server 2003 Enterprise Edition, Cert# 2008079. HP ProLiant DL380 G6 (2 processors, 8 cores, 16 threads) 4,995 SAP SD Users, 2x 2.93 GHz Intel Xeon x5570, 48 GB memory, SQL Server 2005, Windows Server 2003 Enterprise Edition, Cert# 2008071. SAP, R/3, reg TM of SAP AG in Germany and other countries. More info www.sap.com/benchmark.

Oracle Business Intelligence Enterprise Edition benchmark, see http://www.oracle.com/solutions/business_intelligence/resource-library-whitepapers.html for more. Results as of 7/20/09.

Zeus is TM of Zeus Technology Limited. Results as of 7/21/2009 on http://www.zeus.com/news/press_articles/zeus-price-performance-press-release.html?gclid=CLn4jLuuk5cCFQsQagod7gTkJA.

SPEC, SPECint, SPECfp reg tm of Standard Performance Evaluation Corporation. Competitive results from www.spec.org as of 16 July 2009. Sun's new results quoted on this page have been submitted to SPEC. Sun Blade T6320 89.2 SPECint_rate_base2006, 96.7 SPECint_rate2006, 64.1 SPECfp_rate_base2006, 68.5 SPECfp_rate2006; Sun SPARC Enterprise T5220/T5120 89.1 SPECint_rate_base2006, 97.0 SPECint_rate2006, 64.1 SPECfp_rate_base2006, 68.5 SPECfp_rate2006; Sun SPARC Enterprise T5240 172 SPECint_rate_base2006, 183 SPECint_rate2006, 124 SPECfp_rate_base2006, 133 SPECfp_rate2006; Sun SPARC Enterprise T5440 338 SPECint_rate_base2006, 360 SPECint_rate2006, 254 SPECfp_rate_base2006, 270 SPECfp_rate2006; Sun Blade T6320 76.4 SPECint_rate_base2006, 85.5 SPECint_rate2006, 58.1 SPECfp_rate_base2006, 62.3 SPECfp_rate2006; Sun SPARC Enterprise T5220/T5120 76.2 SPECint_rate_base2006, 83.9 SPECint_rate2006, 57.9 SPECfp_rate_base2006, 62.3 SPECfp_rate2006; Sun SPARC Enterprise T5240 142 SPECint_rate_base2006, 157 SPECint_rate2006, 111 SPECfp_rate_base2006, 119 SPECfp_rate2006; Sun SPARC Enterprise T5440 270 SPECint_rate_base2006, 301 SPECint_rate2006, 212 SPECfp_rate_base2006, 230 SPECfp_rate2006; IBM p 570 53.2 SPECint_rate_base2006, 60.9 SPECint_rate2006, 51.5 SPECfp_rate_base2006, 58.0 SPECfp_rate2006; IBM Power 520 102 SPECint_rate_base2006, 124 SPECint_rate2006, 88.7 SPECfp_rate_base2006, 105 SPECfp_rate2006; IBM Power 550 215 SPECint_rate_base2006, 263 SPECint_rate2006, 188 SPECfp_rate_base2006, 222 SPECfp_rate2006; HP Integrity BL870c 114 SPECint_rate_base2006; HP Integrity rx7640 87.4 SPECfp_rate_base2006, 90.8 SPECfp_rate2006.

SPEC, SPECjbb reg tm of Standard Performance Evaluation Corporation. Results as of 7/17/2009 on http://www.spec.org. SPECjbb2005, Sun Blade T6320 229576 SPECjbb2005 bops, 28697 SPECjbb2005 bops/JVM; IBM p 570 88089 SPECjbb2005 bops, 88089 SPECjbb2005 bops/JVM; Fujitsu TX100 223691 SPECjbb2005 bops, 111846 SPECjbb2005 bops/JVM; IBM x3350 194256 SPECjbb2005 bops, 97128 SPECjbb2005 bops/JVM; Sun SPARC Enterprise T5120 192055 SPECjbb2005 bops, 24007 SPECjbb2005 bops/JVM.

SPECjAppServer2004, Sun SPARC Enterprise T5440 (4 chips, 32 cores) 7661.16 SPECjAppServer2004 JOPS@Standard; HP DL580 G5 (4 chips, 24 cores) 4410.07 SPECjAppServer2004 JOPS@Standard; HP DL580 G5 (4 chips, 16 cores) 3339.94 SPECjAppServer2004 JOPS@Standard; Two Dell PowerEdge 2950 (4 chips, 16 cores) 4794.33 SPECjAppServer2004 JOPS@Standard; Dell PowerEdge R610 (2 chips, 8 cores) 3975.13 SPECjAppServer2004 JOPS@Standard; Two Dell PowerEdge R610 (4 chips, 16 cores) 7311.50 SPECjAppServer2004 JOPS@Standard; IBM Power 570 (2 chips, 4 cores) 1197.51 SPECjAppServer2004 JOPS@Standard; SPEC, SPECjAppServer reg tm of Standard Performance Evaluation Corporation. Results from http://www.spec.org as of 7/20/09.

SPECjbb2005 Sun SPARC Enterprise T5440 (4 chips, 32 cores) 841380 SPECjbb2005 bops, 26293 SPECjbb2005 bops/JVM. Results submitted to SPEC. HP DL585 G5 (4 chips, 24 cores) 937207 SPECjbb2005 bops, 234302 SPECjbb2005 bops/JVM. IBM Power 570 (8 chips, 16 cores) 798752 SPECjbb2005 bops, 99844 SPECjbb2005 bops/JVM. Sun SPARC Enterprise T5440 (4 chips, 32 cores) 692736 SPECjbb2005 bops, 21648 SPECjbb2005 bops/JVM. SPEC, SPECjbb reg tm of Standard Performance Evaluation Corporation. Results from www.spec.org as of 7/20/09.

IBM p 570 8P 4.7GHz (4 building blocks) power specifications calculated as 80% of maximum input power reported 7/8/09 in “Facts and Features Report”: ftp://ftp.software.ibm.com/common/ssi/pm/br/n/psb01628usen/PSB01628USEN.PDF

Tuesday Jul 21, 2009

Sun T5440 Oracle BI EE World Record Performance

Oracle BI EE Sun SPARC Enterprise T5440 World Record Performance

The Sun SPARC Enterprise T5440 server running the new 1.6 GHz UltraSPARC T2 Plus processor delivered world record performance on Oracle Business Intelligence Enterprise Edition (BI EE) tests using Sun's ZFS.
  • The Sun SPARC Enterprise T5440 server with four 1.6 GHz UltraSPARC T2 Plus processors delivered the best single system performance of 28K concurrent users on the Oracle BI EE benchmark. This result used Solaris 10 with Solaris Containers and the Oracle 11g Database software.

  • The benchmark demonstrates the scalability of Oracle Business Intelligence Cluster with 4 nodes running in Solaris Containers within single Sun SPARC Enterprise T5440 server.

  • The Sun SPARC Enterprise Server T5440 server with internal SSD and the ZFS file system showed significant I/O performance improvement over traditional disk for Business Intelligence Web Catalog activity.

Performance Landscape

System Processors Users
Chips Cores Threads GHz Type
1 x Sun SPARC Enterprise T5440 4 32 256 1.6 UltraSPARC T2 Plus 28,000
5 x Sun Fire T2000 1 8 32 1.2 UltraSPARC T1 10,000

Results and Configuration Summary

Hardware Configuration:

    Sun SPARC Enterprise T5440
      4 x 1.6 GHz UltraSPARC T2 Plus processors
      256 GB
      STK2540 (6 x 146GB)

Software Configuration:

    Solaris 10 5/09
    Oracle BIEE 10.1.3.4 64-bit
    Oracle 11g R1 Database

Benchmark Description

The objective of this benchmark is to highlight how Oracle BI EE can support pervasive deployments in large enterprises, using minimal hardware, by simulating an organization that needs to support more than 25,000 active concurrent users, each operating in mixed mode: ad-hoc reporting, application development, and report viewing.

The user population was divided into a mix of administrative users and business users. A maximum of 28,000 concurrent users were actively interacting and working in the system during the steady-state period. The tests executed 580 transactions per second, with think times of 60 seconds per user, between requests. In the test scenario 95% of the workload consisted of business users viewing reports and navigating within dashboards. The remaining 5% of the concurrent users, categorized as administrative users, were doing application development.

The benchmark scenario used a typical business user sequence of dashboard navigation, report viewing, and drill down. For example, a Service Manager logs into the system and navigates to his own set of dashboards viz. .Service Manager.. The user then selects the .Service Effectiveness. dashboard, which shows him four distinct reports, .Service Request Trend., .First Time Fix Rate., .Activity Problem Areas., and .Cost Per completed Service Call . 2002 till 2005. . The user then proceeds to view the .Customer Satisfaction. dashboard, which also contains a set of 4 related reports. He then proceeds to drill-down on some of the reports to see the detail data. Then the user proceeds to more dashboards, for example .Customer Satisfaction. and .Service Request Overview.. After navigating through these dashboards, he logs out of the application

This benchmark did not use a synthetic database schema. The benchmark tests were run on a full production version of the Oracle Business Intelligence Applications with a fully populated underlying database schema. The business processes in the test scenario closely represents a true customer scenario.

Key Points and Best Practices

Since the server has 32 cores, we created 4 x Solaris Containers with 8 cores dedicated to each of the containers. And a total of four instances of BI server + Presentation server (collectively referred as an 'instance' here onwards) were installed at one instance per container. All the four BI instances were clustered using the BI Cluster software components.

The ZFS file system was used to overcome the 'Too many links' error when there are ~28,000 concurrent users. Earlier the file system has reached UFS limitation of 32767 sub-directories (LINK_MAX) with ~28K users online -- and there are thousands of errors due to the inability to create new directories beyond 32767 directories within a directory. Web Catalog stores the user profile on the disk by creating at least one dedicated directory for each user. If there are more than 25,000 concurrent users, clearly ZFS is the way to go.

See Also:

Oracle Business Intelligence Website,  BUSINESS INTELLIGENCE has other results

Disclosure Statement

Oracle Business Intelligence Enterprise Edition benchmark, see http://www.oracle.com/solutions/business_intelligence/resource-library-whitepapers.html for more. Results as of 7/20/09.
About

BestPerf is the source of Oracle performance expertise. In this blog, Oracle's Strategic Applications Engineering group explores Oracle's performance results and shares best practices learned from working on Enterprise-wide Applications.

Index Pages
Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today