Monday Oct 26, 2015

ZFS Encryption: SPARC T7-1 Performance

Oracle's SPARC T7-1 server can encrypt/decrypt at near clear text throughput. The SPARC T7-1 server can encrypt/decrypt on the fly and have CPU cycles left over for the application.

  • The SPARC T7-1 server performed 475,123 Clear 8k read IOPs. With AES-256-CCM enabled on the file syste, 8K read IOPS only drop 3.2% to 461,038.

  • The SPARC T7-1 server performed 461,038 AES-256-CCM 8K read IOPS and a two-chip x86 E5-2660 v3 server performed 224,360 AES-256-CCM 8K read IOPS. The SPARC M7 processor result is 4.1 times faster per chip.

  • The SPARC T7-1 server performed 460,600 AES-192-CCM 8K read IOPS and a two chip x86 E5-2660 v3 server performed 228,654 AES-192-CCM 8K read IOPS. The SPARC M7 processor result is 4.0 times faster per chip.

  • The SPARC T7-1 server performed 465,114 AES-128-CCM 8K read IOPS and a two chip x86 E5-2660 v3 server performed 231,911 AES-128-CCM 8K read IOPS. The SPARC M7 processor result is 4.0 times faster per chip.

  • The SPARC T7-1 server performed 475,123 clear text 8K read IOPS and a two chip x86 E5-2660 v3 server performed 438,483 clear text 8K read IOPS The SPARC M7 processor result is 2.2 times faster per chip.

Performance Landscape

Results presented below are for random read performance for 8K size. All of the following results were run as part of this benchmark effort.

Read Performance – 8K
Encryption SPARC T7-1 2 x E5-2660 v3
IOPS Resp Time % Busy IOPS Resp Time % Busy
Clear 475,123 0.8 msec 43% 438,483 0.8 msec 95%
AES-256-CCM 461,038 0.83 msec 56% 224,360 1.6 msec 97%
AES-192-CCM 465,114 0.83 msec 56% 228,654 1.5 msec 97%
AES-128-CCM 465,114 0.82 msec 57% 231,911 1.5 msec 96%

IOPS – IO operations per second
Resp Time – response time
% Busy – percent cpu usage

Configuration Summary

SPARC T7-1 server
1 x SPARC M7 processor (4.13 GHz)
256 GB memory (16 x 16 GB)
Oracle Solaris 11.3
4 x StorageTek 8 Gb Fibre Channel PCIe HBA

Oracle Server X5-2L system
2 x Intel Xeon Processor E5-2660 V3 (2.60 GHz)
256 GB memory
Oracle Solaris 11.3
4 x StorageTek 8 Gb Fibre Channel PCIe HBA

Storage SAN
2 x Brocade 300 FC switches
2 x Sun Storage 6780 array with 64 disk drives / 16 GB Cache

Benchmark Description

The benchmark tests the performance of running an encrypted ZFS file system compared to the non-encrypted (clear text) ZFS file system. The tests were executed with Oracle's Vdbench tool Version 5.04.03. Three different encryption methods are tested, AES-256-CCM, AES-192-CCM and AES-128-CCM.

Key Points and Best Practices

  • The ZFS file system was configured with data cache disabled, meta cache enabled, 4 pools, 128 luns, and 192 file systems with 8K record size. Data cache was disable to insure data would be decrypted as it was read from storage. This is not a recommended setting for normal customer operations.

  • The tests were executed with Oracle's Vdbench tool against 192 file systems. Each file system was run with a queue depth of 2. The script used for testing is listed below.

  • hd=default,jvms=16
    sd=sd001,lun=/dev/zvol/rdsk/p1/vol001,size=5g,hitarea=100m
    sd=sd002,lun=/dev/zvol/rdsk/p1/vol002,size=5g,hitarea=100m
    #
    # sd003 through sd191 statements here
    #
    sd=sd192,lun=/dev/zvol/rdsk/p4/vol192,size=5g,hitarea=100m
    
    # VDBENCH work load definitions for run
    # Sequential write to fill storage.
    wd=swrite1,sd=sd*,readpct=0,seekpct=eof
    
    # Random Read work load.
    wd=rread,sd=sd*,readpct=100,seekpct=random,rhpct=100
    
    # VDBENCH Run Definitions for actual execution of load.
    rd=default,iorate=max,elapsed=3h,interval=10
    rd=seqwritewarmup,wd=swrite1,forxfersize=(1024k),forthreads=(16) 
    
    rd=default,iorate=max,elapsed=10m,interval=10
    
    rd=rread8k-50,wd=rread,forxfersize=(8k),iorate=curve, \
    curve=(95,90,80,70,60,50),forthreads=(2)
    

See Also

Disclosure Statement

Copyright 2015, Oracle and/or its affiliates. All rights reserved. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Results as of 10/25/2015.

Oracle E-Business Suite Applications R12.1.3 (OLTP X-Large): SPARC M7-8 World Record

Oracle's SPARC M7-8 server, using a four-chip Oracle VM Server for SPARC (LDom) virtualized server, produced a world record 20,000 users running the Oracle E-Business OLTP X-Large benchmark. The benchmark runs five Oracle E-Business online workloads concurrently: Customer Service, iProcurement, Order Management, Human Resources Self-Service, and Financials.

  • The virtualized four-chip LDom on the SPARC M7-8 was able to handle more users than the previous best result which used eight processors of Oracle's SPARC M6-32 server.

  • The SPARC M7-8 server using Oracle VM Server for SPARC provides enterprise applications high availability, where each application is executed on its own environment, insulated and independent of the others.

Performance Landscape

Oracle E-Business (3-tier) OLTP X-Large Benchmark
System Chips Total Online Users Weighted Average
Response Time (sec)
90th Percentile
Response Time (s)
SPARC M7-8 4 20,000 0.70 1.13
SPARC M6-32 8 18,500 0.61 1.16

Break down of the total number of users by component.

Users per Component
Component SPARC M7-8 SPARC M6-32
Total Online Users 20,000 users 18,500 users
HR Self-Service
Order-to-Cash
iProcurement
Customer Service
Financial
5000 users
2500 users
2700 users
7000 users
2800 users
4000 users
2300 users
2400 users
7000 users
2800 users

Configuration Summary

System Under Test:

SPARC M7-8 server
8 x SPARC M7 processors (4.13 GHz)
4 TB memory
2 x 600 GB SAS-2 HDD
using a Logical Domain with
4 x SPARC M7 processors (4.13 GHz)
2 TB memory
2 x Sun Storage Dual 16Gb Fibre Channel PCIe Universal HBA
2 x Sun Dual Port 10GBase-T Adapter
Oracle Solaris 11.3
Oracle E-Business Suite 12.1.3
Oracle Database 11g Release 2

Storage Configuration:

4 x Oracle ZFS Storage ZS3-2 appliances each with
2 x Read Flash Accelerator SSD
1 x Storage Drive Enclosure DE2-24P containing:
20 x 900 GB 10K RPM SAS-2 HDD
4 x Write Flash Accelerator SSD
1 x Sun Storage Dual 8Gb FC PCIe HBA
Used for Database files, Zones OS, EBS Mid-Tier Apps software stack
and db-tier Oracle Server
2 x Sun Server X4-2L server with
2 x Intel Xeon Processor E5-2650 v2
128 GB memory
1 x Sun Storage 6Gb SAS PCIe RAID HBA
4 x 400 GB SSD
14 x 600 GB HDD
Used for Redo log files, db backup storage.

Benchmark Description

The Oracle E-Business OLTP X-Large benchmark simulates thousands of online users executing transactions typical of an internal Enterprise Resource Processing, simultaneously executing five application modules: Customer Service, Human Resources Self Service, iProcurement, Order Management and Financial.

Each database tier uses a database instance of about 600 GB in size, supporting thousands of application users, accessing hundreds of objects (tables, indexes, SQL stored procedures, etc.).

Key Points and Best Practices

This test demonstrates virtualization technologies running concurrently various Oracle multi-tier business critical applications and databases on four SPARC M7 processors contained in a single SPARC M7-8 server supporting thousands of users executing a high volume of complex transactions with constrained (<1 sec) weighted average response time.

The Oracle E-Business LDom is further configured using Oracle Solaris Zones.

This result of 20,000 users was achieved by load balancing the Oracle E-Business Suite Applications 12.1.3 five online workloads across two Oracle Solaris processor sets and redirecting all network interrupts to a dedicated third processor set.

Each applications processor set (set-1 and set-2) was running concurrently two Oracle E-Business Suite Application servers and two database servers instances, each within its own Oracle Solaris Zone (4 x Zones per set).

Each application server network interface (to a client) was configured to map with the locality group associated to the CPUs processing the related workload, to guarantee memory locality of network structures and application servers hardware resources.

All external storage was connected with at least two paths to the host multipath-capable fibre channel controller ports and Oracle Solaris I/O multipathing feature was enabled.

See Also

Disclosure Statement

Oracle E-Business Suite R12 extra-large multiple-online module benchmark, SPARC M7-8, SPARC M7, 4.13 GHz, 4 chips, 128 cores, 1024 threads, 2 TB memory, 20,000 online users, average response time 0.70 sec, 90th percentile response time 1.13 sec, Oracle Solaris 11.3, Oracle Solaris Zones, Oracle VM Server for SPARC, Oracle E-Business Suite 12.1.3, Oracle Database 11g Release 2, Results as of 10/25/2015.

Tuesday Mar 26, 2013

SPARC T5-2 Achieves ZFS File System Encryption Benchmark World Record

Oracle continues to lead in enterprise security. Oracle's SPARC T5 processors combined with the Oracle Solaris ZFS file system demonstrate faster file system encryption than equivalent x86 systems using the Intel Xeon Processor E5-2600 Sequence chips which have AES-NI security instructions.

Encryption is the process where data is encoded for privacy and a key is needed by the data owner to access the encoded data.

  • The SPARC T5-2 server is 3.4x faster than a 2 processor Intel Xeon E5-2690 server running Oracle Solaris 11.1 that uses the AES-NI GCM security instructions for creating encrypted files.

  • The SPARC T5-2 server is 2.2x faster than a 2 processor Intel Xeon E5-2690 server running Oracle Solaris 11.1 that uses the AES-NI CCM security instructions for creating encrypted files.

  • The SPARC T5-2 server consumes a significantly less percentage of system resources as compared to a 2 processor Intel Xeon E5-2690 server.

Performance Landscape

Below are results running two different ciphers for ZFS encryption. Results are presented for runs without any cipher, labeled clear, and a variety of different key lengths. The results represent the maximum delivered values measured for 3 concurrent sequential write operations using 1M blocks. Performance is measured in MB/sec (bigger is better). System utilization is reported as %CPU as measured by iostat (smaller is better).

The results for the x86 server were obtained using Oracle Solaris 11.1 with performance bug fixes.

Encryption Using AES-GCM Ciphers

System GCM Encryption: 3 Concurrent Sequential Writes
Clear AES-256-GCM AES-192-GCM AES-128-GCM
MB/sec %CPU MB/sec %CPU MB/sec %CPU MB/sec %CPU
SPARC T5-2 server 3,918 7 3,653 14 3,676 15 3,628 14
SPARC T4-2 server 2,912 11 2,662 31 2,663 30 2,779 31
2-Socket Intel Xeon E5-2690 3,969 42 1,062 58 1,067 58 1,076 57
SPARC T5-2 vs x86 server 1.0x 3.4x 3.4x 3.4x

Encryption Using AES-CCM Ciphers

System CCM Encryption: 3 Concurrent Sequential Writes
Clear AES-256-CCM AES-192-CCM AES-128-CCM
MB/sec %CPU MB/sec %CPU MB/sec %CPU MB/sec %CPU
SPARC T5-2 server 3,862 7 3,665 15 3,622 14 3,707 12
SPARC T4-2 server 2,945 11 2,471 26 2,801 26 2,442 25
2-Socket Intel Xeon E5-2690 3,868 42 1,566 64 1,632 63 1,689 66
SPARC T5-2 vs x86 server 1.0x 2.3x 2.2x 2.2x

Configuration Summary

Storage Configuration:

Sun Storage 6780 array
4 CSM2 trays, each with 16 83GB 15K RPM drives
8x 8 GB/sec Fiber Channel ports per host
R0 Write cache enabled, controller mirroring off for peak write bandwidth
8 Drive R0 512K stripe pools mirrored via ZFS to storage

Sun Storage 6580 array
9 CSM2 trays, each with 16 136GB 15K RPM drives
8x 4 GB/sec Fiber Channel ports per host
R0 Write cache enabled, controller mirroring off for peak write bandwidth
4 Drive R0 512K stripe pools mirrored via ZFS to storage

Server Configuration:

SPARC T5-2 server
2 x SPARC T5 3.6 GHz processors
512 GB memory
Oracle Solaris 11.1

SPARC T4-2 server
2 x SPARC T4 2.85 GHz processors
256 GB memory
Oracle Solaris 11.1

Sun Server X3-2L server
2 x Intel Xeon E5-2690, 2.90 GHz processors
128 GB memory
Oracle Solaris 11.1

Switch Configuration:

Brocade 5300 FC switch

Benchmark Description

This benchmark evaluates secure file system performance by measuring the rate at which encrypted data can be written. The Vdbench tool was used to generate the IO load. The test performed 3 concurrent sequential write operations using 1M blocks to 3 separate files.

Key Points and Best Practices

  • ZFS encryption is integrated with the ZFS command set. Like other ZFS operations, encryption operations such as key changes and re-key are performed online.

  • Data is encrypted using AES (Advanced Encryption Standard) with key lengths of 256, 192, and 128 in the CCM and GCM operation modes.

  • The flexibility of encrypting specific file systems is a key feature.

  • ZFS encryption is inheritable to descendent file systems. Key management can be delegated through ZFS delegated administration.

  • ZFS encryption uses the Oracle Solaris Cryptographic Framework which gives it access to SPARC T5 and Intel Xeon E5-2690 processor hardware acceleration or to optimized software implementations of the encryption algorithms automatically.

  • On modern computers with multiple threads per core, simple statistics like %utilization measured in tools like iostat and vmstat are not "hard" indications of the resources that might be available for other processing. For example, 90% idle may not mean that 10 times the work can be done. So drawing numerical conclusions must be done carefully.

See Also

Disclosure Statement

Copyright 2013, Oracle and/or its affiliates. All rights reserved. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Results as of March 26, 2013.

Tuesday Oct 02, 2012

Performance of Oracle Business Intelligence Benchmark on SPARC T4-4

Oracle's SPARC T4-4 server configured with four SPARC T4 3.0 GHz processors delivered 25,000 concurrent users on Oracle Business Intelligence Enterprise Edition (BI EE) 11g benchmark using Oracle Database 11g Release 2 running on Oracle Solaris 10.

  • A SPARC T4-4 server running Oracle Business Intelligence Enterprise Edition 11g achieved 25,000 concurrent users with an average response time of 0.36 seconds with Oracle BI server cache set to ON.

  • The benchmark data clearly shows that the underlying hardware, SPARC T4 server, and the Oracle BI EE 11g (11.1.1.6.0 64-bit) platform scales within a single system supporting 25,000 concurrent users while executing 415 transactions/sec.

  • The benchmark demonstrated the scalability of Oracle Business Intelligence Enterprise Edition 11g 11.1.1.6.0, which was deployed in a vertical scale-out fashion on a single SPARC T4-4 server.

  • Oracle Internet Directory configured on SPARC T4 server provided authentication for the 25,000 Oracle BI EE users with sub-second response time.

  • A SPARC T4-4 with internal Solid State Drive (SSD) using the ZFS file system showed significant I/O performance improvement over traditional disk for the Web Catalog activity. In addition, ZFS helped get past the UFS limitation of 32767 sub-directories in a Web Catalog directory.

  • The multi-threaded 64-bit Oracle Business Intelligence Enterprise Edition 11g and SPARC T4-4 server proved to be a successful combination by providing sub-second response times for the end user transactions, consuming only half of the available CPU resources at 25,000 concurrent users, leaving plenty of head room for increased load.

  • The Oracle Business Intelligence on SPARC T4-4 server benchmark results demonstrate that comprehensive BI functionality built on a unified infrastructure with a unified business model yields best-in-class scalability, reliability and performance.

  • Oracle BI EE 11g is a newer version of Business Intelligence Suite with richer and superior functionality. Results produced with Oracle BI EE 11g benchmark are not comparable to results with Oracle BI EE 10g benchmark. Oracle BI EE 11g is a more difficult benchmark to run, exercising more features of Oracle BI.

Performance Landscape

Results for the Oracle BI EE 11g version of the benchmark. Results are not comparable to the Oracle BI EE 10g version of the benchmark.

Oracle BI EE 11g Benchmark
System Number of Users Response Time (sec)
1 x SPARC T4-4 (4 x SPARC T4 3.0 GHz) 25,000 0.36

Results for the Oracle BI EE 10g version of the benchmark. Results are not comparable to the Oracle BI EE 11g version of the benchmark.

Oracle BI EE 10g Benchmark
System Number of Users
2 x SPARC T5440 (4 x SPARC T2+ 1.6 GHz) 50,000
1 x SPARC T5440 (4 x SPARC T2+ 1.6 GHz) 28,000

Configuration Summary

Hardware Configuration:

SPARC T4-4 server
4 x SPARC T4-4 processors, 3.0 GHz
128 GB memory
4 x 300 GB internal SSD

Storage Configuration:

Sun ZFS Storage 7120
16 x 146 GB disks

Software Configuration:

Oracle Solaris 10 8/11
Oracle Solaris Studio 12.1
Oracle Business Intelligence Enterprise Edition 11g (11.1.1.6.0)
Oracle WebLogic Server 10.3.5
Oracle Internet Directory 11.1.1.6.0
Oracle Database 11g Release 2

Benchmark Description

Oracle Business Intelligence Enterprise Edition (Oracle BI EE) delivers a robust set of reporting, ad-hoc query and analysis, OLAP, dashboard, and scorecard functionality with a rich end-user experience that includes visualization, collaboration, and more.

The Oracle BI EE benchmark test used five different business user roles - Marketing Executive, Sales Representative, Sales Manager, Sales Vice-President, and Service Manager. These roles included a maximum of 5 different pre-built dashboards. Each dashboard page had an average of 5 reports in the form of a mix of charts, tables and pivot tables, returning anywhere from 50 rows to approximately 500 rows of aggregated data. The test scenario also included drill-down into multiple levels from a table or chart within a dashboard.

The benchmark test scenario uses a typical business user sequence of dashboard navigation, report viewing, and drill down. For example, a Service Manager logs into the system and navigates to his own set of dashboards using Service Manager. The BI user selects the Service Effectiveness dashboard, which shows him four distinct reports, Service Request Trend, First Time Fix Rate, Activity Problem Areas, and Cost Per Completed Service Call spanning 2002 to 2005. The user then proceeds to view the Customer Satisfaction dashboard, which also contains a set of 4 related reports, drills down on some of the reports to see the detail data. The BI user continues to view more dashboards – Customer Satisfaction and Service Request Overview, for example. After navigating through those dashboards, the user logs out of the application. The benchmark test is executed against a full production version of the Oracle Business Intelligence 11g Applications with a fully populated underlying database schema. The business processes in the test scenario closely represent a real world customer scenario.

See Also

Disclosure Statement

Copyright 2012, Oracle and/or its affiliates. All rights reserved. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Results as of 30 September 2012.

Friday Sep 30, 2011

SPARC T4-2 Server Beats Intel (Westmere AES-NI) on ZFS Encryption Tests

Oracle continues to lead in enterprise security. Oracle's SPARC T4 processors combined with Oracle's Solaris ZFS file system demonstrate faster file system encryption than equivalent systems based on the Intel Xeon Processor 5600 Sequence chips which use AES-NI security instructions.

Encryption is the process where data is encoded for privacy and a key is needed by the data owner to access the encoded data. The benefits of using ZFS encryption are:

  • The SPARC T4 processor is 3.5x to 5.2x faster than the Intel Xeon Processor X5670 that has the AES-NI security instructions in creating encrypted files.

  • ZFS encryption is integrated with the ZFS command set. Like other ZFS operations, encryption operations such as key changes and re-key are performed online.

  • Data is encrypted using AES (Advanced Encryption Standard) with key lengths of 256, 192, and 128 in the CCM and GCM operation modes.

  • The flexibility of encrypting specific file systems is a key feature.

  • ZFS encryption is inheritable to descendent file systems. Key management can be delegated through ZFS delegated administration.

  • ZFS encryption uses the Oracle Solaris Cryptographic Framework which gives it access to SPARC T4 processor and Intel Xeon X5670 processor (Intel AES-NI) hardware acceleration or to optimized software implementations of the encryption algorithms automatically.

Performance Landscape

Below are results running two different ciphers for ZFS encryption. Results are presented for runs without any cipher, labeled clear, and a variety of different key lengths.

Encryption Using AES-CCM Ciphers

MB/sec – 5 File Create* Encryption
Clear AES-256-CCM AES-192-CCM AES-128-CCM
SPARC T4-2 server 3,803 3,167 3,335 3,225
SPARC T3-2 server 2,286 1,554 1,561 1,594
2-Socket 2.93 GHz Xeon X5670 3,325 750 764 773

Speedup T4-2 vs X5670 1.1x 4.2x 4.4x 4.2x
Speedup T4-2 vs T3-2 1.7x 2.0x 2.1x 2.0x

Encryption Using AES-GCM Ciphers

MB/sec – 5 File Create* Encryption
Clear AES-256-GCM AES-192-GCM AES-128-GCM
SPARC T4-2 server 3,618 3,929 3,164 2,613
SPARC T3-2 server 2,278 1,451 1,455 1,449
2-Socket 2.93 GHz Xeon X5670 3,299 749 748 753

Speedup T4-2 vs X5670 1.1x 5.2x 4.2x 3.5x
Speedup T4-2 vs T3-2 1.6x 2.7x 2.2x 1.8x

(*) Maximum Delivered values measured over 5 concurrent mkfile operations.

Configuration Summary

Storage Configuration:

Sun Storage 6780 array
16 x 15K RPM drives
Raid 0 pool
Write back cache enable
Controller cache mirroring disabled for maximum bandwidth for test
Eight 8 Gb/sec ports per host

Server Configuration:

SPARC T4-2 server
2 x SPARC T4 2.85 GHz processors
256 GB memory
Oracle Solaris 11

SPARC T3-2 server
2 x SPARC T3 1.6 GHz processors
Oracle Solaris 11 Express 2010.11

Sun Fire X4270 M2 server
2 x Intel Xeon X5670, 2.93 GHz processors
Oracle Solaris 11

Benchmark Description

The benchmark ran the UNIX command mkfile (1M). Mkfile is a simple single threaded program to create a file of a specified size. The script ran 5 mkfile operations in the background and observed the peak bandwidth observed during the test.

See Also

Disclosure Statement

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Results as of December 16, 2011.

Monday Sep 19, 2011

Halliburton ProMAX® Seismic Processing on Sun Blade X6270 M2 with Sun ZFS Storage 7320

Halliburton/Landmark's ProMAX® 3D Pre-Stack Kirchhoff Time Migration's (PSTM) single workflow scalability and multiple workflow throughput using various scheduling methods are evaluated on a cluster of Oracle's Sun Blade X6270 M2 server modules attached to Oracle's Sun ZFS Storage 7320 appliance.

Two resource scheduling methods, compact and distributed, are compared while increasing the system load with additional concurrent ProMAX® workflows.

  • Multiple concurrent 24-process ProMAX® PSTM workflow throughput is constant; 10 workflows on 10 nodes finish as fast as 1 workflow on one compute node. Additionally, processing twice the data volume yields similar traces/second throughput performance.

  • A single ProMAX® PSTM workflow has good scaling from 1 to 10 nodes of a Sun Blade X6270 M2 cluster scaling 4.5X. ProMAX® scales to 4.7X on 10 nodes with one input data set and 6.3X with two consecutive input data sets (i.e. twice the data).

  • A single ProMAX® PSTM workflow has near linear scaling of 11x on a Sun Blade X6270 M2 server module when running from 1 to 12 processes.

  • The 12-thread ProMAX® workflow throughput using the distributed scheduling method is equivalent or slightly faster than the compact scheme for 1 to 6 concurrent workflows.

Performance Landscape

Multiple 24-Process Workflow Throughput Scaling

This test measures the system throughput scalability as concurrent 24-process workflows are added, one workflow per node. The per workflow throughput and the system scalability are reported.

Aggregate system throughput scales linearly. Ten concurrent workflows finish in the same time as does one workflow on a single compute node.

Halliburton ProMAX® Pre-Stack Time Migration - Multiple Workflow Scaling


Single Workflow Scaling

This test measures single workflow scalability across a 10-node cluster. Utilizing a single data set, performance exhibits near linear scaling of 11x at 12 processes, and per-node scaling of 4x at 6 nodes; performance flattens quickly reaching a peak of 60x at 240 processors and per-node scaling of 4.7x with 10 nodes.

Running with two consecutive input data sets in the workflow, scaling is considerably improved with peak scaling ~35% higher than obtained using a single data set. Doubling the data set size minimizes time spent in workflow initialization, data input and output.

Halliburton ProMAX® Pre-Stack Time Migration - Single Workflow Scaling

This next test measures single workflow scalability across a 10-node cluster (as above) but limiting scheduling to a maximum of 12-process per node; effectively restricting a maximum of one process per physical core. The speedup relative to a single process, and single node are reported.

Utilizing a single data set, performance exhibits near linear scaling of 37x at 48 processes, and per-node scaling of 4.3x at 6 nodes. Performance of 55x at 120 processors and per-node scaling of 5x with 10 nodes is reached and scalability is trending higher more strongly compared to the the case of two processes running per physical core above. For equivalent total process counts, multi-node runs using only a single process per physical core appear to run between 28-64% more efficiently (96 and 24 processes respectively). With a full compliment of 10 nodes (120 processes) the peak performance is only 9.5% lower than with 2 processes per vcpu (240 processes).

Running with two consecutive input data sets in the workflow, scaling is considerably improved with peak scaling ~35% higher than obtained using a single data set.

Halliburton ProMAX® Pre-Stack Time Migration - Single Workflow Scaling

Multiple 12-Process Workflow Throughput Scaling, Compact vs. Distributed Scheduling

The fourth test compares compact and distributed scheduling of 1, 2, 4, and 6 concurrent 12-processor workflows.

All things being equal, the system bi-section bandwidth should improve with distributed scheduling of a fixed-size workflow; as more nodes are used for a workflow, more memory and system cache is employed and any node memory bandwidth bottlenecks can be offset by distributing communication across the network (provided the network and inter-node communication stack do not become a bottleneck). When physical cores are not over-subscribed, compact and distributed scheduling performance is within 3% suggesting that there may be little memory contention for this workflow on the benchmarked system configuration.

With compact scheduling of two concurrent 12-processor workflows, the physical cores become over-subscribed and performance degrades 36% per workflow. With four concurrent workflows, physical cores are oversubscribed 4x and performance is seen to degrade 66% per workflow. With six concurrent workflows over-subscribed compact scheduling performance degrades 77% per workflow. As multiple 12-processor workflows become more and more distributed, the performance approaches the non over-subscribed case.

Halliburton ProMAX® Pre-Stack Time Migration - Multiple Workflow Scaling

141616 traces x 624 samples


Test Notes

All tests were performed with one input data set (70808 traces x 624 samples) and two consecutive input data sets (2 * (70808 traces x 624 samples)) in the workflow. All results reported are the average of at least 3 runs and performance is based on reported total wall-clock time by the application.

All tests were run with NFS attached Sun ZFS Storage 7320 appliance and then with NFS attached legacy Sun Fire X4500 server. The StorageTek Workload Analysis Tool (SWAT) was invoked to measure the I/O characteristics of the NFS attached storage used on separate runs of all workflows.

Configuration Summary

Hardware Configuration:

10 x Sun Blade X6270 M2 server modules, each with
2 x 3.33 GHz Intel Xeon X5680 processors
48 GB DDR3-1333 memory
4 x 146 GB, Internal 10000 RPM SAS-2 HDD
10 GbE
Hyper-Threading enabled

Sun ZFS Storage 7320 Appliance
1 x Storage Controller
2 x 2.4 GHz Intel Xeon 5620 processors
48 GB memory (12 x 4 GB DDR3-1333)
2 TB Read Cache (4 x 512 GB Read Flash Accelerator)
10 GbE
1 x Disk Shelf
20.0 TB RAID-Z (20 x 1 TB SAS-2, 7200 RPM HDD)
4 x Write Flash Accelerators

Sun Fire X4500
2 x 2.8 GHz AMD 290 processors
16 GB DDR1-400 memory
34.5 TB RAID-Z (46 x 750 GB SATA-II, 7200 RPM HDD)
10 GbE

Software Configuration:

Oracle Linux 5.5
Parallel Virtual Machine 3.3.11 (bundled with ProMAX)
Intel 11.1.038 Compilers
Libraries: pthreads 2.4, Java 1.6.0_01, BLAS, Stanford Exploration Project Libraries

Benchmark Description

The ProMAX® family of seismic data processing tools is the most widely used Oil and Gas Industry seismic processing application. ProMAX® is used for multiple applications, from field processing and quality control, to interpretive project-oriented reprocessing at oil companies and production processing at service companies. ProMAX® is integrated with Halliburton's OpenWorks® Geoscience Oracle Database to index prestack seismic data and populate the database with processed seismic.

This benchmark evaluates single workflow scalability and multiple workflow throughput of the ProMAX® 3D Prestack Kirchhoff Time Migration (PSTM) while processing the Halliburton benchmark data set containing 70,808 traces with 8 msec sample interval and trace length of 4992 msec. Benchmarks were performed with both one and two consecutive input data sets.

Each workflow consisted of:

  • reading the previously constructed MPEG encoded processing parameter file
  • reading the compressed seismic data traces from disk
  • performing the PSTM imaging
  • writing the result to disk

Workflows using two input data sets were constructed by simply adding a second identical seismic data read task immediately after the first in the processing parameter file. This effectively doubled the data volume read, processed, and written.

This version of ProMAX® currently only uses Parallel Virtual Machine (PVM) as the parallel processing paradigm. The PVM software only used TCP networking and has no internal facility for assigning memory affinity and processor binding. Every compute node is running a PVM daemon.

The ProMAX® processing parameters used for this benchmark:

Minimum output inline = 65
Maximum output inline = 85
Inline output sampling interval = 1
Minimum output xline = 1
Maximum output xline = 200 (fold)
Xline output sampling interval = 1
Antialias inline spacing = 15
Antialias xline spacing = 15
Stretch Mute Aperature Limit with Maximum Stretch = 15
Image Gather Type = Full Offset Image Traces
No Block Moveout
Number of Alias Bands = 10
3D Amplitude Phase Correction
No compression
Maximum Number of Cache Blocks = 500000

Primary PSTM business metrics are typically time-to-solution and accuracy of the subsurface imaging solution.

Key Points and Best Practices

  • Multiple job system throughput scales perfectly; ten concurrent workflows on 10 nodes each completes in the same time and has the same throughput as a single workflow running on one node.
  • Best single workflow scaling is 6.6x using 10 nodes.

    When tasked with processing several similar workflows, while individual time-to-solution will be longer, the most efficient way to run is to fully distribute them one workflow per node (or even across two nodes) and run these concurrently, rather than to use all nodes for each workflow and running consecutively. For example, while the best-case configuration used here will run 6.6 times faster using all ten nodes compared to a single node, ten such 10-node jobs running consecutively will overall take over 50% longer to complete than ten jobs one per node running concurrently.

  • Throughput was seen to scale better with larger workflows. While throughput with both large and small workflows are similar with only one node, the larger dataset exhibits 11% and 35% more throughput with four and 10 nodes respectively.

  • 200 processes appears to be a scalability asymptote with these workflows on the systems used.
  • Hyperthreading marginally helps throughput. For the largest model run on 10 nodes, 240 processes delivers 11% more performance than with 120 processes.

  • The workflows do not exhibit significant I/O bandwidth demands. Even with 10 concurrent 24-process jobs, the measured aggregate system I/O did not exceed 100 MB/s.

  • 10 GbE was the only network used and, though shared for all interprocess communication and network attached storage, it appears to have sufficient bandwidth for all test cases run.

See Also

Disclosure Statement

The following are trademarks or registered trademarks of Halliburton/Landmark Graphics: ProMAX®, GeoProbe®, OpenWorks®. Results as of 9/1/2011.

Tuesday Oct 26, 2010

3D VTI Reverse Time Migration Scalability On Sun Fire X2270-M2 Cluster with Sun Storage 7210

This Oil & Gas benchmark shows the Sun Storage 7210 system delivers almost 2 GB/sec bandwidth and realizes near-linear scaling performance on a cluster of 16 Sun Fire X2270 M2 servers.

Oracle's Sun Storage 7210 system attached via QDR InfiniBand to a cluster of sixteen of Oracle's Sun Fire X2270 M2 servers was used to demonstrate the performance of a Reverse Time Migration application, an important application in the Oil & Gas industry. The total application throughput and computational kernel scaling are presented for two production sized grids of 800 samples.

  • Both the Reverse Time Migration I/O and combined computation shows near-linear scaling from 8 to 16 nodes on the Sun Storage 7210 system connected via QDR InfiniBand to a Sun Fire X2270 M2 server cluster:

      1243 x 1151 x 1231: 2.0x improvement
      2486 x 1151 x 1231: 1.7x improvement
  • The computational kernel of the Reverse Time Migration has linear to super-linear scaling from 8 to 16 nodes in Oracle's Sun Fire X2270 M2 server cluster:

      1243 x 1151 x 1231 : 2.2x improvement
      2486 x 1151 x 1231 : 2.0x improvement
  • Intel Hyper-Threading provides additional performance benefits to both the Reverse Time Migration I/O and computation when going from 12 to 24 OpenMP threads on the Sun Fire X2270 M2 server cluster:

      1243 x 1151 x 1231: 8% - computational kernel; 2% - total application throughput
      2486 x 1151 x 1231: 12% - computational kernel; 6% - total application throughput
  • The Sun Storage 7210 system delivers the Velocity, Epsilon, and Delta data to the Reverse Time Migration at a steady rate even when timing includes memory initialization and data object creation:

      1243 x 1151 x 1231: 1.4 to 1.6 GBytes/sec
      2486 x 1151 x 1231: 1.2 to 1.3 GBytes/sec

    One can see that when doubling the size of the problem, the additional complexity of overlapping I/O and multiple node file contention only produces a small reduction in read performance.

Performance Landscape

Application Scaling

Performance and scaling results of the total application, including I/O, for the reverse time migration demonstration application are presented. Results were obtained using a Sun Fire X2270 M2 server cluster with a Sun Storage 7210 system for the file server. The servers were running with hyperthreading enabled, allowing for 24 OpenMP threads per server.

Application Scaling Across Multiple Nodes
Number Nodes Grid Size - 1243 x 1151 x 1231 Grid Size - 2486 x 1151 x1231
Total Time (sec) Kernel Time (sec) Total Speedup Kernel Speedup Total Time (sec) Kernel Time (sec) Total Speedup Kernel Speedup
16 504 259 2.0 2.2\* 1024 551 1.7 2.0
14 565 279 1.8 2.0 1191 677 1.5 1.6
12 662 343 1.6 1.6 1426 817 1.2 1.4
10 784 394 1.3 1.4 1501 856 1.2 1.3
8 1024 560 1.0 1.0 1745 1108 1.0 1.0

\* Super-linear scaling due to the compute kernel fitting better into available cache

Application Scaling – Hyper-Threading Study

The affects of hyperthreading are presented when running the reverse time migration demonstration application. Results were obtained using a Sun Fire X2270 M2 server cluster with a Sun Storage 7210 system for the file server.

Hyper-Threading Comparison – 12 versus 24 OpenMP Threads
Number Nodes Thread per Node Grid Size - 1243 x 1151 x 1231 Grid Size - 2486 x 1151 x1231
Total Time (sec) Kernel Time (sec) Total HT Speedup Kernel HT Speedup Total Time (sec) Kernel Time (sec) Total HT Speedup Kernel HT Speedup
16 24 504 259 1.02 1.08 1024 551 1.06 1.12
16 12 515 279 1.00 1.00 1088 616 1.00 1.00

Read Performance

Read performance is presented for the velocity, epsilon and delta files running the reverse time migration demonstration application. Results were obtained using a Sun Fire X2270 M2 server cluster with a Sun Storage 7210 system for the file server. The servers were running with hyperthreading enabled, allowing for 24 OpenMP threads per server.

Velocity, Epsilon, and Delta File Read and Memory Initialization Performance
Number Nodes Overlap MBytes Read Grid Size - 1243 x 1151 x 1231 Grid Size - 2486 x 1151 x1231
Time (sec) Time Relative 8-node Total GBytes Read Read Rate GB/s Time (sec) Time Relative 8-node Total GBytes Read Read Rate GB/s
16 2040 16.7 1.1 23.2 1.4 36.8 1.1 44.3 1.2
8 951
14.8 1.0 22.1 1.6 33.0 1.0 43.2 1.3

Configuration Summary

Hardware Configuration:

16 x Sun Fire X2270 M2 servers, each with
2 x 2.93 GHz Intel Xeon X5670 processors
48 GB memory (12 x 4 GB at 1333 MHz)

Sun Storage 7210 system connected via QDR InfiniBand
2 x 18 GB SATA SSD (logzilla)
40 x 1 TB 7200 RM SATA disk

Software Configuration:

SUSE Linux Enterprise Server SLES 10 SP 2
Oracle Message Passing Toolkit 8.2.1 (for MPI)
Sun Studio 12 Update 1 C++, Fortran, OpenMP

Benchmark Description

This Reverse Time Migration (RTM) demonstration application measures the total time it takes to image 800 samples of various production size grids and write the final image to disk. In this version, each node reads in only the trace, velocity, and conditioning data to be processed by that node plus a four element inline 3-D array pad (spatial order of eight) shared with its neighbors to the left and right during the initialization phase. It represents a full RTM application including the data input, computation, communication, and final output image to be used by the next work flow step involving 3D volumetric seismic interpretation.

Key Points and Best Practices

This demonstration application represents a full Reverse Time Migration solution. Many references to the RTM application tend to focus on the compute kernel and ignore the complexity that the input, communication, and output bring to the task.

I/O Characterization without Optimal Checkpointing

Velocity, Epsilon, and Delta Files - Grid Reading

The additional amount of overlapping reads to share velocity, epsilon, and delta edge data with neighbors can be calculated using the following equation:

    (number_nodes - 1) x (order_in_space) x (y_dimension) x (z_dimension) x (4 bytes) x (3 files)

For this particular benchmark study, the additional 3-D pad overlap for the 16 and 8 node cases is:

    16 nodes: 15 x 8 x 1151 x 1231 x 4 x 3 = 2.04 GB extra
    8 nodes: 7 x 8 x 1151 x 1231 x 4 x 3 = 0.95 GB extra

For the first of the two test cases, the total size of the three files used for the 1243 x 1151 x 1231 case is

    1243 x 1151 x 1231 x 4 bytes = 7.05 GB per file x 3 files = 21.13 GB

With the additional 3-D pad, the total amount of data read is:

    16 nodes: 2.04 GB + 21.13 GB = 23.2 GB
    8 nodes: 0.95 GB + 21.13 GB = 22.1 GB

For the second of the two test cases, the total size of the three files used for the 2486 x 1151 x 1231 case is

    2486 x 1151 x 1231 x 4 bytes = 14.09 GB per file x 3 files = 42.27 GB

With the additional pad based on the number of nodes, the total amount of data read is:

    16 nodes: 2.04 GB + 42.27 GB = 44.3 GB
    8 nodes: 0.95 GB + 42.27 GB = 43.2 GB

Note that the amount of overlapping data read increases, not only by the number of nodes, but as the y dimension and/or the z dimension increases.

Trace Reading

The additional amount of overlapping reads to share trace edge data with neighbors for can be calculated using the following equation:

    (number_nodes - 1) x (order_in_space) x (y_dimension) x (4 bytes) x (number_of_time_slices)

For this particular benchmark study, the additional overlap for the 16 and 8 node cases is:

    16 nodes: 15 x 8 x 1151 x 4 x 800 = 442MB extra
    8 nodes: 7 x 8 x 1151 x 4 x 800 = 206MB extra

For the first case the size of the trace data file used for the 1243 x 1151 x 1231 case is

    1243 x 1151 x 4 bytes x 800 = 4.578 GB

With the additional pad based on the number of nodes, the total amount of data read is:

    16 nodes: .442 GB + 4.578 GB = 5.0 GB
    8 nodes: .206 GB + 4.578 GB = 4.8 GB

For the second case the size of the trace data file used for the 2486 x 1151 x 1231 case is

    2486 x 1151 x 4 bytes x 800 = 9.156 GB

With the additional pad based on the number of nodes, the total amount of data read is:

    16 nodes: .442 GB + 9.156 GB = 9.6 GB
    8 nodes: .206 GB + 9.156 GB = 9.4 GB

As the number of nodes is increased, the overlap causes more disk lock contention.

Writing Final Output Image

1243x1151x1231 - 7.1 GB per file:

    16 nodes: 78 x 1151 x 1231 x 4 = 442MB/node (7.1 GB total)
    8 nodes: 156 x 1151 x 1231 x 4 = 884MB/node (7.1 GB total)

2486x1151x1231 - 14.1 GB per file:

    16 nodes: 156 x 1151 x 1231 x 4 = 930 MB/node (14.1 GB total)
    8 nodes: 311 x 1151 x 1231 x 4 = 1808 MB/node (14.1 GB total)

Resource Allocation

It is best to allocate one node as the Oracle Grid Engine resource scheduler and MPI master host. This is especially true when running with 24 OpenMP threads in hyperthreading mode to avoid oversubscribing a node that is cooperating in delivering the solution.

See Also

Disclosure Statement

Copyright 2010, Oracle and/or its affiliates. All rights reserved. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Results as of 10/20/2010.

Wednesday Sep 22, 2010

Oracle Solaris 10 9/10 ZFS OLTP Performance Improvements

Oracle Solaris ZFS has seen significant performance improvements in the Oracle Solaris 10 9/10 release compared to the previous release, Oracle Solaris 10 10/09.
  • A 28% reduction in response time comparing holding the load constant in an OLTP workload test comparing Solaris 10 9/10 release to Oracle Solaris 10 10/09.
  • A 19% increase in IOPS throughput holding the response time of 28 msec constant in an OLTP workload test comparing Solaris 10 9/10 release to Oracle Solaris 10 10/09.
  • OLTP workload throughput rates of at least 800 IOPS using Oracle's Sun SPARC Enterprise T5420 server and Oracle's StorageTek 2540 array were used in calculating the above improvement percentages.

Performance Landscape

8K Block Random Read/Write OLTP-Style Test
IOPS Response Time (msec)
Oracle Solaris 10 9/10 Oracle Solaris 10 10/09
100 5.1 8.3
500 11.7 24.6
800 20.1 28.1
900 23.9 32.0
950 28.8 34.4

Results and Configuration Summary

Storage Configuration:

1 x StorageTek 2540 Array
12 x 73 GB 15K RPM HDDs
2 RAID5 5+1 volumes
1 RAID0 host stripe across the volumes

Server Configuration:

1 x Sun SPARC Enterprise T5240 server with
8 GB memory
2 x 1.6 GHz UltraSPARC T2 Plus processors

Software Configuration:

Oracle Solaris 10 10/09
Oracle Solaris 10 9/10
ZFS
SVM

Benchmark Description

IOPs test consisting of a mixture of random 8K block reads and writes accessing a significant portion of the available storage. As such the workload is not very "cache friendly" and, hence, illustrates the capability of the system to more fully utilize the processing capability of the back end storage.

See Also

Disclosure Statement

Copyright 2010, Oracle and/or its affiliates. All rights reserved. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Results as of 9/20/2010.

Tuesday Sep 21, 2010

ProMAX Performance and Throughput on Sun Fire X2270 and Sun Storage 7410

Halliburton/Landmark's ProMAX 3D Prestack Kirchhoff Time Migration's single job scalability and multiple job throughput using various scheduling methods are evaluated on a cluster of Oracle's Sun Fire X2270 servers attached via QDR InfiniBand to Oracle's Sun Storage 7410 system.

Two resource scheduling methods, compact and distributed, are compared while increasing the system load with additional concurrent ProMAX jobs.

  • A single ProMAX job has near linear scaling of 5.5x on 6 nodes of a Sun Fire X2270 cluster.

  • A single ProMAX job has near linear scaling of 7.5x on a Sun Fire X2270 server when running from 1 to 8 threads.

  • ProMAX can take advantage of Oracle's Sun Storage 7410 system features compared to dedicated local disks. There was no significant difference in run time observed when running up to 8 concurrent 16 thread jobs.

  • The 8-thread ProMAX job throughput using the distributed scheduling method is equivalent or slightly faster than the compact scheme for 1 to 4 concurrent jobs.

  • The 16-thread ProMAX job throughput using the distributed scheduling method is up to 8% faster when compared to the compact scheme on an 8-node Sun Fire X2270 cluster.

The multiple job throughput characterization revealed in this benchmark study are key in pre-configuring Oracle Grid Engine resource scheduling for ProMAX on a Sun Fire X2270 cluster and provide valuable insight for server consolidation.

Performance Landscape

Single Job Scaling

Single job performance on a single node is near linear up the number of cores in the node, i.e. 2 Intel Xeon X5570s with 4 cores each. With hyperthreading (2 active threads per core) enabled, more ProMAX threads are used increasing the load on the CPU's memory architecture causing the reduced speedups.
ProMAX single job performance on the 6-node cluster shows near linear speedup node to node.
Single Job 6-Node Scalability
Hyperthreading Enabled - 16 Threads/Node Maximum
Number of Nodes Threads Per Node Speedup to 1 Thread Speedup to 1 Node
6 16 54.2 5.5
4 16 36.2 3.6
3 16 26.1 2.6
2 16 17.6 1.8
1 16 10.0 1.0
1 14 9.2
1 12 8.6
1 10 7.2\*
1 8 7.5
1 6 5.9
1 4 3.9
1 3 3.0
1 2 2.0
1 1 1.0

\* 2 threads contend with two master node daemons

Multiple Job Throughput Scaling, Compact Scheduling

With the Sun Storage 7410 system, performance of 8 concurrent jobs on the cluster using compact scheduling is equivalent to a single job.

Multiple Job Throughput Scalability
Hyperthreading Enabled - 16 Threads/Node Maximum
Number of Nodes Number of Nodes per Job Threads Per Node per Job Performance Relative to 1 Job Total Nodes Percent Cluster Used
1 1 16 1.00 1 13
2 1 16 1.00 2 25
4 1 16 1.00 4 50
8 1 16 1.00 8 100

Multiple 8-Thread Job Throughput Scaling, Compact vs. Distributed Scheduling

These results report the difference of different distributed method resource scheduling levels to 1, 2, and 4 concurrent job compact method baselines.

Multiple 8-Thread Job Scheduling
HyperThreading Enabled - Use 8 Threads/Node Maximum
Number of Jobs Number of Nodes per Job Threads Per Node per Job Performance Relative to 1 Job Total Nodes Total Threads per Node Used Percent of PVM Master 8 Threads Used
1 1 8 1.00 1 8 100
1 4 2 1.01 4 2 25
1 8 1 1.01 8 1 13

2 1 8 1.00 2 8 100
2 4 2 1.01 4 4 50
2 8 1 1.01 8 2 25

4 1 8 1.00 4 8 100
4 4 2 1.00 4 8 100
4 8 1 1.01 8 4 100

Multiple 16-Thread Job Throughput Scaling, Compact vs. Distributed Scheduling

The results are reported relative to the performance of 1, 2, 4, and 8 concurrent 2-node, 8-thread jobs.

Multiple 16-Thread Job Scheduling
HyperThreading Enabled - 16 Threads/Node Available
Number of Jobs Number of Nodes per Job Threads Per Node per Job Performance Relative to 1 Job Total Nodes Total Threads per Node Used Percent of PVM Master 16 Threads Used
1 1 16 0.66 1 16 100\*
1 2 8 1.00 2 8 50
1 4 4 1.03 4 4 25
1 8 2 1.06 8 2 13

2 1 16 0.70 2 16 100\*
2 2 8 1.00 4 8 50
2 4 4 1.07 8 4 25
2 8 2 1.08 8 4 25

4 1 16 0.74 4 16 100\*
4 4 4 0.74 4 16 100\*
4 2 8 1.00 8 8 50
4 4 4 1.05 8 8 50
4 8 2 1.04 8 8 50

8 1 16 1.00 8 16 100\*
8 4 4 1.00 8 16 100\*
8 8 2 1.00 8 16 100\*

\* master PVM host; running 20 to 21 total threads (over-subscribed)

Results and Configuration Summary

Hardware Configuration:

8 x Sun Fire X2270 servers, each with
2 x 2.93 GHz Intel Xeon X5570 processors
48 GB memory at 1333 MHz
1 x 500 GB SATA
Sun Storage 7410 system
4 x 2.3 GHz AMD Opteron 8356 processors
128 GB memory
2 Internal 233GB SAS drives = 466 GB
2 Internal 93 GB read optimized SSD = 186 GB
1 External Sun Storage J4400 array with 22 1TB SATA drives and 2 18GB write optimized SSD
11 TB mirrored data and mirrored write optimized SSD

Software Configuration:

SUSE Linux Enterprise Server 10 SP 2
Parallel Virtual Machine 3.3.11
Oracle Grid Engine
Intel 11.1 Compilers
OpenWorks Database requires Oracle 10g Enterprise Edition
Libraries: pthreads 2.4, Java 1.6.0_01, BLAS, Stanford Exploration Project Libraries

Benchmark Description

The ProMAX family of seismic data processing tools is the most widely used Oil and Gas Industry seismic processing application. ProMAX is used for multiple applications, from field processing and quality control, to interpretive project-oriented reprocessing at oil companies and production processing at service companies. ProMAX is integrated with Halliburton's OpenWorks Geoscience Oracle Database to index prestack seismic data and populate the database with processed seismic.

This benchmark evaluates single job scalability and multiple job throughput of the ProMAX 3D Prestack Kirchhoff Time Migration while processing the Halliburton benchmark data set containing 70,808 traces with 8 msec sample interval and trace length of 4992 msec. Alternative thread scheduling methods are compared for optimizing single and multiple job throughput. The compact scheme schedules the threads of a single job in as few nodes as possible, whereas, the distributed scheme schedules the threads across a many nodes as possible. The effects of load on the Sun Storage 7410 system are measured. This information provides valuable insight into determining the Oracle Grid Engine resource management policies.

Hyperthreading is enabled for all of the tests. It should be noted that every node is running a PVM daemon and ProMAX license server daemon. On the master PVM daemon node, there are three additional ProMAX daemons running.

The first test measures single job scalability across a 6-node cluster with an additional node serving as the master PVM host. The speedup relative to a single node, single thread are reported.

The second test measures multiple job scalability running 1 to 8 concurrent 16-thread jobs using the Sun Storage 7410 system. The performance is reported relative to a single job.

The third test compares 8-thread multiple job throughput using different job scheduling methods on a cluster. The compact method involves putting all 8 threads for a job on the same node. The distributed method involves spreading the 8 threads of job across multiple nodes. The results report the difference of different distributed method resource scheduling levels to 1, 2, and 4 concurrent job compact method baselines.

The fourth test is similar to the second test except running 16-thread ProMAX jobs. The results are reported relative to the performance of 1, 2, 4, and 8 concurrent 2-node, 8-thread jobs.

The ProMAX processing parameters used for this benchmark:

Minimum output inline = 65
Maximum output inline = 85
Inline output sampling interval = 1
Minimum output xline = 1
Maximum output xline = 200 (fold)
Xline output sampling interval = 1
Antialias inline spacing = 15
Antialias xline spacing = 15
Stretch Mute Aperature Limit with Maximum Stretch = 15
Image Gather Type = Full Offset Image Traces
No Block Moveout
Number of Alias Bands = 10
3D Amplitude Phase Correction
No compression
Maximum Number of Cache Blocks = 500000

Key Points and Best Practices

  • The application was rebuilt with the Intel 11.1 Fortran and C++ compilers with these flags.

    -xSSE4.2 -O3 -ipo -no-prec-div -static -m64 -ftz -fast-transcendentals -fp-speculation=fast
  • There are additional execution threads associated with a ProMAX node. There are two threads that run on each node: the license server and PVM daemon. There are at least three additional daemon threads that run on the PVM master server: the ProMAX interface GUI, the ProMAX job execution - SuperExec, and the PVM console and control. It is best to allocate one node as the master PVM server to handle the additional 5+ threads. Otherwise, hyperthreading can be enabled and the master PVM host can support up to 8 ProMAX job threads.

  • When hyperthreading is enabled in on one of the non-master PVM hosts, there is a 7% penalty going from 8 to 10 threads. However, 12 threads are 11 percent faster than 8. This can be contributed to the two additional support threads when hyperthreading initiates.

  • Single job performance on a single node is near linear up the number of cores in the node, i.e. 2 Intel Xeon X5570s with 4 cores each. With hyperthreading (2 active threads per core) enabled, more ProMAX threads are used increasing the load on the CPU's memory architecture causing the reduced speedups.

    Users need to be aware of these performance differences and how it effects their production environment.

See Also

Disclosure Statement

The following are trademarks or registered trademarks of Halliburton/Landmark Graphics: ProMAX. Results as of 9/20/2010.

Monday Sep 20, 2010

Schlumberger's ECLIPSE 300 Performance Throughput On Sun Fire X2270 Cluster with Sun Storage 7410

Oracle's Sun Storage 7410 system, attached via QDR InfiniBand to a cluster of eight of Oracle's Sun Fire X2270 servers, was used to evaluate multiple job throughput of Schlumberger's Linux-64 ECLIPSE 300 compositional reservoir simulator processing their standard 2 Million Cell benchmark model with 8 rank parallelism (MM8 job).

  • The Sun Storage 7410 system showed little difference in performance (2%) compared to running the MM8 job with dedicated local disk.

  • When running 8 concurrent jobs on 8 different nodes all to the Sun Storage 7140 system, the performance saw little degradation (5%) compared to a single MM8 job running on dedicated local disk.

Experiments were run changing how the cluster was utilized in scheduling jobs. Rather than running with the default compact mode, tests were run distributing the single job among the various nodes. Performance improvements were measured when changing from the default compact scheduling scheme (1 job to 1 node) to a distributed scheduling scheme (1 job to multiple nodes).

  • When running at 75% of the cluster capacity, distributed scheduling outperformed the compact scheduling by up to 34%. Even when running at 100% of the cluster capacity, the distributed scheduling is still slightly faster than compact scheduling.

  • When combining workloads, using the distributed scheduling allowed two MM8 jobs to finish 19% faster than the reference time and a concurrent PSTM workload to find 2% faster.

The Oracle Solaris Studio Performance Analyzer and Sun Storage 7410 system analytics were used to identify a 3D Prestack Kirchhoff Time Migration (PSTM) as a potential candidate for consolidating with ECLIPSE. Both scheduling schemes are compared while running various job mixes of these two applications using the Sun Storage 7410 system for I/O.

These experiments showed a potential opportunity for consolidating applications using Oracle Grid Engine resource scheduling and Oracle Virtual Machine templates.

Performance Landscape

Results are presented below on a variety of experiments run using the 2009.2 ECLIPSE 300 2 Million Cell Performance Benchmark (MM8). The compute nodes are a cluster of Sun Fire X2270 servers connected with QDR InfiniBand. First, some definitions used in the tables below:

Local HDD: Each job runs on a single node to its dedicated direct attached storage.
NFSoIB: One node hosts its local disk for NFS mounting to other nodes over InfiniBand.
IB 7410: Sun Storage 7410 system over QDR InfiniBand.
Compact Scheduling: All 8 MM8 MPI processes run on a single node.
Distributed Scheduling: Allocate the 8 MM8 MPI processes across all available nodes.

First Test

The first test compares the performance of a single MM8 test on a single node using local storage to running a number of jobs across the cluster and showing the effect of different storage solutions.

Compact Scheduling
Multiple Job Throughput Results Relative to Single Job
2009.2 ECLIPSE 300 MM8 2 Million Cell Performance Benchmark

Cluster Load Number of MM8 Jobs Local HDD Relative Throughput NFSoIB Relative Throughput IB 7410 Relative Throughput
13% 1 1.00 1.00\* 0.98
25% 2 0.98 0.97 0.98
50% 4 0.98 0.96 0.97
75% 6 0.98 0.95 0.95
100% 8 0.98 0.95 0.95

\* Performance measured on node hosting its local disk to other nodes in the cluster.

Second Test

This next test uses the Sun Storage 7410 system and compares the performance of running the MM8 job on 1 node using the compact scheduling to running multiple jobs with compact scheduling and to running multiple jobs with the distributed schedule. The tests are run on a 8 node cluster, so each distributed job has only 1 MPI process per node.

Comparing Compact and Distributed Scheduling
Multiple Job Throughput Results Relative to Single Job
2009.2 ECLIPSE 300 MM8 2 Million Cell Performance Benchmark

Cluster Load Number of MM8 Jobs Compact Scheduling
Relative Throughput
Distributed Scheduling\*
Relative Throughput
13% 1 1.00 1.34
25% 2 1.00 1.32
50% 4 0.99 1.25
75% 6 0.97 1.10
100% 8 0.97 0.98

\* Each distributed job has 1 MPI process per node.

Third Test

This next test uses the Sun Storage 7410 system and compares the performance of running the MM8 job on 1 node using the compact scheduling to running multiple jobs with compact scheduling and to running multiple jobs with the distributed schedule. This test only uses 4 nodes, so each distributed job has two MPI processes per node.

Comparing Compact and Distributed Scheduling on 4 Nodes
Multiple Job Throughput Results Relative to Single Job
2009.2 ECLIPSE 300 MM8 2 Million Cell Performance Benchmark

Cluster Load Number of MM8 Jobs Compact Scheduling
Relative Throughput
Distributed Scheduling\*
Relative Throughput
25% 1 1.00 1.39
50% 2 1.00 1.28
100% 4 1.00 1.00

\* Each distributed job it has two MPI processes per node.

Fourth Test

The last test involves running two different applications on the 4 node cluster. It compares the performance of running the cluster fully loaded and changing how the applications are run, either compact or distributed. The comparisons are made against the individual application running the compact strategy (as few nodes as possible). It shows that appropriately mixing jobs can give better job performance than running just one kind of application on a single cluster.

Multiple Job, Multiple Application Throughput Results
Comparing Scheduling Strategies
2009.2 ECLIPSE 300 MM8 2 Million Cell and 3D Kirchoff Time Migration (PSTM)

Number of PSTM Jobs Number of MM8 Jobs Compact Scheduling
(1 node x 8 processes
per job)
ECLIPSE
Distributed Scheduling
(4 nodes x 2 processes
per job)
ECLIPSE
Distributed Scheduling
(4 nodes x 4 processes
per job)
PSTM
Compact Scheduling
(2 nodes x 8 processes per job)
PSTM
Cluster Load
0 1 1.00 1.40

25%
0 2 1.00 1.27

50%
0 4 0.99 0.98

100%
1 2
1.19 1.02
100%
2 0

1.07 0.96 100%
1 0

1.08 1.00 50%

Results and Configuration Summary

Hardware Configuration:

8 x Sun Fire X2270 servers, each with
2 x 2.93 GHz Intel Xeon X5570 processors
24 GB memory (6 x 4 GB memory at 1333 MHz)
1 x 500 GB SATA
Sun Storage 7410 system, 24 TB total, QDR InfiniBand
4 x 2.3 GHz AMD Opteron 8356 processors
128 GB memory
2 Internal 233GB SAS drives (466 GB total)
2 Internal 93 GB read optimized SSD (186 GB total)
1 Sun Storage J4400 with 22 1 TB SATA drives and 2 18 GB write optimized SSD
20 TB RAID-Z2 (double parity) data and 2-way striped write optimized SSD or
11 TB mirrored data and mirrored write optimized SSD
QDR InfiniBand Switch

Software Configuration:

SUSE Linux Enterprise Server 10 SP 2
Scali MPI Connect 5.6.6
GNU C 4.1.2 compiler
2009.2 ECLIPSE 300
ECLIPSE license daemon flexlm v11.3.0.0
3D Kirchoff Time Migration

Benchmark Description

The benchmark is a home-grown study in resource usage options when running the Schlumberger ECLIPSE 300 Compositional reservoir simulator with 8 rank parallelism (MM8) to process Schlumberger's standard 2 Million Cell benchmark model. Schlumberger pre-built executables were used to process a 260x327x73 (2 Million Cell) sub-grid with 6,206,460 total grid cells and model 7 different compositional components within a reservoir. No source code modifications or executable rebuilds were conducted.

The ECLIPSE 300 MM8 job uses 8 MPI processes. It can run within a single node (compact) or across multiple nodes of a cluster (distributed). By using the MM8 job, it is possible to compare the performance between running each job on a separate node using local disk to using a shared network attached storage solution. The benchmark tests study the affect of increasing the number of MM8 jobs in a throughput model.

The first test compares the performance of running 1, 2, 4, 6 and 8 jobs on a cluster of 8 nodes using local disk, NFSoIB disk, and the Sun Storage 7410 system connected via InfiniBand. Results are compared against the time it takes to run 1 job with local disk. This test shows what performance impact there is when loading down a cluster.

The second test compares different methods of scheduling jobs on a cluster. The compact method involves putting all 8 MPI processes for a job on the same node. The distributed method involves using 1 MPI processes per node. The results compare the performance against 1 job on one node.

The third test is similar to the second test, but uses only 4 nodes in the cluster, so when running distributed, there are 2 MPI processes per node.

The fourth test compares the compact and distributed scheduling methods on 4 nodes while running a 2 MM8 jobs and one 16-way parallel 3D Prestack Kirchhoff Time Migration (PSTM).

Key Points and Best Practices

  • ECLIPSE is very sensitive to memory bandwidth and needs to be run on 1333 MHz or greater memory speeds. In order to maintain 1333 MHz memory, the maximum memory configuration for the processors used in this benchmark is 24 GB. Bios upgrades now allow 1333 MHz memory for up to 48 GB of memory. Additional nodes can be used to handle data sets that require more memory than available per node. Allocating at least 20% of memory per node for I/O caching helps application performance.

  • If allocating an 8-way parallel job (MM8) to a single node, it is best to use an ECLIPSE license for that particular node to avoid the any additional network overhead of sharing a global license with all the nodes in a cluster.

  • Understanding the ECLIPSE MM8 I/O access patterns is essential to optimizing a shared storage solution. The analytics available on the Oracle Unified Storage 7410 provide valuable I/O characterization information even without source code access. A single MM8 job run shows an initial read and write load related to reading the input grid, parsing Petrel ascii input parameter files and creating an initial solution grid and runtime specifications. This is followed by a very long running simulation that writes data, restart files, and generates reports to the 7410. Due to the nature of the small block I/O, the mirrored configuration for the 7410 outperformed the RAID-Z2 configuration.

    A single MM8 job reads, processes, and writes approximately 240 MB of grid and property data in the first 36 seconds of execution. The actual read and write of the grid data, that is intermixed with this first stage of processing, is done at a rate of 240 MB/sec to the 7410 for each of the two operations.

    Then, it calculates and reports the well connections at an average 260 KB writes/second with 32 operations/second = 32 x 8 KB writes/second. However, the actual size of each I/O operation varies between 2 to 100 KB and there are peaks every 20 seconds. The write cache is on average operating at 8 accesses/second at approximately 61 KB/second (8 x 8 KB writes/sec). As the number of concurrent jobs increases, the interconnect traffic and random I/O operations per second to the 7410 increases.

  • MM8 multiple job startup time is reduced on shared file systems, if each job uses separate input files.

See Also

Disclosure Statement

Copyright 2010, Oracle and/or its affiliates. All rights reserved. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Results as of 9/20/2010.

Tuesday Apr 13, 2010

Oracle Sun Flash Accelerator F20 PCIe Card Accelerates Web Caching Performance

Using Oracle's Sun FlashFire technology, the Sun Accelerator F20 PCIe Card is shown to be a high performance and cost effective caching device for web servers. Many current web and application servers are designed with an active cache that is used for holding things like session objects, files and web pages. The Sun F20 card is shown to be an excellent candidate to improve performance over using HDD solutions.

  • The Sun Flash Accelerator F20 PCIe Card provides 2x better Quality of Service (QoS) at the same load as compared to 15K RPM high performance disk drives.

  • The Sun Flash Accelerator F20 PCIe Card enables scaling to 3x more users than 15K RPM high performance disk drives.

  • The Sun Flash Accelerator F20 PCIe Card provides 25% higher Quality of Service (QoS) than 15K RPM high performance disk drives at maximum rate.

  • The Sun Flash Accelerator F20 PCIe Card allows for easy expansion of the webcache. Each card provides an additional 96 GB of storage.

  • The Sun Flash Accelerator F20 PCIe Card used as a caching device offers Bitrate and Quality of Service (QoS) comparable to that provided by memory. While memory also provides excellent caching performance in comparison to disk, memory capacity is limited in servers.

Performance Landscape

Experiment results using three Sun Flash Accelerator F20 PCIe Cards.

Load Factor No Cache F20 Webcache Memcache
Max Load @Disk Load Max Load @F20 Load
Max Connections 7,000 7,000 27,000 27,000
Average Bitrate 445 Kbps 870 Kbps 602 Kbps 678 Kbps
Cache Hit Rate 0% 98% 99% 56%

QoS Bitrates %Connect %Connect %Connect %Connect
900 Kbps - 1 Mbps 0% 97% 0% 0%
800 Kbps 0% 3% 0% 6%
700 Kbps 0% 0% 64% 70%
600 Kbps 18% 0% 24% 15%
420 Kbps - 500 Kbps 88% 0% 12% 9%

Experiment results using two Sun Flash Accelerator F20 PCIe Cards.

Load Factor No Cache F20 Webcache Memcache
Max Load @Disk Load Max Load @F20 Load
Max Connections 7,000 7,000 22,000 27,000
Average Bitrate 445 Kbps 870 Kbps 622 Kbps 678 Kbps
Cache Hit Rate 0% 98% 80% 56%

QoS Bitrates %Connect %Connect %Connect %Connect
900 Kbps - 1 Mbps 0% 97% 0% 0%
800 Kbps 0% 3% 1% 6%
700 Kbps 0% 0% 68% 70%
600 Kbps 18% 0% 26% 15%
420 Kbps - 500 Kbps 88% 0% 5% 9%

Results and Configuration Summary

Hardware Configuration:

Sun Fire X4270, 72 GB memory
3 X Sun Flash Accelerator F20 PCIe Card
Sun Storage J4400 (12 15K RPM disks)

Software Configuration:

Sun Java System Web Server 7
OpenSolaris
Flickr Photo Download Workload
Oracle Solaris Zettabyte File System (ZFS)

Three configurations are compared:

  1. No cache, 12 x high-speed 15K RPM Disks
  2. 3 x Sun Flash Accelerator F20 PCIe Cards as cache device
  3. 64 GB server memory as cache device

Benchmark Description

This benchmark is based upon the description of the flickr website presented at http://highscalability.com/flickr-architecture. It measures performance of an HTTP-based file photo Slide Show workload. The workload randomly selects and downloads from 80 photos stored in 4 bins:

  • 20 large photos, 1800x1800p, 1 MB, 1% probability
  • 20 medium photos, 1000x1000p, 500 KB, 4% probability
  • 20 small photos, 540x540p, 100K, 35% probability
  • 20 thumbnail photos, 100x100p, 5k, 60% probability

Benchmark metrics are:

  • Scalability – Number of persistent connections achieved
  • Quality of Service (QoS) – bitrate achieved by each user
    • max speed: 1 Mbps, min speed SLA: 420 Kbps
    • divides bitrates between max and min in 5 bands, corresponding to dial-in, T1, etc.
    • example: 900 Kbps, 800 Kbps, 700 Kbps, 600 Kbps, 500 Kbps
    • reports %users in each bitrate band

Three cases were tested:

  • Disk as OverFlow Cache – Contents are served from 12 high-performance 15K RPM disks configured in a ZFS zpool.
  • Sun Flash Accelerator F20 PCIe Card as Cache Device – Contents are served from 2 F20 Cards, with 8 component DOMs configured in a ZFS spool
  • Memory as Cache – Contents are served from tmpfs

Key Points and Best Practices

See Also

Disclosure Statement

Results as of 4/1/2010.

Thursday Nov 19, 2009

SPECmail2009: New World record on T5240 1.6GHz Sun 7310 and ZFS

The Sun SPARC Enterprise T5240 server running the Sun Java Messaging server 7.2 achieved a World Record SPECmail2009 result using Sun Storage 7310 Unified Storage System and ZFS file system.  Sun's OpenStorage platforms enable another world record.

  • World record SPECmail2009 benchmark using the Sun SPARC Enterprise T5240 server (two 1.6GHz UltraSPARC T2 Plus), Sun Communications Suite 7, Solaris 10, and the Sun Storage 7310 Unified Storage System achieved 14,500 SPECmail_Ent2009 users at 69,857 Sessions/Hour.

  • This SPECmail2009 benchmark result clearly demonstrates that the Sun Messaging Server 7.2, Solaris 10 and ZFS solution can support a large, enterprise level IMAP mail server environment as a low cost 'Sun on Sun' solution, delivering the best performance and maximizing data integrity and availability of Sun Open Storage and ZFS.

  • The Sun SPARC Enterprise T5240 server supported 2.4 times more users with 2.4 times better sessions/hour rate than AppleXserv3 solution on the SPECmail2009 benchmark.

  • There are no IBM Power6 results on this benchmark.

  • The configuration using Sun OpenStorage outperformed all previous results with traditional direct attached storage and significantly higher number of disk devices.

SPECmail2009 Performance Landscape (ordered by performance)

System Performance Disks OS Messaging
Server
Users Sessions/
hour
Sun SPARC Enterprise T5240
2 x 1.6GHz UltraSPARC T2 Plus
14,500 69,857 58
NAS
Solaris 10 CommSuite 7.2
Sun JMS 7.2
Sun SPARC Enterprise T5240
2 x 1.6GHz UltraSPARC T2 Plus
12,000 57,758 80
DAS
Solaris 10 CommSuite 5
Sun JMS 6.3
Sun Fire X4275
2 x 2.93GHz Xeon X5570
8,000 38,348 44
NAS
Solaris 10 Sun JMS 6.2
Apple Xserv3,1
2 x 2.93GHz Xeon X5570
6,000 28,887 82
DAS
MacOS 10.6 Dovecot 1.1.14
apple 0.5
Sun SPARC Enterprise T5220
1 x 1.4GHz UltraSPARC T2
3,600 17,316 52
DAS
Solaris 10 Sun JMS 6.2

Complete benchmark results may be found at the SPEC benchmark website http://www.spec.org

Users - SPECmail_Ent2009 Users
Sessions/hour - SPECmail2009 Sessions/hour
NAS - Network Attached Storage
DAS - Direct Attached Storage

Results and Configuration Summary

Hardware Configuration:

    Sun SPARC Enterprise T5240
      2 x 1.6 GHz UltraSPARC T2 Plus processors
      128 GB memory
      2 x 146GB, 10K RPM SAS disks, 4 x 32GB SSDs

External Storage:

    2 x Sun Storage 7310 Unified Storage System, each with
      32 GB of memory
      24 x 1 TB 7200 RPM SATA Drives

Software Configuration:

    Solaris 10
    ZFS
    Sun Java Communications Suite 7 Update 2
      Sun Java System Messaging Server 7.2
      Directory Server 6.3

Benchmark Description

The SPECmail2009 benchmark measures the ability of corporate e-mail systems to meet today's demanding e-mail users over fast corporate local area networks (LAN). The SPECmail2009 benchmark simulates corporate mail server workloads that range from 250 to 10,000 or more users, using industry standard SMTP and IMAP4 protocols. This e-mail server benchmark creates client workloads based on a 40,000 user corporation, and uses folder and message MIME structures that include both traditional office documents and a variety of rich media content. The benchmark also adds support for encrypted network connections using industry standard SSL v3.0 and TLS 1.0 technology. SPECmail2009 replaces all versions of SPECmail2008, first released in August 2008. The results from the two benchmarks are not comparable.

Software on one or more client machines generates a benchmark load for a System Under Test (SUT) and measures the SUT response times. A SUT can be a mail server running on a single system or a cluster of systems.

A SPECmail2009 'run' simulates a 100% load level associated with the specific number of users, as defined in the configuration file. The mail server must maintain a specific Quality of Service (QoS) at the 100% load level to produce a valid benchmark result. If the mail server does maintain the specified QoS at the 100% load level, the performance of the mail server is reported as SPECmail_Ent2009 SMTP and IMAP Users at SPECmail2009 Sessions per hour. The SPECmail_Ent2009 users at SPECmail2009 Sessions per Hour metric reflects the unique workload combination for a SPEC IMAP4 user.

Key Points and Best Practices

  • Each Sun Storage 7310 Unified Storage System was configured with one J4400 JBOD array with 22x1TB SATA drives to a mirrored device and 4 shared volumes are built under the mirrored device. Total 8 mirrored volumes from 2 x Sun Storage 7310 are mounted on the system under test (SUT) messaging mail indexes and mail messages file system using NFSV4 protocol. Four SSDs were used as the SUT internal disks. Each SSD is configured as a ZFS file system. Four such ZFS directories are used for the messaging server queue, store metadata, LDAP and queue. SSDs substantially reduced the store metadata and queue latencies.

  • Each Sun Storage 7310 Unified Storage System was connected to the SUT via a dual 10-Gigabit Ethernet Fiber XFP card.

  • The Sun Storage 7310 Unified Storage System software version is 2009.08.11,1-0.

  • The clients used these Java options: java -d64 -Xms4096m -Xmx4096m -XX:+AggressiveHeap

  • Substantial performance improvement and scalability was observed with Sun Communications Suite7 update2, Java Messaging Server 7.2 and Directory Server 6.2

  • See the SPEC Report for all OS, network and messaging server tunings.

See Also

Disclosure Statement

SPEC, SPECmail reg tm of Standard Performance Evaluation Corporation. Results as of 10/22/09 on www.spec.org. SPECmail2009: Sun SPARC Enterprise T5240, SPECmail_Ent2009 14,500 users at 69,857 SPECmail2009 Sessions/hour. Apple Xserv3,1, SPECmail_Ent2009 6,000 users at 28,887 SPECmail2009 Sessions/hour.

Tuesday Oct 13, 2009

Sun T5440 Oracle BI EE Sun SPARC Enterprise T5440 World Record

The Oracle BI EE, a component of Oracle Fusion Middleware,  workload was run on two Sun SPARC Enterprise T5440 servers and achieved world record performance.
  • Two Sun SPARC Enterprise T5440 servers with four 1.6 GHz UltraSPARC T2 Plus processors delivered the best performance of 50K concurrent users on the Oracle BI EE 10.1.3.4 benchmark with Oracle 11g database running on free and open Solaris 10.

  • The two node Sun SPARC Enterprise T5440 servers with Oracle BI EE running on Solaris 10 using 8 Solaris Containers shows 1.8x scaling over Sun's previous one node SPARC Enterprise T5440 server result with 4 Solaris Containers.

  • The two node SPARC Enterprise T5440 servers demonstrated the performance and scalability of the UltraSPARC T2 Plus processor demonstrating 50K users can be serviced with 0.2776 sec response time.

  • The Sun SPARC Enterprise T5220 server was used as an NFS server with 4 internal SSDs and the ZFS file system which showed significant I/O performance improvement over traditional disk for Business Intelligence Web Catalog activity.

  • Oracle Fusion Middleware provides a family of complete, integrated, hot pluggable and best-of-breed products known for enabling enterprise customers to create and run agile and intelligent business applications. Oracle BI EE performance demonstrates why so many customers rely on Oracle Fusion Middleware as their foundation for innovation.

  • IBM has not published any POWER6 processor based results on this important benchmark.

Performance Landscape

System Processors Users
Chips GHz Type
2 x Sun SPARC Enterprise T5440 8 1.6 UltraSPARC T2 Plus 50,000
1 x Sun SPARC Enterprise T5440 4 1.6 UltraSPARC T2 Plus 28,000
5 x Sun Fire T2000 1 1.2 UltraSPARC T1 10,000

Results and Configuration Summary

Hardware Configuration:

    2 x Sun SPARC Enterprise T5440 (1.6GHz/128GB)
    1 x Sun SPARC Enterprise T5220 (1.2GHz/64GB) and 4 SSDs (used as NFS server)

Software Configuration:

    Solaris10 05/09
    Oracle BI EE 10.1.3.4
    Oracle Fusion Middleware
    Oracle 11gR1

Benchmark Description

The objective of this benchmark is to highlight how Oracle BI EE can support pervasive deployments in large enterprises, using minimal hardware, by simulating an organization that needs to support more than 25,000 active concurrent users, each operating in mixed mode: ad-hoc reporting, application development, and report viewing.

The user population was divided into a mix of administrative users and business users. A maximum of 28,000 concurrent users were actively interacting and working in the system during the steady-state period. The tests executed 580 transactions per second, with think times of 60 seconds per user, between requests. In the test scenario 95% of the workload consisted of business users viewing reports and navigating within dashboards. The remaining 5% of the concurrent users, categorized as administrative users, were doing application development.

The benchmark scenario used a typical business user sequence of dashboard navigation, report viewing, and drill down. For example, a Service Manager logs into the system and navigates to his own set of dashboards viz. .Service Manager.. The user then selects the .Service Effectiveness. dashboard, which shows him four distinct reports, .Service Request Trend., .First Time Fix Rate., .Activity Problem Areas., and .Cost Per completed Service Call . 2002 till 2005. . The user then proceeds to view the .Customer Satisfaction. dashboard, which also contains a set of 4 related reports. He then proceeds to drill-down on some of the reports to see the detail data. Then the user proceeds to more dashboards, for example .Customer Satisfaction. and .Service Request Overview.. After navigating through these dashboards, he logs out of the application

This benchmark did not use a synthetic database schema. The benchmark tests were run on a full production version of the Oracle Business Intelligence Applications with a fully populated underlying database schema. The business processes in the test scenario closely represents a true customer scenario.

See Also

Disclosure Statement

Oracle BI EE benchmark results 10/13/2009, see

SPECcpu2006 Results On MSeries Servers With Updated SPARC64 VII Processors

The SPEC CPU2006 benchmarks were run on the new 2.88 GHz and 2.53 GHz SPARC64 VII processors for the Sun SPARC Enterprise Mseries servers. The new processors were tested in the Sun SPARC Enterprise M4000, M5000, M8000, M9000 servers.


  • The Sun SPARC Enterprise M9000 server running the new 2.88 GHz SPARC64 VII processors beats the IBM Power 595 server running 5.0 GHz POWER6 processors by 20% on the SPECint_rate2006 benchmark.

  • The Sun SPARC Enterprise M9000 server running the new 2.88 GHz SPARC64 VII processors beats the IBM Power 595 server running 5.0 GHz POWER6 processors by 29% on the SPECint_rate_base2006 benchmark.

  • The Sun SPARC Enterprise M9000 server with 64 SPARC64 VII 2.88GHz processors delivered results of 2590 SPECint_rate2006 and 2100 SPECfp_rate2006.

  • The Sun SPARC Enterprise M9000 server with 64 SPARC64 VII processors at 2.88GHz improves performance vs. 2.52 GHz by 13% for SPECint_rate2006 and 5% for SPECfp_rate2006.

  • The Sun SPARC Enterprise M9000 server with 32 SPARC64 VII 2.88GHz processors delivered results of 1450 SPECint_rate2006 and 1250 SPECfp_rate2006.

  • The Sun SPARC Enterprise M9000 server with 32 SPARC64 VII processors at 2.88GHz improves performance vs. 2.52 GHz by 17% for SPECint_rate2006 and 13% for SPECfp_rate2006.

  • The Sun SPARC Enterprise M8000 server with 16 SPARC64 VII 2.88GHz processors delivered results of 753 SPECint_rate2006 and 666 SPECfp_rate2006.

  • The Sun SPARC Enterprise M8000 server with 16 SPARC64 VII processors at 2.88GHz improves performance vs. 2.52 GHz by 18% for SPECint_rate2006 and 14% for SPECfp_rate2006.

  • The Sun SPARC Enterprise M5000 server with 8 SPARC64 VII 2.53GHz processors delivered results of 296 SPECint_rate2006 and 234 SPECfp_rate2006.

  • The Sun SPARC Enterprise M5000 server with 8 SPARC64 VII processors at 2.53GHz improves performance vs. 2.40 GHz by 12% for SPECint_rate2006 and 5% for SPECfp_rate2006.

  • The Sun SPARC Enterprise M4000 server with 4 SPARC64 VII 2.53GHz processors delivered results of 152 SPECint_rate2006 and 116 SPECfp_rate2006.

  • The Sun SPARC Enterprise M4000 server with 4 SPARC64 VII processors at 2.53GHz improves performance vs. 2.40 GHz by 13% for SPECint_rate2006 and 4% for SPECfp_rate2006.

Performance Landscape

SPEC CPU2006 Performance Charts - bigger is better, selected results, please see www.spec.org for complete results. All results as of 10/07/09.

In the tables below
"Base" = SPECint_rate_base2006 or SPECfp_rate_base2006
"Peak" = SPECint_rate2006 or SPECfp_rate2006

SPECint_rate2006 results - large systems

System Processors Base
Copies
Performance Results Comments
Cores/
Chips
Type GHz Base Peak
SGI Altix 4700 Bandwidth 1024/512 Itanium 2 1.6 1020 9031 na
Sun Blade X6440 Cluster 768/192 Opteron 8384 2.7 705 8845 na
SGI Altix 4700 Density 256/128 Itanium 2 1.66 256 2893 3354
vSMP Foundation 128/32 Xeon X5570 2.93 255 3147 na
SGI Altix 4700 Bandwidth 256/128 Itanium 2 1.6 256 2715 2971
SPARC Enterprise M9000 256/64 SPARC64 VII 2.88 511 2400 2590 New
SPARC Enterprise M9000 256/64 SPARC64 VII 2.52 511 2088 2288
IBM Power 595 64/32 POWER6 5.0 128 1866 2155
HP Superdome 128/64 Itanium 2 1.6 128 1534 1648
SPARC Enterprise M9000 128/32 SPARC64 VII 2.88 255 1370 1450 New
SPARC Enterprise M9000 128/64 SPARC64 VI 2.4 255 1111 1294
SPARC Enterprise M9000 128/32 SPARC64 VI 2.52 255 1141 1240
Unisys ES7000 96/16 Xeon X7460 2.66 96 999 1049
SGI Altix ICE 8200EX 32/8 Xeon X5570 2.93 64 931 999
IBM Power 575 32/16 POWER6 4.7 64 812 934
IBM Power 570 32/16 POWER6+ 4.2 64 661 832
SPARC Enterprise M8000 64/16 SPARC64 VII 2.88 127 706 753 New
SPARC Enterprise M9000 64/32 SPARC64 VI 2.4 127 553 650
SPARC Enterprise M8000 64/16 SPARC64 VII 2.52 127 565 637

SPECint_rate2006 results - small systems

System Processors Base
Copies
Performance Results Comments
Cores/
Chips
Type GHz Base Peak
Sun Fire X4440 24/4 Opteron 8435 SE 2.6 24 296 377
SPARC Enterprise M5000 32/8 SPARC64 VII 2.53 64 267 296 New
Sun Blade X6440 16/4 Opteron 8389 2.9 16 226 292
HP ProLiant BL680c G5 24/4 Xeon E7458 2.4 24 247 268
SPARC Enterprise M5000 32/8 SPARC64 VII 2.4 63 232 264
IBM Power 550 8/4 POWER6+ 5.0 16 215 263
Sun Fire X2270 8/2 Xeon X5570 2.93 16 223 260
SPARC Enterprise T5240 16/2 UltraSPARC T2 Plus 1.6 127 171 183
SPARC Enterprise M4000 16/4 SPARC64 VII 2.53 32 136 152 New
SPARC Enterprise M4000 16/4 SPARC64 VII 2.4 32 118 135

SPECfp_rate2006 results - large systems

System Processors Base
Copies
Performance Results Comments
Cores/
Chips
Type GHz Base Peak
SGI Altix 4700 Bandwidth 1024/512 Itanium 2 1.6 1020 10583 na
SGI Altix 4700 Density 1024/512 Itanium 2 1.66 1020 10580 na
Sun Blade X6440 Cluster 768/192 Opteron 8384 2.7 705 6502 na
SGI Altix 4700 Bandwidth 256/128 Itanium 2 1.6 256 3419 3507
ScaleMP vSMP Foundation 128/32 Xeon X5570 2.93 255 2553 na
IBM Power 595 64/32 POWER6 5.0 128 1681 2184
IBM Power 595 64/32 POWER6 5.0 128 1822 2108
SPARC Enterprise M9000 256/64 SPARC64 VII 2.88 511 1930 2100 New
SPARC Enterprise M9000 256/64 SPARC64 VII 2.52 511 1861 2005
SGI Altix 4700 Bandwidth 128/64 Itanium 2 1.66 128 1832 1947
HP Superdome 128/64 Itanium 2 1.6 128 1422 1479
SPARC Enterprise M9000 128/32 SPARC64 VII 2.88 255 1190 1250 New
SPARC Enterprise M9000 128/64 SPARC64 VI 2.4 255 1160 1225
SPARC Enterprise M9000 128/32 SPARC64 VII 2.52 255 1059 1110
IBM Power 575 32/16 POWER6 4.7 64 730 839
SPARC Enterprise M8000 64/16 SPARC64 VII 2.88 127 616 666 New
SPARC Enterprise M9000 64/32 SPARC64 VI 2.52 127 588 636
IBM Power 570 32/16 POWER6+ 4.2 64 517 602
SPARC Enterprise M8000 64/32 SPARC64 VI 2.4 127 538 582

SPECfp_rate2006 results - small systems

System Processors Base
Copies
Performance Results Comments
Cores/
Chips
Type GHz Base Peak
Supermicro H8QM8-2 24/4 Opteron 8435 SE 2.8 24 261 287
SPARC Enterprise T5440 32/4 UltraSPARC T2 Plus 1.6 255 254 270
IBM Power 560 16/8 POWER6+ 3.6 32 226 263
SPARC Enterprise M5000 32/8 SPARC64 VII 2.53 64 218 234 New
SPARC Enterprise M5000 32/8 SPARC64 VII 2.4 63 208 223
IBM Power 550 8/4 POWER6+ 5.0 16 188 222
ASUS Z8PE-D18 8/2 Xeon X5570 2.93 16 197 203
SPARC Enterprise T5240 16/2 UltraSPARC T2 Plus 1.6 127 124 133
SPARC Enterprise M4000 16/4 SPARC64 VII 2.53 32 111 116 New
SPARC Enterprise M4000 16/4 SPARC64 VII 2.4 32 107 112

Results and Configuration Summary

Test Configurations:

Sun SPARC Enterprise M9000
64 x 2.88 GHz SPARC64 VII
1152 GB (448 x 2GB + 64 x 4GB)
Solaris 10 5/09
Sun Studio 12 Update 1

Sun SPARC Enterprise M9000
32 x 2.88 GHz SPARC64 VII
704 GB (160 x 2GB + 96 x 4GB)
Solaris 10 5/09
Sun Studio 12 Update 1

Sun SPARC Enterprise M8000
16 x 2.88 GHz SPARC64 VII
512 GB (128 x 4GB)
Solaris 10 10/09
Sun Studio 12 Update 1

Sun SPARC Enterprise M5000
8 x 2.53 GHz SPARC64 VII
128 GB (64 x 2GB)
Solaris 10 10/09
Sun Studio 12 Update 1

Sun SPARC Enterprise M4000
4 x 2.53 GHz SPARC64 VII
32 GB (32 x 1GB)
Solaris 10 10/09
Sun Studio 12 Update 1

Results Summary:

M9000 M9000 M8000 M5000 M4000
SPECint_rate_base2006 2400 1370 706 267 136
SPECint_rate2006 2590 1450 753 296 152
SPECfp_rate_base2006 1930 1190 616 218 111
SPECfp_rate2006 2100 1250 666 234 116
SPECint_base2006 - - 12.4 - 12.1
SPECint2006 - - 13.6 - 12.9
SPECfp_base2006 - - 15.6 - 13.3
SPECfp2006 - - 16.5 - 13.9
SPECfp2006 - autopar - - 28.2 - -
SPECfp2006 - autopar - - 33.9 - -

Benchmark Description

SPEC CPU2006 is SPEC's most popular benchmark, with over 8000 results published in the three years since it was introduced. It measures:

  • "Speed" - single copy performance of chip, memory, compiler
  • "Rate" - multiple copy (throughput)

The rate metrics are used for the throughput-oriented systems described on this page. These metrics include:

  • SPECint_rate2006: throughput for 12 integer benchmarks derived from real applications such as perl, gcc, XML processing, and pathfinding
  • SPECfp_rate2006: throughput for 17 floating point benchmarks derived from real applications, including chemistry, physics, genetics, and weather.

There are "base" variants of both the above metrics that require more conservative compilation, such as using the same flags for all benchmarks.

Key Points and Best Practices

Result on this page for the Sun SPARC Enterprise M9000 server were measured on a Fujitsu SPARC Enterprise M9000. The Sun SPARC Enterprise M9000 and Fujitsu SPARC Enterprise M9000 are electronically equivalent. Results for the Sun SPARC Enterprise M8000, M4000 and M5000 were measured on those systems. The similarly named Fujitsu sytems are electronically equivalent.

Use the latest compiler. The Sun Studio group is always working to improve the compiler. Sun Studio 12 Update 1, which are used in these submissions, provides updated code generation for a wide variety of SPARC and x86 implementations.

I/O still counts. Even in a CPU-intensive workload, some I/O remains. This point is explored in some detail at http://blogs.sun.com/jhenning/entry/losing_my_fear_of_zfs.

See Also

Disclosure Statement

SPEC, SPECint, SPECfp reg tm of Standard Performance Evaluation Corporation. Competitive results from www.spec.org as of 7 October 2009. Sun's new results quoted on this page have been submitted to SPEC. Sun SPARC Enterprise M9000 2400 SPECint_rate_base2006, 2590 SPECint_rate2006, 1930 SPECfp_rate_base2006, 2100 SPECfp_rate2006; Sun SPARC Enterprise M9000 (32 chips) 1370 SPECint_rate_base2006, 1450 SPECint_rate2006, 1190 SPECfp_rate_base2006, 1250 SPECfp_rate2006; Sun SPARC Enterprise M8000 706 SPECint_rate_base2006, 753 SPECint_rate2006, 616 SPECfp_rate_base2006, 666 SPECfp_rate2006; Sun SPARC Enterprise M5000 267 SPECint_rate_base2006, 296 SPECint_rate2006, 218 SPECfp_rate_base2006, 234 SPECfp_rate2006; Sun SPARC Enterprise M4000 136 SPECint_rate_base2006, 152 SPECint_rate2006, 111 SPECfp_rate_base2006, 116 SPECfp_rate2006; Sun SPARC Enterprise M9000 (2.52GHz) 2088 SPECint_rate_base2006, 2288 SPECint_rate2006, 1860 SPECfp_rate_base2006, 2010 SPECfp_rate2006; Sun SPARC Enterprise M9000 (32 chips 2.52GHz) 1140 SPECint_rate_base2006, 1240 SPECint_rate2006, 1060 SPECfp_rate_base2006, 1110 SPECfp_rate2006; Sun SPARC Enterprise M8000 (2.52GHz) 565 SPECint_rate_base2006, 637 SPECint_rate2006, 538 SPECfp_rate_base2006, 582 SPECfp_rate2006; Sun SPARC Enterprise M5000 (2.4GHz) 232 SPECint_rate_base2006, 264 SPECint_rate2006, 208 SPECfp_rate_base2006, 223 SPECfp_rate2006; Sun SPARC Enterprise M4000 (2.4GHz) 118 SPECint_rate_base2006, 135 SPECint_rate2006, 107 SPECfp_rate_base2006, 112 SPECfp_rate2006; IBM Power 595 1866 SPECint_rate_base2006, 2155 SPECint_rate2006,

Wednesday Aug 12, 2009

SPECmail2009 on Sun SPARC Enterprise T5240 and Sun Java System Messaging Server 6.3

Significance of Results

The Sun SPARC Enterprise T5240 server running the Sun Java Messaging server 6.3 achieved World Record SPECmail2009 results using ZFS.

  • A Sun SPARC Enterprise T5240 server powered by two 1.6 GHz UltraSPARC T2 Plus processors running the Sun Java Communications Suite 5 software along with the Solaris 10 Operating System and using six Sun StorageTek 2540 arrays achieved a new World Record 12000 SPECmail_Ent2009 IMAP4 users at 57,758 Sessions/hour for SPECmail2009.
  • The Sun SPARC Enterprise T5240 server achieve twice the number of users and sessions/hour rate than the Apple Xserv3,1 solution equipped with Intel Nehalem processors.
  • The Sun result was obtained using ~10% fewer disk spindles with the Sun StorageTek 2540 RAID controller direct attach storage solution versus Apple's direct attached storage.
  • This benchmark result demonstrates that the Sun SPARC Enterprise T5240 server together with Sun Java Communication Suite 5 component Sun Java System Messaging Server 6.3, Solaris 10 and ZFS on Sun StorageTek 2540 arrays supports a large, enterprise level IMAP mail server environment. This solution is reliable, low cost, and low power, delivering the best performance and maximizing the data integrity with Sun's ZFS file systems.

Performance Landscape

SPECmail2009 (ordered by performance)

System Processors Performance
Type GHz Ch, Co, Th SPECmail_Ent2009
Users
SPECmail2009
Sessions/hour
Sun SPARC Enterprise T5240 UltraSPARC T2 Plus 1.6 2, 16, 128 12,000 57,758
Sun Fire X4275 Xeon X5570 2.93 2, 8, 16 8,000 38,348
Apple Xserv3,1 Xeon X5570 2.93 2, 8, 16 6,000 28,887
Sun SPARC Enterprise T5220 UltraSPARC T2 1.4 1, 8, 64 3,600 17,316

Notes:

    Number of SPECmail_Ent2009 users (bigger is better)
    SPECmail2009 Sessions/hour (bigger is better)
    Ch, Co, Th: Chips, Cores, Threads

Complete benchmark results may be found at the SPEC benchmark website http://www.spec.org

Results and Configuration Summary

Hardware Configuration:

    Sun SPARC Enterprise T5240

      2 x 1.6 GHz UltraSPARC T2 Plus processors
      128 GB
      8 x 146GB, 10K RPM SAS disks

    6 x Sun StorageTek 2540 Arrays,

      4 arrays with 12 x 146GB 15K RPM SAS disks
      2 arrays with 12 x 73GB 15K RPM SAS disks

    2 x Sun Fire X4600 benchmark manager, load generator and mail sink

      8 x AMD Opteron 8356 2.7 GHz QC processors
      64 GB
      2 x 73GB 10K RPM SAS disks

    Sun Fire X4240 load generator

      2 x AMD Opteron 2384 2.7 GHz DC processors
      16 GB
      2 x 73GB 10K RPM SAS disks

Software Configuration:

    Solaris 10
    ZFS
    Sun Java Communication Suite 5
    Sun Java System Messaging Server 6.3

Benchmark Description

The SPECmail2009 benchmark measures the ability of corporate e-mail systems to meet today's demanding e-mail users over fast corporate local area networks (LAN). The SPECmail2009 benchmark simulates corporate mail server workloads that range from 250 to 10,000 or more users, using industry standard SMTP and IMAP4 protocols. This e-mail server benchmark creates client workloads based on a 40,000 user corporation, and uses folder and message MIME structures that include both traditional office documents and a variety of rich media content. The benchmark also adds support for encrypted network connections using industry standard SSL v3.0 and TLS 1.0 technology. SPECmail2009 replaces all versions of SPECmail2008, first released in August 2008. The results from the two benchmarks are not comparable.

Software on one or more client machines generates a benchmark load for a System Under Test (SUT) and measures the SUT response times. A SUT can be a mail server running on a single system or a cluster of systems.

A SPECmail2009 'run' simulates a 100% load level associated with the specific number of users, as defined in the configuration file. The mail server must maintain a specific Quality of Service (QoS) at the 100% load level to produce a valid benchmark result. If the mail server does maintain the specified QoS at the 100% load level, the performance of the mail server is reported as SPECmail_Ent2009 SMTP and IMAP Users at SPECmail2009 Sessions per hour. The SPECmail_Ent2009 users at SPECmail2009 Sessions per Hour metric reflects the unique workload combination for a SPEC IMAP4 user.

Key Points and Best Practices

  • Each Sun StorageTek 2540 array was configured with 6 hardware RAID1 volumes. A total of 36 RAID1 volumes were configured with 24 of size 146GB and 12 of size 73GB. Four ZPOOLs of (6x146GB RAID1 volumes) were mounted as the four primary message stores and ZFS file systems. Four ZPOOLs of (8x73GB RAID1 volumes) were mounted as the four primary message indexes. The hardware RAID1 volumes were created with 64K stripe size without read ahead turned on. The 7x146GB internal drives were used to create four ZPOOLs and ZFS file systems for the LDAP, store metadata, queue and the mailserver log.

  • The clients used these Java options: java -d64 -Xms4096m -Xmx4096m -XX:+AggressiveHeap

  • See the SPEC Report for all OS, network and messaging server tunings.

See Also

Disclosure Statement

SPEC, SPECmail reg tm of Standard Performance Evaluation Corporation. Results as of 08/07/2009 on www.spec.org. SPECmail2009: Sun SPARC Enterprise T5240 (16 cores, 2 chips) SPECmail_Ent2009 12000 users at 57,758 SPECmail2009 Sessions/hour. Apple Xserv3,1 (8 cores, 2 chips) SPECmail_Ent2009 6000 users at 28,887 SPECmail2009 Sessions/hour.

Thursday Jul 23, 2009

World Record Performance of Sun CMT Servers

This week, Sun continues to highlight the record-breaking performance of its latest update to the chip multi-threaded (CMT) Sun SPARC Enterprise server family running Solaris.  Some of these benchmarks leverage the use of a variety of Sun's unique technologies including ZFS, SSD, various Storage Products and many more. These benchmarks were blogged about by various members or our team and the URLs are shown below.

Messages

  • Sun's CMT is the most powerful CPU regardless of architectural/implementation details (#transistors, #cores, threads, MHz, etc.)!
  • Performance tests show that Sun can outperform IBM Power6 by more than 2x on a variety of benchmarks.
  • Performance tests show Sun's new 1.6GHz CMT systems can be 20% faster than Sun's previous generation 1.4GHz processors, given Sun's continual advancements in both hardware and software.

Benchmark Results Recently Blogged

Sun T5440 Oracle BI EE World Record Performance
http://blogs.sun.com/BestPerf/entry/sun_t5440_oracle_bi_ee

Sun T5440 World Record SAP-SD 4-Processor Two-tier SAP ERP 6.0 EP 4 (Unicode), Beats IBM POWER6 (note1)
http://blogs.sun.com/BestPerf/entry/sun_t5440_world_record_sap

Zeus ZXTM Traffic Manager World Record on Sun T5240
http://blogs.sun.com/BestPerf/entry/top_performance_on_sun_sparc

Sun T5440 SPECjbb2005, Sun 1.6GHz T2 Plus chip is 2.3x IBM 4.7GHz POWER6 chip
http://blogs.sun.com/BestPerf/entry/sun_t5440_specjbb2005_beats_ibm

New SPECjAppServer2004 Performance on the Sun SPARC Enterprise T5440
http://blogs.sun.com/BestPerf/entry/new_specjappserver2004_performance_on_sun

1.6 GHz SPEC CPU2006: World Record 4-chip system, Rate Benchmarks, Beats IBM POWER6
http://blogs.sun.com/BestPerf/entry/1_6_ghz_spec_cpu2006

Sun Blade T6320 World Record 1-chip SPECjbb2005 performance, Sun 1.6GHz T2 Plus chip is 2.6x IBM 4.7GHz POWER6 chip
http://blogs.sun.com/BestPerf/entry/new_specjbb2005_performance_on_the

Comparison Table

Benchmark Sun CMT Tier Software Key Messages
Oracle BI EE Sun T5440 Appl,
Database
Oracle 11g,
Oracle BIEE,
ZFS,
Solaris
  • World Record: T5440
  • Achieved 28,000 users
  • Reference
SAP-SD 2-Tier Sun T5440 Appl,
Database
SAP ECC 6.0,EP4
Oracle 10g,
Solaris
  • World Record 4-socket: T5440
  • T5440 Beats 4-socket IBM 550 5GHz Power6 by 26% (note1)
  • T5440 Beats HP DL585 G6 4-socket Opteron (note1)
  • Unicode version
SPECjAppServer
2004
Sun T5440 Appl, Database Oracle WebLogic,
Oracle 11g,
JDK 1.6.0_14,
Solaris
  • World Record Single System (Appl Tier): T5440
  • T5440 is 6.4x faster of IBM Power 570 4.7GHz Power6
  • T5440 is 73% faster than HP DL 580 G5 Xeon 6C
  • Oracle Fusion Middleware
Sun T5440
SPECjbb2005
Sun T5440 Appl Java HotSpot,
OpenSolaris
  • 1.6GHz US T2 Plus CPU is 2.3x faster of IBM 4.7GHz Power6 CPU
  • 1.6GHz US T2 Plus CPU is 21% faster than previous generation 1.4GHz US T2 Plus CPU
  • Sun T5440 has 2.3x better power/perf than the IBM 570 (8 4.7GHz Power6)
Sun Blade T6320 SPECjbb2005 Sun T6320 Appl Java HotSpot,
OpenSolaris
  • World Record 1-socket: T6320
  • 1.6GHz US T2 Plus CPU is 2.6x faster than IBM 4.7GHz Power6 CPU
  • T6320 is 3% faster than Fujitsu 3.16GHz Xeon QC
SPEC CPU2006 Sun T5440,
Sun T5240,
Sun T5220,
Sun T5120,
Sun T6320
all tiers Sun Studio12,
Solaris,
ZFS
  • World Record 4-socket: T5440
  • 1.6GHz US T2 Plus CPU is 2.6x faster than IBM 4.7GHz Power6 CPU
  • T6320 is 3% faster than Fujitsu 3.16GHz Xeon QC
Zeus ZXTM
Traffic Manager
Sun T5240 Web Zeus ZXTM v5.1r1,
Solaris
  • World Record: T5240
  • T5240 Beats f5 BIG-IP VIPRON by 34%; 2.6x better $/perf
  • T5240 Beats f5 BIG-IP 8800 by 91%; 2.7x better $/perf⁞
  • T5240 Beats Citrix 12000 by 2.2x; 3.3x better $/perf
  • No IBM result

Virtualization

Sun's announcement also included updated virtualization software (LDOMs 1.1). Downloads are available to existing SPARC Enterprise server customers at: http://www.sun.com/servers/coolthreads/ldoms/index.jsp.  Also look the the blog posting "LDoms for Dummies" at http://blogs.sun.com/PierreReynes/entry/ldoms_for_dummies

Try & Buy Program

Sun is also offering free 60-day trials on Sun CMT servers with with a very popular Try and Buy program: http://www.sun.com/tryandbuy.

Benchmark Performance Disclosure Statements (the URLs listed above go into more detail on each of these benchmarks)

Note1: 4-processor world record on the 2-tier SAP SD Standard Application Benchmark with 4720 SD User, as of July 23, 2009, IBM System 550 (4 processors, 8 cores, 16 threads) 3,752 SAP SD Users, 4x 5 GHz Power6, 64 GB memory, DB2 9.5, AIX 6.1, Cert# 2009023. T5440 beats HP new 4-socket Opteron Servers (HPDL585 G6 with 4665 SD User and HP BL685c G6 with 4422 SD User)

Two-tier SAP Sales and Distribution (SD) standard SAP ERP 6.0 2005/EP4 (Unicode) application benchmarks as of 07/21/09: Sun SPARC Enterprise T5440 Server (4 processors, 32 cores, 256 threads) 4,720 SAP SD Users, 4x 1.6 GHz UltraSPARC T2 Plus, 256 GB memory, Oracle10g, Solaris10, Cert# 2009026. HP ProLiant DL585 G6 (4 processors, 24 cores, 24 threads) 4,665 SAP SD Users, 4x 2.8 GHz AMD Opteron Processor 8439 SE, 64 GB memory, SQL Server 2008, Windows Server 2008 Enterprise Edition, Cert# 2009025. HP ProLiant BL685c G6 (4 processors, 24 cores, 24 threads) 4,422 SAP SD Users, 4x 2.6 GHz AMD Opteron Processor 8435, 64 GB memory, SQL Server 2008, Windows Server 2008 Enterprise Edition, Cert# 2009021. IBM System 550 (4 processors, 8 cores, 16 threads) 3,752 SAP SD Users, 4x 5 GHz Power6, 64 GB memory, DB2 9.5, AIX 6.1, Cert# 2009023. HP ProLiant DL585 G5 (4 processors, 16 cores, 16 threads) 3,430 SAP SD Users, 4x 3.1 GHz AMD Opteron Processor 8393 SE, 64 GB memory, SQL Server 2008, Windows Server 2008 Enterprise Edition, Cert# 2009008. HP ProLiant BL685 G6 (4 processors, 16 cores, 16 threads) 3,118 SAP SD Users, 4x 2.9 GHz AMD Opteron Processor 8389, 64 GB memory, SQL Server 2008, Windows Server 2008 Enterprise Edition, Cert# 2009007. NEC Express5800 (4 processors, 24 cores, 24 threads) 2,957 SAP SD Users, 4x 2.66 GHz Intel Xeon Processor X7460, 64 GB memory, SQL Server 2008, Windows Server 2008 Enterprise Edition, Cert# 2009018. Dell PowerEdge M905 (4 processors, 16 cores, 16 threads) 2,129 SAP SD Users, 4x 2.7 GHz AMD Opteron Processor 8384, 96 GB memory, SQL Server 2005, Windows Server 2003 Enterprise Edition, Cert# 2009017. Sun Fire X4600M2 (8 processors, 32 cores, 32 threads) 7,825 SAP SD Users, 8x 2.7 GHz AMD Opteron 8384, 128 GB memory, MaxDB 7.6, Solaris 10, Cert# 2008070. IBM System x3650 M2 (2 Processors, 8 Cores, 16 Threads) 5,100 SAP SD users,2x 2.93 Ghz Intel Xeon X5570, DB2 9.5, Windows Server 2003 Enterprise Edition, Cert# 2008079. HP ProLiant DL380 G6 (2 processors, 8 cores, 16 threads) 4,995 SAP SD Users, 2x 2.93 GHz Intel Xeon x5570, 48 GB memory, SQL Server 2005, Windows Server 2003 Enterprise Edition, Cert# 2008071. SAP, R/3, reg TM of SAP AG in Germany and other countries. More info www.sap.com/benchmark.

Oracle Business Intelligence Enterprise Edition benchmark, see http://www.oracle.com/solutions/business_intelligence/resource-library-whitepapers.html for more. Results as of 7/20/09.

Zeus is TM of Zeus Technology Limited. Results as of 7/21/2009 on http://www.zeus.com/news/press_articles/zeus-price-performance-press-release.html?gclid=CLn4jLuuk5cCFQsQagod7gTkJA.

SPEC, SPECint, SPECfp reg tm of Standard Performance Evaluation Corporation. Competitive results from www.spec.org as of 16 July 2009. Sun's new results quoted on this page have been submitted to SPEC. Sun Blade T6320 89.2 SPECint_rate_base2006, 96.7 SPECint_rate2006, 64.1 SPECfp_rate_base2006, 68.5 SPECfp_rate2006; Sun SPARC Enterprise T5220/T5120 89.1 SPECint_rate_base2006, 97.0 SPECint_rate2006, 64.1 SPECfp_rate_base2006, 68.5 SPECfp_rate2006; Sun SPARC Enterprise T5240 172 SPECint_rate_base2006, 183 SPECint_rate2006, 124 SPECfp_rate_base2006, 133 SPECfp_rate2006; Sun SPARC Enterprise T5440 338 SPECint_rate_base2006, 360 SPECint_rate2006, 254 SPECfp_rate_base2006, 270 SPECfp_rate2006; Sun Blade T6320 76.4 SPECint_rate_base2006, 85.5 SPECint_rate2006, 58.1 SPECfp_rate_base2006, 62.3 SPECfp_rate2006; Sun SPARC Enterprise T5220/T5120 76.2 SPECint_rate_base2006, 83.9 SPECint_rate2006, 57.9 SPECfp_rate_base2006, 62.3 SPECfp_rate2006; Sun SPARC Enterprise T5240 142 SPECint_rate_base2006, 157 SPECint_rate2006, 111 SPECfp_rate_base2006, 119 SPECfp_rate2006; Sun SPARC Enterprise T5440 270 SPECint_rate_base2006, 301 SPECint_rate2006, 212 SPECfp_rate_base2006, 230 SPECfp_rate2006; IBM p 570 53.2 SPECint_rate_base2006, 60.9 SPECint_rate2006, 51.5 SPECfp_rate_base2006, 58.0 SPECfp_rate2006; IBM Power 520 102 SPECint_rate_base2006, 124 SPECint_rate2006, 88.7 SPECfp_rate_base2006, 105 SPECfp_rate2006; IBM Power 550 215 SPECint_rate_base2006, 263 SPECint_rate2006, 188 SPECfp_rate_base2006, 222 SPECfp_rate2006; HP Integrity BL870c 114 SPECint_rate_base2006; HP Integrity rx7640 87.4 SPECfp_rate_base2006, 90.8 SPECfp_rate2006.

SPEC, SPECjbb reg tm of Standard Performance Evaluation Corporation. Results as of 7/17/2009 on http://www.spec.org. SPECjbb2005, Sun Blade T6320 229576 SPECjbb2005 bops, 28697 SPECjbb2005 bops/JVM; IBM p 570 88089 SPECjbb2005 bops, 88089 SPECjbb2005 bops/JVM; Fujitsu TX100 223691 SPECjbb2005 bops, 111846 SPECjbb2005 bops/JVM; IBM x3350 194256 SPECjbb2005 bops, 97128 SPECjbb2005 bops/JVM; Sun SPARC Enterprise T5120 192055 SPECjbb2005 bops, 24007 SPECjbb2005 bops/JVM.

SPECjAppServer2004, Sun SPARC Enterprise T5440 (4 chips, 32 cores) 7661.16 SPECjAppServer2004 JOPS@Standard; HP DL580 G5 (4 chips, 24 cores) 4410.07 SPECjAppServer2004 JOPS@Standard; HP DL580 G5 (4 chips, 16 cores) 3339.94 SPECjAppServer2004 JOPS@Standard; Two Dell PowerEdge 2950 (4 chips, 16 cores) 4794.33 SPECjAppServer2004 JOPS@Standard; Dell PowerEdge R610 (2 chips, 8 cores) 3975.13 SPECjAppServer2004 JOPS@Standard; Two Dell PowerEdge R610 (4 chips, 16 cores) 7311.50 SPECjAppServer2004 JOPS@Standard; IBM Power 570 (2 chips, 4 cores) 1197.51 SPECjAppServer2004 JOPS@Standard; SPEC, SPECjAppServer reg tm of Standard Performance Evaluation Corporation. Results from http://www.spec.org as of 7/20/09.

SPECjbb2005 Sun SPARC Enterprise T5440 (4 chips, 32 cores) 841380 SPECjbb2005 bops, 26293 SPECjbb2005 bops/JVM. Results submitted to SPEC. HP DL585 G5 (4 chips, 24 cores) 937207 SPECjbb2005 bops, 234302 SPECjbb2005 bops/JVM. IBM Power 570 (8 chips, 16 cores) 798752 SPECjbb2005 bops, 99844 SPECjbb2005 bops/JVM. Sun SPARC Enterprise T5440 (4 chips, 32 cores) 692736 SPECjbb2005 bops, 21648 SPECjbb2005 bops/JVM. SPEC, SPECjbb reg tm of Standard Performance Evaluation Corporation. Results from www.spec.org as of 7/20/09.

IBM p 570 8P 4.7GHz (4 building blocks) power specifications calculated as 80% of maximum input power reported 7/8/09 in “Facts and Features Report”: ftp://ftp.software.ibm.com/common/ssi/pm/br/n/psb01628usen/PSB01628USEN.PDF

Tuesday Jul 21, 2009

Sun T5440 Oracle BI EE World Record Performance

Oracle BI EE Sun SPARC Enterprise T5440 World Record Performance

The Sun SPARC Enterprise T5440 server running the new 1.6 GHz UltraSPARC T2 Plus processor delivered world record performance on Oracle Business Intelligence Enterprise Edition (BI EE) tests using Sun's ZFS.
  • The Sun SPARC Enterprise T5440 server with four 1.6 GHz UltraSPARC T2 Plus processors delivered the best single system performance of 28K concurrent users on the Oracle BI EE benchmark. This result used Solaris 10 with Solaris Containers and the Oracle 11g Database software.

  • The benchmark demonstrates the scalability of Oracle Business Intelligence Cluster with 4 nodes running in Solaris Containers within single Sun SPARC Enterprise T5440 server.

  • The Sun SPARC Enterprise Server T5440 server with internal SSD and the ZFS file system showed significant I/O performance improvement over traditional disk for Business Intelligence Web Catalog activity.

Performance Landscape

System Processors Users
Chips Cores Threads GHz Type
1 x Sun SPARC Enterprise T5440 4 32 256 1.6 UltraSPARC T2 Plus 28,000
5 x Sun Fire T2000 1 8 32 1.2 UltraSPARC T1 10,000

Results and Configuration Summary

Hardware Configuration:

    Sun SPARC Enterprise T5440
      4 x 1.6 GHz UltraSPARC T2 Plus processors
      256 GB
      STK2540 (6 x 146GB)

Software Configuration:

    Solaris 10 5/09
    Oracle BIEE 10.1.3.4 64-bit
    Oracle 11g R1 Database

Benchmark Description

The objective of this benchmark is to highlight how Oracle BI EE can support pervasive deployments in large enterprises, using minimal hardware, by simulating an organization that needs to support more than 25,000 active concurrent users, each operating in mixed mode: ad-hoc reporting, application development, and report viewing.

The user population was divided into a mix of administrative users and business users. A maximum of 28,000 concurrent users were actively interacting and working in the system during the steady-state period. The tests executed 580 transactions per second, with think times of 60 seconds per user, between requests. In the test scenario 95% of the workload consisted of business users viewing reports and navigating within dashboards. The remaining 5% of the concurrent users, categorized as administrative users, were doing application development.

The benchmark scenario used a typical business user sequence of dashboard navigation, report viewing, and drill down. For example, a Service Manager logs into the system and navigates to his own set of dashboards viz. .Service Manager.. The user then selects the .Service Effectiveness. dashboard, which shows him four distinct reports, .Service Request Trend., .First Time Fix Rate., .Activity Problem Areas., and .Cost Per completed Service Call . 2002 till 2005. . The user then proceeds to view the .Customer Satisfaction. dashboard, which also contains a set of 4 related reports. He then proceeds to drill-down on some of the reports to see the detail data. Then the user proceeds to more dashboards, for example .Customer Satisfaction. and .Service Request Overview.. After navigating through these dashboards, he logs out of the application

This benchmark did not use a synthetic database schema. The benchmark tests were run on a full production version of the Oracle Business Intelligence Applications with a fully populated underlying database schema. The business processes in the test scenario closely represents a true customer scenario.

Key Points and Best Practices

Since the server has 32 cores, we created 4 x Solaris Containers with 8 cores dedicated to each of the containers. And a total of four instances of BI server + Presentation server (collectively referred as an 'instance' here onwards) were installed at one instance per container. All the four BI instances were clustered using the BI Cluster software components.

The ZFS file system was used to overcome the 'Too many links' error when there are ~28,000 concurrent users. Earlier the file system has reached UFS limitation of 32767 sub-directories (LINK_MAX) with ~28K users online -- and there are thousands of errors due to the inability to create new directories beyond 32767 directories within a directory. Web Catalog stores the user profile on the disk by creating at least one dedicated directory for each user. If there are more than 25,000 concurrent users, clearly ZFS is the way to go.

See Also:

Oracle Business Intelligence Website,  BUSINESS INTELLIGENCE has other results

Disclosure Statement

Oracle Business Intelligence Enterprise Edition benchmark, see http://www.oracle.com/solutions/business_intelligence/resource-library-whitepapers.html for more. Results as of 7/20/09.

1.6 GHz SPEC CPU2006 - Rate Benchmarks

UltraSPARC T2 and T2 Plus Systems

Improved Performance Over 1.4 GHz

Reported 07/21/09

Significance of Results

Results are presented for the SPEC CPU2006 rate benchmarks run on the new 1.6 GHz Sun UltraSPARC T2 and Sun UltraSPARC T2 Plus processors based systems. The new processors were tested in the Sun CMT family of systems, including the Sun SPARC Enterprise T5120, T5220, T5240, T5440 servers and the Sun Blade T6320 server module.

SPECint_rate2006

  • The Sun SPARC Enterprise T5440 server, equipped with four 1.6 GHz UltraSPARC T2 Plus processor chips, delivered 57% and 37% better results than the best 4-chip IBM POWER6+ based systems on the SPEC CPU2006 integer throughput metrics.

  • The Sun SPARC Enterprise T5240 server equipped with two 1.6 GHz UltraSPARC T2 Plus processor chips, produced 68% and 48% better results than the best 2-chip IBM POWER6+ based systems on the SPEC CPU2006 integer throughput metrics.

  • The single-chip 1.6 GHz UltraSPARC T2 processor-based Sun CMT servers produced 59% to 68% better results than the best single-chip IBM POWER6 based systems on the SPEC CPU2006 integer throughput metrics.

  • On the four-chip Sun SPARC Enterprise T5440 server, when compared versus the 1.4 GHz version of this server, the new 1.6 GHz UltraSPARC T2 Plus processor delivered performance improvements of 25% and 20% as measured by the SPEC CPU2006 integer throughput metrics.

  • The new 1.6 GHz UltraSPARC T2 Plus processor, when put into the 2-chip Sun SPARC Enterprise T5240 server, delivered improvements of 20% and 17% when compared to the 1.4 GHz UltraSPARC T2 Plus processor based server, as measured by the SPEC CPU2006 integer throughput metrics.

  • On the single-chip Sun Blade T6320 server module, Sun SPARC Enterprise T5120 and T5220 servers, the new 1.6 GHz UltraSPARC T2 processor delivered performance improvements of 13% to 17% over the 1.4 GHz version of these servers, as measured by the SPEC CPU2006 integer throughput metrics.

  • The Sun SPARC Enterprise T5440 server, equipped with four 1.6 GHz UltraSPARC T2 Plus processor chips, delivered a SPECint_rate_base2006 score 3X the best 4-chip Itanium based system.

  • The Sun SPARC Enterprise T5440 server, equipped with four 1.6 GHz UltraSPARC T2 Plus processors, delivered a SPECint_rate_base2006 score of 338, a World Record score for 4-chip systems running a single operating system instance (i.e. SMP, not clustered).

SPECfp_rate2006

  • The Sun SPARC Enterprise T5440 server, equipped with four 1.6 GHz UltraSPARC T2 Plus processor chips, delivered 35% and 22% better results than the best 4-chip IBM POWER6+ based systems on the SPEC CPU2006 floating-point throughput metrics.

  • The Sun SPARC Enterprise T5240 server, equipped with two 1.6 GHz UltraSPARC T2 Plus processor chips, produced 40% and 27% better results than the best 2-chip IBM POWER6+ based systems on the SPEC CPU2006 floating-point throughput metrics.

  • The single 1.6 GHz UltraSPARC T2 processor based Sun CMT servers produced between 24% and 18% better results than the best single-chip IBM POWER6 based systems on the SPEC CPU2006 floating-point throughput metrics.

  • On the four chip Sun SPARC Enterprise T5440 server, the new 1.6 GHz UltraSPARC T2 Plus processor delivered performance improvements of 20% and 17% when compared to 1.4 GHz processors in the same system, as measured by the SPEC CPU2006 floating-point throughput metrics.

  • The new 1.6 GHz UltraSPARC T2 Plus processor, when put into a Sun SPARC Enterprise T5240 server, delivered an improvement of 12% when compared to the 1.4 GHz UltraSPARC T2 Plus processor based server as measured by the SPEC CPU2006 floating-point throughput metrics.

  • On the single processor Sun Blade T6320 server module, Sun SPARC Enterprise T5120 and T5220 servers, the new 1.6 GHz UltraSPARC T2 processor delivered a performance improvement over the 1.4 GHz version of these servers of between 11% and 10% as measured by the SPEC CPU2006 floating-point throughput metrics.

  • The Sun SPARC Enterprise T5440 server, equipped with four 1.6 GHz UltraSPARC T2 Plus processor chips, delivered a peak score 3X the best 4-chip Itanium based system, and base 2.9X, on the SPEC CPU2006 floating-point throughput metrics.

Performance Landscape

SPEC CPU2006 Performance Charts - bigger is better, selected results, please see www.spec.org for complete results. All results as of 7/17/09.

In the tables below
"Base" = SPECint_rate_base2006 or SPECfp_rate_base2006
"Peak" = SPECint_rate2006 or SPECfp_rate2006

SPECint_rate2006 results - 1 chip systems

System Processors Base
Copies
Performance Results Comments
Cores/
Chips
Type MHz Base Peak
Supermicro X8DAI 4/1 Xeon W3570 3200 8 127 136 Best Nehalem result
HP ProLiant BL465c G6 6/1 Opteron 2435 2600 6 82.1 104 Best Istanbul result
Sun SPARC T5220 8/1 UltraSPARC T2 1582 63 89.1 97.0 New
Sun SPARC T5120 8/1 UltraSPARC T2 1582 63 89.1 97.0 New
Sun Blade T6320 8/1 UltraSPARC T2 1582 63 89.2 96.7 New
Sun Blade T6320 8/1 UltraSPARC T2 1417 63 76.4 85.5
Sun SPARC T5120 8/1 UltraSPARC T2 1417 63 76.2 83.9
IBM System p 570 2/1 POWER6 4700 4 53.2 60.9 Best POWER6 result

SPECint_rate2006 - 2 chip systems

System Processors Base
Copies
Performance Results Comments
Cores/
Chips
Type MHz Base Peak
Fujitsu CELSIUS R670 8/2 Xeon W5580 3200 16 249 267 Best Nehalem result
Sun Blade X6270 8/2 Xeon X5570 2933 16 223 260
A+ Server 1021M-UR+B 12/2 Opteron 2439 SE 2800 12 168 215 Best Istanbul result
Sun SPARC T5240 16/2 UltraSPARC T2 Plus 1582 127 171 183 New
Sun SPARC T5240 16/2 UltraSPARC T2 Plus 1415 127 142 157
IBM Power 520 4/2 POWER6+ 4700 8 101 124 Best POWER6+ peak
IBM Power 520 4/2 POWER6+ 4700 8 102 122 Best POWER6+ base
HP Integrity rx2660 4/2 Itanium 9140M 1666 4 58.1 62.8 Best Itanium peak
HP Integrity BL860c 4/2 Itanium 9140M 1666 4 61.0 na Best Itanium base

SPECint_rate2006 - 4 chip systems

System Processors Base
Copies
Performance Results Comments
Cores/
Chips
Type MHz Base Peak
SGI Altix ICE 8200EX 16/4 Xeon X5570 2933 32 466 499 Best Nehalem result
Note: clustered, not SMP
Tyan Thunder n4250QE 24/4 Opteron 8439 SE 2800 24 326 417 Best Istanbul result
Sun SPARC T5440 32/4 UltraSPARC T2 Plus 1596 255 338 360 New.  World record for
4-chip SMP
SPECint_rate_base2006
Sun SPARC T5440 32/4 UltraSPARC T2 Plus 1414 255 270 301
IBM Power 550 8/4 POWER6+ 5000 16 215 263 Best POWER6 result
HP Integrity BL870c 8/4 Itanium 9150N 1600 8 114 na Best Itanium result

SPECfp_rate2006 - 1 chip systems

System Processors Base
Copies
Performance Results Comments
Cores/
Chips
Type MHz Base Peak
Supermicro X8DAI 4/1 Xeon W3570 3200 8 102 106 Best Nehalem result
HP ProLiant BL465c G6 6/1 Opteron 2435 2600 6 65.2 72.2 Best Istanbul result
Sun SPARC T5220 8/1 UltraSPARC T2 1582 63 64.1 68.5 New
Sun SPARC T5120 8/1 UltraSPARC T2 1582 63 64.1 68.5 New
Sun Blade T6320 8/1 UltraSPARC T2 1582 63 64.1 68.5 New
Sun Blade T6320 8/1 UltraSPARC T2 1417 63 58.1 62.3
SPARC T5120 8/1 UltraSPARC T2 1417 63 57.9 62.3
SPARC T5220 8/1 UltraSPARC T2 1417 63 57.9 62.3
IBM System p 570 2/1 POWER6 4700 4 51.5 58.0 Best POWER6 result

SPECfp_rate2006 - 2 chip systems

System Processors Base
Copies
Performance Results Comments
Cores/
Chips
Type MHz Base Peak
ASUS TS700-E6 8/2 Xeon W5580 3200 16 201 207 Best Nehalem result
A+ Server 1021M-UR+B 12/2 Opteron 2439 SE 2800 12 133 147 Best Istanbul result
Sun SPARC T5240 16/2 UltraSPARC T2 Plus 1582 127 124 133 New
Sun SPARC T5240 16/2 UltraSPARC T2 Plus 1415 127 111 119
IBM Power 520 4/2 POWER6+ 4700 8 88.7 105 Best POWER6+ result
HP Integrity rx2660 4/4 Itanium 9140M 1666 4 54.5 55.8 Best Itanium result

SPECfp_rate2006 - 4 chip systems

System Processors Base
Copies
Performance Results Comments
Cores/
Chips
Type MHz Base Peak
SGI Altix ICE 8200EX 16/4 Xeon X5570 2933 32 361 372 Best Nehalem result
Tyan Thunder n4250QE 24/4 Opteron 8439 SE 2800 24 259 285 Best Istanbul result
Sun SPARC T5440 32/4 UltraSPARC T2 Plus 1596 255 254 270 New
Sun SPARC T5440 32/4 UltraSPARC T2 Plus 1414 255 212 230
IBM Power 550 8/4 POWER6+ 5000 16 188 222 Best POWER6+ result
HP Integrity rx7640 8/4 Itanium 2 9040 1600 8 87.4 90.8 Best Itanium result

Results and Configuration Summary

Test Configurations:


Sun Blade T6320
1.6 GHz UltraSPARC T2
64 GB (16 x 4GB)
Solaris 10 10/08
Sun Studio 12, Sun Studio 12 Update 1, gccfss V4.2.1

Sun SPARC Enterprise T5120/T5220
1.6 GHz UltraSPARC T2
64 GB (16 x 4GB)
Solaris 10 10/08
Sun Studio 12, Sun Studio 12 Update 1, gccfss V4.2.1

Sun SPARC Enterprise T5240
2 x 1.6 GHz UltraSPARC T2 Plus
128 GB (32 x 4GB)
Solaris 10 5/09
Sun Studio 12, Sun Studio 12 Update 1, gccfss V4.2.1

Sun SPARC Enterprise T5440
4 x 1.6 GHz UltraSPARC T2 Plus
256 GB (64 x 4GB)
Solaris 10 5/09
Sun Studio 12 Update 1, gccfss V4.2.1

Results Summary:



T6320 T5120 T5220 T5240 T5440
SPECint_rate_base2006 89.2 89.1 89.1 171 338
SPECint_rate2006 96.7 97.0 97.0 183 360
SPECfp_rate_base2006 64.1 64.1 64.1 124 254
SPECfp_rate2006 68.5 68.5 68.5 133 270

Benchmark Description

SPEC CPU2006 is SPEC's most popular benchmark, with over 7000 results published in the three years since it was introduced. It measures:

  • "Speed" - single copy performance of chip, memory, compiler
  • "Rate" - multiple copy (throughput)

The rate metrics are used for the throughput-oriented systems described on this page. These metrics include:

  • SPECint_rate2006: throughput for 12 integer benchmarks derived from real applications such as perl, gcc, XML processing, and pathfinding
  • SPECfp_rate2006: throughput for 17 floating point benchmarks derived from real applications, including chemistry, physics, genetics, and weather.

There are "base" variants of both the above metrics that require more conservative compilation, such as using the same flags for all benchmarks.

See here for additional information.

Key Points and Best Practices

Result on this page for the Sun SPARC Enterprise T5120 server were measured on a Sun SPARC Enterprise T5220. The Sun SPARC Enterprise T5120 and Sun SPARC Enterprise T5220 are electronically equivalent. A SPARC Enterprise 5120 can hold up to 4 disks, and a T5220 can hold up to 8. This system was tested with 4 disks; therefore, results on this page apply to both the T5120 and the T5220.

Know when you need throughput vs. speed. The Sun CMT systems described on this page provide massive throughput, as demonstrated by the fact that up to 255 jobs are run on the 4-chip system, 127 on 2-chip, and 63 on 1-chip. Some of the competitive chips do have a speed advantage - e.g. Nehalem and Istanbul - but none of the competitive results undertake to run the large number of jobs tested on Sun's CMT systems.

Use the latest compiler. The Sun Studio group is always working to improve the compiler. Sun Studio 12, and Sun Studio 12 Update 1, which are used in these submissions, provide updated code generation for a wide variety of SPARC and x86 implementations.

I/O still counts. Even in a CPU-intensive workload, some I/O remains. This point is explored in some detail at http://blogs.sun.com/jhenning/entry/losing_my_fear_of_zfs.

Disclosure Statement

SPEC, SPECint, SPECfp reg tm of Standard Performance Evaluation Corporation. Competitive results from www.spec.org as of 16 July 2009.  Sun's new results quoted on this page have been submitted to SPEC.
Sun Blade T6320 89.2 SPECint_rate_base2006, 96.7 SPECint_rate2006, 64.1 SPECfp_rate_base2006, 68.5 SPECfp_rate2006;
Sun SPARC Enterprise T5220/T5120 89.1 SPECint_rate_base2006, 97.0 SPECint_rate2006, 64.1 SPECfp_rate_base2006, 68.5 SPECfp_rate2006;
Sun SPARC Enterprise T5240 172 SPECint_rate_base2006, 183 SPECint_rate2006, 124 SPECfp_rate_base2006, 133 SPECfp_rate2006;
Sun SPARC Enterprise T5440 338 SPECint_rate_base2006, 360 SPECint_rate2006, 254 SPECfp_rate_base2006, 270 SPECfp_rate2006;
Sun Blade T6320 76.4 SPECint_rate_base2006, 85.5 SPECint_rate2006, 58.1 SPECfp_rate_base2006, 62.3 SPECfp_rate2006;
Sun SPARC Enterprise T5220/T5120 76.2 SPECint_rate_base2006, 83.9 SPECint_rate2006, 57.9 SPECfp_rate_base2006, 62.3 SPECfp_rate2006;
Sun SPARC Enterprise T5240 142 SPECint_rate_base2006, 157 SPECint_rate2006, 111 SPECfp_rate_base2006, 119 SPECfp_rate2006;
Sun SPARC Enterprise T5440 270 SPECint_rate_base2006, 301 SPECint_rate2006, 212 SPECfp_rate_base2006, 230 SPECfp_rate2006;
IBM p 570 53.2 SPECint_rate_base2006, 60.9 SPECint_rate2006, 51.5 SPECfp_rate_base2006, 58.0 SPECfp_rate2006;
IBM Power 520 102 SPECint_rate_base2006, 124 SPECint_rate2006, 88.7 SPECfp_rate_base2006, 105 SPECfp_rate2006;
IBM Power 550 215 SPECint_rate_base2006, 263 SPECint_rate2006, 188 SPECfp_rate_base2006, 222 SPECfp_rate2006;
HP Integrity BL870c 114 SPECint_rate_base2006;
HP Integrity rx7640 87.4 SPECfp_rate_base2006, 90.8 SPECfp_rate2006.

New SPECjAppServer2004 Performance on the Sun SPARC Enterprise T5440

One Sun SPARC Enterprise T5440 server with four UltraSPARC T2 Plus processors at 1.6GHz delivered a single-system World Record result of 7661.16 SPECjAppServer2004 JOPS@Standard using Oracle WebLogic Server, a component of Oracle Fusion Middleware, together with Oracle Database 11g.

  • This benchmark used the Oracle WebLogic 10.3.1 Application Server and Oracle Database 11g Enterprise Edition. This benchmark result proves that the Sun SPARC Enterprise T5440 server using the UltraSPARC T2 Plus processor performs as an outstanding J2EE application server as well as an Oracle 11g OLTP database server.
  • The Sun SPARC Enterprise T5440 server (four 1.6 GHz UltraSPARC T2 Plus chips) running as the application server delivered 6.4X better performance than the best published single application server result from the IBM p 570 system based on the 4.7 GHz POWER6 processor.
  • The Sun SPARC Enterprise T5440 server (four 1.6 GHz UltraSPARC T2 Plus chips) demonstrated 73% better performance than the HP DL580 G5 result of 4410.07 SPECjAppServer2004 JOPS@Standard, which used four 2.66 GHz Intel 6-core Xeon processors.
  • The Sun SPARC Enterprise T5440 server (four 1.6 GHz UltraSPARC T2 Plus chips) demonstrated 2.3X better performance than the HP DL580 G5 result of 3339.94 SPECjAppServer2004 JOPS@Standard, which used four 2.93 GHz Intel 4-core Xeon processors.
  • One Sun SPARC Enterprise T5440 server (four 1.6 GHz UltraSPARC T2 Plus chips) demonstrated 1.9X better performance than the Dell PowerEdge R610 result of 3975.13 SPECjAppServer2004 JOPS@Standard, which used two 2.93 GHz Intel 4-core Xeon processors.
  • One Sun SPARC Enterprise T5440 server (four 1.6 GHz UltraSPARC T2 Plus chips) demonstrated 5% better performance than the Dell PowerEdge R610 result of 7311.50 SPECjAppServer2004 JOPS@Standard, which used two Dell R610 systems each with two 2.93 GHz Intel 4-core Xeon processors.
  • These results were obtained using Sun Java SE 6 Update 14 Performance Release on the Sun SPARC Enterprise T5440 server and running the Solaris 10 5/09 Operating Environment.
  • The Sun SPARC Enterprise T5440 server used Solaris Containers technology to consolidate 7 Oracle Weblogic application server instances to achieve this result.
  • Oracle Fusion Middleware provides a family of complete, integrated, hot pluggable and best-of-breed products known for enabling enterprise customers to create and run agile and intelligent business applications. Oracle WebLogic Server’s on-going, record-setting Java application server performance demonstrates why so many customers rely on Oracle Fusion Middleware as their foundation for innovation.

Performance Landscape

SPECjAppServer2004 Performance Chart as of 7/20/2009. Complete benchmark results may be found at the SPEC benchmark website http://www.spec.org. SPECjAppServer2004 JOPS@Standard (bigger is better)

Vendor SPECjApp-Server2004
JOPS@Standard
J2EE Server DB Server
Sun 7661.16 1x Sun SPARC Enterprise T5440
32 cores, 4 chips, 1.6 GHz US-T2 Plus
Oracle WebLogic 10.3.1
1x Sun SPARC Enterprise T5440
32 cores, 4 chips, 1.4 GHz US-T2 Plus
Oracle 11g DB 11.1.0.7
Dell 7311.50 2x PowerEdge R610
16 cores, 4 chips, 2.93 Xeon X5570
Oracle WebLogic 10.3
1x PowerEdge R900
24 cores, 4 chips @ 2.66 Xeon X7460
Oracle 11g DB 11.1.0.7
Sun 6334.86 1x Sun SPARC Enterprise T5440
32 cores, 4 chips, 1.4 GHz US-T2 Plus
Oracle WebLogic 10.3
1x Sun SPARC Enterprise T5440
32 cores, 4 chips, 1.4 GHz US-T2 Plus
Oracle 11g DB 11.1.0.7
Dell 4794.33 2x PowerEdge 2950
16 cores, 4 chips, 3.3 Xeon X5470
Oracle WebLogic 10.3
1x PowerEdge R900
24 cores, 4 chips, 2.66 Xeon X7460
Oracle 11g DB 11.1.0.6
HP 4410.07 1x ProLiant DL580 G5
24 cores, 4 chips, 2.66 GHz Xeon X7460
Oracle WebLogic 10.3
1x ProLiant DL580 G5
24 cores, 4 chips, 2.66 GHz Xeon X7460
Oracle 11g DB 11.1.0.6
HP 3975.13 1x Dell PowerEdge R610
8 cores, 2 chips, 2.93 GHz Xeon X5570
Oracle WebLogic 10.3
1x PowerEdge R900
24 cores, 4 chips, 2.66 GHz Xeon X7460
Oracle 11g DB 11.1.0.7
IBM 1197.51 1x IBM System p 570
4 cores, 2 chips, 4.7 GHz POWER6
WebSphere Application Server V6.1
1x IBM p5 550
4 cores, 2 chips, 2.1 GHz POWER5+
IBM DB2 Universal Database 9.1

Results and Configuration Summary

Application Server:
    Sun SPARC Enterprise T5440
      4 x 1.6 GHz 8-core UltraSPARC T2 Plus
      256 GB memory
      2 x 10GbE XAUI NIC
      2 x 32GB SATA SSD
      Solaris 10 5/09
      Solaris Containers
      Oracle WebLogic 10.3.1 Application Server - Standard Edition
      Oracle Fusion Middleware
      JDK 1.6.0_14 Performance Release

Database Server:

    Sun SPARC Enterprise T5440
      4 x 1.4 GHz 8-core UltraSPARC T2 Plus
      256 GB memory
      6 x Sun StorageTek 2540 FC Array
      4 x Sun StorageTek 2501 FC Expansion Array
      Solaris 10 5/09
      Oracle Database Enterprise Edition Release 11.1.0.7

Benchmark Description

SPECjAppServer2004 (Java Application Server) is a multi-tier benchmark for measuring the performance of Java 2 Enterprise Edition (J2EE) technology-based application servers. SPECjAppServer2004 is an end-to-end application which exercises all major J2EE technologies implemented by compliant application servers as follows:
  • The web container, including servlets and JSPs
  • The EJB container
  • EJB2.0 Container Managed Persistence
  • JMS and Message Driven Beans
  • Transaction management
  • Database connectivity
Moreover, SPECjAppServer2004 also heavily exercises all parts of the underlying infrastructure that make up the application environment, including hardware, JVM software, database software, JDBC drivers, and the system network. The primary metric of the SPECjAppServer2004 benchmark is jAppServer Operations Per Second (JOPS) which is calculated by adding the metrics of the Dealership Management Application in the Dealer Domain and the Manufacturing Application in the Manufacturing Domain. There is NO price/performance metric in this benchmark.

Key Points and Best Practices

  • 7 Oracle WebLogic server instances on the Sun SPARC Enterprise T5440 server were hosted in separate Solaris Containers to demonstrate consolidation of multiple application servers.
  • Each appserver container was bound to a separate processor set each containing 4 cores. This was done to improve performance by reducing memory access latency using the physical memory closest to the processors. The default set was used for network & disk interrupt handling.
  • The Oracle WebLogic application servers were executed in the FX scheduling class to improve performance by reducing the frequency of context switches.
  • The Oracle database processes were run in 4 processor sets using the psrset utility and executed in the FX scheduling class. This was done to improve perfomance by reducing memory access latency and reducing conext switches.
  • Oracle Log Writer process run in a separate processor set containing 1 core and run in the RT scheduling class. This was done to insure that the Log Writer had the most efficient use of cpu resources.
  • Enhancements to the JVM had a major impact on performance.
  • The Sun SPARC Enterprise T5440 used 2x 10GbE NICs for network traffic from the driver systems.

See Also

Disclosure Statement

SPECjAppServer2004, Sun SPARC Enterprise T5440 (4 chips, 32 cores) 7661.16 SPECjAppServer2004 JOPS@Standard; HP DL580 G5 (4 chips, 24 cores) 4410.07 SPECjAppServer2004 JOPS@Standard; HP DL580 G5 (4 chips, 16 cores) 3339.94 SPECjAppServer2004 JOPS@Standard; Two Dell PowerEdge 2950 (4 chips, 16 cores) 4794.33 SPECjAppServer2004 JOPS@Standard; Dell PowerEdge R610 (2 chips, 8 cores) 3975.13 SPECjAppServer2004 JOPS@Standard; Two Dell PowerEdge R610 (4 chips, 16 cores) 7311.50 SPECjAppServer2004 JOPS@Standard; IBM p570 (2 chips, 4 cores) 1197.51 SPECjAppServer2004 JOPS@Standard; SPEC, SPECjAppServer reg tm of Standard Performance Evaluation Corporation. Results from http://www.spec.org as of 7/20/09

Friday Jul 03, 2009

SPECmail2009 on Sun Fire X4275+Sun Storage 7110: Mail Server System Solution

Significance of Results

Sun has a new SPECmail2009 result on a Sun Fire X4275 server and Sun Storage 7110 Unified Storage System running Sun Java Messaging server 6.2.  OpenStorage and ZFS were a key part of the new World Record SPECmail2009.

  • The Sun Fire X4275 server, equipped by 2, 2.93GHz Intel Xeon QC X5570 processors, running the Sun Java System Messaging Server 6.2 on Solaris 10 achieved a new World Record 8000 SPECmail_Ent2009 IMAP4 users at 38,348 Sessions/hour. 

  • The Sun result was obtained using  about half disk spindles, with the Sun Storage 7110 Unified Storage System, than  Apple's direct attached storage solution.  The Sun submission is the first result using a NAS solution, specifically two Sun Storage 7110 Unified Storage System.
  • This benchmark result clearly demonstrates that the Sun Fire X4275 server together with Sun Java System Messaging Server 6.2, Solaris 10 on Sun Storage 7110 Unified Storage Systems can support a large, enterprise level IMAP mail server environment as a reliable low cost solution, delivering best performance and maximizing data integrity with ZFS.

SPECmail 2009 Performance Landscape(ordered by performance)

System Processors Performance
Type GHz Ch, Co, Th SPECmail_Ent2009
Users
SPECmail2009
Sessions/hour
Sun Fire X4275 Intel X5570 2.93 2, 8, 16 8000 38,348
Apple Xserv3,1 Intel X5570 2.93 2, 8, 16 6000 28,887
Sun SPARC Enterprise T5220 UltraSPARC T2 1.4 1, 8, 64 3600 17,316

Notes:
    Number of SPECmail_Ent2009 users (bigger is better)
    SPECmail2009 Sessions/hour (bigger is better)
    Ch, Co, Th: Chips, Cores, Threads

Complete benchmark results may be found at the SPEC benchmark website http://www.spec.org

Results and Configuration Summary

Hardware Configuration:
    Sun Fire X4275
      2 x 2.93 GHz QC Intel Xeon X5570 processors
      72 GB
      12 x 300GB, 10000 RPM SAS disk

    2 x Sun Storage 7110 Unified Storage System, each with
      16 x 146GB SAS 10K RPM

Software Configuration:

    O/S: Solaris 10
    ZFS
    Mail Server: Sun Java System Messaging Server 6.2

Benchmark Description

The SPECmail2009 benchmark measures the ability of corporate e-mail systems to meet today's demanding e-mail users over fast corporate local area networks (LAN). The SPECmail2009 benchmark simulates corporate mail server workloads that range from 250 to 10,000 or more users, using industry standard SMTP and IMAP4 protocols. This e-mail server benchmark creates client workloads based on a 40,000 user corporation, and uses folder and message MIME structures that include both traditional office documents and a variety of rich media content. The benchmark also adds support for encrypted network connections using industry standard SSL v3.0 and TLS 1.0 technology. SPECmail2009 replaces all versions of SPECmail2008, first released in August 2008. The results from the two benchmarks are not comparable.

Software on one or more client machines generates a benchmark load for a System Under Test (SUT) and measures the SUT response times. A SUT can be a mail server running on a single system or a cluster of systems.

A SPECmail2009 'run' simulates a 100% load level associated with the specific number of users, as defined in the configuration file. The mail server must maintain a specific Quality of Service (QoS) at the 100% load level to produce a valid benchmark result. If the mail server does maintain the specified QoS at the 100% load level, the performance of the mail server is reported as SPECmail_Ent2009 SMTP and IMAP Users at SPECmail2009 Sessions per hour. The SPECmail_Ent2009 users at SPECmail2009 Sessions per Hour metric reflects the unique workload combination for a SPEC IMAP4 user.

Key Points and Best Practices

  • Each XTA7110 was configured as 1xRAID1 (14x143GB) volume with 2 NFS Shared LUNs, accessed by the SUT via NFSV4. The mailstore volumes were mounted with the nfs mount options: nointr,hard, xattr. There were a total of 4GB/sec Network connections between the SUT and the 7110 Unified Storage systems using 2 NorthStar dual 1GB/sec NICs.
  • The clients used these Java options: java -d64 -Xms4096m -Xmx4096m -XX:+AggressiveHeap
  • See the SPEC Report for all OS, network and messaging server tunings.

See Also

Disclosure Statement

SPEC, SPECmail reg tm of Standard Performance Evaluation Corporation. Results as of 07/06/2009 on www.spec.org. SPECmail2009: Sun Fire X4275 (8 cores, 2 chips) SPECmail_Ent2009 8000 users at 38,348 SPECmail2009 Sessions/hour. Apple Xserv3,1 (8 cores, 2 chips) SPECmail_Ent2009 6000 users at 28,887 SPECmail2009 Sessions/hour.

Wednesday Jun 17, 2009

Performance of Sun 7410 and 7310 Unified Storage Array Line

Roch (rhymes with Spock) Bourbonnais, posted more data showing the performance of Sun's OpenStorage products.  Some of his basic conclusions are:

  • The Sun Storage 7410 Unified Storage Array delivers 1 GB/sec throughput performance.
  • The Sun Storage 7310 Unified Storage Array delivers over 500 MB/sec on streaming writes for backups and imaging applications.
  • The Sun Storage 7410 Unified Storage Array delivers over 22000 of 8K synchronous writes per second combining great DB performance and ease of deployment of Network Attached Storage while delivering the economics benefits of inexpensive SATA disks.
  • The Sun Storage 7410 Unified Storage Array delivers over 36000 of random 8K reads per second from a 400GB working set for great Mail application responsiveness. This corresponds to an entreprise of 100000 people with every employee accessing new data every 3.6 second consolidated on a single server.

You can read more about it at: http://blogs.sun.com/roch/entry/compared_performance_of_sun_7000

Monday Jun 15, 2009

Sun Fire X4600 M2 Server Two-tier SAP ERP 6.0 (Unicode) Standard Sales and Distribution (SD) Benchmark

Significance of Results

  • World Record performance result with 8 processors on the two-tier SAP ERP 6.0 enhancement pack 4 (unicode) standard sales and distribution (SD) benchmark as of June 10, 2009.
  • The Sun Fire X4600 M2 server with 8 AMD Opteron 8384 SE processors (32 cores, 32 threads) achieved 6,050 SAP SD Benchmark users running SAP ERP application release 6.0 enhancement pack 4 benchmark with unicode software, using MaxDB 7.8 database and Solaris 10 OS.
  • This benchmark result highlights the optimal performance of SAP ERP on Sun Fire servers running the Solaris OS and the seamless multilingual support available for systems running SAP applications.
  • ZFS is used in this benchmark for its database and log files.
  • The Sun Fire X4600 M2 server beats both the HP ProLiant DL785 G5 and the NEC Express5800 running Windows by 10% and 35% respectively even though all three systems use the same number of processors.
  • In January 2009, a new version, the Two-tier SAP ERP 6.0 Enhancement Pack 4 (Unicode) Standard Sales and Distribution (SD) Benchmark, was released. This new release has higher cpu requirements and so yields from 25-50% fewer users compared to the previous Two-tier SAP ERP 6.0 (non-unicode) Standard Sales and Distribution (SD) Benchmark. 10-30% of this is due to the extra overhead from the processing of the larger character strings due to Unicode encoding.  Refer to SAP Note for more details (https://service.sap.com/sap/support/notes/1139642 Note: User and password for SAP Service Marketplace required).

  • Unicode is a computing standard that allows for the representation and manipulation of text expressed in most of the world's writing systems. Before the Unicode requirement, this benchmark used ASCII characters meaning each was just 1 byte. The new version of the benchmark requires Unicode characters and the Application layer (where ~90% of the cycles in this benchmark are spent) uses a new encoding, UTF-16, which uses 2 bytes to encode most characters (including all ASCII characters) and 4 bytes for some others. This requires computers to do more computation and use more bandwidth and storage for most character strings. Refer to the above SAP Note for more details.

Performance Landscape

SAP-SD 2-Tier Performance Table (in decreasing performance order).

SAP ERP 6.0 Enhancement Pack 4 (Unicode) Results
(New version of the benchmark as of January 2009)

System OS
Database
Users SAP
ERP/ECC
Release
SAPS SAPS/
Proc
Date
Sun Fire X4600 M2
8xAMD Opteron 8384 SE @2.7GHz
256 GB
Solaris 10
MaxDB 7.8
6,050 2009
6.0 EP4
(Unicode)
33,230 4,154 10-Jun-09
HP ProLiant DL785 G5
8xAMD Opteron 8393 SE @3.1GHz
128 GB
Windows Server 2008
Enterprise Edition
SQL Server 2008
5,518 2009
6.0 EP4
(Unicode)
30,180 3,772 24-Apr-09
NEC Express 5800
8xIntel Xeon X7460 @2.66GHz
256 GB
Windows Server 2008
Datacenter Edition
SQL Server 2008
4,485 2009
6.0 EP4
(Unicode)
25,280 12,640 09-Feb-09
Sun Fire X4270
2xIntel Xeon X5570 @2.93GHz
48 GB
Solaris 10
Oracle 10g
3,700 2009
6.0 EP4
(Unicode)
20,300 10,150 30-Mar-09

SAP ERP 6.0 (non-unicode) Results
(Old version of the benchmark retired at the end of 2008)

System OS
Database
Users SAP
ERP/ECC
Release
SAPS SAPS/
Proc
Date
Sun Fire X4600 M2
8xAMD Opteron 8384 @2.7GHz
128 GB
Solaris 10
MaxDB 7.6
7,825 2005
6.0
39,270 4,909 09-Dec-08
IBM System x3650
2xIntel Xeon X5570 @2.93GHz
48 GB
Windows Server 2003 EE
DB2 9.5
5,100 2005
6.0
25,530 12,765 19-Dec-08
HP ProLiant DL380 G6
2xIntel Xeon X5570 @2.93GHz
48 GB
Windows Server 2003 EE
SQL Server 2005
4,995 2005
6.0
25,000 12,500 15-Dec-08

Complete benchmark results may be found at the SAP benchmark website http://www.sap.com/benchmark.

Results and Configuration Summary

Hardware Configuration:

    One, Sun Fire X4600 M2
      8 x 2.7 GHz AMD Opteron 8384 SE processors (8 processors / 32 cores / 32 threads)
      256 GB memory
      3 x STK2540, 3 x STK2501 each with 12 x 146GB/15KRPM disks

Software Configuration:

    Solaris 10
    SAP ECC Release: 6.0 Enhancement Pack 4 (Unicode)
    MaxDB 7.8

Certified Results

    Performance:
    6050 benchmark users
    SAP Certification:
    2009022

Key Points and Best Practices

  • This is the best 8 Processor SAP ERP 6.0 EP4 (Unicode) result as of June 10, 2009.
  • Two-tier SAP ERP 6.0 Enhancement Pack 4 (Unicode) Standard Sales and Distribution (SD) Benchmark on Sun Fire X4600 M2 (8 processors, 32 cores, 32 threads, 8x2.7 GHz AMD Opteron 8384 SE) was able to support 6,050 SAP SD Users on top of the Solaris 10 OS.
  • Since random writes are an important part of this benchmark, we used zfs to help coalesce those into sequential writes.

Benchmark Description

The SAP Standard Application SD (Sales and Distribution) Benchmark is a two-tier ERP business test that is indicative of full business workloads of complete order processing and invoice processing, and demonstrates the ability to run both the application and database software on a single system. The SAP Standard Application SD Benchmark represents the critical tasks performed in real-world ERP business environments.

SAP is one of the premier world-wide ERP application providers, and maintains a suite of benchmark tests to demonstrate the performance of competitive systems on the various SAP products.

Disclosure Statement

Two-tier SAP Sales and Distribution (SD) standard SAP ERP 6.0 2005/EP4 (Unicode) application benchmarks as of 06/10/09: Sun Fire X4600 M2 (8 processors, 32 cores, 32 threads) 6,050 SAP SD Users, 8x 2.7 GHz AMD Opteron 8384 SE, 256 GB memory, MaxDB 7.8, Solaris 10, Cert# 2009022. HP ProLiant DL785 G5 (8 processors, 32 cores, 32 threads) 5,518 SAP SD Users, 8x 3.1 GHz AMD Opteron 8393 SE, 128 GB memory, SQL Server 2008, Windows Server 2008 Enterprise Edition, Cert# 2009009. NEC Express 5800 (8 processors, 48 cores, 48 threads) 4,485 SAP SD Users, 8x 2.66 GHz Intel Xeon X7460, 256 GB memory, SQL Server 2008, Windows Server 2008 Datacenter Edition, Cert# 2009001. Sun Fire X4270 (2 processors, 8 cores, 16 threads) 3,700 SAP SD Users, 2x 2.93 GHz Intel Xeon x5570, 48 GB memory, Oracle 10g, Solaris 10, Cert# 2009005. Sun Fire X4600M2 (8 processors, 32 cores, 32 threads) 7,825 SAP SD Users, 8x 2.7 GHz AMD Opteron 8384, 128 GB memory, MaxDB 7.6, Solaris 10, Cert# 2008070. IBM System x3650 M2 (2 Processors, 8 Cores, 16 Threads) 5,100 SAP SD users,2x 2.93 Ghz Intel Xeon X5570, DB2 9.5, Windows Server 2003 Enterprise Edition, Cert# 2008079. HP ProLiant DL380 G6 (2 processors, 8 cores, 16 threads) 4,995 SAP SD Users, 2x 2.93 GHz Intel Xeon x5570, 48 GB memory, SQL Server 2005, Windows Server 2003 Enterprise Edition, Cert# 2008071.

SAP, R/3, reg TM of SAP AG in Germany and other countries. More info www.sap.com/benchmark

Thursday Jun 11, 2009

SAS Grid Computing 9.2 utilizing the Sun Storage 7410 Unified Storage System

Sun has demonstrated the first large scale grid validation for the SAS Grid Computing 9.2 benchmark. This workload showed both the strength of Solaris 10 utilizing containers for ease of deployment, as well as the Sun Storage 7410 Unified Storage Systems and Fishworks Analytics in analyzing, tuning and delivering performance in complex SAS grid data-intensive multi-node environments.

In order to model the real world, the Grid Endurance Test uses large data sizes and complex data processing.  Results demonstrate real customer scenarios and results.  These benchmark results represent significant engineering effort, collaboration and coordination between SAS and Sun. The results also illustrate the commitment of the two companies to provide the best solutions for the most demanding data integration requirements.

  • A combination of 7 Sun Fire X2200 M2 servers utilizing Solaris 10 and a Sun Storage 7410 Unified Storage System showed continued performance improvement as the node count increased from 2 to 7 nodes for the Grid Endurance Test.
  • SAS environments are often complex. Ease of deployment, configuration, use, and ability to observe application IO characteristics (hotspots, trouble areas) are critical for production environments. The power of Fishworks Analytics combined with the reliability of ZFS is a perfect fit for these types of applications.
  • Sun Storage 7410 Unified Storage System (exporting via NFS) satisfied performance needs, throughput peaking at over  900MB/s (near 10GbE line speed) in this multi-node environment.
  • Solaris 10 Containers were used to create agile and flexible deployment environments. Container deployments were trivially migrated (within minutes) as HW resources became available (Grid expanded).
  • This result is the only large scale grid validation for the SAS Grid Computing 9.2, and the first and most timely qualification of OpenStorage for SAS.
  • The test show a delivered throughput through client 1Gb connection of over 100MB/s.

Configuration

The test grid consisted of 8x Sun Fire x2200 M2 servers, 1 configured as the grid manager, 7 as the actual grid nodes.  Each node had a 1GbE connection through a Brocade FastIron 1GbE/10GbE switch.  The 7410 had a 10GbE connection to the switch and sat as the back end storage providing a common shared file system to all nodes which SAS Grid Computing requires.  A storage appliance like the 7410 serves as an easy to setup and maintain solution, satisfying the bandwidth required by the grid.  Our particular 7410 consisted of 46 700GB 7200RPM SATA drives, 36GB of write optimized SSD's and 300GB of Read optimized SSD's.


About the Test

The workload is a batch mixture.  CPU bound workloads are numerically intensive tests, some using tables varying in row count from  9,000 to almost 200,000.  The tables have up to 297 variables, and are processed with both stepwise linear regression and stepwise logistic regression.   Other computational tests use GLM (General Linear Model).  IO intensive jobs vary as well.  One particular test reads raw data from multiple files, then generates 2 SAS data sets, one containing over 5 million records, the 2nd over 12 million.  Another IO intensive job creates a 50 million record SAS data set, then subsequently does lookups against it and finally sorts it into a dimension table.   Finally, other jobs are both compute and IO intensive.

 The SAS IO pattern for all these jobs is almost always sequential, for read, write, and mixed access, as can be viewed via Fishworks Analytics further below.  The typical block size for IO is 32KB. 

Governing the batch is the SAS Grid Manager Scheduler,  Platform LSF.  It determines when to add a job to a node based on number of open job slots (user defined), and a point in time sample of how busy the node actually is.  From run to run, jobs end up scheduled randomly making runs less predictable.  Inevitably, multiple IO intensive jobs will get scheduled on the same node, throttling the 1Gb connection, creating a bottleneck while other nodes do little to no IO.  Often this is unavoidable due to the great variety in behavior a SAS program can go through during its lifecycle.  For example, a program can start out as CPU intensive and be scheduled on a node processing an IO intensive job.  This is the desired behavior and the correct decision based on that point in time.  However, the intially CPU intensive job can then turn IO intensive as it proceeds through its lifecycle.


Results of scaling up node count

Below is a chart of results scaling from 2 to 7 nodes.  The metric is total run time from when the 1st job is scheduled, until the last job is completed.

Scaling of 400 Analytics Batch Workload
Number of Nodes Time to Completion
2 6hr 19min
3 4hr 35min
4 3hr 30min
5 3hr 12min
6 2hr 54min
7 2hr 44min

One may note that time to completion is not linear as node count scales upwards.   To a large extent this is due to the nature of the workload as explained above regarding 1Gb connections getting saturated.  If this were a highly tuned benchmark with jobs placed with high precision, we certainly could have improved run time.  However, we did not do this in order to keep the batch as realistic as possible.  On the positive side, we do continue to see improved run times up to the 7th node.

The Fishworks Analytics displays below show several performance statistics with varying numbers of nodes, with more nodes on the left and fewer on the right.  The first two graphs show file operations per second, and the third shows network bytes per second.   The 7410 provides over 900 MB/sec in the seven-node test.  More information about the interpretation of the Fishworks data for these test will be provided in a later white paper.

An impressive part is in the Fishworks Analytics shot above, throughput of 763MB/s was achieved during the sample period.  That wasn't the top end of the what 7410 could provide.  For the tests summarized in the table above, the 7 node run peaked at over 900MB/s through a single 10GbE connection.  Clearly the 7410 can sustain a fairly high level of IO.

It is also important to note that while we did try to emulate a real world scenario with varying types of jobs and well over 1TB of data being manipulated during the batch, this is a benchmark.  The workload tries to encompass a large variety of job behavior.  Your scenario may vary quite differently from what was run here.  Along with scheduling issues, we were certainly seeing signs of pushing this 7410 configuration near its limits (with the SAS IO pattern and data set sizes), which also affected the ability to achieve linear scaling .  But many grid environments are running workloads that aren't very IO intensive and tend to be more CPU bound with minimal IO requirements.  In that scenario one could expect to see excellent node scaling well beyond what was demonstrated by this batch.  To demonstrate this, the batch was run sans the IO intensive jobs.  These jobs do require some IO, but tend to be restricted to 25MB/s or less per process and only for the purpose of initially reading a data set, or writing results.

  • 3 nodes ran in 120 minutes
  • 7 nodes ran in 58 minutes
Very nice scaling - near linear, especially with the lag time that can occur with scheduling batch jobs.  The point of this exercise being, know your workload.  In this case, the 7410 solution on the back end was more than up to the demands these 350+ jobs put on it and there was still room to grow and scale out more nodes further reducing overall run time.


Tuning(?)

The question mark is actually appropriate.  For the achieved results, after configuring a RAID1 share on the 7410, only 1 parameter made a significant difference.  During the IO intensive periods, single 1Gb client throughput was observed at 120MB/s simplex, and 180MB/s duplex - producing well over 100,000 interrupts a second.  Jumbo frames were enabled on the 7410 and clients, reducing interrupts by almost 75% and reducing IO intensive job run time by an average of 12%.  Many other NFS, Solaris, tcp/ip tunings were tried, with no meaningful reduction in microbenchmarks, or the actual batch.  Nice relatively simple (for a grid) setup.

Not a direct tuning but an application change worth mentioning was due to the visibility that Analytics provides.  Early on during the load phase of the benchmark, the IO rate was less than spectacular.  What should have taken about 4.5 hours was going to take almost a day.  Drilling down through analytics showed us that 100,000's of file open/closes were occurring that the development team had been unaware of.  Quickly that was fixed and the data loader ran at expected rates.


Okay - Really no other tuning?  How about 10GbE!

Alright, so there was something else we tried which was outside the test results achieved above.  The x2200 we were using is an 8 core box.  Even when maxing out the 1Gb testing with multiple IO bound jobs, there was still CPU resources left over.  Considering that a higher core count with more memory is becoming more the standard when referencing a "client", it makes sense to utilize all those resources.  In the case where a node would be scheduled with multiple IO jobs, we wanted to see if 10GbE could potentially push up client throughput.  Through our testing, two things helped improve performance.

The first was to turn off interrupt blanking.  With blanking disabled, packets are processed when they arrive as opposed to being processed when an interrupt is issued.  Doing this resulted in a ~15% increase in duplex throughput.  Caveat - there is a reason interrupt blanking exists and it isn't to slow down your network throughput.  Tune this only if you have a decent amount of idle cpu as disabling interrupt blanking will consume it.  The other piece that resulted in a significant increase in throughput through the 10GbE NIC was to use multiple NFS client processes.  We achieved this through zones.  By adding a second zone, throughput through the single 10GbE interface increased ~30%.  The final duplex numbers were (These are also peak throughput).

  • 288MB/s no tuning
  • 337MB/s interrupt blanking disabled
  • 430MB/s 2 NFS client processes + interrupt blanking disabled


Conclusion - what does this show?

  • SAS Grid Computing which requires a shared file system between all nodes, can fit in very nicely on the 7410 storage appliance.  The workload continues to scale while adding nodes.
  • The 7410 can provide very solid throughput peaking at over 900MB/s (near 10GbE linespeed) with the configuration tested. 
  • The 7410 is easy to set up, gives an incredible depth of knowledge about the IO your application does which can lead to optimization. 
  • Know your workload, in many cases the 7410 storage appliance can be a great fit at a relatively inexpensive price while providing the benefits described (and others not described) above.
  • 10GbE client networking can be a help if your 1GbE IO pipeline is a bottleneck and there is a reasonable amount of free CPU overhead.


Additional Reading

Sun.com on Sas Grid Computing and the Sun Storage 7410 Unified Storage Array

Description of Sas Grid Computing


    Wednesday Jun 03, 2009

    Wide Variety of Topics to be discussed on BestPerf

    A sample of the various Sun and partner technologies to be discussed:
    OpenSolaris, Solaris, Linux, Windows, vmware, gcc, Java, Glassfish, MySQL, Java, Sun-Studio, ZFS, dtrace, perflib, Oracle, DB2, Sybase, OpenStorage, CMT, SPARC64, X64, X86, Intel, AMD

    About

    BestPerf is the source of Oracle performance expertise. In this blog, Oracle's Strategic Applications Engineering group explores Oracle's performance results and shares best practices learned from working on Enterprise-wide Applications.

    Index Pages
    Search

    Archives
    « April 2016
    SunMonTueWedThuFriSat
         
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
           
    Today