Friday Jul 03, 2009

SPECmail2009 on Sun Fire X4275+Sun Storage 7110: Mail Server System Solution

Significance of Results

Sun has a new SPECmail2009 result on a Sun Fire X4275 server and Sun Storage 7110 Unified Storage System running Sun Java Messaging server 6.2.  OpenStorage and ZFS were a key part of the new World Record SPECmail2009.

  • The Sun Fire X4275 server, equipped by 2, 2.93GHz Intel Xeon QC X5570 processors, running the Sun Java System Messaging Server 6.2 on Solaris 10 achieved a new World Record 8000 SPECmail_Ent2009 IMAP4 users at 38,348 Sessions/hour. 

  • The Sun result was obtained using  about half disk spindles, with the Sun Storage 7110 Unified Storage System, than  Apple's direct attached storage solution.  The Sun submission is the first result using a NAS solution, specifically two Sun Storage 7110 Unified Storage System.
  • This benchmark result clearly demonstrates that the Sun Fire X4275 server together with Sun Java System Messaging Server 6.2, Solaris 10 on Sun Storage 7110 Unified Storage Systems can support a large, enterprise level IMAP mail server environment as a reliable low cost solution, delivering best performance and maximizing data integrity with ZFS.

SPECmail 2009 Performance Landscape(ordered by performance)

System Processors Performance
Type GHz Ch, Co, Th SPECmail_Ent2009
Users
SPECmail2009
Sessions/hour
Sun Fire X4275 Intel X5570 2.93 2, 8, 16 8000 38,348
Apple Xserv3,1 Intel X5570 2.93 2, 8, 16 6000 28,887
Sun SPARC Enterprise T5220 UltraSPARC T2 1.4 1, 8, 64 3600 17,316

Notes:
    Number of SPECmail_Ent2009 users (bigger is better)
    SPECmail2009 Sessions/hour (bigger is better)
    Ch, Co, Th: Chips, Cores, Threads

Complete benchmark results may be found at the SPEC benchmark website http://www.spec.org

Results and Configuration Summary

Hardware Configuration:
    Sun Fire X4275
      2 x 2.93 GHz QC Intel Xeon X5570 processors
      72 GB
      12 x 300GB, 10000 RPM SAS disk

    2 x Sun Storage 7110 Unified Storage System, each with
      16 x 146GB SAS 10K RPM

Software Configuration:

    O/S: Solaris 10
    ZFS
    Mail Server: Sun Java System Messaging Server 6.2

Benchmark Description

The SPECmail2009 benchmark measures the ability of corporate e-mail systems to meet today's demanding e-mail users over fast corporate local area networks (LAN). The SPECmail2009 benchmark simulates corporate mail server workloads that range from 250 to 10,000 or more users, using industry standard SMTP and IMAP4 protocols. This e-mail server benchmark creates client workloads based on a 40,000 user corporation, and uses folder and message MIME structures that include both traditional office documents and a variety of rich media content. The benchmark also adds support for encrypted network connections using industry standard SSL v3.0 and TLS 1.0 technology. SPECmail2009 replaces all versions of SPECmail2008, first released in August 2008. The results from the two benchmarks are not comparable.

Software on one or more client machines generates a benchmark load for a System Under Test (SUT) and measures the SUT response times. A SUT can be a mail server running on a single system or a cluster of systems.

A SPECmail2009 'run' simulates a 100% load level associated with the specific number of users, as defined in the configuration file. The mail server must maintain a specific Quality of Service (QoS) at the 100% load level to produce a valid benchmark result. If the mail server does maintain the specified QoS at the 100% load level, the performance of the mail server is reported as SPECmail_Ent2009 SMTP and IMAP Users at SPECmail2009 Sessions per hour. The SPECmail_Ent2009 users at SPECmail2009 Sessions per Hour metric reflects the unique workload combination for a SPEC IMAP4 user.

Key Points and Best Practices

  • Each XTA7110 was configured as 1xRAID1 (14x143GB) volume with 2 NFS Shared LUNs, accessed by the SUT via NFSV4. The mailstore volumes were mounted with the nfs mount options: nointr,hard, xattr. There were a total of 4GB/sec Network connections between the SUT and the 7110 Unified Storage systems using 2 NorthStar dual 1GB/sec NICs.
  • The clients used these Java options: java -d64 -Xms4096m -Xmx4096m -XX:+AggressiveHeap
  • See the SPEC Report for all OS, network and messaging server tunings.

See Also

Disclosure Statement

SPEC, SPECmail reg tm of Standard Performance Evaluation Corporation. Results as of 07/06/2009 on www.spec.org. SPECmail2009: Sun Fire X4275 (8 cores, 2 chips) SPECmail_Ent2009 8000 users at 38,348 SPECmail2009 Sessions/hour. Apple Xserv3,1 (8 cores, 2 chips) SPECmail_Ent2009 6000 users at 28,887 SPECmail2009 Sessions/hour.

Tuesday Jun 16, 2009

Sun Fire X2270 MSC/Nastran Vendor_2008 Benchmarks

Significance of Results

The I/O intensive MSC/Nastran Vendor_2008 benchmark test suite was used to compare the performance on a Sun Fire X2270 server when using SSDs internally instead of HDDs.

The effect on performance from increasing memory to augment I/O caching was also examined. The Sun Fire X2270 server was equipped with Intel QC Xeon X5570 processors (Nehalem). The positive effect of adding memory to increase I/O caching is offset to some degree by the reduction in memory frequency with additional DIMMs in the bays of each memory channel on each cpu socket for these Nehalem processors.

  • SSDs can significantly improve NASTRAN performance especially on runs with larger core counts.
  • Additional memory in the server can also increase performance, however in some systems additional memory can decrease memory GHz so this may offset the benefits of increased capacity.
  • If SSDs are not used striped disks will often improve performance of IO-bound MCAE applications.
  • To obtain the highest performance it is recommended that SSDs be used and servers be configured with the largest memory possible without decreasing memory GHz. One should always look at the workload characteristics and compare against this benchmark to correctly set expectations.

SSD vs. HDD Performance

The performance of two striped 30GB SSDs was compared to two striped 7200 rpm 500GB SATA drives on a Sun Fire X2270 server.

  • At the 8-core level (maximum cores for a single node) SSDs were 2.2x faster for the larger xxocmd2 and the smaller xlotdf1 cases.
  • For 1-core results SSDs are up to 3% faster.
  • On the smaller mdomdf1 test case there was no increase in performance on the 1-, 2-, and 4-cores configurations.

Performance Enhancement with I/O Memory Caching

Performance for Nastran can often be increased by additional memory to provide additional in-core space to cache I/O and thereby reduce the IO demands.

The main memory was doubled from 24GB to 48GB. At the 24GB level one 4GB DIMM was placed in the first bay of each of the 3 CPU memory channels on each of the two CPU sockets on the Sun Fire X2270 platform. This configuration allows a memory frequency of 1333MHz.

At the 48GB level a second 4GB DIMM was placed in the second bay of each of the 3 CPU memory channels on each socket. This reduces the memory frequency to 1066MHz.

Adding Memory With HDDs (SATA)

  • The additional server memory increased the performance when running with the slower SATA drives at the higher core levels (e.g. 4- & 8-cores on a single node)
  • The larger xxocmd2 case was 42% faster and the smaller xlotdf1 case was 32% faster at the maximum 8-core level on a single system.
  • The special I/O intensive getrag case was 8% faster at the 1-core level.

Adding Memory With SDDs

  • At the maximum 8-core level (for a single node) the larger xxocmd2 case was 47% faster in overall run time.
  • The effects were much smaller at lower core counts and in the tests at the 1-core level most test cases ran from 5% to 14% slower with the slower CPU memory frequency dominating over the added in-core space available for I/O caching vs. direct transfer to SSD.
  • Only the special I/O intensive getrag case was an exception running 6% faster at the 1-core level.

Increasing performance with Two Striped (SATA) Drives

The performance of multiple striped drives was also compared to single drive. The study compared two striped internal 7200 rpm 500GB SATA drives to a singe single internal SATA drive.

  • On a single node with 8 cores, the largest test xx0cmd2 was 40% faster, a smaller test case xl0tdf1 was 33% faster and even the smallest test case mdomdf1 case was 12% faster.

  • On 1-core the added boost in performance with striped disks was from 4% to 13% on the various test cases.

  • One 1-core the special I/O-intensive test case getrag was 29% faster.

Performance Landscape

Times in table are elapsed time (sec).


MSC/Nastran Vendor_2008 Benchmark Test Suite

Test Cores Sun Fire X2270
2 x X5570 QC 2.93 GHz
2 x 7200 RPM SATA HDDs
Sun Fire X2270
2 x X5570 QC 2.93 GHz
2 x SSDs
48 GB
1067MHz
24 GB
2 SATA
1333MHz
24 GB
1 SATA
1333MHz
Ratio (2xSATA):
48GB/
24GB
Ratio:
2xSATA/
1xSATA
48 GB
1067MHz
24 GB
1333MHz
Ratio:
48GB/
24GB
Ratio (24GB):
2xSATA/
2xSSD

vlosst1 1 133 127 134 1.05 0.95 133 126 1.05 1.01

xxocmd2 1
2
4
8
946
622
466
1049
895
614
631
1554
978
703
991
2590
1.06
1.01
0.74
0.68
0.87
0.87
0.64
0.60
947
600
426
381
884
583
404
711
1.07
1.03
1.05
0.53
1.01
1.05
1.56
2.18

xlotdf1 1
2
4
8
2226
1307
858
912
2000
1240
833
1562
2081
1308
1030
2336
1.11
1.05
1.03
0.58
0.96
0.95
0.81
0.67
2214
1315
744
674
1939
1189
751
712
1.14
1.10
0.99
0.95
1.03
1.04
1.11
2.19

xloimf1 1 1216 1151 1236 1.06 0.93 1228 1290 0.95 0.89

mdomdf1 1
2
4
987
524
270
913
485
237
983
520
269
1.08
1.08
1.14
0.93
0.93
0.88
987
524
270
911
484
250
1.08
1.08
1.08
1.00
1.00
0.95

Sol400_1
(xl1fn40_1)
1 2555 2479 2674 1.03 0.93 2549 2402 1.06 1.03

Sol400_S
(xl1fn40_S)
1 2450 2302 2481 1.06 0.93 2449 2262 1.08 1.02

getrag
(xx0xst0)
1 778 843 1178 0.92 0.71 771 817 0.94 1.03

Results and Configuration Summary

Hardware Configuration:
    Sun Fire X2270
      1 2-socket rack mounted server
      2 x 2.93 GHz QC Intel Xeon X5570 processors
      2 x internal striped SSDs
      2 x internal striped 7200 rpm 500GB SATA drives

Software Configuration:

    O/S: Linux 64-bit SUSE SLES 10 SP 2
    Application: MSC/NASTRAN MD 2008
    Benchmark: MSC/NASTRAN Vendor_2008 Benchmark Test Suite
    HP MPI: 02.03.00.00 [7585] Linux x86-64
    Voltaire OFED-5.1.3.1_5 GridStack for SLES 10

Benchmark Description

The benchmark tests are representative of typical MSC/Nastran applications including both SMP and DMP runs involving linear statics, nonlinear statics, and natural frequency extraction.

The MD (Multi Discipline) Nastran 2008 application performs both structural (stress) analysis and thermal analysis. These analyses may be either static or transient dynamic and can be linear or nonlinear as far as material behavior and/or deformations are concerned. The new release includes the MARC module for general purpose nonlinear analyses and the Dytran module that employs an explicit solver to analyze crash and high velocity impact conditions.

  • As of the Summer '08 there is now an official Solaris X64 version of the MD Nastran 2008 system that is certified and maintained.
  • The memory requirements for the test cases in the new MSC/Nastran Vendor 2008 benchmark test suite range from a few hundred megabytes to no more than 5 GB.

Please go here for a more complete description of the tests.

Key Points and Best Practices

For more on Best Practices of SSD on HPC applications also see the Sun Blueprint:
http://wikis.sun.com/display/BluePrints/Solid+State+Drives+in+HPC+-+Reducing+the+IO+Bottleneck

Additional information on the MSC/Nastran Vendor 2008 benchmark test suite.

  • Based on the maximum physical memory on a platform the user can stipulate the maximum portion of this memory that can be allocated to the Nastran job. This is done on the command line with the mem= option. On Linux based systems where the platform has a large amount of memory and where the model does not have large scratch I/O requirements the memory can be allocated to a tmpfs scratch space file system. On Solaris X64 systems advantage can be taken of ZFS for higher I/O performance.

  • The MSC/Nastran Vendor 2008 test cases don't scale very well, a few not at all and the rest on up to 8 cores at best.

  • The test cases for the MSC/Nastran module all have a substantial I/O component where 15% to 25% of the total run times are associated with I/O activity (primarily scratch files). The required scratch file size ranges from less than 1 GB on up to about 140 GB. Performance will be enhanced by using the fastest available drives and striping together more than one of them or using a high performance disk storage system, further enhanced as indicated here by implementing the Lustre based I/O system. High performance interconnects such as Infiniband for inter node cluster message passing as well as I/O transfer from the storage system can also enhance performance substantially.

See Also

Disclosure Statement

MSC.Software is a registered trademark of MSC. All information on the MSC.Software website is copyrighted. MSC/Nastran Vendor 2008 results from http://www.mscsoftware.com and this report as of June 9, 2009.

Friday Jun 12, 2009

OpenSolaris Beats Linux on memcached Sun Fire X2270

OpenSolaris provided 25% better performance on memcached than Linux on the Sun Fire X2270 server. memcached 1.3.2 using OpenSolaris gave a maximum throughput of 352K ops/sec compared to the same server running RHEL5 (with kernel 2.6.29) which produced a result of 281K ops/sec.

memcached is the de-facto distributed caching server used to scale many web2.0 sites today. With the requirement to support a very large number of users as sites grow, memcached aids scalability by effectively cutting down on MySQL traffic and improving response times.

  • memcached is a very light-weight server but is known not to scale beyond 4-6 threads. Some scalability improvements have gone into the 1.3 release (still in beta).
  • As customers move to the newer, more powerful Intel Nehalem based systems, it is important that they have the ability to use these systems efficiently using appropriate software and hardware components.

Performance Landscape

memcached performance results: ops/sec (bigger is better)

System C/C/T Processors Memory Operating System Performance
Ops/Sec
GHz Type
Sun Fire X2270 2/8/16 2.93 Intel X5570 QC 48GB OpenSolaris 2009.06 352K
Sun Fire X2270 2/8/16 2.93 Intel X5570 QC 48GB RedHat Enterprise Linux 5 (kernel 2.6.29) 281K

C/C/T: Chips, Cores, Threads

Results and Configuration Summary

Sun's results used the following hardware and software components.

Hardware:

    Sun Fire X2270
    2 x Intel X5570 QC 2.93 GHz
    48GB of memory
    10GbE Intel Oplin card

Software:

    OpenSolaris 2009.06
    Linux RedHat 5 (on kernel 2.6.29)

Benchmark Description

memcached is a high-performance, distributed memory object caching system, generic in nature, but intended for use in speeding up dynamic web applications by alleviating database load. The memcached benchmark was based on Apache Olio - a web2.0 workload.

The benchmark initially populates the server cache with objects of different sizes to simulate the types of data that real sites typically store in memcached :

  • small objects (4-100 bytes) to represent locks and query results
  • medium objects (1-2 KBytes) to represent thumbnails, database rows, resultsets
  • large objects (5-20 KBytes) to represent whole or partially generated pages

The benchmark then runs a mixture of operations (90% gets, 10% sets) and measures the throughput and response times when the system reaches steady-state. The workload is implemented using Faban, an open-source benchmark development framework. It not only speeds benchmark development, but the Faban harness is a great way to queue, monitor and archive runs for analysis.

Key Points and Best Practices

OpenSolaris Tuning

The following /etc/system settings were used to set the number of MSIX:

  • set ddi_msix_alloc_limit=4
  • set pcplusmp:apic_intr_policy=1

For the ixgbe interface, 4 transmit and 4 receive rings gave the best performance :

  • tx_queue_number=4, rx_queue_number=4

The crossbow threads were bound:

dladm set-linkprop -p cpus=12,13,14,15 ixgbe0

Linux Tuning

Linux was more complicated to tune, the following Linux tunables were changed to try and get the best performance:

  • net.ipv4.tcp_timestamps = 0
  • net.core.wmem_default = 67108864
  • net.core.wmem_max = 67108864
  • net.core.optmem_max = 67108864
  • net.ipv4.tcp_dsack = 0
  • net.ipv4.tcp_sack = 0
  • net.ipv4.tcp_window_scaling = 0
  • net.core.netdev_max_backlog = 300000
  • net.ipv4.tcp_max_syn_backlog = 200000

Here are the ixgbe specific settings that were used (2 transmit, 2 receive rings):

  • RSS=2,2 InterruptThrottleRate=1600,1600

Linux Issues

The 10GbE Intel Oplin card on Linux resulted in the following driver and kernel re-builds.

  • With the default ixgbe driver from the RedHat distribution (version 1.3.30-k2 on kernel 2.6.18)), the interface simply hung during the benchmark test.
  • This led to downloading the driver from the Intel site (1.3.56.11-2-NAPI) and re-compiling it. This version does work and we got a maximum throughput of 232K operations/sec on the same linux kernel (2.6.18). However, this version of the kernel does not have support for multiple TX rings.
  • The kernel version 2.6.29 includes support for multiple TX rings but still doesn't have the ixgbe driver which is 1.3.56.11-2-NAPI. So we downloaded, built and installed these versions of the kernel and driver. This worked well giving a maximum throughput of 281K with some tuning.

See Also

Disclosure Statement

Sun Fire X2270 server with OpenSolaris 352K ops/sec. Sun Fire X2270 server with RedHat Linux 281K ops/sec. For memcached information, visit http://www.danga.com/memcached. Results as of June 8, 2009.

Thursday Jun 11, 2009

SAS Grid Computing 9.2 utilizing the Sun Storage 7410 Unified Storage System

Sun has demonstrated the first large scale grid validation for the SAS Grid Computing 9.2 benchmark. This workload showed both the strength of Solaris 10 utilizing containers for ease of deployment, as well as the Sun Storage 7410 Unified Storage Systems and Fishworks Analytics in analyzing, tuning and delivering performance in complex SAS grid data-intensive multi-node environments.

In order to model the real world, the Grid Endurance Test uses large data sizes and complex data processing.  Results demonstrate real customer scenarios and results.  These benchmark results represent significant engineering effort, collaboration and coordination between SAS and Sun. The results also illustrate the commitment of the two companies to provide the best solutions for the most demanding data integration requirements.

  • A combination of 7 Sun Fire X2200 M2 servers utilizing Solaris 10 and a Sun Storage 7410 Unified Storage System showed continued performance improvement as the node count increased from 2 to 7 nodes for the Grid Endurance Test.
  • SAS environments are often complex. Ease of deployment, configuration, use, and ability to observe application IO characteristics (hotspots, trouble areas) are critical for production environments. The power of Fishworks Analytics combined with the reliability of ZFS is a perfect fit for these types of applications.
  • Sun Storage 7410 Unified Storage System (exporting via NFS) satisfied performance needs, throughput peaking at over  900MB/s (near 10GbE line speed) in this multi-node environment.
  • Solaris 10 Containers were used to create agile and flexible deployment environments. Container deployments were trivially migrated (within minutes) as HW resources became available (Grid expanded).
  • This result is the only large scale grid validation for the SAS Grid Computing 9.2, and the first and most timely qualification of OpenStorage for SAS.
  • The test show a delivered throughput through client 1Gb connection of over 100MB/s.

Configuration

The test grid consisted of 8x Sun Fire x2200 M2 servers, 1 configured as the grid manager, 7 as the actual grid nodes.  Each node had a 1GbE connection through a Brocade FastIron 1GbE/10GbE switch.  The 7410 had a 10GbE connection to the switch and sat as the back end storage providing a common shared file system to all nodes which SAS Grid Computing requires.  A storage appliance like the 7410 serves as an easy to setup and maintain solution, satisfying the bandwidth required by the grid.  Our particular 7410 consisted of 46 700GB 7200RPM SATA drives, 36GB of write optimized SSD's and 300GB of Read optimized SSD's.


About the Test

The workload is a batch mixture.  CPU bound workloads are numerically intensive tests, some using tables varying in row count from  9,000 to almost 200,000.  The tables have up to 297 variables, and are processed with both stepwise linear regression and stepwise logistic regression.   Other computational tests use GLM (General Linear Model).  IO intensive jobs vary as well.  One particular test reads raw data from multiple files, then generates 2 SAS data sets, one containing over 5 million records, the 2nd over 12 million.  Another IO intensive job creates a 50 million record SAS data set, then subsequently does lookups against it and finally sorts it into a dimension table.   Finally, other jobs are both compute and IO intensive.

 The SAS IO pattern for all these jobs is almost always sequential, for read, write, and mixed access, as can be viewed via Fishworks Analytics further below.  The typical block size for IO is 32KB. 

Governing the batch is the SAS Grid Manager Scheduler,  Platform LSF.  It determines when to add a job to a node based on number of open job slots (user defined), and a point in time sample of how busy the node actually is.  From run to run, jobs end up scheduled randomly making runs less predictable.  Inevitably, multiple IO intensive jobs will get scheduled on the same node, throttling the 1Gb connection, creating a bottleneck while other nodes do little to no IO.  Often this is unavoidable due to the great variety in behavior a SAS program can go through during its lifecycle.  For example, a program can start out as CPU intensive and be scheduled on a node processing an IO intensive job.  This is the desired behavior and the correct decision based on that point in time.  However, the intially CPU intensive job can then turn IO intensive as it proceeds through its lifecycle.


Results of scaling up node count

Below is a chart of results scaling from 2 to 7 nodes.  The metric is total run time from when the 1st job is scheduled, until the last job is completed.

Scaling of 400 Analytics Batch Workload
Number of Nodes Time to Completion
2 6hr 19min
3 4hr 35min
4 3hr 30min
5 3hr 12min
6 2hr 54min
7 2hr 44min

One may note that time to completion is not linear as node count scales upwards.   To a large extent this is due to the nature of the workload as explained above regarding 1Gb connections getting saturated.  If this were a highly tuned benchmark with jobs placed with high precision, we certainly could have improved run time.  However, we did not do this in order to keep the batch as realistic as possible.  On the positive side, we do continue to see improved run times up to the 7th node.

The Fishworks Analytics displays below show several performance statistics with varying numbers of nodes, with more nodes on the left and fewer on the right.  The first two graphs show file operations per second, and the third shows network bytes per second.   The 7410 provides over 900 MB/sec in the seven-node test.  More information about the interpretation of the Fishworks data for these test will be provided in a later white paper.

An impressive part is in the Fishworks Analytics shot above, throughput of 763MB/s was achieved during the sample period.  That wasn't the top end of the what 7410 could provide.  For the tests summarized in the table above, the 7 node run peaked at over 900MB/s through a single 10GbE connection.  Clearly the 7410 can sustain a fairly high level of IO.

It is also important to note that while we did try to emulate a real world scenario with varying types of jobs and well over 1TB of data being manipulated during the batch, this is a benchmark.  The workload tries to encompass a large variety of job behavior.  Your scenario may vary quite differently from what was run here.  Along with scheduling issues, we were certainly seeing signs of pushing this 7410 configuration near its limits (with the SAS IO pattern and data set sizes), which also affected the ability to achieve linear scaling .  But many grid environments are running workloads that aren't very IO intensive and tend to be more CPU bound with minimal IO requirements.  In that scenario one could expect to see excellent node scaling well beyond what was demonstrated by this batch.  To demonstrate this, the batch was run sans the IO intensive jobs.  These jobs do require some IO, but tend to be restricted to 25MB/s or less per process and only for the purpose of initially reading a data set, or writing results.

  • 3 nodes ran in 120 minutes
  • 7 nodes ran in 58 minutes
Very nice scaling - near linear, especially with the lag time that can occur with scheduling batch jobs.  The point of this exercise being, know your workload.  In this case, the 7410 solution on the back end was more than up to the demands these 350+ jobs put on it and there was still room to grow and scale out more nodes further reducing overall run time.


Tuning(?)

The question mark is actually appropriate.  For the achieved results, after configuring a RAID1 share on the 7410, only 1 parameter made a significant difference.  During the IO intensive periods, single 1Gb client throughput was observed at 120MB/s simplex, and 180MB/s duplex - producing well over 100,000 interrupts a second.  Jumbo frames were enabled on the 7410 and clients, reducing interrupts by almost 75% and reducing IO intensive job run time by an average of 12%.  Many other NFS, Solaris, tcp/ip tunings were tried, with no meaningful reduction in microbenchmarks, or the actual batch.  Nice relatively simple (for a grid) setup.

Not a direct tuning but an application change worth mentioning was due to the visibility that Analytics provides.  Early on during the load phase of the benchmark, the IO rate was less than spectacular.  What should have taken about 4.5 hours was going to take almost a day.  Drilling down through analytics showed us that 100,000's of file open/closes were occurring that the development team had been unaware of.  Quickly that was fixed and the data loader ran at expected rates.


Okay - Really no other tuning?  How about 10GbE!

Alright, so there was something else we tried which was outside the test results achieved above.  The x2200 we were using is an 8 core box.  Even when maxing out the 1Gb testing with multiple IO bound jobs, there was still CPU resources left over.  Considering that a higher core count with more memory is becoming more the standard when referencing a "client", it makes sense to utilize all those resources.  In the case where a node would be scheduled with multiple IO jobs, we wanted to see if 10GbE could potentially push up client throughput.  Through our testing, two things helped improve performance.

The first was to turn off interrupt blanking.  With blanking disabled, packets are processed when they arrive as opposed to being processed when an interrupt is issued.  Doing this resulted in a ~15% increase in duplex throughput.  Caveat - there is a reason interrupt blanking exists and it isn't to slow down your network throughput.  Tune this only if you have a decent amount of idle cpu as disabling interrupt blanking will consume it.  The other piece that resulted in a significant increase in throughput through the 10GbE NIC was to use multiple NFS client processes.  We achieved this through zones.  By adding a second zone, throughput through the single 10GbE interface increased ~30%.  The final duplex numbers were (These are also peak throughput).

  • 288MB/s no tuning
  • 337MB/s interrupt blanking disabled
  • 430MB/s 2 NFS client processes + interrupt blanking disabled


Conclusion - what does this show?

  • SAS Grid Computing which requires a shared file system between all nodes, can fit in very nicely on the 7410 storage appliance.  The workload continues to scale while adding nodes.
  • The 7410 can provide very solid throughput peaking at over 900MB/s (near 10GbE linespeed) with the configuration tested. 
  • The 7410 is easy to set up, gives an incredible depth of knowledge about the IO your application does which can lead to optimization. 
  • Know your workload, in many cases the 7410 storage appliance can be a great fit at a relatively inexpensive price while providing the benefits described (and others not described) above.
  • 10GbE client networking can be a help if your 1GbE IO pipeline is a bottleneck and there is a reasonable amount of free CPU overhead.


Additional Reading

Sun.com on Sas Grid Computing and the Sun Storage 7410 Unified Storage Array

Description of Sas Grid Computing


    Friday Jun 05, 2009

    Interpreting Sun's SPECpower_ssj2008 Publications

    Sun recently entered the SPECpower fray with the publication of three results on the SPECpower_ssj2008 benchmark.  Strangely, the three publications documented results on the same hardware platform (Sun Netra X4250) running identical software stacks, but the results were markedly different.  What exactly were we trying to get at?

     Benchmark Configurations

    Sun produces robust industrial-grade servers with a range of redundancy features we believe benefit our customers.   These features increase reliability, at the cost of additional power consumption. For example, redundant power supplies and redundant fans allow servers to tolerate faults, and hot-swap capabilities further minimize downtime.

    The benchmark run and reporting rules require the incorporation within the tested configuration of all components implied by the model name.  Within these limitations, the first publication was intended to be the best result (that is, the lowest power consumption per unit of performance) achievable on the Sun Netra X4250 platform, by minimizing the configured hardware to the greatest extent possible.

    Common Components

    All tested configurations had the following components in common:

    • System:  Sun Netra X4250
    • Processor: 2 x Intel L5408 QC @ 2.13GHz
    • 2 x 658 watt redundant AC power supplies
    • redundant fans
    • standard I/O expansion mezzanine
    • standard Telco dry contact alarm

    And the same software stack:

    • OS: Windows Server 2003 R2 Enterprise X64 Edition SP2
    • Drivers: platform-specific drivers from Sun Netra X4250 Tools and Drivers DVD Version 2.1N
    • JVM: Java HotSpot 32-Bit Server VM on Windows, version 1.6.0_14

    Tiny Configuration

    In addition to the common hardware components, the tiny configuration was limited to:

    • 8 GB of Memory (4 x 2048 MB as PC2-5300F 2Rx8)
    • 1 x Sun 146 GB 10K RPM SAS internal drive

    This is called the tiny configuration because it seems unlikely that most customers would configure an 8-core server with only one disk and only 1 GB available per core. Nevertheless, from a benchmark point of view, this configuration gave the best result.

    Typical Configuration

    The other two results were both produced on a configuration we considered much more typical of configurations that are actually ordered by customers.  In addition to the common hardware, these typical configuration included:

    • 32 GB of Memory (8 x 4096 MB as PC2-5300F)
    • 4 x Sun 146 GB 10K RPM SAS internal drives
    • 1 x Sun x8 PCIe Quad Gigabit Ethernet option card (X4447A-Z)

    Nothing special was done with the additional components.  The added memory increased the performance component of the benchmark. The other components were installed and configured but allowed to sit idle, so consumed less power than they would have under load.

    One Other Thing: Tuning for Performance

    So one thing we're getting at is the difference in power consumption between a small configuration optimized for a power-performance benchmark and a typical configuration optimized for customer workloads.  Hardware (power consumption) is only half of the benchmark--the other half being the performance achieved by the System Under Test (SUT).

    Tuning Choices 

    In all three publications the identical tunings were applied at the software level: identical java command-line arguments and JVM-to-processor affinity.  We also applied, in the case of the better results, the common (but usually non-default) BIOS-level optimization of disabling hardware prefetcher and adjacent cache line prefetch.  These optimizations are commonly applied to produce optimized SPECpower_ssj2008 results but it is unlikely that many production applications would benefit from these settings.  To demonstrate the effect of this tuning, the final result was generated with standard BIOS settings.

     And just so we couldn't be accused of sand-bagging the results, the number of JVMs was increased in the typical configurations to take advantage of the additional memory populated over and above the tiny configuration.  Additional performance was achieved but sadly it doesn't compensate for the higher power consumption of all that memory.

    So in summary we tuned:

    • Tiny Configuration: non-default BIOS settings
    • Typical Configuration 1: non-default BIOS settings; additional JVMs to utilize added memory
    • Typical Configuration 2: default BIOS settings; additional JVMs to utilize added memory

    At the OS level, all tunings  were identical.

    Results 

    The results are summarized in this table:

    System
    (Click system for SPEC full disclosure)

    Processors

    Performance

    Model

    GHz

    Metric
    overall
    ssj_ops/watt

    Peak
    Performance
    ssj_ops

    Peak
    Power
    watts

    Idle
    Power
    watts

    Sun Netra X4250
    (8GB non-default BIOS)

    L5408

    2.13

    600

    244832

    226

    174

    Sun Netra X425
    (32GB non-default BIOS)

    L5408

    2.13

    478

    251555

    294

    226

    Sun Netra X4250
    (32GB default BIOS)

    L5408

    2.13

    437

    229828

    296

    225

    Conclusions

    • The measurement and reporting methods of the benchmark encourage small memory configurations.  Comparing the first and second result, adding additional memory yielded minimal performance improvement (from 244832 to 251555) but a large increase in power consumption, 68 watts at peak.

    • In our opinion, unrealistically small configurations yield the best results on this benchmark.  On the more typical system, the benchmark overall metric decreased from 600 overall ssj_ops per watt to 478 overall ssj_ops per watt, despite our best effort to utilize the additional configured memory.

    • On typical configurations, reverting to default BIOS settings resulted in a significant decrease in performance (from 25155 to 229828) with no corresponding decrease in power consumption (essentially identical for both results).

    Configurations typical of customer systems (with adequate memory, internal disks, and option cards) consume more power than configurations which are commonly benchmarked, while providing no corresponding improvement in SPECpower_ssj2008 benchmark performance. The result is a lower overall power-performance metric on typical configurations and a lack of published benchmark results on robust systems with the capacities and redundancies that enterprise customers desire.

    Fair Use Disclosure

    SPEC, SPECpower, and SPECpower_ssj are trademarks of the Standard Performance Evaluation Corporation.  All results from the SPEC website (www.spec.com)  as of June 5, 2009.  For a complete set of accepted results refer to that site.

    Wednesday Jun 03, 2009

    Wide Variety of Topics to be discussed on BestPerf

    A sample of the various Sun and partner technologies to be discussed:
    OpenSolaris, Solaris, Linux, Windows, vmware, gcc, Java, Glassfish, MySQL, Java, Sun-Studio, ZFS, dtrace, perflib, Oracle, DB2, Sybase, OpenStorage, CMT, SPARC64, X64, X86, Intel, AMD

    About

    BestPerf is the source of Oracle performance expertise. In this blog, Oracle's Strategic Applications Engineering group explores Oracle's performance results and shares best practices learned from working on Enterprise-wide Applications.

    Index Pages
    Search

    Archives
    « April 2014
    SunMonTueWedThuFriSat
      
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
       
           
    Today