By Marcus Heckel on Jun 11, 2009
In order to model the real world, the Grid Endurance Test uses large data sizes and complex data processing. Results demonstrate real customer scenarios and results. These benchmark results represent significant engineering effort, collaboration and coordination between SAS and Sun. The results also illustrate the commitment of the two companies to provide the best solutions for the most demanding data integration requirements.
- A combination of 7 Sun Fire X2200 M2 servers utilizing Solaris 10 and a Sun Storage 7410 Unified Storage System showed continued performance improvement as the node count increased from 2 to 7 nodes for the Grid Endurance Test.
- SAS environments are often complex. Ease of deployment, configuration, use, and ability to observe application IO characteristics (hotspots, trouble areas) are critical for production environments. The power of Fishworks Analytics combined with the reliability of ZFS is a perfect fit for these types of applications.
- Sun Storage 7410 Unified Storage System (exporting via NFS) satisfied performance needs, throughput peaking at over 900MB/s (near 10GbE line speed) in this multi-node environment.
- Solaris 10 Containers were used to create agile and flexible deployment environments. Container deployments were trivially migrated (within minutes) as HW resources became available (Grid expanded).
- This result is the only large scale grid validation for the SAS Grid Computing 9.2, and the first and most timely qualification of OpenStorage for SAS.
- The test show a delivered throughput through client 1Gb connection of over 100MB/s.
The test grid consisted of 8x Sun Fire x2200 M2 servers, 1 configured as the grid manager, 7 as the actual grid nodes. Each node had a 1GbE connection through a Brocade FastIron 1GbE/10GbE switch. The 7410 had a 10GbE connection to the switch and sat as the back end storage providing a common shared file system to all nodes which SAS Grid Computing requires. A storage appliance like the 7410 serves as an easy to setup and maintain solution, satisfying the bandwidth required by the grid. Our particular 7410 consisted of 46 700GB 7200RPM SATA drives, 36GB of write optimized SSD's and 300GB of Read optimized SSD's.
About the Test
The workload is a batch mixture. CPU bound workloads are numerically intensive tests, some using tables varying in row count from 9,000 to almost 200,000. The tables have up to 297 variables, and are processed with both stepwise linear regression and stepwise logistic regression. Other computational tests use GLM (General Linear Model). IO intensive jobs vary as well. One particular test reads raw data from multiple files, then generates 2 SAS data sets, one containing over 5 million records, the 2nd over 12 million. Another IO intensive job creates a 50 million record SAS data set, then subsequently does lookups against it and finally sorts it into a dimension table. Finally, other jobs are both compute and IO intensive.
The SAS IO pattern for all these jobs is almost always sequential, for read, write, and mixed access, as can be viewed via Fishworks Analytics further below. The typical block size for IO is 32KB.
Governing the batch is the SAS Grid Manager Scheduler, Platform LSF. It determines when to add a job to a node based on number of open job slots (user defined), and a point in time sample of how busy the node actually is. From run to run, jobs end up scheduled randomly making runs less predictable. Inevitably, multiple IO intensive jobs will get scheduled on the same node, throttling the 1Gb connection, creating a bottleneck while other nodes do little to no IO. Often this is unavoidable due to the great variety in behavior a SAS program can go through during its lifecycle. For example, a program can start out as CPU intensive and be scheduled on a node processing an IO intensive job. This is the desired behavior and the correct decision based on that point in time. However, the intially CPU intensive job can then turn IO intensive as it proceeds through its lifecycle.
Results of scaling up node count
Below is a chart of results scaling from 2 to 7 nodes. The metric is total run time from when the 1st job is scheduled, until the last job is completed.
|Scaling of 400 Analytics Batch Workload|
|Number of Nodes||Time to Completion|
One may note that time to completion is not linear as node count scales upwards. To a large
extent this is due to the nature of the workload as explained above regarding 1Gb connections getting saturated. If this were a highly tuned benchmark with jobs placed with high precision, we certainly could have improved run time. However, we did not do this in order to keep the batch as realistic as possible. On the positive side, we do continue to see improved run times up to the 7th node.
The Fishworks Analytics displays below show several performance statistics with varying numbers of nodes, with more nodes on the left and fewer on the right. The first two graphs show file operations per second, and the third shows network bytes per second. The 7410 provides over 900 MB/sec in the seven-node test. More information about the interpretation of the Fishworks data for these test will be provided in a later white paper.
An impressive part is in the Fishworks Analytics shot above, throughput of 763MB/s was achieved during the sample period. That wasn't the top end of the what 7410 could provide. For the tests summarized in the table above, the 7 node run peaked at over 900MB/s through a single 10GbE connection. Clearly the 7410 can sustain a fairly high level of IO.
It is also important to note that while we did try to emulate a real world scenario with varying types of jobs and well over 1TB of data being manipulated during the batch, this is a benchmark. The workload tries to encompass a large variety of job behavior. Your scenario may vary quite differently from what was run here. Along with scheduling issues, we were certainly seeing signs of pushing this 7410 configuration near its limits (with the SAS IO pattern and data set sizes), which also affected the ability to achieve linear scaling . But many grid environments are running workloads that aren't very IO intensive and tend to be more CPU bound with minimal IO requirements. In that scenario one could expect to see excellent node scaling well beyond what was demonstrated by this batch. To demonstrate this, the batch was run sans the IO intensive jobs. These jobs do require some IO, but tend to be restricted to 25MB/s or less per process and only for the purpose of initially reading a data set, or writing results.
- 3 nodes ran in 120 minutes
- 7 nodes ran in 58 minutes
The question mark is actually appropriate. For the achieved results, after configuring a RAID1 share on the 7410, only 1 parameter made a significant difference. During the IO intensive periods, single 1Gb client throughput was observed at 120MB/s simplex, and 180MB/s duplex - producing well over 100,000 interrupts a second. Jumbo frames were enabled on the 7410 and clients, reducing interrupts by almost 75% and reducing IO intensive job run time by an average of 12%. Many other NFS, Solaris, tcp/ip tunings were tried, with no meaningful reduction in microbenchmarks, or the actual batch. Nice relatively simple (for a grid) setup.
Not a direct tuning but an application change worth mentioning was due to the visibility that Analytics provides. Early on during the load phase of the benchmark, the IO rate was less than spectacular. What should have taken about 4.5 hours was going to take almost a day. Drilling down through analytics showed us that 100,000's of file open/closes were occurring that the development team had been unaware of. Quickly that was fixed and the data loader ran at expected rates.
Okay - Really no other tuning? How about 10GbE!
Alright, so there was something else we tried which was outside the test results achieved above. The x2200 we were using is an 8 core box. Even when maxing out the 1Gb testing with multiple IO bound jobs, there was still CPU resources left over. Considering that a higher core count with more memory is becoming more the standard when referencing a "client", it makes sense to utilize all those resources. In the case where a node would be scheduled with multiple IO jobs, we wanted to see if 10GbE could potentially push up client throughput. Through our testing, two things helped improve performance.
The first was to turn off interrupt blanking. With blanking disabled, packets are processed when they arrive as opposed to being processed when an interrupt is issued. Doing this resulted in a ~15% increase in duplex throughput. Caveat - there is a reason interrupt blanking exists and it isn't to slow down your network throughput. Tune this only if you have a decent amount of idle cpu as disabling interrupt blanking will consume it. The other piece that resulted in a significant increase in throughput through the 10GbE NIC was to use multiple NFS client processes. We achieved this through zones. By adding a second zone, throughput through the single 10GbE interface increased ~30%. The final duplex numbers were (These are also peak throughput).
- 288MB/s no tuning
- 337MB/s interrupt blanking disabled
- 430MB/s 2 NFS client processes + interrupt blanking disabled
Conclusion - what does this show?
- SAS Grid Computing which requires a shared file system between all nodes, can fit in very nicely on the 7410 storage appliance. The workload continues to scale while adding nodes.
- The 7410 can provide very solid throughput peaking at over 900MB/s (near 10GbE linespeed) with the configuration tested.
- The 7410 is easy to set up, gives an incredible depth of knowledge about the IO your application does which can lead to optimization.
- Know your workload, in many cases the 7410 storage appliance can be a great fit at a relatively inexpensive price while providing the benefits described (and others not described) above.
- 10GbE client networking can be a help if your 1GbE IO pipeline is a bottleneck and there is a reasonable amount of free CPU overhead.