Sun Fire X2270 M2 Super-Linear Scaling of Hadoop Terasort and CloudBurst Benchmarks

A 16-node cluster of Oracle's Sun Fire X2270 M2 servers showed super-linear scaling of two Hadoop benchmarks. Performance was measured using the Terasort benchmark with a 100GB data set. In addition, performance was measured using Cloudburst which maps next generation "short read" sequence data onto the human and other genomes.

  • On the Terasort workload, a 16-node Sun Fire X2270 M2 cluster sorted the 100GB data set at a rate of 433.3 MB/s finishing in 236.3 seconds.

  • The 16-node Sun Fire X2270 M2 cluster was 9.3x faster on a per node basis than the 2010 winner of the Terasort benchmark competition (www.sortbenchmark.org) which used a 3,452-node Xeon cluster to sort 100 TB of input data in 173 minutes. Both systems used Hadoop, Terasort and 2-socket x86 servers. Allowances have to be made for the differences in problem complexity.

  • The Terasort benchmark showed super-linear scaling on the Sun Fire X2270 M2 cluster (total of 32 Intel 2.93GHz Xeons).

  • Using Cloudburst on a workload of the human genome and the SRR001113 short read data set, a 16-node Sun Fire X2270 M2 cluster finished mapping the short reads onto the human genome in 34.2 minutes.

  • On a per node basis, a 2-node Sun Fire X2270 M2 cluster was 1.7x faster than a 12-node Xeon cluster that processed the human genome and the SRR001113 short read data set in approximately 60,000 seconds (see figure 3 of this journal article). Both systems used Hadoop, CloudBurst and x86 servers.

  • The Terasort benchmark showed super-linear scaling on the Sun Fire X2270 M2 cluster (total of 32 Intel 2.93GHz Xeons).

Performance Landscape

Terasort
100 GB input data set
Performance is "real" execution time reported by /usr/bin/time in seconds (smaller is better)
Number
of Nodes
Seconds Scaling Linear
Scaling
16 236.3 25.4 16
8 466.3 12.9 8
4 927.2 6.5 4
2 2140.8 2.8 2
1 6010.2 1.0 1

CloudBurst
SR001111 short read data set mapped onto the hs_ref_GRCh37 human genome
Performance is "total running" time reported by CloudBurst in seconds (smaller is better)
Number
of Nodes
Seconds Scaling Linear
Scaling
16 2054.9 8.4 8
8 3615.8 4.7 4
4 7895.7 2.2 2
2 17155.1 1.0 1

Results and Configuration Summary

Hardware Configuration:

16 x Sun Fire X2270 M2 server, each server with
2 Intel Xeon X5670 2.93GHz processors, turbo enabled
96 GB memory 1066 MHz
HDD SATA 1 TB 7200 RPM 3.5-in.
2 x 10/100/1000 ethernet

Software Configuration:

Oracle Solaris 10 10/09
Java Platform, Standard Edition, JDK 6 Update 20 Performance Release
Hadoop v0.20.2

Benchmark Description

The Apache Hadoop middleware is the Yahoo implementation of Google's Map Reduce. Map Reduce permits the programmer to write serial code that Map Reduce schedules for parallel execution. Map Reduce has been applied to a wide variety of problems, including image processing, sorting, database merging and genomics.

Hadoop uses the Hadoop Distributed Filesystem (HDFS) that distributes data across the local disks of a cluster such that each node in the cluster accesses its local disk to the greatest extent possible.

Results for two different Hadoop benchmarks are reported above:

  • Terasort is an I/O intensive benchmark that was originally developed by Jim Gray. By having many Hadoop data nodes, it is possible to achieve high I/O capacity. For purposes of benchmarking, the Teragen program was used to create an input data set that comprised 100 GB.

  • CloudBurst is a genome assembly benchmark that was developed by Michael Schatz, previously of the University of Maryland and presently of Cold Springs Harbor Laboratory. CloudBurst maps what is known as DNA short read data onto a reference genome. For purposes of benchmarking, the SRR001113 short read data set is mapped onto the hs_ref_GRCh37 sequence data for all chromosomes of the human genome. Specifically, the hs_ref_GRCh37 FASTA files for chromosomes 1, 2, ... 21, 22, X and Y were catenated in that order to obtain one large FASTA file that represented all chromosomes of the human genome. For purposes of benchmarking, any DNA fragment from the SRR001113 short read data set that contained more than three mismatches was ignored.

See Also

Disclosure Statement

Hadoop, see http://hadoop.apache.org/ for more information. Results as of 9/20/2010.
Comments:

Post a Comment:
Comments are closed for this entry.
About

BestPerf is the source of Oracle performance expertise. In this blog, Oracle's Strategic Applications Engineering group explores Oracle's performance results and shares best practices learned from working on Enterprise-wide Applications.

Index Pages
Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today