Tuesday Sep 01, 2009

String Searching - Sun T5240 & T5440 Outperform IBM Cell Broadband Engine

Significance of Results

Sun SPARC Enterprise T5220, T5240 and T5440 servers ran benchmarks using the Aho-Corasick string searching algorithm. String searching or pattern matching are important to a variety of commercial, government and HPC applications. One of the core functions needed for text identification algorithms in data repositories is real-time string searching. For this benchmark, the IBM, HP and Sun systems used the Aho-Corasick algorithm for string searching.

Sun SPARC Enterprise T5440

  • A 1.6 GHz Sun SPARC Enterprise T5440 server could search a book as tall as Mt. Everest (29,208 feet, 861 GB book) in 61 seconds, which corresponds to a string search rate of 14.2 GB/s.

  • A 1.6 GHz Sun SPARC Enterprise T5440 server can search at a rate of 14.2 GB/s, which corresponds to searching a book containing one terabyte of data (34,745 feet high) in only 70 seconds.

  • The 4-chip 1.6 GHz Sun SPARC Enterprise T5440 server performed string searching at a rate of 14.2 GB/s which is 29.9 times as fast as the 2-chip IBM Cell Broadband Engine DD3 Blade that performed string searching at a rate of 0.475 GB/s

  • The 4-chip 1.6 GHz Sun SPARC Enterprise T5440 server performed string searching 3.7 times as fast as the 4-chip HP DL-580 (2.93 GHz Xeon QC) server that performed string searching at a rate of 3.87 GB/s. The 1.6 GHz Sun SPARC Enterprise T5440 server has a 1.7 times advantage in delivered power-performance over the HP DL-580 (using a power consumption rate of 830 watts for the HP system as measured on other tests).

  • The 1.6 GHz Sun SPARC Enterprise T5440 server demonstrated a 12% improvement over the 1.4 GHz Sun SPARC Enterprise T5440.

  • The 1.6 GHz Sun SPARC Enterprise T5440 server demonstrated a 2x speedup over the 1.6 GHz Sun SPARC Enterprise T5240 server which demonstrated a 2.3x speedup over the 1.4 GHz Sun SPARC Enterprise T5220 server.

Sun SPARC Enterprise T5240

  • The 2-chip 1.6 GHz Sun SPARC Enterprise T5240 server performed string searching at a rate of 7.22 GB/s which is 15.4 times as fast as the 2-chip IBM Cell Broadband Engine DD3 Blade that performed string searching at a rate of 0.475 GB/s.

  • The 2-chip 1.6 GHz Sun SPARC Enterprise T5240 server performed string searching 1.9 times as fast as the 4-chip HP DL-580 (2.93 GHz Xeon QC) server that performed string searching at a rate of 3.87 GB/s. The 1.6 GHz Sun SPARC Enterprise T5240 server has a 2.4 times advantage in delivered power-performance over the HP DL-580 (using a power consumption rate of 830 watts for the HP system as measured on other

  • The 1.6 GHz Sun SPARC Enterprise T5240 server demonstrated a 14% speedup over the 1.4 GHz Sun SPARC Enterprise T5240 server.

Sun SPARC Enterprise T5220

  • The 1-chip 1.4 GHz Sun SPARC Enterprise T5220 server performed string searching at a rate of 3.16 GB/s which is 6.7 times as fast as the 2-chip IBM Cell Broadband Engine DD3 Blade that performed string searching at a rate of 0.475 GB/s.

Performance Landscape

System Throughput
(GB/sec)
Chips Cores
Sun SPARC Enterprise T5440 (1.6 GHz) 14.2 4 32
Sun SPARC Enterprise T5440 (1.4 GHz) 12.7 4 32
Sun SPARC Enterprise T5240 (1.6 GHz) 7.2 2 16
Sun SPARC Enterprise T5240 (1.4 GHz) 6.4 2 16
HP DL-580 (2.9 GHz) 3.9 4 16
Sun SPARC Enterprise T5220 (1.4 GHz) 3.2 1 8
IBM Cell Broadband Engine DD3 Blade (3.2 GHz) 0.475 2 16

Results and Configuration Summary

Hardware Configuration:
    Sun SPARC Enterprise T5440 (1.6 GHz)
      4 x 1.6 GHz UltraSPARC T2 Plus processors
      256 GB
    Sun SPARC Enterprise T5440 (1.4 GHz)
      4 x 1.4 GHz UltraSPARC T2 Plus processors
      128 GB
    Sun SPARC Enterprise T5240 (1.6 GHz)
      2 x 1.6 GHz UltraSPARC T2 Plus processors
      64 GB
    Sun SPARC Enterprise T5240 (1.4 GHz)
      2 x 1.4 GHz UltraSPARC T2 Plus processors
      64 GB
    Sun SPARC Enterprise T5220 (1.4 GHz)
      1 x 1.4 GHz UltraSPARC T2 processor
      32 GB

Software Configuration:

    Sun SPARC Enterprise T5440 (1.6 GHz)
      OpenSolaris 2009.06
      Sun Studio 12 (Sun C 5.9 2007.05)
    Sun SPARC Enterprise T5440 (1.4 GHz)
      Solaris 10 2008.07
      Sun Studio 12 (Sun C 5.9 2007.05)
    Sun SPARC Enterprise T5240 (1.6 GHz)
      OpenSolaris 2009.06
      Sun Studio 12 (Sun C 5.9 2007.05)
    Sun SPARC Enterprise T5240 (1.4 GHz)
      Solaris 10 2008.07
      Sun Studio 12 (Sun C 5.9 2007.05)
    Sun SPARC Enterprise T5220 (1.4 GHz)
      Solaris 10 2008.07
      Sun Studio 12 (Sun C 5.9 2007.05)

Benchmark Description

One of the core functions needed for text identification algorithms in data repositories is real-time string searching. This string searching benchmark demonstrates the usefulness of Sun's UltraSPARC T2 and T2 Plus processors for both ease of code creation and speed of code execution. In IEEE Computer, Volume 41, Number 4, pp. 42-50, April 2008, IBM describes a variant of the Aho-Corasick string searching algorithm that uses deterministic finite automata. The algorithm first constructs a graph that represents a dictionary, then walks that graph using successive input characters from a text file. Each "state" in the graph includes a state transition table (STT) that is accessed using the next input character from the text file to determine the address of the next state in the graph. IBM defines an automaton as a two-step loop that: (1) obtains the address of the next state from the STT, and (2) fetches the next state in the graph.

IBM reports the performance of its Cell Broadband Engine (CBE) to execute this algorithm to search a 4.4 MB version of the King James Bible using a dictionary of the 20,000 most used words in the English language (average word length of 7.59 characters). Each of the 8 synergistic processing elements (SPEs) of each of the two CBEs executes 16 automata, for a total of 256 automata. All automata and hence all SPEs access a single, shared dictionary.

IBM describes elaborate optimizations of the Aho-Corasick algorithm, including state shuffling, state replication, alphabet shuffling and state caching. These optimizations were required to: (1) overcome "memory congestion", i.e., contention amongst the SPEs for access to the shared dictionary, and (2) compensate for the limited local storage that is associated with each SPE. These optimizations were necessary to achieve the performance reported for the CBE DD3 Blade.

IBM does not provide references that indicate where to obtain the dictionary and Bible. IBM reports the algorithmic performance in Gbits/s but does not indicate whether an 8-bit byte is extended to 10 bits as required for network transmission.

In order to closely approximate the dictionary and Bible that were used by IBM, Sun used a dictionary of 25,143 English words (the Open Solaris file cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/cmd/spell/list) for which the average word length is 7.2 characters, and a 4.6 MB version of the King James Bible (www.patriot.net/users/bmcgin/kjv12.zip). For reporting of results in Gbits/s, the length of a byte is assumed to be 8 bits.

Key Points and Best Practices

  • Power was measured during execution of the Aho-Corasick algorithm using a WattsUp power meter, and the average rate of power consumption is presented.

  • The Aho-Corasick algorithm as deployed on the IBM Cell Broadband Engine DD3 Blade required substantial optimization and tuning to achieve the reported performance, whereas on the Sun SPARC Enterprise T5220, T5240 or T5440 servers only a basic implementation of the algorithm and a simple compilation were needed.

  • In order to demonstrate the usefulness of Sun's UltraSPARC T2 and T2 Plus processors for both ease of code generation and speed of code execution, Sun implemented the Aho-Corasick algorithm using ANSI C. No optimizations of the algorithm were required to achieve the performance reported for the T5220, T5240 and T5440. The source code was compiled using the -m32 -xO3 and -xopenmp options. The dictionary is represented using a graph that comprises 82 MB. Each core of the T5220, T5240 or T5440 executes 8 automata using one OpenMP thread per automaton. Thus, the T5220 executes 64 total automata, the T5240 executes 128 total automata and the T5440 executes 256 total automata. All automata and hence all cores access a single, shared dictionary. Access to this dictionary is accelerated by the large, shared L2 caches of the Sun SPARC Enterprise T5220, T5240 and T5440.

See Also

About

BestPerf is the source of Oracle performance expertise. In this blog, Oracle's Strategic Applications Engineering group explores Oracle's performance results and shares best practices learned from working on Enterprise-wide Applications.

Index Pages
Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today