Thursday Mar 19, 2009

Interesting collation of stuff

Interesting new book, the Developers Edge, brings together a good collection of technical articles harvested from the Sun Blogosphre. Naturally, there is some info included on T2 crypto. The book can be found here.

Wednesday Sep 06, 2006

Should be good for life sciences

The Hot Chips presentation for Niagara-2 can be found here. With double the threads of Niagara-1, improved single-thread performance and greatly improved floating point performance (FGU per core), this should be a good fit with the life sciences space -- extending applicability from genomics to the molecular dynamics apps that have significant FP.

Tuesday Sep 05, 2006

NCBI BLAST Performance

While looking at the performance of the new computational core in BLAST 2.2.14, I noticed that the precompiled version for SPARC Solaris is compiled using an old compiler (Studio 8). Building your own version from the source code (also available on the website) appears to improve performance noticeably -- about 15% (results observed on a 1GHz T1000 running the NCBI benchmark queries). It's likely that further gains are feasible by altering the default compiler flags -- although the default flags look reasonable (-fast etc).

Monday Apr 24, 2006

Bioinformatics and Try and Buy

In recent posts I have mentioned the performance of Bioinformatics applications on the T1000 and T2000 (Niagara boxes). However, I never got around to mentioning Sun's Try and Buy program:

At the risk of this sounding like a sales pitch, this is a pretty interesting program that essentially allows customers to experiment with a Niagara box for 60 days free of charge. At the end of the 60 days, customers can either opt to buy the system or return the box to Sun – I even think Sun pays for the return shipping. If you are interested in investigating the performance of Bioinformatics applications on Niagara, this represents a great opportunity to do so.....

Wednesday Apr 05, 2006

HMMER and T2000 (Niagara box)

I've been looking at HMMER (another frequently used bioinformatics application) performance on T2000. So far things look good -- good scaling, and pretty impressive performance when compared with other platforms. I will try and post some scaling results in the next day or so. Be sure to use the Sun compiler when compiling HMMER -- some pretty cool optimizations leveraging conditional moves to reduce the number of hard to predict conditional branches in hot loops...

Monday Apr 03, 2006

NCBI BLAST Scaling on Sun Fire T2000

As promised, here are some scaling results for NCBI Blast on a T2000. The first plot illustrates a focus on reducing the latency of a single query -- performance scaling from a run with 4-threads (-a 4) to a run with 32-threads (-a 32) [nucleotide query (2160 letters), blastn, nt database]. For the 4T run, the threads are bound to a single core using psrset. The scaling is pretty good, but you can't avoid Amdahl's law, and we see around 6.8X moving from 1-core to 8-cores. This scaling is pretty typical, although some queries scale better and some scale worse. The final moments of a Blast run are single threaded, and during this time the final scoring for the best matches is computed and the output file is generated (as a result, if the number of matches reported is decreased from the default, scaling is improved). It should, however, be mentioned that there is no reason why much of this final single-threaded portion of the run could not be multi-threaded and the move to multi-core processors is starting to justify the incremental parallelization of the final serial phase.

Alternatively, rather than running a single multi-threaded query on a T2000, it is possible to run multiple independent queries in parallel. This could be in the form of four 8T queries, eight 4T queries or, at the extreme, thirty-two single-threaded queries. The following plot illustrates the scaling, in terms of aggregate throughput of queries, for the single-threaded query scenario. Moving from 1-core (4 independent single-threaded queries bound to a single core) to 8-cores (32 independent single-threaded queries), throughput increases almost 8X.

Leveraging a throughput centric approach delivers almost perfect scaling going from 1 to 8-cores. Further, given the manner in which many BLAST queries are handled, it would seem that this throughput approach may be perfectly acceptable – shared server that accepts queries from multiple users, resulting in an abundance of queries for parallel processing and no tight constraints time for each query (queries might be submitted via a web interface and the results emailed to the requester) -- it is also possible to leverage a hybrid solution of say 8 4-thread queries in parallel, if there are certain QoS levels that can't be met with single-threaded queries.

Monday Mar 13, 2006

Niagara, Bioinformatics and a 32-way BLAST trace

As I touched on briefly in my last entry, we have looked at the scaling of a variety of different bioinformatics applications on Niagara (T1). We have seen great scaling, whether you consider scaling from a single 1-thread job to a single 32-thread job or to 32 1-thread jobs in parallel. I have some plots that show this nicely, and I will try to post them here in the next couple of days.

Following from this, I just wanted to quickly mention the availability of a 32-way BLAST trace. The trace leverages Sun's RST format (more details here) and can be downloaded via:

The trace is for a protein query (1887-letters), using the blastp program, against a Non-Redundant Protein database (2,244,936 sequences; 757,978,433 total letters). It a pretty substantial trace and it should be pretty useful to investigate the various pressures a key bioinformatics application places on a CMT processor (how do the various threads interact in the shared L2 cache, what are the offchip bandwidth requirements, how effectively do the threads co-exist on a VT core etc.). If time permits, I will try and post some of these details in the coming days.

While BLAST scaling has been examined before, this has typically been in the context of MP systems, where each thread leverages a separate L1 cache, L2 cache and inter-thread communication can be costly. With the introduction of CMT processors, it may be that we could achieve significant improvements in BLAST coding by taking into account that multiple threads now share the same L2 cache (and, in the case of Niagara, even the L1 caches are shared between 4 threads)....Just a thought.


Dr. Spracklen is a senior staff engineer in the Architecture Technology Group (Sun Microelectronics), that is focused on architecting and modeling next-generation SPARC processors. His current focus is hardware accelerators.


Top Tags
« March 2015