By sprack on Mar 19, 2009
Interesting new book, the Developers Edge, brings together a good collection of technical articles harvested from the Sun Blogosphre. Naturally, there is some info included on T2 crypto. The book can be found here.
In recent posts I have mentioned the performance of Bioinformatics applications on the T1000 and T2000 (Niagara boxes). However, I never got around to mentioning Sun's Try and Buy program:
At the risk of this sounding like a sales pitch, this is a pretty interesting program that essentially allows customers to experiment with a Niagara box for 60 days free of charge. At the end of the 60 days, customers can either opt to buy the system or return the box to Sun – I even think Sun pays for the return shipping. If you are interested in investigating the performance of Bioinformatics applications on Niagara, this represents a great opportunity to do so.....
As promised, here are some scaling results for NCBI Blast on a T2000. The first plot illustrates a focus on reducing the latency of a single query -- performance scaling from a run with 4-threads (-a 4) to a run with 32-threads (-a 32) [nucleotide query (2160 letters), blastn, nt database]. For the 4T run, the threads are bound to a single core using psrset. The scaling is pretty good, but you can't avoid Amdahl's law, and we see around 6.8X moving from 1-core to 8-cores. This scaling is pretty typical, although some queries scale better and some scale worse. The final moments of a Blast run are single threaded, and during this time the final scoring for the best matches is computed and the output file is generated (as a result, if the number of matches reported is decreased from the default, scaling is improved). It should, however, be mentioned that there is no reason why much of this final single-threaded portion of the run could not be multi-threaded and the move to multi-core processors is starting to justify the incremental parallelization of the final serial phase.
Alternatively, rather than running a single multi-threaded query on a T2000, it is possible to run multiple independent queries in parallel. This could be in the form of four 8T queries, eight 4T queries or, at the extreme, thirty-two single-threaded queries. The following plot illustrates the scaling, in terms of aggregate throughput of queries, for the single-threaded query scenario. Moving from 1-core (4 independent single-threaded queries bound to a single core) to 8-cores (32 independent single-threaded queries), throughput increases almost 8X.
Leveraging a throughput centric approach delivers almost perfect scaling going from 1 to 8-cores. Further, given the manner in which many BLAST queries are handled, it would seem that this throughput approach may be perfectly acceptable – shared server that accepts queries from multiple users, resulting in an abundance of queries for parallel processing and no tight constraints time for each query (queries might be submitted via a web interface and the results emailed to the requester) -- it is also possible to leverage a hybrid solution of say 8 4-thread queries in parallel, if there are certain QoS levels that can't be met with single-threaded queries.
As I touched on briefly in my last entry, we have looked at the scaling of a variety of different bioinformatics applications on Niagara (T1). We have seen great scaling, whether you consider scaling from a single 1-thread job to a single 32-thread job or to 32 1-thread jobs in parallel. I have some plots that show this nicely, and I will try to post them here in the next couple of days.
Following from this, I just wanted to quickly mention the availability of a 32-way BLAST trace. The trace leverages Sun's RST format (more details here) and can be downloaded via:
The trace is for a protein query (1887-letters), using the blastp program, against a Non-Redundant Protein database (2,244,936 sequences; 757,978,433 total letters). It a pretty substantial trace and it should be pretty useful to investigate the various pressures a key bioinformatics application places on a CMT processor (how do the various threads interact in the shared L2 cache, what are the offchip bandwidth requirements, how effectively do the threads co-exist on a VT core etc.). If time permits, I will try and post some of these details in the coming days.
While BLAST scaling has been examined before, this has typically been in the context of MP systems, where each thread leverages a separate L1 cache, L2 cache and inter-thread communication can be costly. With the introduction of CMT processors, it may be that we could achieve significant improvements in BLAST coding by taking into account that multiple threads now share the same L2 cache (and, in the case of Niagara, even the L1 caches are shared between 4 threads)....Just a thought.
Dr. Spracklen is a senior staff engineer in the Architecture Technology Group (Sun Microelectronics), that is focused on architecting and modeling next-generation SPARC processors. His current focus is hardware accelerators.