NCBI BLAST Scaling on Sun Fire T2000
By sprack on Apr 03, 2006
As promised, here are some scaling results for NCBI Blast on a T2000. The first plot illustrates a focus on reducing the latency of a single query -- performance scaling from a run with 4-threads (-a 4) to a run with 32-threads (-a 32) [nucleotide query (2160 letters), blastn, nt database]. For the 4T run, the threads are bound to a single core using psrset. The scaling is pretty good, but you can't avoid Amdahl's law, and we see around 6.8X moving from 1-core to 8-cores. This scaling is pretty typical, although some queries scale better and some scale worse. The final moments of a Blast run are single threaded, and during this time the final scoring for the best matches is computed and the output file is generated (as a result, if the number of matches reported is decreased from the default, scaling is improved). It should, however, be mentioned that there is no reason why much of this final single-threaded portion of the run could not be multi-threaded and the move to multi-core processors is starting to justify the incremental parallelization of the final serial phase.
Alternatively, rather than running a single multi-threaded query on a T2000, it is possible to run multiple independent queries in parallel. This could be in the form of four 8T queries, eight 4T queries or, at the extreme, thirty-two single-threaded queries. The following plot illustrates the scaling, in terms of aggregate throughput of queries, for the single-threaded query scenario. Moving from 1-core (4 independent single-threaded queries bound to a single core) to 8-cores (32 independent single-threaded queries), throughput increases almost 8X.
Leveraging a throughput centric approach delivers almost perfect scaling going from 1 to 8-cores. Further, given the manner in which many BLAST queries are handled, it would seem that this throughput approach may be perfectly acceptable – shared server that accepts queries from multiple users, resulting in an abundance of queries for parallel processing and no tight constraints time for each query (queries might be submitted via a web interface and the results emailed to the requester) -- it is also possible to leverage a hybrid solution of say 8 4-thread queries in parallel, if there are certain QoS levels that can't be met with single-threaded queries.