Next-generation CMT processors
By sprack on Dec 06, 2005
Its amazing to see the pace at which Chip Multithreaded (CMT) processors are evolving. It hasn't been that long since we were all obsessed with high frequency, super-complex OOO processors. Diminishing returns on performance and outlandish power requirements soon put an end to a number of these chip projects and hastened the industry-wide move toward the CMT design-point.
In the last few years alone we have seen CMT processors evolve through 2 generations - starting with the practice of just putting two uniprocessors on the same die (nothing being shared between the two cores but the offchip resources), and more recently moving to a more integrated design point, where the cores share an onchip level-2 cache (a number of obvious reasons why this sharing could be beneficial).
With the release of the UltraSPARC T1 (code-named Niagara), the next-generation of CMT processors is starting to arrive. Rather than just reusing uniprocessor designs, we are seeing the design of the processors tailored to a CMT design point. In the case of the UltraSPARC T1, this design point is commercial server workloads, such as databases, web servers, and application servers.
Server workloads are broadly characterized by high levels of thread-level parallelism (TLP), low instruction-level parallelism (ILP) and large working sets. The potential for further improvements in overall single-thread CPI is limited, but significant performance gains can be observed by leveraging the available TLP -- providing support for many simultaneous hardware threads of execution via a combination of support for multiple cores (Chip Multiprocessors (CMP)) and Multi-Threading (MT).
Sun's UltraSPARC T1 processor provides support for 32 hardware threads using 8 4-way vertically threaded cores. In comparison to other sever processors, each of these hardware threads is fairly modest (lower frequency, smaller issue-width etc.). However, the aggregate performance of the 32 such hardware threads that comprise the UltraSPARC T1 is significant, often providing several fold the performance of existing dual-core designs. And, given the almost cubic dependence between core frequency and power consumption, it does so at a fraction of the power of other solutions!
In Sun's Advanced Processor Architecture group (APA), we have been focusing on next-generation CMT processors for some time and talk more about some of the opportunities and challenges associated with this design trend in a recent publication at the International Symposium on High Performance Computer Architecture (HPCA'05), which can be found here along with the slide set for the presentation.
Another topic we have been investigating is how well the server CMT design point fits with other classes of application. The results have been encouraging, with CMT server processors delivering great performance at a fraction of the power associated with more traditional processors.
One interesting application space is that of Bioinformatics. In this space, significant effort is expended comparing DNA, RNA or protein queries against large (multi-GB) databases of sequences. A variety of different applications have been developed to identify similarities between the query and the sequences in the database. Probably the best known such application is BLAST.
These databases are composed of literally millions of different sequences, so there is an abundance of available parallelism. Most of these applications, including BLAST, have been coded as multithreaded applications and have been widely demonstrated to scale well.
We have been experimenting with T2000 systems running both multithreaded and single-threaded BLAST configurations and have found that performance scales almost perfectly with the number of cores utilized i.e. Performance observed with 8 cores (32-threads) is almost 8X the performance observed using 1-core (4-threads).
Looks like T2000 could be a nice fit in the Bioinformatics space. Stay tuned....