Next-generation CMT processors

Its amazing to see the pace at which Chip Multithreaded (CMT) processors are evolving. It hasn't been that long since we were all obsessed with high frequency, super-complex OOO processors. Diminishing returns on performance and outlandish power requirements soon put an end to a number of these chip projects and hastened the industry-wide move toward the CMT design-point.

In the last few years alone we have seen CMT processors evolve through 2 generations - starting with the practice of just putting two uniprocessors on the same die (nothing being shared between the two cores but the offchip resources), and more recently moving to a more integrated design point, where the cores share an onchip level-2 cache (a number of obvious reasons why this sharing could be beneficial).

With the release of the UltraSPARC T1 (code-named Niagara), the next-generation of CMT processors is starting to arrive. Rather than just reusing uniprocessor designs, we are seeing the design of the processors tailored to a CMT design point. In the case of the UltraSPARC T1, this design point is commercial server workloads, such as databases, web servers, and application servers.

Server workloads are broadly characterized by high levels of thread-level parallelism (TLP), low instruction-level parallelism (ILP) and large working sets. The potential for further improvements in overall single-thread CPI is limited, but significant performance gains can be observed by leveraging the available TLP -- providing support for many simultaneous hardware threads of execution via a combination of support for multiple cores (Chip Multiprocessors (CMP)) and Multi-Threading (MT).

Sun's UltraSPARC T1 processor provides support for 32 hardware threads using 8 4-way vertically threaded cores. In comparison to other sever processors, each of these hardware threads is fairly modest (lower frequency, smaller issue-width etc.). However, the aggregate performance of the 32 such hardware threads that comprise the UltraSPARC T1 is significant, often providing several fold the performance of existing dual-core designs. And, given the almost cubic dependence between core frequency and power consumption, it does so at a fraction of the power of other solutions!

In Sun's Advanced Processor Architecture group (APA), we have been focusing on next-generation CMT processors for some time and talk more about some of the opportunities and challenges associated with this design trend in a recent publication at the International Symposium on High Performance Computer Architecture (HPCA'05), which can be found here along with the slide set for the presentation.

Another topic we have been investigating is how well the server CMT design point fits with other classes of application. The results have been encouraging, with CMT server processors delivering great performance at a fraction of the power associated with more traditional processors.

One interesting application space is that of Bioinformatics. In this space, significant effort is expended comparing DNA, RNA or protein queries against large (multi-GB) databases of sequences. A variety of different applications have been developed to identify similarities between the query and the sequences in the database. Probably the best known such application is BLAST.

These databases are composed of literally millions of different sequences, so there is an abundance of available parallelism. Most of these applications, including BLAST, have been coded as multithreaded applications and have been widely demonstrated to scale well.

We have been experimenting with T2000 systems running both multithreaded and single-threaded BLAST configurations and have found that performance scales almost perfectly with the number of cores utilized i.e. Performance observed with 8 cores (32-threads) is almost 8X the performance observed using 1-core (4-threads).

Looks like T2000 could be a nice fit in the Bioinformatics space. Stay tuned....


I'll be very interested to see if Sun can scale the clock on this device. This will, IMHO, will be a good indicator if Suns CMT product has a solid, long-term, future. By comparison, Suns (or should I say TIs) ability to scale the clock on its ultraSPARC III chips could best be described by a single word: miserable! :( So here is my expectation of good clock scaling capability for the T1: - first 3 months 15% increase in clock frequency - first 6 months 22% - first 12 months 50% - first 24 months 85% At less than the18 month mark we should have an announcement of a followup, new design, next generation, T1 product. It should have the following, broad, characteristics: - on die cache is 2x the size of the T1 - # of cores increased by 50% - process technology is the next smallest preferred size[1] - launch clock = 1.5x launch clock of the T1 - first delivery in production quantities < 24 months after the release of the T1 After that, the new generation part should follow the same, or better, clock scaling schedule outlined above for the T1. Note [1] the T1 is 130nm IIRC. So the followup part should use 90nm technology - the 3rd generation part: 65nm etc.

Posted by Al Hopper on January 31, 2006 at 11:14 PM PST #


Posted by the occasional blog on August 23, 2006 at 10:48 AM PDT #

Post a Comment:
Comments are closed for this entry.

Dr. Spracklen is a senior staff engineer in the Architecture Technology Group (Sun Microelectronics), that is focused on architecting and modeling next-generation SPARC processors. His current focus is hardware accelerators.


Top Tags
« August 2016