The presentation covers some ways to utilise the new CMT systems. It also touches upon the Cool Tools initiative.
For me, one of the interesting parts of the presentation was collecting data on scaling in the presence of mutex locks and false sharing. In traditional systems these two issues can result in poor scaling. The root problem is that they both require multiple threads to write to the same cacheline. Consequently, the cacheline gets bounced around between multiple processors (and each bounce is quite costly). In a CMT processor, such as the UltraSPARC-T1, the sharing is done at the level of the common L2 cache - which is much closer to the processor cores, and hence much less costly. So the impact of mutex locks and false sharing is greatly reduced.
This leads to an interesting observation. Not only does the processor have many threads ready to contribute to the performance of a multi-threaded application, it also has an architecture which reduces the traditional costs of writing multi-threaded applications.