Scorching Mutexes with CoolThreads - Part 1
By pgdh on Dec 06, 2005
How could I not be excited about CoolThreads?! Regular readers of Multiple Threads will be aware of my technical white paper on Multithreading in the Solaris Operating Environment, and of more recent work making getenv() scale for Solaris 10. And there's a lot more stuff I'm itching to blog -- not least libMicro, a scalable framework for microbenchmarking which is now available to the world via the OpenSolaris site.
For my small part in today's launch of the first CoolThreads servers, I thought it would be fun to use libMicro to explore some aspects of application level mutex performance on the UltraSPARC T1 chip. In traditional symmetric multiprocess configurations mutex performance is dogged by inter-chip cache to cache latencies.
To applications, the Sun Fire T2000 Server looks like a 32-way monster. Indeed, log in to one of these babies over the network and you soon get the impression of a towering, floor-breaking, hot-air-blasting, mega-power-consuming beast. In reality, it's an cool, quiet, unassuming 2U rackable box with a tiny appetite!
Eight processor cores -- each with its own L1 cache and four hardware strands -- share a common on-chip L2 cache. The thirty-two virtual processors see very low latencies from strand to strand, and core to core. But how does this translate to mutex performance? And is there any measurable difference between inter-core and intra-core synchronization?
First up, I took libMicro's cascade_mutex test case for a spin. Literally! This test takes a defined number of threads and/or processes and arranges them in a ring. Each thread has two mutexes on which it blocks alternately; and each thread manipulates the two mutexes of the next thread in the ring such that only one thread is unblocked at a time. Just now, I'm only interested in the minimum time taken to get right around the loop.
The default application mutex implementation in Solaris uses an adaptive algorithm in which a thread waiting for a mutex does a short spin for the lock in the hope of avoiding a costly sleep in the kernel. However, in the case of an intraprocess mutex the waiter will only attempt the spin as long as the mutex holder is running (there is no point spinning for a mutex held by thread which is making no forward progress).
With 16 threads running cascade_mutex the T2000 achieved a blistering 11.9us/loop (that's less than 750ns per thread)! The V890, on the other hand, took a more leisurely 25.3us/loop. Clearly, mutex synchronization can be very fast with CoolThreads!
Naturally, spinning is not going to help the cascade_mutex case if you have more runable threads than available virtual processors. With 32 active threads the V890 loop time rockets to 850us/loop, whereas the T2000 (with just enough hardware strands available) manages a very respectable 32.4us/loop. Only when the T2000 runs out of virtual processors does the V890 catch up (due to better single thread performance). At 33 threads the T2000 jumps to 1140us/loop versus 900us/loop on the V890.
libMicro's cascade_mutex case clearly shows that UltraSPARC T1 delivers incredibly low latency synchronization across 32 virtual processors. Whilst this is a good thing in general it is particularly good news for the many applications which use thread worker pools to express their concurrency.
In Part 2 we will explore the small difference cascade_mutex sees between strands on the same core and strands on different cores. Stay tuned!