An Oracle blog about Transactional locks

Polite busy-waiting with WRPAUSE on SPARC

Dave Dice
Senior Research Scientist

Unbounded busy-waiting is an poor idea for user-space code, so we typically use spin-then-block strategies when, say, waiting for a lock to be released or some other event. If we're going to spin, even briefly, then we'd prefer to do so in a manner that minimizes performance degradation for other sibling logical processors ("strands") that share compute resources. We want to spin politely and refrain from impeding the progress and performance of other threads — ostensibly doing useful work and making progress — that run on the same core. On a SPARC T4, for instance, 8 strands will share a core, and that core has its own L1 cache and 2 pipelines. On x86 we have the PAUSE instruction, which, naively, can be thought of as a hardware "yield" operator which temporarily surrenders compute resources to threads on sibling strands. Of course this helps avoid intra-core performance interference. On the SPARC T2 our preferred busy-waiting idiom was "RD %CCR,%G0" which is a high-latency no-nop. The T4 provides a dedicated and extremely useful WRPAUSE instruction. The processor architecture manuals are the authoritative source, but briefly, WRPAUSE writes a cycle count into the the PAUSE register, which is ASR27. Barring interrupts, the processor then delays for the requested period. There's no need for the operating system to save the PAUSE register over context switches as it always resets to 0 on traps.

Digressing briefly, if you use unbounded spinning then ultimately the kernel will preempt and deschedule your thread if there are other ready threads than are starving. But by using a spin-then-block strategy we can allow other ready threads to run without resorting to involuntary time-slicing, which operates on a long-ish time scale. Generally, that makes your application more responsive. In addition, by blocking voluntarily we give the operating system far more latitude regarding power management. Finally, I should note that while we have OS-level facilities like sched_yield() at our disposal, yielding almost never does what you'd want or naively expect.

Returning to WRPAUSE, it's natural to ask how well it works. To help answer that question I wrote a very simple C/pthreads benchmark that launches 8 concurrent threads and binds those threads to processors 0..7. The processors are numbered geographically on the T4, so those threads will all be running on just one core. Unlike the SPARC T2, where logical CPUs 0,1,2 and 3 were assigned to the first pipeline, and CPUs 4,5,6 and 7 were assigned to the 2nd, there's no fixed mapping between CPUs and pipelines in the T4. And in some circumstances when the other 7 logical processors are idling quietly, it's possible for the remaining logical processor to leverage both pipelines -- "pipeline fusion". Some number T of the threads will iterate in a tight loop advancing a simple Marsaglia xor-shift pseudo-random number generator. T is a command-line argument. The main thread loops, reporting the aggregate number of PRNG steps performed collectively by those T threads in the last 10 second measurement interval. The other threads (there are 8-T of these) run in a loop busy-waiting concurrently with the T threads. We vary T between 1 and 8 threads, and report on various busy-waiting idioms. The values in the table are the aggregate number of PRNG steps completed by the set of T threads. The unit is millions of iterations per 10 seconds. For the "PRNG step" busy-waiting mode, the busy-waiting threads execute exactly the same code as the T worker threads. We can easily normalize the scores and compute the average rate of progress for individual worker threads by dividing the aggregate score by the number of worker threads T. I should note that the PRNG steps are extremely cycle-heavy and access almost no memory, so arguably this microbenchmark is not as representative of "normal" code as it could be. And for the purposes of comparison I included a row in the table that reflects a waiting policy where the waiting threads call poll(NULL,0,1000) and block in the kernel. Obviously this isn't busy-waiting, but the data is interesting for reference.

As we can see in the table below, WRPAUSE provides a good way to spin politely. And for short-term waiting it's much more efficient than parking in the kernel and potentially creating software timers for timed OS-level waiting. So we have a new facility that's as polite and effective -- with respect to sibling interference -- as is parking a thread, but that avoids the trip to the kernel and the other overheads associated with context switching. It's worth pointing out that the T=1 and T=2 scores for poll() and WRPAUSE forms are about equal because at T=1 we're leveraging both pipelines. And 3348 units of work is the approximate cycle cap for a core.

Aggregate progress
T = #worker threads
Wait Mechanism for 8-T threadsT=1T=2T=3T=4T=5T=6T=7T=8
Park thread in poll() 32653347334833483348334833483348
no-op 415 831 124316482060249729303349
RD %ccr,%g0 "pause" 14262429269228623013316232553349
PRNG step 412 829 124616702092251029303348
WRPause(8000) 32443361333133483349334833483348
WRPause(4000) 32153308331533223347334833473348
WRPause(1000) 30853199322432513310334833483348
WRPause(500) 29173070315032223270330933483348
WRPause(250) 26942864294930773205338833483348
WRPause(100) 21552469262227902911321433303348

Join the discussion

Comments ( 2 )
  • Aleksey Shipilev Wednesday, October 24, 2012

    Good stuff.

    If T4 has two pipelines, why T1 -> T2 does not increase performance in poll()-wait case? I would assume two threads are running, why their composite performance is not better?

    P.S. It would be helpful to normalize table to the maximum achievable performance with the payload (which I presume is 3348). Otherwise numbers are rather hard to read. Errors would be helpful too.

  • dave Wednesday, October 24, 2012

    Hi Aleksey,

    In the T1 vs T2 case, at T1 the #s for poll() reflect the our thread leveraging both pipelines, which is why the aggregate performance scores for T1 and T2 are about the same, although the per-thread normalized throughput for T1 higher. And 3348 units of work is the approximate cycle bound for the core.

    And I agree about normalizing the #s. I had a 2nd graph that was normalized but thought the raw aggregate #s might be more interesting.

Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.