Wednesday Apr 22, 2015

Locks with LIFO admission order

Why would we ever want a lock with a LIFO admission policy?

First, a LIFO lock provides a useful measure of real-world scalability. Lets say we have a set of threads that each iterate as follows : acquire some lock L; execute a fixed-length critical section body of duration C; release L; and finally execute a non-critical section of length N. We run T threads concurrently, and at the end of a measurement interval we report the total number of iterations completed, as well as per-thread iteration counts. Amdahl's law says the maximum ideal speedup relative to a single thread should be (N+C)/C. We can run our experiments, varying the thread count, measure aggregate throughput and see compare to see how close we come to Amdahl's bound. Assuming we have a heterogeneous system and ignoring any potential superlinear effects, the observed peak speedup will be capped by Amdahl's bound. And if we use a fair FIFO lock, such as MCS, the threads will all have approximately equal completion counts.

It's worth noting that Amdahl's law is sometimes misapplied to locks and critical sections. In the classic Amdahl model, during the serial phase no other threads may be executing concurrently, while with locks, when one threads is in the critical section other threads may be executing concurrently in their critical sections. That is, classic Amdahl's law applies to barriers. See also Gustafson's law, Gunther's universal scaling law, and in particular Eyerman's model. Critically, though, the maximum speedup bounds still hold.

Now lets say we switch to a LIFO lock. Ideally, the aggregate throughput will be the same as we saw with the FIFO lock. If N=30 and C=10, then the ideal speedup is 4X. If we run with 10 threads under a LIFO lock, when we examine the distribution of per-thread completion counts we expect to see 4 threads dominate with about equal performance, and 6 threads should have starved completely. This gives us another non-analytic empirical way to guage the maximum speedup over a lock. Put another way, can we figure out how many threads we can "squeeze" or pack into a contended lock before we hit saturation. We keep increasing T until some threads show evidence of starvation. This lets us discern the N/C ratio. Of course we could try to derive the ratio using FIFO locks, varying T, and using Amdahl's law, but in practice there are quite a few practical confounding factors. The LIFO approach gives us a fairly direct reading of the number of threads that will "fit" before we reach over-saturation.

LIFO locks are also useful in their own right. While they are deeply unfair, they work very well with spin-then-park waiting strategies. If we imagine the lock as implemented with a stack of waiting threads, threads near the head are mostly likely to be spinning, and are also most likely to be next granted the lock. If the lock is over-saturated, then under a LIFO policy, ownership will circulate over just a subset of the contending threads. In turn this can reduce cache pressure and yield benefits arising from thermal and energy bounds. Of course we have to take measures to ensure long-term eventual fairness, but many locks intentionally trade-off short-term fairness for throughput. (See our "Cohort" locks, for example).

A possibly systemic downside to LIFO locks is that arrivals and departures may need to access the same lock metadata, creating an acute coherence hot-spot. With a contended MCS lock, for instance, an unlock operation doesn't need to access the "tail" field.

I wondered if there was a LIFO analog to the classic FIFO ticket lock and put the question to my colleagues in Oracle Lab's Scalable Synchronization Research Group, and collected some of the designs, which I'll report below. It's an interesting exercise and puzzle, and hard to resist for concurrency folks. Alex Kogan, Victor Luchangco, Tim Harris, Yossi Lev and I contributed. Any mistakes are mine.

The most basic LIFO algorithm I could think of was to implement an explicit stack of waiting threads with a central "head" pointer which points to the most recently arrived thread. The approach is pretty obvious and yields a true LIFO admission order. Expressed in a pidgin Python/C++ dialect and assuming atomic<T> for all shared variables, the following sketch describes that form. The lock is very similar to the Thread::MuxAcquire() and ::MuxRelease() primitives that I wrote for the HotSpot JVM. (Those are internal locks used by JVM to get over a bootstrapping phase where the normal native C++ HotSpot Monitor:: and Mutex:: classes aren't yet initialized). We call this form "E3".

(I apologize for the crude listings that follow. Oracle's blog software explicitly filters out remote javascript scripts, so I'm unable to use civilized pretty-print facilities such as github's "gist" mechanism).


class StackElement :
StackElement * Next
int Wait

class LIFOLock :
// Encoding for Head field :
// 0 = lock is free
// 1 = locked but no waiting threads
// T = locked : T refers to stack of waiting threads
StackElement * Head

Acquire(L) :
StackElement A ; // on-stack auto-variable
auto h = L->Head
if h == 0 :
h = CAS (&L->Head, 0, 1)
if h == 0 : return // uncontended fast-path acquisition
// inopportune interleaving -- some other thread mutated L->Head
// in the LD-CAS window above. We lost the race

// Apparent contention ...
// Doorway phase : use CAS to push our element A onto the stack
assert h != 0
A.Next = h & ~1
A.Wait = 1
auto v = CAS (&L->Head, h, &A)
if v == h : break
if v == 0 :
v = CAS (&L->Head, 0, 1)
if v == 0 : return
h = v

// Waiting phase
// the lock provides local spinning
while A.Wait != 0 : Pause()
assert L->Head != 0

Release(L) :
auto h = L->Head
assert h != 0
if h == 1
h = CAS (&L->Head, 1, 0)
if h == 1 : return // simple uncontended release
// the stack can only grow while the lock is held ...

// The lock is contended
// try to pop the head of the stack.
// This is the most recently arrived thread
assert h != 0 && h != 1
// Note that we're using CAS to push and pop elements
// Normally that would leave us exposed to ABA problems.
// But as there can be only one thread trying to pop -- that being the owner --
// we have multiple-push-single-pop concurrency model and are thus not vulnerable
// to ABA pathologies. The lock itself restricts concurrency and prevents
// multiple concurrent pop operations.
auto next = h->Next
auto v = CAS (&L->Head, h, next)
if v == h :
assert h->Wait == 1
h->Wait = 0
h = v

// Variation :
// Note that if the CAS fails we are guaranteed to have at least 2 elements
// on the stack. We can splice out the element that follows the element
// identified by "v" and pass ownership to the associated thread.
// The thread we pick will either be the same as the head at the time
// we fetched L->Head in Release(), or some thread that arrived afterward.
// Depending on how liberal your interpretation, this is a plausibly LIFO ordering.
// This approach yields a constant-time Release() operator, with no loops.
// As constituted above, the policy is "strict" LIFO, however.

The next few variations forgo explicit nodes, and as such, we'll have global spinning. The broad inspiration for this family is the CLH lock, where a thread knows the adjacent thread on the queue, but the queue is implicit.

We call the following "E5" because it was the 5th experimental version.


Class LIFOLock
int Head
int NextToRun

enum DistinguishedValues
U = 0 // unlocked
H = 1 // Locked, no other threads waiting
P = 2 // ownership being passed
Invalid = 3 //
IDBias = 4 // offset so thread ID values don't intersect above

Acquire(L) :
auto tkt = Self->UniqueID + IDBias
assert tkt != U && tkt != P && tkt != H
auto h = L->Head ;
for :
if h == P :
Pause() ; h = L->Head; continue ;
auto newh = (h == U) ? H : tkt
auto v = CAS (&L->Head, h, newh)
if v == h : break
h = v
if h != U :
while L->NextToRun != tkt : Pause()
L->NextToRun = Invalid
assert L->Head == P
L->Head = h

Release (L) :
auto h = L->Head ;
for :
assert h != P && h != U
auto newh = (h == H) ? U : P ;
auto v = CAS (&L->Head, h, newh) ;
if v == h : break
h = v
if h != H : L->NextToRun = h

The first thing to notice is that the "P" encoding can result in two waiting phases in Acquire() : arriving threads may first wait while Head == P and then for their specific turn. The interlock protocol to hand-off feels rather synchronous. P state is effectively a lock that freezes out arrivals until the successor manages
to depart. In addition, a group of threads could be waiting while Head == P, but subsequently "enqueue" themselves in an order that differs from their arrival order, so we don't have strict pedantic FIFO. (See also FCFE = First-Come-First-Enabled).

We can streamline E5 slightly, yielding E5B :


Acquire(L) :
auto tkt = Self->UniqueID + IDBias
assert tkt != U && tkt != P && tkt != H
auto h = L->Head
for :
if h == P :
Pause() ; h = L->Head; continue ;
if h == U :
auto v = CAS (&L->Head, U, H)
if v == U : return
h = v
// Case : H or T = most-recently-arrived thread
auto v = CAS (&L->Head, h, tkt)
if v == h : break
h = v
while L->NextToRun != tkt : Pause()
L->NextToRun = U
assert L->Head == P
L->Head = h

Release (L) :
auto h = L->Head ;
if h == H :
// uncontended path
auto v = CAS (&L->Head, H, U) ;
if v == H : return
h = swap (&L->Head, P)
assert h != U && h != H && h != P
L->NextToRun = h

The next version, E6, eliminates the P encoding and switches to a seqlock-like lock for the hand-off transition. The lock instance requires just a single "Next" field. When the low-order bit of Next is set, arrivals are frozen during the hand-off. E6 appears the fastest on non-NUMA systems, possibly because the paths are relatively tight.


Acquire(L) :
auto w = L->Next ;
if w & 1 :
Pause(); continue
auto v = CAS (&L->Next, w, w+2)
if w != v :
w = v ; continue
if w != 0 :
while L->Next != (w+1) : Pause()
L->Next = w

Release(L) :
auto w = L->Next ;
assert w != 0 && (w & 1) == 0
if w == 2:
auto v = CAS (&L->Next, 2, 0) ;
if v == 2 : return ;
FetchAdd (&L->Next, -1) // set to odd, locked for handoff

For E7 we revert to using an "inner lock" to protect an explicit stack of waiting threads. An MCS or CLH lock work nicely for that purpose. E7 provides local spinning and, depending on the inner lock implementation, is true FIFO. We use an encoding of Head == 1 to indicate the lock is held but no threads are waiting.


Acquire(L) :
CLHAcquire (&L->Inner)
Thread * h = L->Head
if h == NULL :
L->Head = 1 ;
CLHRelease (&L->Inner) ;
Self->Grant = 0
Self->Next = h
L->Head = Self
CLHRelease (&L->Inner)
while Self->Grant == 0 : Pause() ;

LockRelease (L) {
CLHAcquire (&L->Inner)
Thread * h = L->Head
assert h != 0
if h == 1 :
L->Head = 0 ;
CLHRelease (&L->Inner)
else :
L->Head = h->Next
CLHRelease (&L->Inner)
assert h->Grant == 0
h->Grant = 1

For E8 we try to make greater user of atomic fetch-and-add operators. The lock contains "Ticket" and "Admit" fields.


Acquire(L) :
auto t = FetchAdd (&L->Ticket, 1) ;
if t != 0 :
for :
while L->Admit != t : Pause()
// Adjudicate among set of threads with same ticket "depth" value
// Admits some non-determinism because of race
auto v = SWAP (&L->Admit, -1) ;
assert v == t || v == -1
if v != -1 : break ;

Release(L) :
// Claim : t value in LockRelease() must be >= t value
// in corresponding LockAcquire() invocation
auto t = FetchAdd (&L->Ticket, -1) ;
assert t > 0
if --t != 0 L L->Admit = t ;

E9 uses a latch -- really a lock that allows asymmetric acquire and release. Specifically, if thread T1 acquires the latch then T2 may subsequently release the latch. (This is precisely the "thread-oblivious" property required by top-level cohort locks). For our purposes we can use a ticket lock. Our lock structure contains an inner ticket lock and Depth and Admit fields.


Acquire (L) :
TicketAcquire (L->TicketLock)
auto d = L->Depth ++ ;
TicketRelease (L->TicketLock)
if d != 0 :
while L->Admit != d : Pause()
L->Admit = -1 ;
TicketRelease (L->TicketLock)

Release (L) :
TicketAcquire (L->TicketLock)
auto d = -- L->Depth ;
assert d >= 0
L->Admit = d
if d == 0 :
TicketRelease (L->TicketLock)

Those were the most interesting variations. We found the exercise rather fun. And I'm sure there are lots of wonderfully clever ideas that we missed.

Monday Apr 20, 2015

waiting policies for locks : spin-then-park

I thought I'd collect a few notes on the topic of waiting policies in the context of locks. Lock algorithms in the literature are usually implemented via unbounded spinning. That's fine for an academic paper, but in practice it's almost never viable to use unbounded spinning. More typically, we'll use a spin-then-park policy where the spin phase is bounded, and a thread reverts to parking as necessary. In theory we can also yield occasionally while spinning, but yielding doesn't replace the need to park. (In practice, yielding is often a fool's errand, particularly with modern schedulers where the semantics of the operation have
been considerably weakened over time).

Parking is expected to deschedule the caller until it is unparked. This frees up the CPU for other eligible ready threads. It's also friendly and "polite" to siblings on the same core that share the pipelines, and for thermal/energy caps and turbo-mode. (MONITOR-MWAIT addresses some of those concerns, but the waiting thread still occupies a CPU). The interface to park-unpark is simple. A thread T1 calls park() and is descheduled until some thread T2 calls unpark(T1). If T2 happens to call unpark(T2) before T1 parks, then T1's park() operation will return immediately. You can conceptualize of the implementation as a per-thread restricted-range semaphore. Park() is also allowed to return spuriously, so the caller is expected to use it in a loop. A simple litmus test for correct park-unpark usage is that the application should still work if park and unpark were implemented as no-ops, although we'd be reduced to degenerate spinning. I've used the park-unpark interface in the JVM for about a decade, and exported to Java-land for use in java.util.Concurrent (JUC). On Solaris it's easy to implement park-unpark with a per-thread lock-condvar-flag triple. On Linux you can opt to use futexes directly. Per-thread pipe pairs also yields a good implementation, although it consumes too many file descriptors (handles). See also benaphores. Redundant unpark() operations are benign and are expected to be cheap.

A consequence of the "point-to-point" park-unpark interface -- where thread A unparks thread B by directly naming B -- is that we need to have explicit list of threads waiting on lock. Lock-free approaches come in handy for list management.

A park() implementation will typically spin for a short period locally before it has to revert to the kernel. (Half the round-trip context switch time is the norm, and is 2-competitive. See Karlin et. al). This constitutes local spinning which is relatively benign in terms of induced coherence traffic. This gives us a spin-then-park waiting policy. Park() also has a variation with a timeout.

Now lets shift to using park-unpark with classic lock algorithms. (I'll assume our context is C/C++ instead of Java, btw). With MCS, it's trivial and obvious to adapt the algorithm to use parking. I'll typically add a thread reference field to the MCS QNode structure. (I've yet to try it, but I think the same can be done with the "K42" MCS variant). Broadly, queue-based locks with local spinning are amenable to conversion to spin-then-park. A ticket lock, in comparison, isn't a good candidate. CLH requires a bit more finesse, so I'll provide that here for illustration.

As an aside, MCS with spin-then-park waiting is acceptable in some circumstances, but should be used with caution. Recall that MCS is strict FIFO-FCFS. (All strict FIFO locks also provide succession by direct handoff of ownership). If the critical section length or thread count is such that the threads at the tail of the MCS queue start to park, then you'll spend all a large fraction of time in the kernel in voluntary context switching, which is quite expensive. Specifically, the critical path includes kernel overheads to wake a successor. Recently arrived threads at the tail are spinning, while those at the head -- the next ones to be granted ownership -- may be parked in the kernel, which is exactly what we don't want. The performance inflection point where this effect manifests is rather abrupt as we increase contention. Because waiting threads ultimately need to be able to park, and because parking with MCS can toxic to performance, we tend not to find this combination in production code.

(I apologize for the crude listings that follow. Oracle's blog software explicitly filters out remote javascript scripts, so I'm unable to use civilized pretty-print facilities such as github's "gist" mechanism).

A classic pure-spinning CLH algorithm might appear as follows in a pidgin
python/C++ dialect :


Acquire(L) :
auto n = QNodeAllocate()
n->Locked = 1
auto prv = swap (&L->Tail, n)
while prv->Locked != 0 : Pause
QnodeFree (prv)
L->Owner = n

Release(L) :
L->Owner->Locked = 0

The fields should be obvious from the usage, so I've skipped the structure

(I'll take a quick digression to note the duality between queues and mutual exclusion locks. It's usually trivial to convert a lock-free MPSC queue -- multiple-producer-single-consumer -- into a lock. Threads in lock() arrive and enqueue an "acquire" request upon which they wait. Only the owner can dequeue at unlock()-time, thus the "SC" constraint. Conversely, you can take a queue-based lock such as CLH and deconstruct it to yield an MPSC queue such as Vyukov's MPSC queue).

I've converted it to use park-unpark as follows, but I'd be interested in other
approaches :


Acquire(L) :
auto n = CLHNodeAllocate()
n->Locked = 1
auto prv = swap (&L->Tail, n)
if prv->Locked == 1 :
// Self uniquely identifies the current thread
if swap(&prv->Locked, Self) != 0 :
while prv->Locked != 0 :
CLHNodeFree (prv)
L->Owner = n

Release(L) :
auto w = swap (&L->Owner->Locked, 0)
assert w != 0
if w != 1 : Unpark(w)

This changes the encoding of the node "Locked" field from the usual 0/1 to 0, 1 or a thread reference. 0 continues to mean available, 1 means locked, and any other value indicates locked and identifies the thread waiting on that queue node. We assume "1" is distinguished and no thread will be identified by that value. CLHNodeAllocate and CLHNodeFree are implemented with thread-local caches of free queue nodes, and typically only require a few load and store operations. We can do a bit better than the above if you're willing to make the queue nodes type-stable and immortal. I'd like to avoid that, however.

We'll now shift to Brandenburg's PF-T reader-writer lock. This is an extremely clever algorithm and is quite subtle. It's also very terse. If you're not familiar with the algorithm, I recommend reading section 4.1 of their paper in RTSJ11. The lock is phase-fair, which, depending on your tastes, may be a useful property. As written, it uses pure spinning. Additionally, the only atomic it needs is fetch-and-add. PF-T requires a more elaborate transformation -- which I'll step through incrementally -- to use spin-then-park. For the purposes of explication I'll assume a sequentially consistent memory model. All shared fields are assumed to
be atomic<:T>.


Class Brandenburg :
int rin
int rout
int win
int wout

// The lowest order bit of "rin" encodes the phase number.
// The next higher bit if "rin" is the writer-present bit.
// Note that rin+rout resemble the ingress-egress variables
// we use in our cohort reader-writer locks.
// Bit 0x100 and above in rin and rout is the count of arrived readers and
// departed readers, respectively. 0x100 was used in the original algorithm as
// that enabled the use a byte-store instead of an atomic to clear the phase and
// writer-present bits when releasing write permission. Mixed-size accesses in
/ the context of shared or atomic operations can be problematic, so I've eliminated
// the byte-store optimization and just use an atomic to clear those bits.

Reader(L) :
// acquire R permission
auto w = FetchAdd (&L->rin, 0x100) & 3
if w != 0 :
while w == (L->rin & 3) : Pause()

// release R permission
FetchAdd (&L->rout, 0x100)

Writer(L) :
// acquire W permission
// resolve W-W conflicts
auto Ticket = FetchAdd (&L->win, 1)
while Ticket != L->wout: Pause()
// Next, wait for readers to drain
auto w = 2 | (Ticket & 1)
auto tx = FetchAdd (&L->rin, w)
while tx != L->rout : Pause()

// Clear the low-order 2 bits of L->rin
FetchAdd (&L->rin, -(L->rin & 3)) ;
L->wout ++

The first thing to note is that win and wout really constitute a ticket lock used to adjudicate access between writers. The implementation also leverages the low-order bit of "win" as the phase number. So we can replace win+wout with a proper mutex, and add an explicit phase variable. We'll assume that
our mutex is implemented properly and avoids unbounded spinning. This results in the following:


Class Brandenburg :
int rin
int rout
mutex WriterLock
int Phase

Reader(L) :
// acquire R permission
auto w = FetchAdd (&L->rin, 0x100) & 3
if w != 0 :
while w == (L->rin & 3) : Pause()

// release R permission
FetchAdd (&L->rout, 0x100)

Writer(L) :
// acquire W permission
MutexLock (WriterLock)
// Next, wait for readers to drain
auto p = (++L->Phase) & 1
auto tx = FetchAdd (&L->rin, p|2)
while ((tx ^ L->rout) & ~3) != 0 : Pause()

// Clear the low-order 2 bits of L->rin
FetchAdd (&L->rin, -(L->rin & 3)) ;
MutexUnlock (WriterLock)

So at this point the writers are in check (no unbounded spinning) except for at most one writer that's waiting for readers to drain. In anticipation of subsequent changes I'm going to clean up the rin+rout encodings and move the phase number back into the LSB of rin. The use of 0x100 as the increment for a single reader is an historical artifact related to the word-tearing trick, so we'll use 4 instead. (The low order bits remain the writer-present bit and phase number). We also change the code to advance the phase when releasing write permission.


Class Brandenburg :
int rin
int rout
mutex WriterLock

Reader(L) :
// acquire R permission
auto w = FetchAdd (&L->rin, 4)
if (w & 2) != 0 :
// writer is active; wait for next read phase
while ((v ^ L->rin) & 1) == 0 : Pause()

// release R permission
FetchAdd (&L->rout, 4)

Writer(L) :
// acquire W permission
MutexLock (WriterLock)
// Resolve W-R conflicts : specifically W vs extant R
auto t = FetchAdd (&L->rin, 2) & ~1 ;
// Wait for readers to drain
while L->rout != t : Pause() ;

// Clear the writer-present bit and advance phase
auto cur = L->rin ;
FetchAdd (&L->rin, (cur ^ 3) - cur) ;
MutexUnlock (WriterLock)

The changes in the step above were relatively cosmetic, but the next transformation is rather large. We're going to implement an explicit stack of waiting readers. This is the "RList" field, the low order bit of which also serves as the phase number. There are now 2 copies of the phase bit, one in the LSB of rin and one in the LSB or RList. They change almost in unison. We'll also assume that the thread structure has "Next" and "Grant" fields that we can use. "Self" is thread-local variable that refers to a thread's own thread structure. Note that our approach is not vulnerable to ABA problem.


Class Brandenburg :
mutex WriterLock
Thread * RList
int rin
int rout

Reader(L) :
// acquire R permission
auto w = FetchAdd (&L->rin, 4)
if (w & 2) != 0 :
// writer is active; wait for next read phase
// Attempt to push ourselves on the RList
auto ph = v & 1 // arrival phase of this thread
Self->Grant = 0
Thread * head = L->RList
for :
// If phase changed then proceed directly to RCS
// Threads with the previous phase "bounce off" RList.
// RList accepts threads only of the current phase
if (head & 1) != ph : goto EnterRCS ;
Self->Next = head & ~1
auto v = CAS (&L->RList, head, Self|ph) ;
if v == head : break ;
head = v ;
while Self->Grant == 0 : Park()


// release R permission
FetchAdd (&L->rout, 4)

Writer(L) :
// acquire W permission
MutexLock (WriterLock)
// Resolve W-R conflicts : specifically W vs extant R
auto t = FetchAdd (&L->rin, 2)
assert (t & 2) == 0
// Wait for readers to drain
while L->rout != (t & ~1) : Pause()

auto cur = L->rin
// Detach the list of waiting readers and advance the phase number
// in the least significant bit of RList
auto List = swap (&L->RList, NULL|((cur + 1) & 1) )
assert (List & 1) == (cur & 1)
// Clear the writer-present bit and advance phase
FetchAdd (&L->rin, (cur ^ 3) - cur) ;

// Wake the list of waiting readers we have in-hand
// We could drop WriterLock early but there's no point as we're changing
// phase and readers run next.
List = List & ~1
while List != NULL :
auto w = List
List = List->Next
assert w->Grant == 0
w->Grant = 1

MutexUnlock (WriterLock)

At this point we're close to complete. There's only one remaining case of unbounded spinning. That appears when a writer waits for the readers to drain. That's relatively easy to remedy by having the waiting writer make its ID or reference visible so readers can wake it. (There's a bit of a Dekker-ish protocol involved, but nothing elaborate). We can also run into performance problems where the writer, while unparking readers, is preempted by one of those readers. This is the so-called wakee-preempts-waker problem. We can address that issue with concurrent helping -- possibly by forming the set of threads to wake into a tree. Finally, we can make the lock NUMA-friendly by using a cohort lock for WriterLock and by using per-node rin+rout fields. I believe the final form remains phase-fair. And in fact it performs
quite well on NUMA systems.

Note that the "C-RW-NP" Cohort NUMA-aware reader-writer has only one instance of unbounded indefinite spinning, where the next writer waits for readers to depart. Again, that particular case is trivial to address.


Friday Feb 20, 2015

Cohort Locks in ACM TOPC

Lock Cohorting: A General Technique for Designing NUMA Locks appears in ACM Transactions on Parallel Computing (TOPC) Volume 1 Issue 2, January 2015. (local copy -- apologies for gzip-ed pdf but the file size exceeded Oracle blog limits).

This is an extended journal version of our PPoPP 2012 conference paper. Since the PPoPP version we've found ways to make the lock friendlier for Intel MESI/MESIF-based systems. Our preferred form now uses a partitioned ticket lock for the top-level global lock, and normal ticket locks for the node-level local locks. In addition, the journal version shows that cohort locks can provide benefit even on single-node non-NUMA systems by treating each core as a node for the purposes of cohort formation.

See also US8694706.

ACM DL Author-ize serviceLock Cohorting: A General Technique for Designing NUMA Locks
David Dice, Virendra J. Marathe, Nir Shavit
ACM Transactions on Parallel Computing - SPECIAL ISSUE ON PPOPP '12, 2015

Wednesday Jan 28, 2015

Using reader-writer locks to improve hardware TLE

At its most simple, traditional hardware TLE (Transactional Lock Elision) operates as follows : start a hardware transaction; "subscribe" to the lock state (committing or aborting if the lock is found to be held); execute the critical section; and finally commit at the end of the critical section. For the purposes of discussion I'm assuming an Intel TSX-RTM implementation of hardware transactional memory (HTM). If we can't make progress using the optimistic transactional mode then we revert to using classic pessimistic mutual exclusion by taking the lock, running the critical section in non-transactional execution mode, and finally releasing lock. Myriad retry policies are possible. It's worth noting that HTM implementations built on top of the existing cache coherence protocol with "requester-wins" conflict resolution policies will usually admit or allow mutual abort with no progress by any parties. That is, the progress properties can be rather poor.

A relatively simple way to improve on the implementation above is to employ a reader-writer lock. Threads trying to use optimistic transactions to execute the critical section must acquire and hold shared "R" permission. Threads executing the critical section via the classic non-transactional path must hold exclusive "W" permission. This approach confers a number of advantages. First, this algorithm obviates the need for subscription in optimistic path. 2nd, it avoids the lemming effect. (Perhaps I should have used the term "pyrrhic locking" instead, as we all know full well that lemmings don't really follow each other off cliffs to their death. It's pyrrhic in the sense that you "win" the lock, but lose the ability to runs transactions). 3rd, under the usual simplistic TLE implementations, a thread taking the classic path will cause all concurrent threads in the optimistic path to abort. But with the reader-writer lock variant, threads taking the pessimistic path wait politely for the optimistic threads to finish. The existence of threads in the R path is visible to threads trying to acquire W -- this is vaguely analogous to the idea of "visible readers" in STMs. (See also TLRW). Visible readers and polite writers may result in fewer aborts and cycles wasted on futile transactions. Both the simple traditional form and the form based on reader-writer locks prohibit simultaneous execution of transactional and non-transactional modes under the lock, but the reader-writer lock variety is a bit less medieval. In a sense we have a "guarded" execution mode for transactions. Finally, some reader-writer lock implementations intentionally control admission policy so as to promote large groups of simultaneous readers -- we call this "R-group formation". When used for TLE, a reader-writer lock that promotes R-group formation will also tend to exhibit better throughput, say, than a reader-writer lock that uses a strict FIFO admission policy. Specifically, larger R-groups let us co-schedule larger groups of concurrent transactions.

As side-note, hardware TLE can be implemented either within pthread_mutex primitives, or on top of such primitives. In the latter case, if your TLE implementation covers all critical sections protected by the lock in question then you can add a separate "isLocked" tracking variable for the purposes of subscription. In this case the slow "classic" pessimistic path would acquire the lock and store "true" into isLocked. On x86 you also need a fence or fence-equivalent instruction after the store. In some circumstances, however, you might not have access to all the critical sections to impose the use of the isLocked convention. In that case it'd helpful if glib/libc were to expose a pthread_mutex_islocked_np() operator which would test the lock state and subscribe. TLE could call that function in the fast optimistic path. Note that the reader-writer lock form needs neither an isLocked() call to inspect the lock, or an explicit tracking variable.

There are a few ways to embellish this approach. I’ve only explored this on a single-node system, but it looks useful to use a cohort NUMA reader-writer lock with 2 synthetic logical cohort nodes, one for R and one for W. This tends to further favor and promote R-group formation, and thus "batch" or co-schedule groups of transactions together to an even greater degree, yielding better parallelism.

Another idea is to augment the R path with K-exclusion. (See US8402464 for more details). Because of the mutual abort with no progress issue mentioned above, we can encounter fratricide if we have too many threads simultaneously in the R path under a given lock. By decreasing the number of threads in the R-path, we can sometimes improve aggregate throughput. K-exclusion around the R path gives us a convenient way to throttle the number of threads. Specifically, we vary K based on the recent transactional success or abort rates. When K reaches 1 we have a degenerate case and just revert to mutual exclusion.

An AIMD policy for K seems to work reasonably well. AIMD policies are usually associated with TCP congestion windows, but seem to map cleanly to problems involving optimistic synchronization. The TCP sending window is also optimistic, as elements in the window may require re-transmission. (A few years ago I experimented with AIMD policies to experiment with the spin duration for contended java locks).

The canonical use case for reader-writer locks is the usual and obvious single-writer vs multiple-reader roles. Our example above is single pessimistic/normal vs multiple optimistic/transactional. Another mode we see is in lock-free code with deferred memory reclamation, where R permission confers the right to access some data, and the W role is required to unmap or physically free the underlying memory. A related usage is to protect the file system mount tables in operating systems. R permission enables a thread to parse pathnames and access the filesystem without worry of dismount, while W is used by unmount to make sure the filesystem is quiesced. Another use-case is multiple mutator threads vs a garbage collector where the heap is protected by a single reader-writer lock. If you have a JVM or other managed runtime with a stop-the-world copying collector, then normal heap mutator threads have R access. If a mutator calls out of the JVM via JNI, or parks itself, then it relinquishes R for the duration of that operation. The garbage collector takes the W role when active. Mutators also periodically poll for pending GC and relinquish R to allow the collector to run.

See also PhTM : Phased Transactional Memory which appeared in Transact 2007.

Note that a good quality reader-writer lock will provide automatic gang wakeup.

Invisible text : still takes up space

Invisible text : no space allocated

Monday Jan 26, 2015

Blizzard Preparations

3 cars in a 2-car garage

AVIS Frankfurt

I recently rented a car from AVIS at Frankfurt for travel to Dagstuhl. We drove to Dagstuhl, left it parked for 5 days, and returned without incident. On return, the attendant -- very pleasant, btw -- found 'damage' to the front driver's side rim and wanted to know details about the 'accident' . There was no accident while the car was in my hands. I related such to the attendant, but didn't make any progress. AVIS tacked an additional 147€ change on my corporate credit card. This was immediately after the Paris Hebdo incident, so we knew security lines would be long and slow, so even though we arrived 3 hours before the flight, we didn't have any spare time to argue the point.

Interestingly, there was no outbound inspection by AVIS when I rented the car. Below is a picture of the rim after the attendant pointed out the issue. There's certainly something on the rim at about the 6:00 and 3:30 positions, but it didn't happen while I had the car. The photographic quality is bad, as the return area was very poorly lit. I took a quick look around the car when I rented it, but didn't spot the supposedly damaged area.

Hopefully this isn't systemic behavior or practice on the part of AVIS.

I suppose the lesson is (a) take a walk-around video when your rent the car -- hopefully there's enough light -- or (b) perhaps use a different vendor.

A subsequent letter from AVIS detailing the damage charge ended as follows :

If you would like to exclude this risk in your next rental, you can purchase an excess-waiver for a small additional charge.


Wednesday Jan 14, 2015

Seen in St. John

I originally mistook this as "Dave".

Just back from Dagstuhl

Concurrent Computing in the Many-core Era 2015 - Schloss Dagstuhl.

As usual, a productive and invigorating week. (All talks My slides).

We also visited the Völklingen Ironworks. (wikipedia).

Monday Dec 29, 2014

Measuring short-term fairness for locks

In a companion blog entry I discussed ways to measure or characterize the long-term fairness of lock implementations. Measuring short-term fairness is also interesting. We could easily have a lock, such as cohort locks, that are fair over long intervals but intentionally unfair over the short-term. It's useful to have ways to define exactly how unfair a lock is over some interval. We'll assume the lock admission policy is work-conserving.

It's worth mentioning that measuring short-term fairness is slightly tricky because of induced probe effect. As such, we sometimes use statistical sampling to reduce the overheads albeit with reduced certainty. I'll leave that discussion for subsequent blog entries.

I'll borrow Denning's working set concept, but apply it locks instead of memory management. The working set of a program over some interval I is the set of distinct pages accessed during I. Where convenient and unambiguous, we'll sometimes quantify the working set as the number of pages in the working set. Similarly, I'll define the lock working set (LWS) for a given lock L over interval I as the set of distinct threads that acquired the lock during the interval. Instead of defining the interval I over wall-clock time, it's more convenient to treat each acquire of the lock as a virtual increment of our interval clock. We'll define LWSS as the size of the LWS. Even though we might define the interval I to be small, we can partition a long run into a set of M I-length disjoint intervals and report the average or median LWSS over those M points.

A related statistic is the median time to reacquire (MTTR). Conceptually, we can think of a measurement run as generating a stream of thread IDs that acquired the lock. We can then iterate through that stream, and for every acquire, determine the number of intervening acquire operations that occurred since the thread last obtained the lock. That gives us a time-to-reacquire value, and in turn, we can compute the MTTR over the run.

Note that we might measure either the number of intervening acquires, or the number of distinct threads to acquire in the interim.

My colleague Tim Harris noted that it could be useful to plot by varying the interval length I on the x-axis, and the average LWSS on the y-axis. This is somewhat reminiscent of MMU (minimum mutator utilization) graphs that appear in the garbage collection literature.

When a contended lock is unfair over the short term, it's typically the case that the set of threads circulating over the lock is partitioned into active and passive sets. The active circulating set corresponds to the LWSS.

Measuring long-term fairness for locks

Lets say we have N concurrent and homogenous threads that each loop as follows : acquire a central lock; execute a critical section; release the lock; execute a non-critical section. We start the threads at the same time. At the end of the measurement interval we have N data points : the number of iterations completed by each thread. This is the so-called fixed-time-report-work benchmark methodology. Aggregate throughput is measured by the sum of the data points, but we're also interested in quantifying the fairness, as reflected by the distribution of those points. For simplicity, we'll assume that the scheduler is ideally fair, there are no NUMA or physical "geographic" issues in play, and that each thread has about the same cache footprint. The only remaining source of unfairness is the lock implementation or policies. (In practice there are myriad confounding factors, including energy and thermal caps, but we'll ignore those and assume an idealized model). A histogram of the completion counts is a good way to visualize the fairness or unfairness of the lock. I recommend HdrHistogram.

Sometimes, though, it's convenient to describe unfairness in terms of a simple real-valued numerical statistic. Such a descriptive statistic can be used to quantity the fairness of various lock policies, and in particular to help establish the trade-off between fairness and throughput. Ideally that statistic would be scale-invariant -- that property is desirable but optional. Some of the usual statistics are standard deviation or variance. Extrema-based statistics such as (max-min) or (max-min)/max can also be useful. These can give us a sense of the size of the range of the data points. The average divided by the max can also provide some insight. IQR is another commonly used statistic, as is Jain's Fairness Index. Cederman et al. suggested another useful fairness metric. In recent papers I've reported the relative standard deviation. (In communications literature it's not uncommon to see fairness expressed in terms of average/stddev, which is the reciprocal of the relative standard deviation). Median Absolute Deviation (MAD) is also informative. I don't have one favorite -- my lock benchmark harnesses report all of the above.

Recently I was looking at other measures of disparity or uniformity and came up with the following idea. First, we sort our N completion counts in ascending order. We then plot as follows. On the X-axis we have the number of threads T, and on the Y-axis we have the cumulative sum of iteration counts up to thread #T. If the completion counts were 10,15,22,33 then the Y values would be C(x) = 0,10,25,37,70, for instance, for 0,1,2,3,4 threads, respectively. Beware that the C(x) function is obviously discrete, but for the purposes of discussion and terminology I'll treat it as a continuous real-valued function. Next, we normalize both axes to [0,1]. For the Y-axis, we can simply divide C(x) by the maximum -- final -- C(x) value. If all the data points are equal -- ideally fair -- then C(x) is just the identity diagonal function : C(x)=x. Note that C(x) is convex. The more unfair the distribution, the larger the area under the diagonal and above the normalized C(x) function. And in fact that area measure makes a useful index of inequality. We could derive additional statistics from this approach, such as the maximum x-C(x) value, but the area between the diagonal and C(x) seems viable.

Subsequently, I realized a similar approach had been long used in economics to describe income disparity : Gini Coefficient.

See also : Measuring Short-Term Fairness for locks.

Tuesday Dec 02, 2014

No blocking parking

Taken at lovely Gillette Stadium; possibly amusing to synchronization practitioners. "Easy exit" sounds interesting as well.

Thursday Oct 09, 2014

Synchronization horror story of the day

While evaluating some new lock algorithms (exposed via LD_PRELOAD interposition on the pthread_lock and condvar interfaces), I noticed that C++ catch and throw on Solaris/SPARC compiled with GCC 4.9.0 will take runtime paths that take locks and impede scaling. Specifically, if you have a set of threads that simply execute "try { throw 20; } catch (int e) {;}", they won't scale as expected because of runtime locks. Unwind_RaiseException and Unwind_Find_FDE always appear on the stack when we encounter the contention. Interestingly, I didn't see any evidence of this behavior on Ubuntu 14.04 on x86.

Tuesday Jul 29, 2014

A simple lazy subscription pathology

Following up on The Pitfalls of lazy subscription, I thought I'd provide a simple case that illustrates where transactional lock elision (TLE) with lazy subscription can fail. I've managed to reproduce the failure in-house on an i7-4770 (haswell).

Say we have a ring-buffer that’s used for logging. It just records the last 16 messages posted. (This is kind of common for “black box flight recorders” that appear in kernels and the JVM). The memory layout is : int pos; intptr_t RingBuffer[16]; volatile int Lock. We'll also assume that we compile with code at low optimization levels, so accesses to "pos" result in loads and stores with no caching in registers. “Lock” is a simple test-and-test-and-set lock possibly augmented with TLE and late subscription.

The critical section is :

Post(m) :
auto index = pos ++ // load and store pos
RingBuffer[index] = m
if pos == 16 : pos = 0 // reload pos

A number of threads just loop, calling Post(0). Some run take the lock, others use TLE with late subscription. It's possible for one of threads using TLE with late subscription to observe an intermediate and transient "pos" value of 16 by virtue of some thread that holds "Lock" and has incremented "pos" from 15 to 16, but not yet reset the variable to 0. The thread using TLE that observed "pos" at 16 will then overwrite "Lock" with 0, so the late subscription check succeeds and the transaction inadvertently commits. This is a case of "wrongful commit" via lock corruption.

Characterizing which critical sections remain safe under TLE with late subscription remains an interesting research topic. But in the general case it's unsafe.

It's worth pointing out that for weakly ordered processors, the load or loads that constitute the classic early subscription check must have proper fencing. None is needed on x86, of course.

Monday Jul 28, 2014

Hardware extensions to make lazy subscription safe

Hardware extensions to make lazy subscription safe is a follow-on to our WTTM 2014 paper, Pitfalls of Lazy Subscription . We describe a number of hardware approaches to avoid the wrongful commit pathology admitted by generalized transactional lock elision using lazy subscription. Eager (early) subscription remains safe, of course.

Wednesday Jun 11, 2014

PTLQueue : a scalable bounded-capacity MPMC queue

I've used the following concurrent queue algorithm enough that it warrants a blog entry. I'll sketch out the design of a fast and scalable multiple-producer multiple-consumer (MPMC) concurrent queue called PTLQueue. The queue has bounded capacity and is implemented via a circular array. Bounded capacity can be a useful property if there's a mismatch between producer rates and consumer rates where an unbounded queue might otherwise result in excessive memory consumption by virtue of the container nodes that -- in some queue implementations -- are used to hold values. A bounded-capacity queue can provide flow control between components. Beware, however, that bounded collections can also result in resource deadlock if abused. The put() and take() operators are partial and wait for the collection to become non-full or non-empty, respectively. Put() and take() do not allocate memory, and are not vulnerable to the ABA pathologies. The PTLQueue algorithm can be implemented equally well in C/C++ and Java.

Partial operators are often more convenient than total methods. In many use cases if the preconditions aren't met, there's nothing else useful the thread can do, so it may as well wait via a partial method. An exception is in the case of work-stealing queues where a thief might scan a set of queues from which it could potentially steal. Total methods return ASAP with a success-failure indication. (It's tempting to describe a queue or API as blocking or non-blocking instead of partial or total, but non-blocking is already an overloaded concurrency term. Perhaps waiting/non-waiting or patient/impatient might be better terms). It's also trivial to construct partial operators by busy-waiting via total operators, but such constructs may be less efficient than an operator explicitly and intentionally designed to wait.

A PTLQueue instance contains an array of slots, where each slot has volatile Turn and MailBox fields. The array has power-of-two length allowing mod/div operations to be replaced by masking. We assume sensible padding and alignment to reduce the impact of false sharing. (On x86 I recommend 128-byte alignment and padding because of the adjacent-sector prefetch facility). Each queue also has PutCursor and TakeCursor cursor variables, each of which should be sequestered as the sole occupant of a cache line or sector. You can opt to use 64-bit integers if concerned about wrap-around aliasing in the cursor variables. Put(null) is considered illegal, but the caller or implementation can easily check for and convert null to a distinguished non-null proxy value if null happens to be a value you'd like to pass. Take() will accordingly convert the proxy value back to null. An advantage of PTLQueue is that you can use atomic fetch-and-increment for the partial methods. We initialize each slot at index I with (Turn=I, MailBox=null). Both cursors are initially 0. All shared variables are considered "volatile" and atomics such as CAS and AtomicFetchAndIncrement are presumed to have bidirectional fence semantics. Finally T is the templated type.


// PTLQueue :

Put(v) : // producer : partial method - waits as necessary
assert v != null
assert Mask >= 1 && (Mask & (Mask+1)) == 0 // Document invariants
// doorway step
// Obtain a sequence number -- ticket
// As a practical concern the ticket value is temporally unique
// The ticket also identifies and selects a slot
auto tkt = AtomicFetchIncrement (&PutCursor, 1)
slot * s = &Slots[tkt & Mask]
// waiting phase :
// wait for slot's generation to match the tkt value assigned to this put() invocation.
// The "generation" is implicitly encoded as the upper bits in the cursor
// above those used to specify the index : tkt div (Mask+1)
// The generation serves as an epoch number to identify a cohort of threads
// accessing disjoint slots
while s->Turn != tkt : Pause
assert s->MailBox == null
s->MailBox = v // deposit and pass message

Take() : // consumer : partial method - waits as necessary
auto tkt = AtomicFetchIncrement (&TakeCursor,1)
slot * s = &Slots[tkt & Mask]
// 2-stage waiting :
// First wait for turn for our generation
// Acquire exclusive "take" access to slot's MailBox field
// Then wait for the slot to become occupied
while s->Turn != tkt : Pause
// Concurrency in this section of code is now reduced to just 1 producer thread
// vs 1 consumer thread.
// For a given queue and slot, there will be most one Take() operation running
// in this section.
// Consumer waits for producer to arrive and make slot non-empty
// Extract message; clear mailbox; advance Turn indicator
// We have an obvious happens-before relation : HBO
// Put(m) happens-before corresponding Take() that returns that same "m" - antecedent
T v = s->MailBox
if v != null :
s->MailBox = null
ST-ST barrier
s->Turn = tkt + Mask + 1 // unlock slot to admit next producer and consumer
return v

PTLQueue borrows and derives from the Partitioned Ticket Lock "PTL" (US20120240126-A1) and the MultiLane Concurrent Bag (US8689237). The latter is essentially a circular ring-buffer where the elements themselves are queues or concurrent collections. You can think of the PTLQueue as a partitioned ticket lock "PTL" augmented to pass values from lock to unlock via the slots. Alternatively, you could conceptualize of PTLQueue as a degenerate MultiLane bag where each slot or "lane" consists of a simple single-word MailBox instead of a general queue. Each lane in PTLQueue also has a private Turn field which acts like the Turn (Grant) variables found in PTL. Turn enforces strict FIFO ordering and restricts concurrency on the slot mailbox field to at most one simultaneous put() and take() operation.

PTL uses a single "ticket" variable and per-slot Turn (grant) fields while MultiLane has distinct PutCursor and TakeCursor cursors and abstract per-slot sub-queues but does not use per-slot Turn variables. Both PTL and MultiLane advance their cursor and ticket variables with atomic fetch-and-increment. PTLQueue borrows from both PTL and MultiLane and incorporates distinct put and take cursors and per-slot Turn fields. Instead of a per-slot queues, PTLQueue uses a simple single-word MailBox field. PutCursor and TakeCursor act like a pair of ticket locks, conferring "put" and "take" access to a given slot. PutCursor, for instance, assigns an incoming put() request to a slot and serves as a PTL "Ticket" to acquire "put" permission to that slot's MailBox field. To better explain the operation of PTLQueue we deconstruct the operation of put() and take() as follows. Put() first increments PutCursor obtaining a new unique ticket. That ticket value also identifies a slot. Put() next waits for that slot's Turn field to match that ticket value. This is tantamount to using a PTL to acquire "put" permission on the slot's MailBox field. Finally, having obtained exclusive "put" permission on the slot, put() stores the message value into the slot's MailBox. Take() similarly advances TakeCursor, identifying a slot, and then acquires and secures "take" permission on a slot by waiting for Turn. Take() then waits for the slot's MailBox to become non-empty, extracts the message, and clears MailBox. Finally, take() advances the slot's Turn field, which releases both "put" and "take" access to the slot's MailBox. Note the asymmetry : put() acquires "put" access to the slot, but take() releases that lock. At any given time, for a given slot in a PTLQueue, at most one thread has "put" access and at most one thread has "take" access. This restricts concurrency from general MPMC to 1-vs-1. We have 2 ticket locks -- one for put() and one for take() -- each with its own "ticket" variable in the form of the corresponding cursor, but they share a single "Grant" egress variable in the form of the slot's Turn variable. Advancing the PutCursor, for instance, serves two purposes. First, we obtain a unique ticket which identifies a slot. Second, incrementing the cursor is the doorway protocol step to acquire the per-slot mutual exclusion "put" lock. The cursors and operations to increment those cursors serve double-duty : slot-selection and ticket assignment for locking the slot's MailBox field.

At any given time a slot MailBox field can be in one of the following states: empty with no pending operations -- neutral state; empty with one or more waiting take() operations pending -- deficit; occupied with no pending operations; occupied with one or more waiting put() operations -- surplus; empty with a pending put() or pending put() and take() operations -- transitional; or occupied with a pending take() or pending put() and take() operations -- transitional.

The partial put() and take() operators can be implemented with an atomic fetch-and-increment operation, which may confer a performance advantage over a CAS-based loop. In addition we have independent PutCursor and TakeCursor cursors. Critically, a put() operation modifies PutCursor but does not access the TakeCursor and a take() operation modifies the TakeCursor cursor but does not access the PutCursor. This acts to reduce coherence traffic relative to some other queue designs.

It's worth noting that slow threads or obstruction in one slot (or "lane") does not impede or obstruct operations in other slots -- this gives us some degree of obstruction isolation. With respect to progress properties, however, PTLQueue is not lock-free.

The implementation above is expressed with polite busy-waiting (Pause) but it's trivial to implement per-slot parking and unparking to deschedule waiting threads. It's also easy to convert the queue to a more general deque by replacing the PutCursor and TakeCursor cursors with Left/Front and Right/Back cursors that can move either direction. Specifically, to push and pop from the "left" side of the deque we would decrement and increment the Left cursor, respectively, and to push and pop from the "right" side of the deque we would increment and decrement the Right cursor, respectively.

We used a variation of PTLQueue for message passing in our recent OPODIS 2013 paper.

I'll now introduce a variation on the above, but borrow Dmitry Vyukov's encoding for the Turn variable. After a thread deposits a value into the MailBox, it advances Turn by 1 to indicate to Take() operations that a value is present. Turn values follow this trajectory : K, K+1, K+1+Mask, etc. (More precisely, the Turn field for slot I takes on values of the form: I*G, (I*G)+1, I*(G+1), etc where G is the generation number). This form is friendlier for a total tryTake() operator. Put() and Take() operators are also more symmetric than in the version above. It also allows null to be passed without substituting proxy values. Like the form above, it can use and benefit from fetch-and-increment instead of CAS for the partial forms. the partial tryTake() operator, however, uses CAS. tryPut() -- not shown -- is analogous to tryTake(). UPDATE: see the comments section for a concern identified by Nitsan.


// PTLQueueV2 :

Put(v) : // producer : partial method - waits as necessary
assert Mask > 1 && (Mask & (Mask+1)) == 0 // Document invariants
// doorway step
// Obtain a sequence number -- ticket
// As a practical concern the ticket value is temporally unique
// The ticket also identifies and selects a slot
auto tkt = AtomicFetchIncrement (&PutCursor, 1)
slot * s = &Slots[tkt & Mask]

// waiting phase :
// wait for slot's generation to match the tkt value assigned to this put() invocation.
// The "generation" is implicitly encoded as the upper bits in the cursor
// above those used to specify the index : tkt div (Mask+1)
// The generation serves as an epoch number to identify a cohort of threads
// accessing disjoint slots
while s->Turn != tkt : Pause

assert s->MailBox == null
s->MailBox = v // deposit and pass message
s->Turn = tkt + 1 // mark occupied - release corresponding Take() operation
// We pass & cede ownership and exclusive access of MailBox to the next Take()
// That Take() operation will be in the same generation as this Put()

Take() : // consumer : partial method - waits as necessary
auto tkt = AtomicFetchIncrement (&TakeCursor,1)
slot * s = &Slots[tkt & Mask]

// Wait turn for our generation and for the slot become occupied
while s->Turn != (tkt+1) : Pause

T v = s->MailBox // extract message
// In a garbage-collected environment we might want to set s->MailBox = null
s->Turn = tkt + Mask + 1 // mark unoccupied - unlock slot to admit next Put()
// We pass & cede ownership and exclusive access to MailBox to the next Put()
// That Put() will be in the next generation
return v

// tryTake() is a "strong" operator in that it can return null IFF
// the queue was empty at some point during the tryTake() invocation.
// Spurious false-null return values are not allowed.
// Note that if we provide tryTake() with the interface below, then null can not
// be a legal message value passed via Put(null). Alternatively, we could allow
// null messages but augment tryTake() to return a success-failure indication as well
// as the value.
// Thanks to Nathan Reyonds for comments on an earlier version of this algorithm.

tryTake() : // consume : "strong" total method - returns ASAP
auto tkt = TakeCursor
slot * s = &Slots[tkt & Mask]
auto delta = s->Turn - (tkt+1)
if delta == 0 :
if CAS (&TakeCursor, tkt, tkt+1) == tkt :
// our thread has exclusive access to s->MailBox
T v = s->MailBox
s->Turn = tkt + Mask + 1 // ceded MailBox access to next Put()
return v
if delta lessThan 0 :
return null
// inopportune concurrent interleaving - race
// tkt is stale with respect to TakeCursor
// Other Take() or tryTake() operations bypassed this operation
// and raced past
// This can happen if we stall after loading TakeCursor, for instance
// just retry
assert TakeCursor != tkt

There's quite a bit of related literature in this area. I'll call out a few relevant references:

I'll propose an optimization left as an exercise for the reader. Say we wanted to reduce memory usage by eliminating inter-slot padding. Such padding is usually "dark" memory and otherwise unused and wasted. But eliminating the padding leaves us at risk of increased false sharing. Furthermore lets say it was usually the case that the PutCursor and TakeCursor were numerically close to each other. (That's true in some use cases). We might still reduce false sharing by incrementing the cursors by some value other than 1 that is not trivially small and is coprime with the number of slots. Alternatively, we might increment the cursor by one and mask as usual, resulting in a logical index. We then use that logical index value to index into a permutation table, yielding an effective index for use in the slot array. The permutation table would be constructed so that nearby logical indices would map to more distant effective indices. (Open question: what should that permutation look like? Possibly some perversion of a Gray code or De Bruijn sequence might be suitable).

As an aside, say we need to busy-wait for some condition as follows : "while C == 0 : Pause". Lets say that C is usually non-zero, so we typically don't wait. But when C happens to be 0 we'll have to spin for some period, possibly brief. We can arrange for the code to be more machine-friendly with respect to the branch predictors by transforming the loop into : "if C == 0 : for { Pause; if C != 0 : break; }". Critically, we want to restructure the loop so there's one branch that controls entry and another that controls loop exit. A concern is that your compiler or JIT might be clever enough to transform this back to "while C == 0 : Pause". You can sometimes avoid this by inserting a call to a some type of very cheap "opaque" method that the compiler can't elide or reorder. On Solaris, for instance, you could use :"if C == 0 : { gethrtime(); for { Pause; if C != 0 : break; }}".

It's worth noting the obvious duality between locks and queues. If you have strict FIFO lock implementation with local spinning and succession by direct handoff such as MCS or CLH,then you can usually transform that lock into a queue.

If you want a multiple-producer single-consumer MPSC queue then you can replace atomic operations on the TakeCursor with normal updates. More generally you can take a simple SPSC queue and wrap the put() and take() operations with put and take mutexes, restricting concurrency to 1-vs-1, but the performance isn't usually as good as what you'd get from a queue designed for MPSC usage.

Wednesday Apr 30, 2014

SPAA 2014 : Persistent Unfairness Arising From Cache Residency Imbalance

Persistent Unfairness Arising From Cache Residency Imbalance by Dave Dice, Virendra J. Marathe and Nir Shavit will appear in SPAA 2014.

This is the Matthew Effect for caches.

The description of the methodology is necessarily terse in the paper, but we took pains to measure and minimize DRAM channel imbalance and to control for DRAM bank/page locality.

An interesting related paper is Efficient techniques for predicting cache sharing and throughput by Andreas Sandberg, David Black-Schaffer and Erik Hagersten (Uppsala).

Tuesday Apr 29, 2014

malloc for Haswell - Hardware Transactional Memory

Haswell i7-4770 processors.

For background, the i7-4770 has relatively simple cache geometry. The L1 (level-1 cache) is 32KB with 64-byte lines, is physically tagged, and is 8-way set-associative. There are 64 possibly indices (sets). As such the cache page size is 4KB -- addresses that differ by an integer multiple of 4K will map to the same index (set) in the L1. The low-order 6 bits of the address presented to the L1 form the offset into the line, and the next higher 6 bits serve as the L1 index. The MMU base page size is 4KB, so there is no overlap between the virtual page number and the index field in a virtual address. The L1 index field passes through address translation verbatim. As such, OS-level page coloring is not in play with respect to the L1. (An advantage of this design is that indexing can commence before the virtual address is translated to a physical address, although we still need the physical address for tag comparison). Some CPUs hash addresses -- usually XORing high-order physical address bits into the index bits -- in order to reduce the odds of index hotspots and index imbalance, but experiments suggest that does not appear to be the case with the i7-4770.

Such simple caches -- without the index hashing mentioned above -- can be vulnerable to excessive index conflicts, but malloc allocators can be made index-aware (local copy) to mitigate and reduce the frequency of index conflicts. Index imbalance results in underutilization of the cache. Some indices will be "cold"(less frequently accessed) while others are "hot" and thus incur relatively higher miss rates. It's worth pointing out that most application/allocator combinations don't exhibit excessive index conflicts, but for those that do, the performance impact can be significant. An index-aware allocator can act to "immunize" an application against some common cases of index-imbalance while typically incurring no additional cost over index-oblivious allocators. Think of the index-aware allocator as cheap insurance against a rare but painful performance disorder. The paper above describes an index-aware allocator designed for the L1 in a SPARC T2+ processor, but it's trivial to change a few cache geometry constants and retarget the allocator to the i7-4770.

The "CIA-Malloc" (Cache-Index Aware) allocator described in the paper has a number of other useful design properties. It also happens to be NUMA-friendly and large-page-friendly. Underlying pages are allocated on the node where the malloc() was invoked. Put another way, the pages underlying a block returned by malloc() will typically reside on the node where the malloc() was invoked. The allocator is also scalable with very little internal lock contention or coherence traffic. Each per-CPU sub-heap has a private lock -- the only time we'll encounter contention is via migration or preemption, which is relatively rare. The critical sections are also constant-time and very short. We also make heavy use of trylock(), so if a thread is obstructed it can usually make progress by reverting to another data structure. Remote free() operations are lock-free. Critically, the allocator acts to reduce the cost of malloc() and free() operations as well as the cost to the application when accessing blocks allocated via malloc(). The allocator is also designed specifically to reduce common cases of false sharing : allocator metadata-vs-metadata; metadata-vs-block; and inter-block block-vs-block. Metadata-vs-metadata sharing and false sharing is reduced by using per-CPU sub-heaps. False sharing arising between adjacent data blocks -- blocks returned by malloc() -- is addressed by placement and alignment. These attributes will prove even more useful when we use CIA-Malloc in conjunction with hardware transactions.

The i7-4770 provides hardware transactional memory (HTM). For the purposes of discussion we'll assume we're using TSX-RTM for the purposes of transactional lock elision (TLE). The critical section body contains unmodified HTM-oblivious legacy code that expects to run under the lock in the usual fashion, but via TLE we can modify the lock implementation to attempt optimistic execution, reverting to the lock only as necessary. The i7-4770's HTM implementation tracks the transactional write-set in the L1 and the read-set over the cache hierarchy. It uses a requester-wins conflict resolution strategy implemented via the MESIF coherence protocol. At most a single cache can have a given line in M/E state at any one time -- a classic multiple-reader single-writer model. Eviction or invalidation of a tracked cache line results in a transactional abort. For example if a transaction on CPU C loads address A, and some other CPU writes A before C commits, the write will invalidate the line from C's cache and cause an abort. Similarly, if C stores into A and some other CPU loads or stores into A before C commits, the invalidation of A will cause C's transaction to abort. Read-write or write-write sharing on locations accessed within a transaction results in coherence invalidation and consequent abort. (The HTM implementation in the i7-4770 shares quite a few aspects with Sun's experimental ROCK processor).

In addition to coherence traffic, self-displacement via conflict misses can also result in aborts. This is where a CIA-Malloc allocator may provide benefit relative to other allocators. Normally an index-aware allocator is expected to reduce conflict misses arising from index-imbalance, but it can also reduce transactional aborts caused by eviction of read-set or write-set entries from index conflicts. Aborts are usually far more expensive than simple cache misses. (Absent any potential benefit from warming up of caches, aborts are pure wasted and futile effort).

Lets take an actual example. The following data was collected on an i7-4770 running Ubuntu 14.04. We use a simple C single-threaded benchmark that uses malloc() to individually allocate a set of 250 nodes, and then arranges those nodes into an circular intrusive singly linked list. The benchmark was compiled with gcc 4.8.2 using the x32 ABI. The node structure has a "next" field at offset 0 followed by a volatile integer "w" field. A command-line switch gives us ability to specify the effective size of the node as passed to malloc(). Since there may be a correlation between allocation order and virtual address, we randomize the order of the nodes with a Fisher-Yates shuffle in order to minimize the impact of automatic hardware stride-based prefetchers. (Such a randomized order can put stress on the TLBs with lots of page crossings as we traverse the list, but that's not the dominant performance issue for the cases we'll discuss). We then report the time needed to complete 10000000 steps of the following loop body :

a->w = 0 ; a = a->next
If we use an effective node size of 950 bytes, then the default glibc malloc() allocator places our nodes at 960 byte intervals (1024-64) and each step of the loop requires 2.1 nsecs. When we increase the node size to 1010 the interval is 1024 bytes and each step takes 8.1 nsecs. If we further increase the node size to 1080 bytes then the interval is 1088 bytes (1024+64)and the time drops back to 2.1 nsecs. The performance drop at 1010 bytes was caused by the 1024-byte placement interval. The base addresses of our 250 nodes resided on just 4 of the 64 possible indices, so we grossly underutilized the L1. This nicely illustrates index conflicts arising from index-oblivious allocator placement policies. An index-aware allocator will avoid this scenario.

Now we'll extend our benchmark to use RTM hardware transactions. Each traversal of the ring is wrapped in a RTM XBEGIN-XEND transaction and we'll measure and report success rates. This is intended to model TLE or "naked" use of RTM transactions. We keep the ring circumference at 250 nodes, so each transaction will iterate over that
many elements. With nodes of 960 bytes the failure rate is 0.145%. Increasing the node size to 1010 bytes, the failure rate becomes 100%. (We have no progress). But if we bump the size to 1080 bytes then the failure rate drops back to 0.2%. The complete and persistent failure at 1010 bytes was caused by elements of the write-set being evicted and consequent abort. But if we use CIA-Malloc modified for the i7-4770 we can avoid such abrupt performance inflections.

To recap, an index-aware allocator can help avoid performance pathologies for normal non-transactional code as well as improving the success rate of hardware transactions. Specifically, conflict misses become aborts in transactions, and aborts are more expensive than normal misses.

Ideally, an HTM-friendly allocator will satisfy the desiderata enumerated in Cache-Index Aware Memory Allocation and also act to reduce the abort rate. The following properties are desirable:

  • malloc() and free() should be callable from within transactions with low odds of abort
  • memory accesses within transactions to blocks returned by malloc() are less prone to abort; specifically we want to reduce conflict misses and coherence invalidation from sharing, both of which cause aborts.
      Sharing and false-sharing:

    • allocator metadata-vs-blocks
    • allocator metadata-vs-metadata : we note that internal lock contention or promiscuous shared locks internal to the allocator are a special case of metadata-vs-metadata sharing that can cause aborts.
    • block-vs-block : inter-block false sharing. Application code can also use explicit memalign() and padding to reduce the odds of inter-block false sharing. An allocator or application that increases alignment to avoid false sharing may also cause quantization of block sizes and increased internal fragmentation.

  • malloc() and free() operations are scalable. Highly scalable allocators also tend to be transaction-friendly as they avoid or reduce the use shared global mutable data.

As a side note, under a requester-wins conflict resolution strategy, to the extent possible and reasonable it's a good idea to shift stores of frequently accessed shared variables toward the end of a transaction. You can do this by hand, or a transaction-aware compiler or JIT can perform some of the transformations. The modCount field Java's hashtable is a canonical example of an update that should be shifted. Shifting reduces the window of vulnerability where the store resides in the transaction's write-set. But the asymmetry in the i7-4770 where the write-set is tracked in the L1 and the read-set in the L1-L2-L3 gives us yet another reason to shift stores toward the end of a transaction. Consider a transaction that executes a store followed by large number of loads. Those loads may displace the store and cause an abort. But if we shift the store to the end of the transaction, the same set of accesses (just reordered) can succeed without abort. The store may displace a loaded line from the L1, but the L2 can still track the line.

Finally, when a given size-class is index-unfriendly, we can use the punctuated array approach as described in the original CIA-Malloc paper. A more radical approach is to intentionally and explicitly pick size-classes that are prime multiples of the cache line size. This helps to reduce inter-size-class index conflicts.

Another approach to index-aware allocation (other than friendly size-classes or punctuated arrays) is to simply intercept malloc(S) calls and run a Bernoulli trial to decide if we want to add 64 to S. The probably P should be low. As such, we'll occasionally, on a random basis, add 64 to S. A slightly more elaborate variation will make P a function of S, increasing the odds for larger S values.

Finally, we describe how to use hardware transactional memory within an allocator implementation in Simplifying Concurrent Algorithms by Exploiting Hardware Transactional Memory which appeared in SPAA 2010 (DOI). That I know of, this is the first use of HTM in an allocator implementation. The allocator used in the SPAA paper was derived from the allocator described in ISMM 2002 Mostly Lock-Free Malloc (DOI), which used restartable critical sections instead of transactions.

Tuesday Mar 11, 2014

A simple PRNG construction idiom

The academic literature is rife with algorithms for pseudo-random number generators (PRNGs). Typically, there's a trade-off between performance and the quality of the distribution. In most cases I need PRNGs to implement very lightweight Bernoulli trials for randomized stress tests, benchmarks, or scalable probabilistic counters. My budget is usually less that 100 cycles to generate a uniformly distributed value. Marsaglia's xor-shift PRNG is one of my personal favorites. If I need better quality I'll step up to Ziff's four tap or Mersenne twister.

One variation of Marsaglia has only one word of state, a 4G-1 period, and requires just 3 shifts and 3 XOR operations to generate a new value. 0 is an absorbing state that we avoid. See MarsagliaNext(), below. Ignoring 0, the trajectory or stream of values forms a cycle -- conceptually a ring. The initialization and seeding operation should act to place different threads at different positions on that ring. In a sense the ring is shared by all threads but we start the threads at different points. Unfortunately, we can sometimes find that 2 different threads are at about the same position at about the same time through simple bad luck, and thus generate similar streams of values. (Longer PRNG cycles reduce the odds of such scenarios, of course). Ideally, to avoid such inopportune and undesirable behavior, each thread would have its own private ring of values.

A simple approach that is tantamount to giving each thread its own ring -- trajectory of values -- is shown below in NextRandom(). Hash32() is a hash function, which we'll describe later. Note that we explicitly "color" the value passed into Hash32() with the address of the thread-local PRNG state. Recall that at any one time, such an address will be associated with at most one thread, and is thus temporally unique. This approach gives us additional independence over concurrently executing threads. It also makes NextRandom() self-seeding -- explicit initialization is not required.

The Hash32() hash function is the key to this particular PRNG, and its implementation directly embodies the trade-off between performance and the quality of the resulting distribution. Conceptually, you could think of Hash32() as representing a randomized cycle with a period of 4G. We might implement Hash32() via a strong cryptographic hash, such as SHA-3 or MD5. Those hash functions would work perfectly well, but tend to be high quality but somewhat expensive, so we'd prefer a cheaper hash. Critically, we want the hash function to exhibit a high degree of avalanche effect. (A good encryption function could also be used as a hash operator). Some cheaper candidates for the hash function include: MurmurHash ;CityHash ; FNV hash family; siphash; and Jenkins hash.

Doug Lea and Guy Steele invented some of the best candidates for our hash function: see Mix32() and Mix64() below. These are relatively cheap but do well on avalanche tests and strike a reasonable balance between quality and performance. Beware that they're obviously not cryptographically strong. Mix32() and Mix64() are related to mix functions found in java.util.SplittableRandom.

Update 2014-03-13 : Thanks to Paul Sandoz for pointing out Stafford's mix function.

static int32_t MarsagliaNext () {
static __thread int32_t rv = 0 ;
if (rv == 0) rv = GenerateSeed() ;
int32_t v = rv ;
v ^= v << 6 ;
v ^= uint32_t(v) >> 21 ;
v ^= v << 7 ;
rv = v ;
return v ;

static int NextRandom() {
// assumes a 32-bit environment : ILP32
// Instead of incrementing by 1 we could also
// increment by a large prime or by the MIX constant
// if we're using Mix32() as our hash function.
static __thread int rv = 0 ;
return Hash32 (int(&rv) + (++rv)) ;

// Mix32 and Mix64 are provided courtesy of Doug Lea and Guy Steele

static int32_t Mix32 (int32_t z) {
static const int32_t MIX = 0x9abe94e3;
// Another workable constant is 0xa0ca9b6b
z = (z ^ (uint32_t(z) >> 16)) * MIX;
z = (z ^ (uint32_t(z) >> 16)) * MIX;
return z ^ (uint32_t(z) >> 16);

static int64_t Mix64 (int64_t z) {
static const int64_t MIX = 0xdaba0b6eb09322e3LL;
z = (z ^ (uint64_t(z) >> 32)) * MIX;
z = (z ^ (uint64_t(z) >> 32)) * MIX;
return z ^ (uint64_t(z) >> 32);

MSPC 2014 submission deadline has been extended until April 4th

See the workshop web site for details.

Monday Mar 10, 2014

PRNG optimizations

Say you wanted to generate a stream of uniformly distributed random integers in the range [0,N) where N > 0. Java's Random.nextInt(N) is an example of a convenient API that does just that. Furthermore lets say you willing to trade off the quality of the distribution in order to gain better performance -- lower latency for each call to nextInt(N). Since we care about speed, we'll assume the implementation can't use floating point.

A good starting point might be Marsaglia's xor-shift pseudo-random number generator (PRNG), which can generate a new 32-bit value in [0,4G) in just a few cycles. Once you have that 32-bit value in hand you next need to reduce it to the [0,N) range. The most obvious approach is to use the "MOD" operator. The MOD (and DIV) instructions can be relatively expensive, however. Somewhat perversely, it's not uncommon to find that the MOD instruction is more expensive than generating the value via the Marsaglia xor-shift PRNG, and thus dominates performance of your nextInt(N) implementation. Furthermore, some processors have only one processing unit per core that can handle MOD instructions, so you might encounter both scaling and latency problems. If N happens to be a constant known at compile-time then your compiler might be sufficiently clever to apply strength reduction techniques, such as those described by Granlund and Montgomery in Division by Invariant Integers using Multiplication (PLDI 1994). Hacker's Delight (2E) is another good resource. Alternatively, you could explicitly apply the optimization yourself. But lets assume that N is a variable.

One approach to avoid the MOD instruction is to use a degenerate scaled or fixed-point representation. Say "v" is the value returned by the Marsgalia xor-shift PRNG. We can calculate and return (((v & 0x3FF) * N) >> 10). 0x3FF and 10 are related as 0x3FF is equal to ((2^10)-1), of course. Note that we traded MOD for multiplication, which is usually profitable. While this approach is tempting -- and might be viable in some circumstances -- it suffers from range overflow concerns and quantization (really additional discretization). Note that only 1024 possible values can be returned. We might change 0x3FF and 10 to 0x7FF and 11, respectively, allowing 2048 distinct values, but if we keep increasing those values the "interesting" bits will be truncated by overflow in the 32-bit multiplication and the distribution will suffer.

Another approach is to borrow an idea from rejection sampling. For discussion lets assume N is 10. Ignoring overflow, we can easily compute M where M is the least integer power of 2 greater than or equal to N. Since N is 10, M will be 16. See "Hacker's Delight" 2E, Figure 3-3, or the following article. Critically, given N, we can quickly compute M in constant time with no memory references and no conditional branches. Our nextInt(N) implementation can first compute M from N and then invoke the Marsaglia xor-shift PRNG (or PRNG of your choice) to obtain a 32-bit value "v". Next, we mask "v" with (M-1) yielding a value in [0,M). If that value is less than N then we're done and can return -- we accept the value. But if the value is greater than or equal to N then we reject the value and loop, calling the PRNG again and retrying until we finally find a value less than N. We're effectively running Bernoulli trails until the 1st success, and the probability P of success (N/M) will always be greater than 0.5. The number of trials (iterations) until the 1st success has a geometric distribution so the average running time -- iterations required -- is reasonable if you're willing to tolerate the non-deterministic nature of rejection sampling. If you're concerned about the worst-case you could fall back to the MOD instruction after some number of rejections within a nextInt() episode. One downside to this approach is that the branch to exit the loop might not be very well predicted.

Whether or not this approach is profitable depends on the cost of MOD relative to the additional work to compute M and the cost of the retries. Somewhat sadly, it appears useful on a range of systems.

(Beware that the low-order bits from a Marsaglia xor-shift PRNG may have a skewed distribution over short intervals, but you can remedy this with a shift or by using a slightly better but still cheap PRNG).


Dave is a senior research scientist in the Scalable Synchronization Research Group within Oracle Labs : Google Scholar.


« April 2015