X

An Oracle blog about Transactional locks

Locks with LIFO admission order

Dave Dice
Senior Research Scientist
Why would we ever want a lock with a LIFO admission policy?
First, a LIFO lock provides a useful measure of real-world scalability. Lets say we have a set of threads that each iterate as follows : acquire some lock L; execute a fixed-length critical section body of duration C; release L; and finally execute a non-critical section of length N. We run T threads concurrently, and at the end of a measurement interval we report the total number of iterations completed, as well as per-thread iteration counts. Amdahl's law says the maximum ideal speedup relative to a single thread should be (N+C)/C. We can run our experiments, varying the thread count, measure aggregate throughput and see compare to see how close we come to Amdahl's bound. Assuming we have a heterogeneous system and ignoring any potential superlinear effects, the observed peak speedup will be capped by Amdahl's bound. And if we use a fair FIFO lock, such as MCS, the threads will all have approximately equal completion counts.
It's worth noting that Amdahl's law is sometimes misapplied to locks and critical sections. In the classic Amdahl model, during the serial phase no other threads may be executing concurrently, while with locks, when one threads is in the critical section other threads may be executing concurrently in their critical sections. That is, classic Amdahl's law applies to barriers. See also Gustafson's law, Gunther's universal scaling law, and in particular Eyerman's model. Critically, though, the maximum speedup bounds still hold.
Now lets say we switch to a LIFO lock. Ideally, the aggregate throughput will be the same as we saw with the FIFO lock. If N=30 and C=10, then the ideal speedup is 4X. If we run with 10 threads under a LIFO lock, when we examine the distribution of per-thread completion counts we expect to see 4 threads dominate with about equal performance, and 6 threads should have starved completely. This gives us another non-analytic empirical way to guage the maximum speedup over a lock. Put another way, can we figure out how many threads we can "squeeze" or pack into a contended lock before we hit saturation. We keep increasing T until some threads show evidence of starvation. This lets us discern the N/C ratio. Of course we could try to derive the ratio using FIFO locks, varying T, and using Amdahl's law, but in practice there are quite a few practical confounding factors. The LIFO approach gives us a fairly direct reading of the number of threads that will "fit" before we reach over-saturation.
LIFO locks are also useful in their own right. While they are deeply unfair, they work very well with spin-then-park waiting strategies. If we imagine the lock as implemented with a stack of waiting threads, threads near the head are mostly likely to be spinning, and are also most likely to be next granted the lock. If the lock is over-saturated, then under a LIFO policy, ownership will circulate over just a subset of the contending threads. In turn this can reduce cache pressure and yield benefits arising from thermal and energy bounds. Of course we have to take measures to ensure long-term eventual fairness, but many locks intentionally trade-off short-term fairness for throughput. (See our "Cohort" locks, for example).
A possibly systemic downside to LIFO locks is that arrivals and departures may need to access the same lock metadata, creating an acute coherence hot-spot. With a contended MCS lock, for instance, an unlock operation doesn't need to access the "tail" field.
I wondered if there was a LIFO analog to the classic FIFO ticket lock and put the question to my colleagues in Oracle Lab's Scalable Synchronization Research Group, and collected some of the designs, which I'll report below. It's an interesting exercise and puzzle, and hard to resist for concurrency folks. Alex Kogan, Victor Luchangco, Tim Harris, Yossi Lev and I contributed. Any mistakes are mine.
The most basic LIFO algorithm I could think of was to implement an explicit stack of waiting threads with a central "head" pointer which points to the most recently arrived thread. The approach is pretty obvious and yields a true LIFO admission order. Expressed in a pidgin Python/C++ dialect and assuming atomic<T> for all shared variables, the following sketch describes that form. The lock is very similar to the Thread::MuxAcquire() and ::MuxRelease() primitives that I wrote for the HotSpot JVM. (Those are internal locks used by JVM to get over a bootstrapping phase where the normal native C++ HotSpot Monitor:: and Mutex:: classes aren't yet initialized). We call this form "E3".
(I apologize for the crude listings that follow. Oracle's blog software explicitly filters out remote javascript scripts, so I'm unable to use civilized pretty-print facilities such as github's "gist" mechanism).
 

class StackElement :
StackElement * Next
int Wait
class LIFOLock :
// Encoding for Head field :
// 0 = lock is free
// 1 = locked but no waiting threads
// T = locked : T refers to stack of waiting threads
StackElement * Head
Acquire(L) :
StackElement A ; // on-stack auto-variable
auto h = L->Head
if h == 0 :
h = CAS (&L->Head, 0, 1)
if h == 0 : return // uncontended fast-path acquisition
// inopportune interleaving -- some other thread mutated L->Head
// in the LD-CAS window above. We lost the race
// Apparent contention ...
// Doorway phase : use CAS to push our element A onto the stack
for
assert h != 0
A.Next = h & ~1
A.Wait = 1
auto v = CAS (&L->Head, h, &A)
if v == h : break
if v == 0 :
v = CAS (&L->Head, 0, 1)
if v == 0 : return
h = v
// Waiting phase
// the lock provides local spinning
while A.Wait != 0 : Pause()
assert L->Head != 0
Release(L) :
auto h = L->Head
assert h != 0
if h == 1
h = CAS (&L->Head, 1, 0)
if h == 1 : return // simple uncontended release
// the stack can only grow while the lock is held ...
// The lock is contended
// try to pop the head of the stack.
// This is the most recently arrived thread
assert h != 0 && h != 1
// Note that we're using CAS to push and pop elements
// Normally that would leave us exposed to ABA problems.
// But as there can be only one thread trying to pop -- that being the owner --
// we have multiple-push-single-pop concurrency model and are thus not vulnerable
// to ABA pathologies. The lock itself restricts concurrency and prevents
// multiple concurrent pop operations.
for
auto next = h->Next
auto v = CAS (&L->Head, h, next)
if v == h :
assert h->Wait == 1
h->Wait = 0
break
h = v
// Variation :
// Note that if the CAS fails we are guaranteed to have at least 2 elements
// on the stack. We can splice out the element that follows the element
// identified by "v" and pass ownership to the associated thread.
// The thread we pick will either be the same as the head at the time
// we fetched L->Head in Release(), or some thread that arrived afterward.
// Depending on how liberal your interpretation, this is a plausibly LIFO ordering.
// This approach yields a constant-time Release() operator, with no loops.
// As constituted above, the policy is "strict" LIFO, however.

The next few variations forgo explicit nodes, and as such, we'll have global spinning. The broad inspiration for this family is the CLH lock, where a thread knows the adjacent thread on the queue, but the queue is implicit.
We call the following "E5" because it was the 5th experimental version.
 

Class LIFOLock
int Head
int NextToRun
enum DistinguishedValues
U = 0 // unlocked
H = 1 // Locked, no other threads waiting
P = 2 // ownership being passed
Invalid = 3 //
IDBias = 4 // offset so thread ID values don't intersect above
Acquire(L) :
auto tkt = Self->UniqueID + IDBias
assert tkt != U && tkt != P && tkt != H
auto h = L->Head ;
for :
if h == P :
Pause() ; h = L->Head; continue ;
auto newh = (h == U) ? H : tkt
auto v = CAS (&L->Head, h, newh)
if v == h : break
h = v
if h != U :
while L->NextToRun != tkt : Pause()
L->NextToRun = Invalid
assert L->Head == P
L->Head = h
Release (L) :
auto h = L->Head ;
for :
assert h != P && h != U
auto newh = (h == H) ? U : P ;
auto v = CAS (&L->Head, h, newh) ;
if v == h : break
h = v
if h != H : L->NextToRun = h

The first thing to notice is that the "P" encoding can result in two waiting phases in Acquire() : arriving threads may first wait while Head == P and then for their specific turn. The interlock protocol to hand-off feels rather synchronous. P state is effectively a lock that freezes out arrivals until the successor manages
to depart. In addition, a group of threads could be waiting while Head == P, but subsequently "enqueue" themselves in an order that differs from their arrival order, so we don't have strict pedantic FIFO. (See also FCFE = First-Come-First-Enabled).
We can streamline E5 slightly, yielding E5B :
 

Acquire(L) :
auto tkt = Self->UniqueID + IDBias
assert tkt != U && tkt != P && tkt != H
auto h = L->Head
for :
if h == P :
Pause() ; h = L->Head; continue ;
if h == U :
auto v = CAS (&L->Head, U, H)
if v == U : return
h = v
continue
// Case : H or T = most-recently-arrived thread
auto v = CAS (&L->Head, h, tkt)
if v == h : break
h = v
while L->NextToRun != tkt : Pause()
L->NextToRun = U
assert L->Head == P
L->Head = h
Release (L) :
auto h = L->Head ;
if h == H :
// uncontended path
auto v = CAS (&L->Head, H, U) ;
if v == H : return
h = swap (&L->Head, P)
assert h != U && h != H && h != P
L->NextToRun = h

The next version, E6, eliminates the P encoding and switches to a seqlock-like lock for the hand-off transition. The lock instance requires just a single "Next" field. When the low-order bit of Next is set, arrivals are frozen during the hand-off. E6 appears the fastest on non-NUMA systems, possibly because the paths are relatively tight.
 

Acquire(L) :
auto w = L->Next ;
for
if w & 1 :
Pause(); continue
auto v = CAS (&L->Next, w, w+2)
if w != v :
w = v ; continue
if w != 0 :
while L->Next != (w+1) : Pause()
L->Next = w
break
Release(L) :
auto w = L->Next ;
assert w != 0 && (w & 1) == 0
if w == 2:
auto v = CAS (&L->Next, 2, 0) ;
if v == 2 : return ;
FetchAdd (&L->Next, -1) // set to odd, locked for handoff

For E7 we revert to using an "inner lock" to protect an explicit stack of waiting threads. An MCS or CLH lock work nicely for that purpose. E7 provides local spinning and, depending on the inner lock implementation, is true FIFO. We use an encoding of Head == 1 to indicate the lock is held but no threads are waiting.
 

Acquire(L) :
CLHAcquire (&L->Inner)
Thread * h = L->Head
if h == NULL :
L->Head = 1 ;
CLHRelease (&L->Inner) ;
return
Self->Grant = 0
Self->Next = h
L->Head = Self
CLHRelease (&L->Inner)
while Self->Grant == 0 : Pause() ;

LockRelease (L) {
CLHAcquire (&L->Inner)
Thread * h = L->Head
assert h != 0
if h == 1 :
L->Head = 0 ;
CLHRelease (&L->Inner)
else :
L->Head = h->Next
CLHRelease (&L->Inner)
assert h->Grant == 0
h->Grant = 1

For E8 we try to make greater user of atomic fetch-and-add operators. The lock contains "Ticket" and "Admit" fields.
 

Acquire(L) :
auto t = FetchAdd (&L->Ticket, 1) ;
if t != 0 :
for :
while L->Admit != t : Pause()
// Adjudicate among set of threads with same ticket "depth" value
// Admits some non-determinism because of race
auto v = SWAP (&L->Admit, -1) ;
assert v == t || v == -1
if v != -1 : break ;
Release(L) :
// Claim : t value in LockRelease() must be >= t value
// in corresponding LockAcquire() invocation
auto t = FetchAdd (&L->Ticket, -1) ;
assert t > 0
if --t != 0 L L->Admit = t ;

E9 uses a latch -- really a lock that allows asymmetric acquire and release. Specifically, if thread T1 acquires the latch then T2 may subsequently release the latch. (This is precisely the "thread-oblivious" property required by top-level cohort locks). For our purposes we can use a ticket lock. Our lock structure contains an inner ticket lock and Depth and Admit fields.
 

Acquire (L) :
TicketAcquire (L->TicketLock)
auto d = L->Depth ++ ;
TicketRelease (L->TicketLock)
if d != 0 :
while L->Admit != d : Pause()
L->Admit = -1 ;
TicketRelease (L->TicketLock)
Release (L) :
TicketAcquire (L->TicketLock)
auto d = -- L->Depth ;
assert d >= 0
L->Admit = d
if d == 0 :
TicketRelease (L->TicketLock)

Those were the most interesting variations. We found the exercise rather fun. And I'm sure there are lots of wonderfully clever ideas that we missed.

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.Captcha
Oracle

Integrated Cloud Applications & Platform Services