Saturday Jan 16, 2016

Refined Transactional Lock Elision

Refined Transactional Lock Elision in PPoPP 2016 by Dave Dice, Alex Kogan and Yossi Lev.

Abstract

Transactional lock elision (TLE) is a well-known technique that exploits hardware transactional memory (HTM) to introduce con- currency into lock-based software. It achieves that by attempting to execute a critical section protected by a lock in an atomic hard- ware transaction, reverting to the lock if these attempts fail. One significant drawback of TLE is that it disables hardware speculation once there is a thread running under lock. In this paper we present two algorithms that rely on existing compiler support for transactional programs and allow threads to speculate concurrently on HTM along with a thread holding the lock. We demonstrate the benefit of our algorithms over TLE and other related approaches with an in-depth analysis of a number of benchmarks and a wide range of workloads, including an AVL tree-based micro-benchmark and ccTSA, a real sequence assembler application.

Tuesday Nov 17, 2015

JNI performance - false sharing on the "-UseMembar" serialization page

For background on the membar elision techniques and the serialization page, see the following: 7644409; Asymmetric Dekker Synchronization; and QPI Quiescence. On normal x86 and SPARC systems these are strictly local latency optimizations (because MEMBAR is a local operation) although on some systems where fences have global effects, they may actually improve scalability. As an aside, such optimizations may no longer be profitable on modern processors where the cost of fences has decreased steadily. Relatedly, on larger systems, the TLB shootdown activity -- interprocessor interrupts, etc -- associated with mprotect(PROT_NONE) may constitute a system-wide scaling impediment. So the prevailing trend is away from such techniques, and back toward fences. Similar arguments apply to the biased locking -- another local latency optimization -- which may have outworn its usefulness.

A colleague in Oracle Labs ran into a puzzling JNI performance problem. It originally manifested in a complex environment, but he managed to reduce the problem to a simple test case where a set of independent concurrent threads make JNI calls to targets that return immediately. Scaling starts to fade at a suspiciously low number of threads. (I eliminated the usual thermal, energy and hyperthreading concerns).

On a hunch, I tried +UseMembar, and the scaling was flat. The problem appears to be false sharing for the store accesses into the serialization page. If you're following along in the openjdk source code, the culprits appear to be write_memory_serialize_page() and Macroassembler::serialize_memory(). The “hash” function that selects an offset in the page — to reduce false sharing — needs improvement. And since the membar elision code was written, I believe biased locking forced the thread instances to be aligned on 256-byte boundaries, which contributes in part to the poor hash distribution. On a whim, I added an “Ordinal” field to the thread structure, and initialize it in the Thread ctor by fetch-and-add of a static global. The 5th created thread will have Ordinal==5, etc. I then changed the hash function in the files mentioned above to generate an offset calculated via : ((Ordinal*128) & (PageSize-1)). “128” is important as that’s the alignment/padding unit to avoid false sharing on x86. (The unit of coherence on x86 is a 64-byte cache line, but Intel notes in their manuals that you need 128 to avoid false sharing. Adjacent sector prefetch makes it 128 bytes, effectively). This provided relief.

With 128 byte units and a 4K base page size, we have only 32 unique “slots" on the serialization page. It might make sense to increase the serialization region to multiple pages, with the number of pages is possibly a function of the number of logical CPUs. That is, to reduce the odds of collisions, it probably makes sense to conservatively over-provision the region. (mprotect() operations on contiguous regions of virtual pages are only slightly more expensive than mprotect operations on a single page, at least on x86 or SPARC. So switching from a single page to multiple pages shouldn’t result in any performance loss). Ideally we’d index with the CPUID, but I don’t see that happening as getting the CPUID in a timely fashion can be problematic on some platforms. We could still have very poor distribution with the OrdinalID scheme I mentioned above. Slightly better than the OrdinalID approach might be to try to balance the number of threads associated with each of the slots. This could be done in the thread ctor. It’s still palliative as you could have a poor distribution over the set of threads using JNI at any given moment. But something like that, coupled with increasing the size of the region, would probably work well.

p.s., the mprotect()-based serialization technique is safe only on systems that have a memory consistency model that's TSO or stronger. And the access to the serialization page has to be store. Because of memory model issues, a load isn't sufficient.

Update: friends in J2SE have filed an RFE as JDK-8143878.

Wednesday Sep 30, 2015

Dekker’s mutual exclusion algorithm made RW-safe

Dekker’s mutual exclusion algorithm made RW-safe by Peter Buhr, Dave Dice and Wim H. Hesselink appears in CCPE 2015. (Sorry, it's pay-walled).

Evaluating HTM for pauseless garbage collectors in Java

Evaluating HTM for pauseless garbage collectors in Java by Maria Carpen-Amarie, Dave Dice, Patrick Marlier, Gaël Thomas and Pascal Felber appeared in The 13th IEEE International Symposium on Parallel and Distributed Processing with Applications (IEEE ISPA-15).

Abstract:

While garbage collectors (GCs) significantly simplify programmers’ tasks by transparently handling memory management, they also introduce various overheads and sources of unpredictability. Most importantly, GCs typically block the application while reclaiming free memory, which makes them unfit for environments where responsiveness is crucial, such as real- time systems. There have been several approaches for developing concurrent GCs that can exploit the processing capabilities of multi-core architectures, but at the expense of a synchronization overhead between the application and the collector. In this paper, we investigate a novel approach to implementing pauseless moving garbage collection using hardware transactional memory (HTM). We describe the design of a moving GC algorithm that can operate concurrently with the application threads. We study the overheads resulting from using transactional barriers in the Java virtual machine (JVM) and discuss various optimizations. Our findings show that, while the cost of these barriers can be minimized by carefully restricting them to volatile accesses when executing within the interpreter, the actual performance degradation becomes unacceptably high with the just-in-time compiler. The results tend to indicate that current HTM mechanisms cannot be readily used to implement a pauseless GC in Java that can compete with state-of-the-art concurrent GCs.

See also:US9208081.

Saturday Sep 26, 2015

Malthusian Locks

Arxiv version

Wednesday Apr 22, 2015

Locks with LIFO admission order

Why would we ever want a lock with a LIFO admission policy?

First, a LIFO lock provides a useful measure of real-world scalability. Lets say we have a set of threads that each iterate as follows : acquire some lock L; execute a fixed-length critical section body of duration C; release L; and finally execute a non-critical section of length N. We run T threads concurrently, and at the end of a measurement interval we report the total number of iterations completed, as well as per-thread iteration counts. Amdahl's law says the maximum ideal speedup relative to a single thread should be (N+C)/C. We can run our experiments, varying the thread count, measure aggregate throughput and see compare to see how close we come to Amdahl's bound. Assuming we have a heterogeneous system and ignoring any potential superlinear effects, the observed peak speedup will be capped by Amdahl's bound. And if we use a fair FIFO lock, such as MCS, the threads will all have approximately equal completion counts.

It's worth noting that Amdahl's law is sometimes misapplied to locks and critical sections. In the classic Amdahl model, during the serial phase no other threads may be executing concurrently, while with locks, when one threads is in the critical section other threads may be executing concurrently in their critical sections. That is, classic Amdahl's law applies to barriers. See also Gustafson's law, Gunther's universal scaling law, and in particular Eyerman's model. Critically, though, the maximum speedup bounds still hold.

Now lets say we switch to a LIFO lock. Ideally, the aggregate throughput will be the same as we saw with the FIFO lock. If N=30 and C=10, then the ideal speedup is 4X. If we run with 10 threads under a LIFO lock, when we examine the distribution of per-thread completion counts we expect to see 4 threads dominate with about equal performance, and 6 threads should have starved completely. This gives us another non-analytic empirical way to guage the maximum speedup over a lock. Put another way, can we figure out how many threads we can "squeeze" or pack into a contended lock before we hit saturation. We keep increasing T until some threads show evidence of starvation. This lets us discern the N/C ratio. Of course we could try to derive the ratio using FIFO locks, varying T, and using Amdahl's law, but in practice there are quite a few practical confounding factors. The LIFO approach gives us a fairly direct reading of the number of threads that will "fit" before we reach over-saturation.

LIFO locks are also useful in their own right. While they are deeply unfair, they work very well with spin-then-park waiting strategies. If we imagine the lock as implemented with a stack of waiting threads, threads near the head are mostly likely to be spinning, and are also most likely to be next granted the lock. If the lock is over-saturated, then under a LIFO policy, ownership will circulate over just a subset of the contending threads. In turn this can reduce cache pressure and yield benefits arising from thermal and energy bounds. Of course we have to take measures to ensure long-term eventual fairness, but many locks intentionally trade-off short-term fairness for throughput. (See our "Cohort" locks, for example).

A possibly systemic downside to LIFO locks is that arrivals and departures may need to access the same lock metadata, creating an acute coherence hot-spot. With a contended MCS lock, for instance, an unlock operation doesn't need to access the "tail" field.

I wondered if there was a LIFO analog to the classic FIFO ticket lock and put the question to my colleagues in Oracle Lab's Scalable Synchronization Research Group, and collected some of the designs, which I'll report below. It's an interesting exercise and puzzle, and hard to resist for concurrency folks. Alex Kogan, Victor Luchangco, Tim Harris, Yossi Lev and I contributed. Any mistakes are mine.

The most basic LIFO algorithm I could think of was to implement an explicit stack of waiting threads with a central "head" pointer which points to the most recently arrived thread. The approach is pretty obvious and yields a true LIFO admission order. Expressed in a pidgin Python/C++ dialect and assuming atomic<T> for all shared variables, the following sketch describes that form. The lock is very similar to the Thread::MuxAcquire() and ::MuxRelease() primitives that I wrote for the HotSpot JVM. (Those are internal locks used by JVM to get over a bootstrapping phase where the normal native C++ HotSpot Monitor:: and Mutex:: classes aren't yet initialized). We call this form "E3".

(I apologize for the crude listings that follow. Oracle's blog software explicitly filters out remote javascript scripts, so I'm unable to use civilized pretty-print facilities such as github's "gist" mechanism).

 

class StackElement :
StackElement * Next
int Wait

class LIFOLock :
// Encoding for Head field :
// 0 = lock is free
// 1 = locked but no waiting threads
// T = locked : T refers to stack of waiting threads
StackElement * Head

Acquire(L) :
StackElement A ; // on-stack auto-variable
auto h = L->Head
if h == 0 :
h = CAS (&L->Head, 0, 1)
if h == 0 : return // uncontended fast-path acquisition
// inopportune interleaving -- some other thread mutated L->Head
// in the LD-CAS window above. We lost the race

// Apparent contention ...
// Doorway phase : use CAS to push our element A onto the stack
for
assert h != 0
A.Next = h & ~1
A.Wait = 1
auto v = CAS (&L->Head, h, &A)
if v == h : break
if v == 0 :
v = CAS (&L->Head, 0, 1)
if v == 0 : return
h = v

// Waiting phase
// the lock provides local spinning
while A.Wait != 0 : Pause()
assert L->Head != 0

Release(L) :
auto h = L->Head
assert h != 0
if h == 1
h = CAS (&L->Head, 1, 0)
if h == 1 : return // simple uncontended release
// the stack can only grow while the lock is held ...

// The lock is contended
// try to pop the head of the stack.
// This is the most recently arrived thread
assert h != 0 && h != 1
// Note that we're using CAS to push and pop elements
// Normally that would leave us exposed to ABA problems.
// But as there can be only one thread trying to pop -- that being the owner --
// we have multiple-push-single-pop concurrency model and are thus not vulnerable
// to ABA pathologies. The lock itself restricts concurrency and prevents
// multiple concurrent pop operations.
for
auto next = h->Next
auto v = CAS (&L->Head, h, next)
if v == h :
assert h->Wait == 1
h->Wait = 0
break
h = v

// Variation :
// Note that if the CAS fails we are guaranteed to have at least 2 elements
// on the stack. We can splice out the element that follows the element
// identified by "v" and pass ownership to the associated thread.
// The thread we pick will either be the same as the head at the time
// we fetched L->Head in Release(), or some thread that arrived afterward.
// Depending on how liberal your interpretation, this is a plausibly LIFO ordering.
// This approach yields a constant-time Release() operator, with no loops.
// As constituted above, the policy is "strict" LIFO, however.


The next few variations forgo explicit nodes, and as such, we'll have global spinning. The broad inspiration for this family is the CLH lock, where a thread knows the adjacent thread on the queue, but the queue is implicit.

We call the following "E5" because it was the 5th experimental version.

 

Class LIFOLock
int Head
int NextToRun

enum DistinguishedValues
U = 0 // unlocked
H = 1 // Locked, no other threads waiting
P = 2 // ownership being passed
Invalid = 3 //
IDBias = 4 // offset so thread ID values don't intersect above

Acquire(L) :
auto tkt = Self->UniqueID + IDBias
assert tkt != U && tkt != P && tkt != H
auto h = L->Head ;
for :
if h == P :
Pause() ; h = L->Head; continue ;
auto newh = (h == U) ? H : tkt
auto v = CAS (&L->Head, h, newh)
if v == h : break
h = v
if h != U :
while L->NextToRun != tkt : Pause()
L->NextToRun = Invalid
assert L->Head == P
L->Head = h

Release (L) :
auto h = L->Head ;
for :
assert h != P && h != U
auto newh = (h == H) ? U : P ;
auto v = CAS (&L->Head, h, newh) ;
if v == h : break
h = v
if h != H : L->NextToRun = h


The first thing to notice is that the "P" encoding can result in two waiting phases in Acquire() : arriving threads may first wait while Head == P and then for their specific turn. The interlock protocol to hand-off feels rather synchronous. P state is effectively a lock that freezes out arrivals until the successor manages
to depart. In addition, a group of threads could be waiting while Head == P, but subsequently "enqueue" themselves in an order that differs from their arrival order, so we don't have strict pedantic FIFO. (See also FCFE = First-Come-First-Enabled).

We can streamline E5 slightly, yielding E5B :

 

Acquire(L) :
auto tkt = Self->UniqueID + IDBias
assert tkt != U && tkt != P && tkt != H
auto h = L->Head
for :
if h == P :
Pause() ; h = L->Head; continue ;
if h == U :
auto v = CAS (&L->Head, U, H)
if v == U : return
h = v
continue
// Case : H or T = most-recently-arrived thread
auto v = CAS (&L->Head, h, tkt)
if v == h : break
h = v
while L->NextToRun != tkt : Pause()
L->NextToRun = U
assert L->Head == P
L->Head = h

Release (L) :
auto h = L->Head ;
if h == H :
// uncontended path
auto v = CAS (&L->Head, H, U) ;
if v == H : return
h = swap (&L->Head, P)
assert h != U && h != H && h != P
L->NextToRun = h


The next version, E6, eliminates the P encoding and switches to a seqlock-like lock for the hand-off transition. The lock instance requires just a single "Next" field. When the low-order bit of Next is set, arrivals are frozen during the hand-off. E6 appears the fastest on non-NUMA systems, possibly because the paths are relatively tight.

 

Acquire(L) :
auto w = L->Next ;
for
if w & 1 :
Pause(); continue
auto v = CAS (&L->Next, w, w+2)
if w != v :
w = v ; continue
if w != 0 :
while L->Next != (w+1) : Pause()
L->Next = w
break

Release(L) :
auto w = L->Next ;
assert w != 0 && (w & 1) == 0
if w == 2:
auto v = CAS (&L->Next, 2, 0) ;
if v == 2 : return ;
FetchAdd (&L->Next, -1) // set to odd, locked for handoff


For E7 we revert to using an "inner lock" to protect an explicit stack of waiting threads. An MCS or CLH lock work nicely for that purpose. E7 provides local spinning and, depending on the inner lock implementation, is true FIFO. We use an encoding of Head == 1 to indicate the lock is held but no threads are waiting.

 

Acquire(L) :
CLHAcquire (&L->Inner)
Thread * h = L->Head
if h == NULL :
L->Head = 1 ;
CLHRelease (&L->Inner) ;
return
Self->Grant = 0
Self->Next = h
L->Head = Self
CLHRelease (&L->Inner)
while Self->Grant == 0 : Pause() ;

LockRelease (L) {
CLHAcquire (&L->Inner)
Thread * h = L->Head
assert h != 0
if h == 1 :
L->Head = 0 ;
CLHRelease (&L->Inner)
else :
L->Head = h->Next
CLHRelease (&L->Inner)
assert h->Grant == 0
h->Grant = 1


For E8 we try to make greater user of atomic fetch-and-add operators. The lock contains "Ticket" and "Admit" fields.

 

Acquire(L) :
auto t = FetchAdd (&L->Ticket, 1) ;
if t != 0 :
for :
while L->Admit != t : Pause()
// Adjudicate among set of threads with same ticket "depth" value
// Admits some non-determinism because of race
auto v = SWAP (&L->Admit, -1) ;
assert v == t || v == -1
if v != -1 : break ;

Release(L) :
// Claim : t value in LockRelease() must be >= t value
// in corresponding LockAcquire() invocation
auto t = FetchAdd (&L->Ticket, -1) ;
assert t > 0
if --t != 0 L L->Admit = t ;


E9 uses a latch -- really a lock that allows asymmetric acquire and release. Specifically, if thread T1 acquires the latch then T2 may subsequently release the latch. (This is precisely the "thread-oblivious" property required by top-level cohort locks). For our purposes we can use a ticket lock. Our lock structure contains an inner ticket lock and Depth and Admit fields.

 

Acquire (L) :
TicketAcquire (L->TicketLock)
auto d = L->Depth ++ ;
TicketRelease (L->TicketLock)
if d != 0 :
while L->Admit != d : Pause()
L->Admit = -1 ;
TicketRelease (L->TicketLock)

Release (L) :
TicketAcquire (L->TicketLock)
auto d = -- L->Depth ;
assert d >= 0
L->Admit = d
if d == 0 :
TicketRelease (L->TicketLock)


Those were the most interesting variations. We found the exercise rather fun. And I'm sure there are lots of wonderfully clever ideas that we missed.

Monday Apr 20, 2015

waiting policies for locks : spin-then-park

I thought I'd collect a few notes on the topic of waiting policies in the context of locks. Lock algorithms in the literature are usually implemented via unbounded spinning. That's fine for an academic paper, but in practice it's almost never viable to use unbounded spinning. More typically, we'll use a spin-then-park policy where the spin phase is bounded, and a thread reverts to parking as necessary. In theory we can also yield occasionally while spinning, but yielding doesn't replace the need to park. (In practice, yielding is often a fool's errand, particularly with modern schedulers where the semantics of the operation have
been considerably weakened over time).

Parking is expected to deschedule the caller until it is unparked. This frees up the CPU for other eligible ready threads. It's also friendly and "polite" to siblings on the same core that share the pipelines, and for thermal/energy caps and turbo-mode. (MONITOR-MWAIT addresses some of those concerns, but the waiting thread still occupies a CPU). The interface to park-unpark is simple. A thread T1 calls park() and is descheduled until some thread T2 calls unpark(T1). If T2 happens to call unpark(T2) before T1 parks, then T1's park() operation will return immediately. You can conceptualize of the implementation as a per-thread restricted-range semaphore. Park() is also allowed to return spuriously, so the caller is expected to use it in a loop. A simple litmus test for correct park-unpark usage is that the application should still work if park and unpark were implemented as no-ops, although we'd be reduced to degenerate spinning. I've used the park-unpark interface in the JVM for about a decade, and exported to Java-land for use in java.util.Concurrent (JUC). On Solaris it's easy to implement park-unpark with a per-thread lock-condvar-flag triple. On Linux you can opt to use futexes directly. Per-thread pipe pairs also yields a good implementation, although it consumes too many file descriptors (handles). See also benaphores. Redundant unpark() operations are benign and are expected to be cheap.

A consequence of the "point-to-point" park-unpark interface -- where thread A unparks thread B by directly naming B -- is that we need to have explicit list of threads waiting on lock. Lock-free approaches come in handy for list management.

A park() implementation will typically spin for a short period locally before it has to revert to the kernel. (Half the round-trip context switch time is the norm, and is 2-competitive. See Karlin et. al). This constitutes local spinning which is relatively benign in terms of induced coherence traffic. This gives us a spin-then-park waiting policy. Park() also has a variation with a timeout.

Now lets shift to using park-unpark with classic lock algorithms. (I'll assume our context is C/C++ instead of Java, btw). With MCS, it's trivial and obvious to adapt the algorithm to use parking. I'll typically add a thread reference field to the MCS QNode structure. (I've yet to try it, but I think the same can be done with the "K42" MCS variant). Broadly, queue-based locks with local spinning are amenable to conversion to spin-then-park. A ticket lock, in comparison, isn't a good candidate. CLH requires a bit more finesse, so I'll provide that here for illustration.

As an aside, MCS with spin-then-park waiting is acceptable in some circumstances, but should be used with caution. Recall that MCS is strict FIFO-FCFS. (All strict FIFO locks also provide succession by direct handoff of ownership). If the critical section length or thread count is such that the threads at the tail of the MCS queue start to park, then you'll spend all a large fraction of time in the kernel in voluntary context switching, which is quite expensive. Specifically, the critical path includes kernel overheads to wake a successor. Recently arrived threads at the tail are spinning, while those at the head -- the next ones to be granted ownership -- may be parked in the kernel, which is exactly what we don't want. The performance inflection point where this effect manifests is rather abrupt as we increase contention. Because waiting threads ultimately need to be able to park, and because parking with MCS can toxic to performance, we tend not to find this combination in production code.

(I apologize for the crude listings that follow. Oracle's blog software explicitly filters out remote javascript scripts, so I'm unable to use civilized pretty-print facilities such as github's "gist" mechanism).

A classic pure-spinning CLH algorithm might appear as follows in a pidgin
python/C++ dialect :

 

Acquire(L) :
auto n = QNodeAllocate()
n->Locked = 1
auto prv = swap (&L->Tail, n)
while prv->Locked != 0 : Pause
QnodeFree (prv)
L->Owner = n

Release(L) :
L->Owner->Locked = 0


The fields should be obvious from the usage, so I've skipped the structure
definitions.

(I'll take a quick digression to note the duality between queues and mutual exclusion locks. It's usually trivial to convert a lock-free MPSC queue -- multiple-producer-single-consumer -- into a lock. Threads in lock() arrive and enqueue an "acquire" request upon which they wait. Only the owner can dequeue at unlock()-time, thus the "SC" constraint. Conversely, you can take a queue-based lock such as CLH and deconstruct it to yield an MPSC queue such as Vyukov's MPSC queue).

I've converted it to use park-unpark as follows, but I'd be interested in other
approaches :

 

Acquire(L) :
auto n = CLHNodeAllocate()
n->Locked = 1
auto prv = swap (&L->Tail, n)
if prv->Locked == 1 :
// Self uniquely identifies the current thread
if swap(&prv->Locked, Self) != 0 :
while prv->Locked != 0 :
Park()
CLHNodeFree (prv)
L->Owner = n

Release(L) :
auto w = swap (&L->Owner->Locked, 0)
assert w != 0
if w != 1 : Unpark(w)


This changes the encoding of the node "Locked" field from the usual 0/1 to 0, 1 or a thread reference. 0 continues to mean available, 1 means locked, and any other value indicates locked and identifies the thread waiting on that queue node. We assume "1" is distinguished and no thread will be identified by that value. CLHNodeAllocate and CLHNodeFree are implemented with thread-local caches of free queue nodes, and typically only require a few load and store operations. We can do a bit better than the above if you're willing to make the queue nodes type-stable and immortal. I'd like to avoid that, however.

We'll now shift to Brandenburg's PF-T reader-writer lock. This is an extremely clever algorithm and is quite subtle. It's also very terse. If you're not familiar with the algorithm, I recommend reading section 4.1 of their paper in RTSJ11. The lock is phase-fair, which, depending on your tastes, may be a useful property. As written, it uses pure spinning. Additionally, the only atomic it needs is fetch-and-add. PF-T requires a more elaborate transformation -- which I'll step through incrementally -- to use spin-then-park. For the purposes of explication I'll assume a sequentially consistent memory model. All shared fields are assumed to
be atomic<:T>.

 

Class Brandenburg :
int rin
int rout
int win
int wout

// The lowest order bit of "rin" encodes the phase number.
// The next higher bit if "rin" is the writer-present bit.
// Note that rin+rout resemble the ingress-egress variables
// we use in our cohort reader-writer locks.
//
// Bit 0x100 and above in rin and rout is the count of arrived readers and
// departed readers, respectively. 0x100 was used in the original algorithm as
// that enabled the use a byte-store instead of an atomic to clear the phase and
// writer-present bits when releasing write permission. Mixed-size accesses in
/ the context of shared or atomic operations can be problematic, so I've eliminated
// the byte-store optimization and just use an atomic to clear those bits.

Reader(L) :
// acquire R permission
auto w = FetchAdd (&L->rin, 0x100) & 3
if w != 0 :
while w == (L->rin & 3) : Pause()

// release R permission
FetchAdd (&L->rout, 0x100)

Writer(L) :
// acquire W permission
// resolve W-W conflicts
auto Ticket = FetchAdd (&L->win, 1)
while Ticket != L->wout: Pause()
// Next, wait for readers to drain
auto w = 2 | (Ticket & 1)
auto tx = FetchAdd (&L->rin, w)
while tx != L->rout : Pause()

// Clear the low-order 2 bits of L->rin
FetchAdd (&L->rin, -(L->rin & 3)) ;
L->wout ++


The first thing to note is that win and wout really constitute a ticket lock used to adjudicate access between writers. The implementation also leverages the low-order bit of "win" as the phase number. So we can replace win+wout with a proper mutex, and add an explicit phase variable. We'll assume that
our mutex is implemented properly and avoids unbounded spinning. This results in the following:

 

Class Brandenburg :
int rin
int rout
mutex WriterLock
int Phase

Reader(L) :
// acquire R permission
auto w = FetchAdd (&L->rin, 0x100) & 3
if w != 0 :
while w == (L->rin & 3) : Pause()

// release R permission
FetchAdd (&L->rout, 0x100)

Writer(L) :
// acquire W permission
MutexLock (WriterLock)
// Next, wait for readers to drain
auto p = (++L->Phase) & 1
auto tx = FetchAdd (&L->rin, p|2)
while ((tx ^ L->rout) & ~3) != 0 : Pause()

// Clear the low-order 2 bits of L->rin
FetchAdd (&L->rin, -(L->rin & 3)) ;
MutexUnlock (WriterLock)


So at this point the writers are in check (no unbounded spinning) except for at most one writer that's waiting for readers to drain. In anticipation of subsequent changes I'm going to clean up the rin+rout encodings and move the phase number back into the LSB of rin. The use of 0x100 as the increment for a single reader is an historical artifact related to the word-tearing trick, so we'll use 4 instead. (The low order bits remain the writer-present bit and phase number). We also change the code to advance the phase when releasing write permission.

 

Class Brandenburg :
int rin
int rout
mutex WriterLock

Reader(L) :
// acquire R permission
auto w = FetchAdd (&L->rin, 4)
if (w & 2) != 0 :
// writer is active; wait for next read phase
while ((v ^ L->rin) & 1) == 0 : Pause()

// release R permission
FetchAdd (&L->rout, 4)

Writer(L) :
// acquire W permission
MutexLock (WriterLock)
// Resolve W-R conflicts : specifically W vs extant R
auto t = FetchAdd (&L->rin, 2) & ~1 ;
// Wait for readers to drain
while L->rout != t : Pause() ;

// Clear the writer-present bit and advance phase
auto cur = L->rin ;
FetchAdd (&L->rin, (cur ^ 3) - cur) ;
MutexUnlock (WriterLock)


The changes in the step above were relatively cosmetic, but the next transformation is rather large. We're going to implement an explicit stack of waiting readers. This is the "RList" field, the low order bit of which also serves as the phase number. There are now 2 copies of the phase bit, one in the LSB of rin and one in the LSB or RList. They change almost in unison. We'll also assume that the thread structure has "Next" and "Grant" fields that we can use. "Self" is thread-local variable that refers to a thread's own thread structure. Note that our approach is not vulnerable to ABA problem.

 

Class Brandenburg :
mutex WriterLock
Thread * RList
int rin
int rout

Reader(L) :
// acquire R permission
auto w = FetchAdd (&L->rin, 4)
if (w & 2) != 0 :
// writer is active; wait for next read phase
// Attempt to push ourselves on the RList
auto ph = v & 1 // arrival phase of this thread
Self->Grant = 0
Thread * head = L->RList
for :
// If phase changed then proceed directly to RCS
// Threads with the previous phase "bounce off" RList.
// RList accepts threads only of the current phase
if (head & 1) != ph : goto EnterRCS ;
Self->Next = head & ~1
auto v = CAS (&L->RList, head, Self|ph) ;
if v == head : break ;
head = v ;
while Self->Grant == 0 : Park()

EnterRCS:

// release R permission
FetchAdd (&L->rout, 4)

Writer(L) :
// acquire W permission
MutexLock (WriterLock)
// Resolve W-R conflicts : specifically W vs extant R
auto t = FetchAdd (&L->rin, 2)
assert (t & 2) == 0
// Wait for readers to drain
while L->rout != (t & ~1) : Pause()

auto cur = L->rin
// Detach the list of waiting readers and advance the phase number
// in the least significant bit of RList
auto List = swap (&L->RList, NULL|((cur + 1) & 1) )
assert (List & 1) == (cur & 1)
// Clear the writer-present bit and advance phase
FetchAdd (&L->rin, (cur ^ 3) - cur) ;

// Wake the list of waiting readers we have in-hand
// We could drop WriterLock early but there's no point as we're changing
// phase and readers run next.
List = List & ~1
while List != NULL :
auto w = List
List = List->Next
assert w->Grant == 0
w->Grant = 1
Unpark(w)

MutexUnlock (WriterLock)

At this point we're close to complete. There's only one remaining case of unbounded spinning. That appears when a writer waits for the readers to drain. That's relatively easy to remedy by having the waiting writer make its ID or reference visible so readers can wake it. (There's a bit of a Dekker-ish protocol involved, but nothing elaborate). We can also run into performance problems where the writer, while unparking readers, is preempted by one of those readers. This is the so-called wakee-preempts-waker problem. We can address that issue with concurrent helping -- possibly by forming the set of threads to wake into a tree. Finally, we can make the lock NUMA-friendly by using a cohort lock for WriterLock and by using per-node rin+rout fields. I believe the final form remains phase-fair. And in fact it performs
quite well on NUMA systems.

Note that the "C-RW-NP" Cohort NUMA-aware reader-writer has only one instance of unbounded indefinite spinning, where the next writer waits for readers to depart. Again, that particular case is trivial to address.

.

Friday Feb 20, 2015

Cohort Locks in ACM TOPC

Lock Cohorting: A General Technique for Designing NUMA Locks appears in ACM Transactions on Parallel Computing (TOPC) Volume 1 Issue 2, January 2015. (local copy -- apologies for gzip-ed pdf but the file size exceeded Oracle blog limits).

This is an extended journal version of our PPoPP 2012 conference paper. Since the PPoPP version we've found ways to make the lock friendlier for Intel MESI/MESIF-based systems. Our preferred form now uses a partitioned ticket lock for the top-level global lock, and normal ticket locks for the node-level local locks. In addition, the journal version shows that cohort locks can provide benefit even on single-node non-NUMA systems by treating each core as a node for the purposes of cohort formation.

See also US8694706.


ACM DL Author-ize serviceLock Cohorting: A General Technique for Designing NUMA Locks
David Dice, Virendra J. Marathe, Nir Shavit
ACM Transactions on Parallel Computing - SPECIAL ISSUE ON PPOPP '12, 2015


Wednesday Jan 28, 2015

Using reader-writer locks to improve hardware TLE : TLE-RW

At its most simple, traditional hardware TLE (Transactional Lock Elision) operates as follows : start a hardware transaction; "subscribe" to the lock state (committing or aborting if the lock is found to be held); execute the critical section; and finally commit at the end of the critical section. For the purposes of discussion I'm assuming an Intel TSX-RTM implementation of hardware transactional memory (HTM). If we can't make progress using the optimistic transactional mode then we revert to using classic pessimistic mutual exclusion by taking the lock, running the critical section in non-transactional execution mode, and finally releasing lock. Myriad retry policies are possible. It's worth noting that HTM implementations built on top of the existing cache coherence protocol with "requester-wins" conflict resolution policies will usually admit or allow mutual abort with no progress by any parties. That is, the progress properties can be rather poor.

A relatively simple way to improve on the implementation above is to employ a reader-writer lock. Threads trying to use optimistic transactions to execute the critical section must acquire and hold shared "R" permission. Threads executing the critical section via the classic non-transactional path must hold exclusive "W" permission. This approach confers a number of advantages. First, this algorithm obviates the need for subscription in optimistic path. 2nd, it avoids the lemming effect. (Perhaps I should have used the term "pyrrhic locking" instead, as we all know full well that lemmings don't really follow each other off cliffs to their death. It's pyrrhic in the sense that you "win" the lock, but lose the ability to runs transactions). 3rd, under the usual simplistic TLE implementations, a thread taking the classic path will cause all concurrent threads in the optimistic path to abort. But with the reader-writer lock variant, threads taking the pessimistic path wait politely for the optimistic threads to finish. The existence of threads in the R path is visible to threads trying to acquire W -- this is vaguely analogous to the idea of "visible readers" in STMs. (See also TLRW). Visible readers and polite writers may result in fewer aborts and cycles wasted on futile transactions. Both the simple traditional form and the form based on reader-writer locks prohibit simultaneous execution of transactional and non-transactional modes under the lock, but the reader-writer lock variety is a bit less medieval. In a sense we have a "guarded" execution mode for transactions. Finally, some reader-writer lock implementations intentionally control admission policy so as to promote large groups of simultaneous readers -- we call this "R-group formation". When used for TLE, a reader-writer lock that promotes R-group formation will also tend to exhibit better throughput, say, than a reader-writer lock that uses a strict FIFO admission policy. Specifically, larger R-groups let us co-schedule larger groups of concurrent transactions.

As side-note, hardware TLE can be implemented either within pthread_mutex primitives, or on top of such primitives. In the latter case, if your TLE implementation covers all critical sections protected by the lock in question then you can add a separate "isLocked" tracking variable for the purposes of subscription. In this case the slow "classic" pessimistic path would acquire the lock and store "true" into isLocked. On x86 you also need a fence or fence-equivalent instruction after the store. In some circumstances, however, you might not have access to all the critical sections to impose the use of the isLocked convention. In that case it'd helpful if glib/libc were to expose a pthread_mutex_islocked_np() operator which would test the lock state and subscribe. TLE could call that function in the fast optimistic path. Note that the reader-writer lock form needs neither an isLocked() call to inspect the lock, or an explicit tracking variable.

There are a few ways to embellish this approach. I’ve only explored this on a single-node system, but it looks useful to use a cohort NUMA reader-writer lock with 2 synthetic logical cohort nodes, one for R and one for W. This tends to further favor and promote R-group formation, and thus "batch" or co-schedule groups of transactions together to an even greater degree, yielding better parallelism.

Another idea is to augment the R path with K-exclusion. (See US8402464 for more details). Because of the mutual abort with no progress issue mentioned above, we can encounter fratricide if we have too many threads simultaneously in the R path under a given lock. By decreasing the number of threads in the R-path, we can sometimes improve aggregate throughput. K-exclusion around the R path gives us a convenient way to throttle the number of threads. Specifically, we vary K based on the recent transactional success or abort rates. When K reaches 1 we have a degenerate case and just revert to mutual exclusion.

An AIMD policy for K seems to work reasonably well. AIMD policies are usually associated with TCP congestion windows, but seem to map cleanly to problems involving optimistic synchronization. The TCP sending window is also optimistic, as elements in the window may require re-transmission. (A few years ago I experimented with AIMD policies to experiment with the spin duration for contended java locks).

The canonical use case for reader-writer locks is the usual and obvious single-writer vs multiple-reader roles. Our example above is single pessimistic/normal vs multiple optimistic/transactional. Another mode we see is in lock-free code with deferred memory reclamation, where R permission confers the right to access some data, and the W role is required to unmap or physically free the underlying memory. A related usage is to protect the file system mount tables in operating systems. R permission enables a thread to parse pathnames and access the filesystem without worry of dismount, while W is used by unmount to make sure the filesystem is quiesced. Another use-case is multiple mutator threads vs a garbage collector where the heap is protected by a single reader-writer lock. If you have a JVM or other managed runtime with a stop-the-world copying collector, then normal heap mutator threads have R access. If a mutator calls out of the JVM via JNI, or parks itself, then it relinquishes R for the duration of that operation. The garbage collector takes the W role when active. Mutators also periodically poll for pending GC and relinquish R to allow the collector to run.

See also PhTM : Phased Transactional Memory which appeared in Transact 2007.

Note that a good quality reader-writer lock will provide automatic gang wakeup.

Invisible text : still takes up space

Invisible text : no space allocated

Monday Jan 26, 2015

Blizzard Preparations

3 cars in a 2-car garage

AVIS Frankfurt

I recently rented a car from AVIS at Frankfurt for travel to Dagstuhl. We drove to Dagstuhl, left it parked for 5 days, and returned without incident. On return, the attendant -- very pleasant, btw -- found 'damage' to the front driver's side rim and wanted to know details about the 'accident' . There was no accident while the car was in my hands. I related such to the attendant, but didn't make any progress. AVIS tacked an additional 147€ change on my corporate credit card. This was immediately after the Paris Hebdo incident, so we knew security lines would be long and slow, so even though we arrived 3 hours before the flight, we didn't have any spare time to argue the point.

Interestingly, there was no outbound inspection by AVIS when I rented the car. Below is a picture of the rim after the attendant pointed out the issue. There's certainly something on the rim at about the 6:00 and 3:30 positions, but it didn't happen while I had the car. The photographic quality is bad, as the return area was very poorly lit. I took a quick look around the car when I rented it, but didn't spot the supposedly damaged area.

Hopefully this isn't systemic behavior or practice on the part of AVIS.

I suppose the lesson is (a) take a walk-around video when your rent the car -- hopefully there's enough light -- or (b) perhaps use a different vendor.

A subsequent letter from AVIS detailing the damage charge ended as follows :

If you would like to exclude this risk in your next rental, you can purchase an excess-waiver for a small additional charge.

AVIS

Wednesday Jan 14, 2015

Seen in St. John

I originally mistook this as "Dave".

Just back from Dagstuhl

Concurrent Computing in the Many-core Era 2015 - Schloss Dagstuhl.

As usual, a productive and invigorating week. (All talks My slides).

Conference publication

We also visited the Völklingen Ironworks. (wikipedia).

Monday Dec 29, 2014

Measuring short-term fairness for locks

In a companion blog entry I discussed ways to measure or characterize the long-term fairness of lock implementations. Measuring short-term fairness is also interesting. We could easily have a lock, such as cohort locks, that are fair over long intervals but intentionally unfair over the short-term. It's useful to have ways to define exactly how unfair a lock is over some interval. We'll assume the lock admission policy is work-conserving.

It's worth mentioning that measuring short-term fairness is slightly tricky because of induced probe effect. As such, we sometimes use statistical sampling to reduce the overheads albeit with reduced certainty. I'll leave that discussion for subsequent blog entries.

I'll borrow Denning's working set concept, but apply it locks instead of memory management. The working set of a program over some interval I is the set of distinct pages accessed during I. Where convenient and unambiguous, we'll sometimes quantify the working set as the number of pages in the working set. Similarly, I'll define the lock working set (LWS) for a given lock L over interval I as the set of distinct threads that acquired the lock during the interval. Instead of defining the interval I over wall-clock time, it's more convenient to treat each acquire of the lock as a virtual increment of our interval clock. We'll define LWSS as the size of the LWS. Even though we might define the interval I to be small, we can partition a long run into a set of M I-length disjoint intervals and report the average or median LWSS over those M points.

A related statistic is the median time to reacquire (MTTR). Conceptually, we can think of a measurement run as generating a stream of thread IDs that acquired the lock. We can then iterate through that stream, and for every acquire, determine the number of intervening acquire operations that occurred since the thread last obtained the lock. That gives us a time-to-reacquire value, and in turn, we can compute the MTTR over the run.

Note that we might measure either the number of intervening acquires, or the number of distinct threads to acquire in the interim.

My colleague Tim Harris noted that it could be useful to plot by varying the interval length I on the x-axis, and the average LWSS on the y-axis. This is somewhat reminiscent of MMU (minimum mutator utilization) graphs that appear in the garbage collection literature.

When a contended lock is unfair over the short term, it's typically the case that the set of threads circulating over the lock is partitioned into active and passive sets. The active circulating set corresponds to the LWSS.

Measuring long-term fairness for locks

Lets say we have N concurrent and homogenous threads that each loop as follows : acquire a central lock; execute a critical section; release the lock; execute a non-critical section. We start the threads at the same time. At the end of the measurement interval we have N data points : the number of iterations completed by each thread. This is the so-called fixed-time-report-work benchmark methodology. Aggregate throughput is measured by the sum of the data points, but we're also interested in quantifying the fairness, as reflected by the distribution of those points. For simplicity, we'll assume that the scheduler is ideally fair, there are no NUMA or physical "geographic" issues in play, and that each thread has about the same cache footprint. The only remaining source of unfairness is the lock implementation or policies. (In practice there are myriad confounding factors, including energy and thermal caps, but we'll ignore those and assume an idealized model). A histogram of the completion counts is a good way to visualize the fairness or unfairness of the lock. I recommend HdrHistogram.

Sometimes, though, it's convenient to describe unfairness in terms of a simple real-valued numerical statistic. Such a descriptive statistic can be used to quantity the fairness of various lock policies, and in particular to help establish the trade-off between fairness and throughput. Ideally that statistic would be scale-invariant -- that property is desirable but optional. Some of the usual statistics are standard deviation or variance. Extrema-based statistics such as (max-min) or (max-min)/max can also be useful. These can give us a sense of the size of the range of the data points. The average divided by the max can also provide some insight. IQR is another commonly used statistic, as is Jain's Fairness Index. Cederman et al. suggested another useful fairness metric. In recent papers I've reported the relative standard deviation. (In communications literature it's not uncommon to see fairness expressed in terms of average/stddev, which is the reciprocal of the relative standard deviation). Median Absolute Deviation (MAD) is also informative. I don't have one favorite -- my lock benchmark harnesses report all of the above.

Recently I was looking at other measures of disparity or uniformity and came up with the following idea. First, we sort our N completion counts in ascending order. We then plot as follows. On the X-axis we have the number of threads T, and on the Y-axis we have the cumulative sum of iteration counts up to thread #T. (Think: CDF or Riemann Integral) If the completion counts were 10,15,22,33 then the Y values would be C(x) = 0,10,25,37,70, for instance, for 0,1,2,3,4 threads, respectively. Beware that the C(x) function is obviously discrete, but for the purposes of discussion and terminology I'll treat it as a continuous real-valued function. Next, we normalize both axes to [0,1]. For the Y-axis, we can simply divide C(x) by the maximum -- final -- C(x) value. If all the data points are equal -- ideally fair -- then C(x) is just the identity diagonal function : C(x)=x. Note that C(x) is convex. The more unfair the distribution, the larger the area under the diagonal and above the normalized C(x) function. And in fact that area measure makes a useful index of inequality. We could derive additional statistics from this approach, such as the maximum x-C(x) value, but the area between the diagonal and C(x) seems viable.

Subsequently, I realized a similar approach had been long used in economics to describe income disparity : Gini Coefficient.

See also : Measuring Short-Term Fairness for locks.

Tuesday Dec 02, 2014

No blocking parking

Taken at lovely Gillette Stadium; possibly amusing to synchronization practitioners. "Easy exit" sounds interesting as well.

Thursday Oct 09, 2014

Synchronization horror story of the day

While evaluating some new lock algorithms (exposed via LD_PRELOAD interposition on the pthread_lock and condvar interfaces), I noticed that C++ catch and throw on Solaris/SPARC compiled with GCC 4.9.0 will take runtime paths that take locks and impede scaling. Specifically, if you have a set of threads that simply execute "try { throw 20; } catch (int e) {;}", they won't scale as expected because of runtime locks. Unwind_RaiseException and Unwind_Find_FDE always appear on the stack when we encounter the contention. Interestingly, I didn't see any evidence of this behavior on Ubuntu 14.04 on x86.

Tuesday Jul 29, 2014

A simple lazy subscription pathology

Following up on The Pitfalls of lazy subscription, I thought I'd provide a simple case that illustrates where transactional lock elision (TLE) with lazy subscription can fail. I've managed to reproduce the failure in-house on an i7-4770 (haswell).

Say we have a ring-buffer that’s used for logging. It just records the last 16 messages posted. (This is kind of common for “black box flight recorders” that appear in kernels and the JVM). The memory layout is : int pos; intptr_t RingBuffer[16]; volatile int Lock. We'll also assume that we compile with code at low optimization levels, so accesses to "pos" result in loads and stores with no caching in registers. “Lock” is a simple test-and-test-and-set lock possibly augmented with TLE and late subscription.

The critical section is :



Post(m) :
Acquire(&Lock)
auto index = pos ++ // load and store pos
RingBuffer[index] = m
if pos == 16 : pos = 0 // reload pos
Release(&Lock)

A number of threads just loop, calling Post(0). Some run take the lock, others use TLE with late subscription. It's possible for one of threads using TLE with late subscription to observe an intermediate and transient "pos" value of 16 by virtue of some thread that holds "Lock" and has incremented "pos" from 15 to 16, but not yet reset the variable to 0. The thread using TLE that observed "pos" at 16 will then overwrite "Lock" with 0, so the late subscription check succeeds and the transaction inadvertently commits. This is a case of "wrongful commit" via lock corruption.

Characterizing which critical sections remain safe under TLE with late subscription remains an interesting research topic. But in the general case it's unsafe.

It's worth pointing out that for weakly ordered processors, the load or loads that constitute the classic early subscription check must have proper fencing. None is needed on x86, of course.

Monday Jul 28, 2014

Hardware extensions to make lazy subscription safe

Hardware extensions to make lazy subscription safe is a follow-on to our WTTM 2014 paper, Pitfalls of Lazy Subscription . We describe a number of hardware approaches to avoid the wrongful commit pathology admitted by generalized transactional lock elision using lazy subscription. Eager (early) subscription remains safe, of course.

About

Dave is a senior research scientist in the Scalable Synchronization Research Group within Oracle Labs : Google Scholar.

Search

Categories
Archives
« February 2016
SunMonTueWedThuFriSat
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
     
       
Today