X

Recent Posts

Towards an Efficient Pauseless Java GC with Selective HTM-Based Access Barriers - in ManLang 2017

Towards an Efficient Pauseless Java GC with Selective HTM-Based Access Barriers appears in ManLang 2017 -- formerly PPPJ  (http://d3s.mff.cuni.cz/conferences/manlang17/).   Abstract The garbage collector (GC) is a critical component of any managed runtime environment (MRE), such as the Java virtual machine. While the main goal of the GC is to simplify and automate memory management, it may have a negative impact on the application performance, especially on multi-core systems. This is typically due to stop-the-world pauses, i.e., intervals for which the application threads are blocked during the collection. Existing approaches to concurrent GCs allow the application threads to perform at the same time as the GC at the expense of throughput and simplicity. In this paper we build upon an existing pauseless transactional GC algorithm and design an important optimization that would signicantly increase its throughput. More precisely, we devise selective access barriers, that define multiple paths based on the state of the garbage collector. Preliminary evaluation of the selective barriers shows up to 93% improvement over the initial transactional barriers in the worst case scenario. We estimate the performance of a pauseless GC having selective transactional barriers and find it to be on par with Java’s concurrent collector.       

Towards an Efficient Pauseless Java GC with Selective HTM-Based Access Barriers appears in ManLang 2017 -- formerly PPPJ  (http://d3s.mff.cuni.cz/conferences/manlang17/).   Abstract The...

General

Malthusian Locks

Malthusian Locks appears in EuroSys 2017. An extended version is in arxiv. Abstract: Applications running in modern multithreaded environments are sometimes overthreaded. The excess threads do not improve performance, and in fact may act to degrade performance via scalability collapse, which can manifest even when there are fewer ready threads than available cores. Often, such software also has highly contended locks. We leverage the existence of such locks by modifying the lock admission policy so as to intentionally limit the number of distinct threads circulating over the lock in a given period. Specifically, if there are more threads circulating than are necessary to keep the lock saturated (continuously held), our approach will selectively cull and passivate some of those excess threads. We borrow the concept of swapping from the field of memory management and impose concurrency restriction (CR) if a lock suffers from contention. The resultant admission order is unfair over the short term but we explicitly provide long-term fairness by periodically shifting threads between the set of passivated threads and those actively circulating. Our approach is palliative, but is often effective at avoiding or reducing scalability collapse, and in the worst case does no harm. Specifically, throughput is either unaffected or improved, and unfairness is bounded, relative to common test-and-set locks which allow unbounded bypass and starvation. By reducing competition for shared resources, such as pipelines, processors and caches, concurrency restriction may also reduce overall resource consumption and improve the overall load carrying capacity of a system.

Malthusian Locks appears in EuroSys 2017. An extended version is in arxiv.Abstract: Applications running in modern multithreaded environments are sometimes overthreaded. The excess threads do not...

General

Mitigating the Java nanoTime coherence hotspot

Java's nanoTime() API guarantees a monotonic (really, non-retrograde) relative clock source. It's also expected to be causal in the following sense. Say thread A calls nanoTime(), gets value V, and then stores V into memory. Some other thread B then observes A's store of V, then calls nanoTime() itself and gets W. We expect W should be greater than or equal to V. In an ideal world the clock sources underlying nanoTime() would be causal. But that's not always the case on all platforms, in which case the JVM has to enforce the property by tracking the maximum observed time in a memory location and only returning the larger of the underlying source that variable. In turn, this creates a coherence hotspot that can impede scaling. (In some cases we even seen a concave scaling curve, particularly on NUMA systems). I've provided a simple harness that models the nanoTime() implementation on Solaris. It creates T threads, each of which loops calling MonoTime(). At the end of a 10 second measurement interval it reports throughput -- the aggregate number of MonoTime() calls completed by that cohort of threads. On a SPARC T5 for 1,2,4,8 and 16 threads, I observe 11M, 14M, 25M, 29M and 29M iterations completed, respectively. You might naively expect ideal scaling up the number of cores (16 in this case), but that's not the case. (Note that I just happened to use a T5 for data collection. Gethrtime() is causal on a T5). MonoTime can also take a "granularity" argument, which allows a trade-off between the granularity of the returned value and the update rate on the variable that tracks the maximum observed time. (The command line is "./MonoTime Threads GranularityInNsecs"). 0 reflects the existing nanoTime() implementation. If I use a granularity of 1000 then for 1,2,4,8 and 16 threads I observe 14M, 27M, 54M, 104M, and 181M iterations. (The improvement at 1 thread arises because of reduced local latency as we use atomic "CAS" less, but the improved scalability proper comes from reduced coherence traffic on the variable that tracks the maximum returned value). MonoTimeThis might make an interesting "-XX:" switch for HotSpot, both as a diagnostic test for the existence of the problem, and as a potential work around.

Java's nanoTime() API guarantees a monotonic (really, non-retrograde) relative clock source. It's also expected to be causal in the following sense. Say thread A calls nanoTime(), gets value V, and...

General

Preemption tolerant MCS locks

A simple test-and-set based spin lock is a reasonable choice when contention is nil or low. Lock handover is relatively efficient and there's no need to maintain a list of waiting threads, yielding a simple design. Under higher contention, queue-based such as MCS often provide better throughput and be a better choice. Classic MCS locks are strictly FIFO queue-based locks that use local spinning. (Clever lock implementations can try to adaptively select the best algorithm based on recent load, in order to try to have the best of both). However, if there are more ready threads than logical CPUs and involuntary context switching -- preemption -- is in play, then MCS performance can suffer. Specifically, the unlock operator might pass ownership to a preempted thread, resulting in undesirable lock waiter preemption which in turn can result in convoying and head-of-line blocking phenomena. Somewhat perversely, simple test-and-set locks often perform better than MCS locks when preemption is active, as threads have to actively compete for the lock. That is, ownership will never be passed to a preempted thread. (Both MCS and test-and-set locks are vulnerable to lock holder preemption, and it can certainly be the case that a thread acquires the test-and-set lock and is then immediately preempted, but by definition the successor was running when it acquired a test-and-set lock). To make MCS more preemption tolerant, we can use the Solaris schedctl facility to avoid handoff of the lock to threads that are preempted. Schedctl gives us an extremely efficient way -- requiring just a load instruction -- to query the scheduling state of threads from user-mode code. The unlock operator uses schedctl to interrogate the scheduling state of a tentative successor. The successor is the next node in the MCS chain after the owner's node. If that thread is running on a CPU, then the unlock proceeds normally and the caller passes ownership of the lock to the successor in the usual fashion. But if the successor has been preempted, the unlock operator splices the successor's MCS node out of the MCS chain and writes a special evicted value into that node. We can use the same field on which threads spin, using 3 possible values instead of the usual 2. The unlock operator then inspects the chain and repeats as necessary. When the preempted thread eventually resumes and is dispatched onto a CPU, it notices the evicted value and simply reenqueues "re-arriving" at the lock. If the unlock operator evicts the terminal element in the chain, then the lock is released and set to available state. (Evicting interior nodes requires just pointer manipulation. Excising the terminal node requires as CAS, as would be the case when the lock is made available). This clearly breaks the FIFO queue policy of normal MCS. On the other hand it forces the lock's admission policy to respect and accede to the kernel's scheduling and priority decisions instead of blindly and obstinately using FIFO. Put a different way, it's troubling if excessively rigid lock policies run counter to kernel scheduling policies. The kernel is ultimately in control, after all. If the kernel decided that thread A should run and B should be preempted, the lock should respect that decision and admit accordingly. And critically, this techniques preserves MCS performance when there are more ready threads than CPUs and preemption is in play. Normally we might find a fairly abrupt cliff at that point, but the augmented MCS avoids the usual drop at performance exhibited by classic MCS when the number of threads exceeds the number of CPUs. It's also worth noting that there's a window between the inspection of the schedctl state and passing the lock to a successor where that thread might be preempted, but in practice the window is small and the risk is no greater than normal lock holder preemption. Kontothanassis et al. and He et al. suggested related ways to mitigate the issue of handoff to preempted waiters. I'll try to post a graph in next week or so.

A simple test-and-set based spin lock is a reasonable choice when contention is nil or low. Lock handover is relatively efficient and there's no need to maintain a list of waiting threads, yielding...

General

Fast Non-intrusive Memory Reclamation for Highly-Concurrent Data Structures

Fast Non-intrusive Memory Reclamation for Highly-Concurrent Data Structures by Dave Dice, Maurice Herlihy and Alex Kogan, in ISMM 2016 Abstract: Current memory reclamation mechanisms for highly-concurrent data structures present an awkward trade-off. Techniques such as epoch-based reclamation perform well when all threads are running on dedicated processors, but the delay or failure of a single thread will prevent any other thread from reclaiming memory. Alternatives such as hazard pointers are highly robust, but they are expensive because they require a large number of memory barriers.This paper proposes three novel ways to alleviate the costs of the memory barriers associated with hazard pointers and related techniques. These new proposals are backward-compatible with existing code that uses hazard pointers. They move the cost of memory management from the principal code path to the infrequent memory reclamation procedure, significantly reducing or eliminating memory barriers executed on the principal code path.These proposals include (1) exploiting the operating system’s memory protection ability, (2) exploiting certain x86 hardware features to trigger memory barriers only when needed, and (3) a novel hardware-assisted mechanism, called a hazard look-aside buffer (HLB) that allows a reclaiming thread to query whether there are hazardous pointers that need to be flushed to memory. We evaluate our proposals using a few fundamental data structures (linked lists and skiplists) and libcuckoo, a recent high-throughput hash-table library, and show significant improvements over the hazard pointer technique.

Fast Non-intrusive Memory Reclamation for Highly-Concurrent Data Structures by Dave Dice, Maurice Herlihy and Alex Kogan, in ISMM 2016Abstract:Current memory reclamation mechanisms for...

General

JNI performance - false sharing on the "-UseMembar" serialization page

For background on the membar elision techniques and the serialization page, see the following: 7644409; Asymmetric Dekker Synchronization; and QPI Quiescence. On normal x86 and SPARC systems these are strictly local latency optimizations (because MEMBAR is a local operation) although on some systems where fences have global effects, they may actually improve scalability. As an aside, such optimizations may no longer be profitable on modern processors where the cost of fences has decreased steadily. Relatedly, on larger systems, the TLB shootdown activity -- interprocessor interrupts, etc -- associated with mprotect(PROT_NONE) may constitute a system-wide scaling impediment. So the prevailing trend is away from such techniques, and back toward fences. Similar arguments apply to the biased locking -- another local latency optimization -- which may have outworn its usefulness. A colleague in Oracle Labs ran into a puzzling JNI performance problem. It originally manifested in a complex environment, but he managed to reduce the problem to a simple test case where a set of independent concurrent threads make JNI calls to targets that return immediately. Scaling starts to fade at a suspiciously low number of threads. (I eliminated the usual thermal, energy and hyperthreading concerns). On a hunch, I tried +UseMembar, and the scaling was flat. The problem appears to be false sharing for the store accesses into the serialization page. If you're following along in the openjdk source code, the culprits appear to be write_memory_serialize_page() and Macroassembler::serialize_memory(). The “hash” function that selects an offset in the page — to reduce false sharing — needs improvement. And since the membar elision code was written, I believe biased locking forced the thread instances to be aligned on 256-byte boundaries, which contributes in part to the poor hash distribution. On a whim, I added an “Ordinal” field to the thread structure, and initialize it in the Thread ctor by fetch-and-add of a static global. The 5th created thread will have Ordinal==5, etc. I then changed the hash function in the files mentioned above to generate an offset calculated via : ((Ordinal*128) & (PageSize-1)). “128” is important as that’s the alignment/padding unit to avoid false sharing on x86. (The unit of coherence on x86 is a 64-byte cache line, but Intel notes in their manuals that you need 128 to avoid false sharing. Adjacent sector prefetch makes it 128 bytes, effectively). This provided relief. With 128 byte units and a 4K base page size, we have only 32 unique “slots" on the serialization page. It might make sense to increase the serialization region to multiple pages, with the number of pages is possibly a function of the number of logical CPUs. That is, to reduce the odds of collisions, it probably makes sense to conservatively over-provision the region. (mprotect() operations on contiguous regions of virtual pages are only slightly more expensive than mprotect operations on a single page, at least on x86 or SPARC. So switching from a single page to multiple pages shouldn’t result in any performance loss). Ideally we’d index with the CPUID, but I don’t see that happening as getting the CPUID in a timely fashion can be problematic on some platforms. We could still have very poor distribution with the OrdinalID scheme I mentioned above. Slightly better than the OrdinalID approach might be to try to balance the number of threads associated with each of the slots. This could be done in the thread ctor. It’s still palliative as you could have a poor distribution over the set of threads using JNI at any given moment. But something like that, coupled with increasing the size of the region, would probably work well. p.s., the mprotect()-based serialization technique is safe only on systems that have a memory consistency model that's TSO or stronger. And the access to the serialization page has to be store. Because of memory model issues, a load isn't sufficient. Update: friends in J2SE have filed an RFE as JDK-8143878.

For background on the membar elision techniques and the serialization page, see the following: 7644409; Asymmetric Dekker Synchronization; and QPI Quiescence. On normal x86 and SPARC systems these...

General

Evaluating HTM for pauseless garbage collectors in Java

Evaluating HTM for pauseless garbage collectors in Java by Maria Carpen-Amarie, Dave Dice, Patrick Marlier, Gaël Thomas and Pascal Felber appeared in The 13th IEEE International Symposium on Parallel and Distributed Processing with Applications (IEEE ISPA-15). Abstract: While garbage collectors (GCs) significantly simplify programmers’ tasks by transparently handling memory management, they also introduce various overheads and sources of unpredictability. Most importantly, GCs typically block the application while reclaiming free memory, which makes them unfit for environments where responsiveness is crucial, such as real- time systems. There have been several approaches for developing concurrent GCs that can exploit the processing capabilities of multi-core architectures, but at the expense of a synchronization overhead between the application and the collector. In this paper, we investigate a novel approach to implementing pauseless moving garbage collection using hardware transactional memory (HTM). We describe the design of a moving GC algorithm that can operate concurrently with the application threads. We study the overheads resulting from using transactional barriers in the Java virtual machine (JVM) and discuss various optimizations. Our findings show that, while the cost of these barriers can be minimized by carefully restricting them to volatile accesses when executing within the interpreter, the actual performance degradation becomes unacceptably high with the just-in-time compiler. The results tend to indicate that current HTM mechanisms cannot be readily used to implement a pauseless GC in Java that can compete with state-of-the-art concurrent GCs.See also:US9208081.

Evaluating HTM for pauseless garbage collectors in Java by Maria Carpen-Amarie, Dave Dice, Patrick Marlier, Gaël Thomas and Pascal Felber appeared in The 13th IEEE International Symposium on Parallel...

General

Locks with LIFO admission order

Why would we ever want a lock with a LIFO admission policy? First, a LIFO lock provides a useful measure of real-world scalability. Lets say we have a set of threads that each iterate as follows : acquire some lock L; execute a fixed-length critical section body of duration C; release L; and finally execute a non-critical section of length N. We run T threads concurrently, and at the end of a measurement interval we report the total number of iterations completed, as well as per-thread iteration counts. Amdahl's law says the maximum ideal speedup relative to a single thread should be (N+C)/C. We can run our experiments, varying the thread count, measure aggregate throughput and see compare to see how close we come to Amdahl's bound. Assuming we have a heterogeneous system and ignoring any potential superlinear effects, the observed peak speedup will be capped by Amdahl's bound. And if we use a fair FIFO lock, such as MCS, the threads will all have approximately equal completion counts. It's worth noting that Amdahl's law is sometimes misapplied to locks and critical sections. In the classic Amdahl model, during the serial phase no other threads may be executing concurrently, while with locks, when one threads is in the critical section other threads may be executing concurrently in their critical sections. That is, classic Amdahl's law applies to barriers. See also Gustafson's law, Gunther's universal scaling law, and in particular Eyerman's model. Critically, though, the maximum speedup bounds still hold. Now lets say we switch to a LIFO lock. Ideally, the aggregate throughput will be the same as we saw with the FIFO lock. If N=30 and C=10, then the ideal speedup is 4X. If we run with 10 threads under a LIFO lock, when we examine the distribution of per-thread completion counts we expect to see 4 threads dominate with about equal performance, and 6 threads should have starved completely. This gives us another non-analytic empirical way to guage the maximum speedup over a lock. Put another way, can we figure out how many threads we can "squeeze" or pack into a contended lock before we hit saturation. We keep increasing T until some threads show evidence of starvation. This lets us discern the N/C ratio. Of course we could try to derive the ratio using FIFO locks, varying T, and using Amdahl's law, but in practice there are quite a few practical confounding factors. The LIFO approach gives us a fairly direct reading of the number of threads that will "fit" before we reach over-saturation. LIFO locks are also useful in their own right. While they are deeply unfair, they work very well with spin-then-park waiting strategies. If we imagine the lock as implemented with a stack of waiting threads, threads near the head are mostly likely to be spinning, and are also most likely to be next granted the lock. If the lock is over-saturated, then under a LIFO policy, ownership will circulate over just a subset of the contending threads. In turn this can reduce cache pressure and yield benefits arising from thermal and energy bounds. Of course we have to take measures to ensure long-term eventual fairness, but many locks intentionally trade-off short-term fairness for throughput. (See our "Cohort" locks, for example). A possibly systemic downside to LIFO locks is that arrivals and departures may need to access the same lock metadata, creating an acute coherence hot-spot. With a contended MCS lock, for instance, an unlock operation doesn't need to access the "tail" field. I wondered if there was a LIFO analog to the classic FIFO ticket lock and put the question to my colleagues in Oracle Lab's Scalable Synchronization Research Group, and collected some of the designs, which I'll report below. It's an interesting exercise and puzzle, and hard to resist for concurrency folks. Alex Kogan, Victor Luchangco, Tim Harris, Yossi Lev and I contributed. Any mistakes are mine. The most basic LIFO algorithm I could think of was to implement an explicit stack of waiting threads with a central "head" pointer which points to the most recently arrived thread. The approach is pretty obvious and yields a true LIFO admission order. Expressed in a pidgin Python/C++ dialect and assuming atomic<T> for all shared variables, the following sketch describes that form. The lock is very similar to the Thread::MuxAcquire() and ::MuxRelease() primitives that I wrote for the HotSpot JVM. (Those are internal locks used by JVM to get over a bootstrapping phase where the normal native C++ HotSpot Monitor:: and Mutex:: classes aren't yet initialized). We call this form "E3". (I apologize for the crude listings that follow. Oracle's blog software explicitly filters out remote javascript scripts, so I'm unable to use civilized pretty-print facilities such as github's "gist" mechanism). class StackElement : StackElement * Next int Wait class LIFOLock : // Encoding for Head field : // 0 = lock is free // 1 = locked but no waiting threads // T = locked : T refers to stack of waiting threads StackElement * HeadAcquire(L) : StackElement A ; // on-stack auto-variable auto h = L->Head if h == 0 : h = CAS (&L->Head, 0, 1) if h == 0 : return // uncontended fast-path acquisition // inopportune interleaving -- some other thread mutated L->Head // in the LD-CAS window above. We lost the race // Apparent contention ... // Doorway phase : use CAS to push our element A onto the stack for assert h != 0 A.Next = h & ~1 A.Wait = 1 auto v = CAS (&L->Head, h, &A) if v == h : break if v == 0 : v = CAS (&L->Head, 0, 1) if v == 0 : return h = v // Waiting phase // the lock provides local spinning while A.Wait != 0 : Pause() assert L->Head != 0 Release(L) : auto h = L->Head assert h != 0 if h == 1 h = CAS (&L->Head, 1, 0) if h == 1 : return // simple uncontended release // the stack can only grow while the lock is held ... // The lock is contended // try to pop the head of the stack. // This is the most recently arrived thread assert h != 0 && h != 1 // Note that we're using CAS to push and pop elements // Normally that would leave us exposed to ABA problems. // But as there can be only one thread trying to pop -- that being the owner -- // we have multiple-push-single-pop concurrency model and are thus not vulnerable // to ABA pathologies. The lock itself restricts concurrency and prevents // multiple concurrent pop operations. for auto next = h->Next auto v = CAS (&L->Head, h, next) if v == h : assert h->Wait == 1 h->Wait = 0 break h = v // Variation : // Note that if the CAS fails we are guaranteed to have at least 2 elements // on the stack. We can splice out the element that follows the element // identified by "v" and pass ownership to the associated thread. // The thread we pick will either be the same as the head at the time // we fetched L->Head in Release(), or some thread that arrived afterward. // Depending on how liberal your interpretation, this is a plausibly LIFO ordering. // This approach yields a constant-time Release() operator, with no loops. // As constituted above, the policy is "strict" LIFO, however. The next few variations forgo explicit nodes, and as such, we'll have global spinning. The broad inspiration for this family is the CLH lock, where a thread knows the adjacent thread on the queue, but the queue is implicit. We call the following "E5" because it was the 5th experimental version. Class LIFOLock int Head int NextToRunenum DistinguishedValues U = 0 // unlocked H = 1 // Locked, no other threads waiting P = 2 // ownership being passed Invalid = 3 // IDBias = 4 // offset so thread ID values don't intersect aboveAcquire(L) : auto tkt = Self->UniqueID + IDBias assert tkt != U && tkt != P && tkt != H auto h = L->Head ; for : if h == P : Pause() ; h = L->Head; continue ; auto newh = (h == U) ? H : tkt auto v = CAS (&L->Head, h, newh) if v == h : break h = v if h != U : while L->NextToRun != tkt : Pause() L->NextToRun = Invalid assert L->Head == P L->Head = h Release (L) : auto h = L->Head ; for : assert h != P && h != U auto newh = (h == H) ? U : P ; auto v = CAS (&L->Head, h, newh) ; if v == h : break h = v if h != H : L->NextToRun = h The first thing to notice is that the "P" encoding can result in two waiting phases in Acquire() : arriving threads may first wait while Head == P and then for their specific turn. The interlock protocol to hand-off feels rather synchronous. P state is effectively a lock that freezes out arrivals until the successor manages to depart. In addition, a group of threads could be waiting while Head == P, but subsequently "enqueue" themselves in an order that differs from their arrival order, so we don't have strict pedantic FIFO. (See also FCFE = First-Come-First-Enabled). We can streamline E5 slightly, yielding E5B : Acquire(L) : auto tkt = Self->UniqueID + IDBias assert tkt != U && tkt != P && tkt != H auto h = L->Head for : if h == P : Pause() ; h = L->Head; continue ; if h == U : auto v = CAS (&L->Head, U, H) if v == U : return h = v continue // Case : H or T = most-recently-arrived thread auto v = CAS (&L->Head, h, tkt) if v == h : break h = v while L->NextToRun != tkt : Pause() L->NextToRun = U assert L->Head == P L->Head = h Release (L) : auto h = L->Head ; if h == H : // uncontended path auto v = CAS (&L->Head, H, U) ; if v == H : return h = swap (&L->Head, P) assert h != U && h != H && h != P L->NextToRun = h The next version, E6, eliminates the P encoding and switches to a seqlock-like lock for the hand-off transition. The lock instance requires just a single "Next" field. When the low-order bit of Next is set, arrivals are frozen during the hand-off. E6 appears the fastest on non-NUMA systems, possibly because the paths are relatively tight. Acquire(L) : auto w = L->Next ; for if w & 1 : Pause(); continue auto v = CAS (&L->Next, w, w+2) if w != v : w = v ; continue if w != 0 : while L->Next != (w+1) : Pause() L->Next = w breakRelease(L) : auto w = L->Next ; assert w != 0 && (w & 1) == 0 if w == 2: auto v = CAS (&L->Next, 2, 0) ; if v == 2 : return ; FetchAdd (&L->Next, -1) // set to odd, locked for handoffFor E7 we revert to using an "inner lock" to protect an explicit stack of waiting threads. An MCS or CLH lock work nicely for that purpose. E7 provides local spinning and, depending on the inner lock implementation, is true FIFO. We use an encoding of Head == 1 to indicate the lock is held but no threads are waiting. Acquire(L) : CLHAcquire (&L->Inner) Thread * h = L->Head if h == NULL : L->Head = 1 ; CLHRelease (&L->Inner) ; return Self->Grant = 0 Self->Next = h L->Head = Self CLHRelease (&L->Inner) while Self->Grant == 0 : Pause() ; LockRelease (L) { CLHAcquire (&L->Inner) Thread * h = L->Head assert h != 0 if h == 1 : L->Head = 0 ; CLHRelease (&L->Inner) else : L->Head = h->Next CLHRelease (&L->Inner) assert h->Grant == 0 h->Grant = 1 For E8 we try to make greater user of atomic fetch-and-add operators. The lock contains "Ticket" and "Admit" fields. Acquire(L) : auto t = FetchAdd (&L->Ticket, 1) ; if t != 0 : for : while L->Admit != t : Pause() // Adjudicate among set of threads with same ticket "depth" value // Admits some non-determinism because of race auto v = SWAP (&L->Admit, -1) ; assert v == t || v == -1 if v != -1 : break ; Release(L) : // Claim : t value in LockRelease() must be >= t value // in corresponding LockAcquire() invocation auto t = FetchAdd (&L->Ticket, -1) ; assert t > 0 if --t != 0 L L->Admit = t ; E9 uses a latch -- really a lock that allows asymmetric acquire and release. Specifically, if thread T1 acquires the latch then T2 may subsequently release the latch. (This is precisely the "thread-oblivious" property required by top-level cohort locks). For our purposes we can use a ticket lock. Our lock structure contains an inner ticket lock and Depth and Admit fields. Acquire (L) : TicketAcquire (L->TicketLock) auto d = L->Depth ++ ; TicketRelease (L->TicketLock) if d != 0 : while L->Admit != d : Pause() L->Admit = -1 ; TicketRelease (L->TicketLock) Release (L) : TicketAcquire (L->TicketLock) auto d = -- L->Depth ; assert d >= 0 L->Admit = d if d == 0 : TicketRelease (L->TicketLock) Those were the most interesting variations. We found the exercise rather fun. And I'm sure there are lots of wonderfully clever ideas that we missed.

Why would we ever want a lock with a LIFO admission policy?First, a LIFO lock provides a useful measure of real-world scalability. Lets say we have a set of threads that each iterate as follows :...

General

waiting policies for locks : spin-then-park

I thought I'd collect a few notes on the topic of waiting policies in the context of locks. Lock algorithms in the literature are usually implemented via unbounded spinning. That's fine for an academic paper, but in practice it's almost never viable to use unbounded spinning. More typically, we'll use a spin-then-park policy where the spin phase is bounded, and a thread reverts to parking as necessary. In theory we can also yield occasionally while spinning, but yielding doesn't replace the need to park. (In practice, yielding is often a fool's errand, particularly with modern schedulers where the semantics of the operation havebeen considerably weakened over time). Parking is expected to deschedule the caller until it is unparked. This frees up the CPU for other eligible ready threads. It's also friendly and "polite" to siblings on the same core that share the pipelines, and for thermal/energy caps and turbo-mode. (MONITOR-MWAIT addresses some of those concerns, but the waiting thread still occupies a CPU). The interface to park-unpark is simple. A thread T1 calls park() and is descheduled until some thread T2 calls unpark(T1). If T2 happens to call unpark(T2) before T1 parks, then T1's park() operation will return immediately. You can conceptualize of the implementation as a per-thread restricted-range semaphore. Park() is also allowed to return spuriously, so the caller is expected to use it in a loop. A simple litmus test for correct park-unpark usage is that the application should still work if park and unpark were implemented as no-ops, although we'd be reduced to degenerate spinning. I've used the park-unpark interface in the JVM for about a decade, and exported to Java-land for use in java.util.Concurrent (JUC). On Solaris it's easy to implement park-unpark with a per-thread lock-condvar-flag triple. On Linux you can opt to use futexes directly. Per-thread pipe pairs also yields a good implementation, although it consumes too many file descriptors (handles). See also benaphores. Redundant unpark() operations are benign and are expected to be cheap. A consequence of the "point-to-point" park-unpark interface -- where thread A unparks thread B by directly naming B -- is that we need to have explicit list of threads waiting on lock. Lock-free approaches come in handy for list management. A park() implementation will typically spin for a short period locally before it has to revert to the kernel. (Half the round-trip context switch time is the norm, and is 2-competitive. See Karlin et. al). This constitutes local spinning which is relatively benign in terms of induced coherence traffic. This gives us a spin-then-park waiting policy. Park() also has a variation with a timeout. Now lets shift to using park-unpark with classic lock algorithms. (I'll assume our context is C/C++ instead of Java, btw). With MCS, it's trivial and obvious to adapt the algorithm to use parking. I'll typically add a thread reference field to the MCS QNode structure. (I've yet to try it, but I think the same can be done with the "K42" MCS variant). Broadly, queue-based locks with local spinning are amenable to conversion to spin-then-park. A ticket lock, in comparison, isn't a good candidate. CLH requires a bit more finesse, so I'll provide that here for illustration.As an aside, MCS with spin-then-park waiting is acceptable in some circumstances, but should be used with caution. Recall that MCS is strict FIFO-FCFS. (All strict FIFO locks also provide succession by direct handoff of ownership). If the critical section length or thread count is such that the threads at the tail of the MCS queue start to park, then you'll spend all a large fraction of time in the kernel in voluntary context switching, which is quite expensive. Specifically, the critical path includes kernel overheads to wake a successor. Recently arrived threads at the tail are spinning, while those at the head -- the next ones to be granted ownership -- may be parked in the kernel, which is exactly what we don't want. The performance inflection point where this effect manifests is rather abrupt as we increase contention. Because waiting threads ultimately need to be able to park, and because parking with MCS can toxic to performance, we tend not to find this combination in production code. (I apologize for the crude listings that follow. Oracle's blog software explicitly filters out remote javascript scripts, so I'm unable to use civilized pretty-print facilities such as github's "gist" mechanism). A classic pure-spinning CLH algorithm might appear as follows in a pidgin python/C++ dialect : Acquire(L) : auto n = QNodeAllocate() n->Locked = 1 auto prv = swap (&L->Tail, n) while prv->Locked != 0 : Pause QnodeFree (prv) L->Owner = nRelease(L) : L->Owner->Locked = 0 The fields should be obvious from the usage, so I've skipped the structuredefinitions. (I'll take a quick digression to note the duality between queues and mutual exclusion locks. It's usually trivial to convert a lock-free MPSC queue -- multiple-producer-single-consumer -- into a lock. Threads in lock() arrive and enqueue an "acquire" request upon which they wait. Only the owner can dequeue at unlock()-time, thus the "SC" constraint. Conversely, you can take a queue-based lock such as CLH and deconstruct it to yield an MPSC queue such as Vyukov's MPSC queue). I've converted it to use park-unpark as follows, but I'd be interested in otherapproaches : Acquire(L) : auto n = CLHNodeAllocate() n->Locked = 1 auto prv = swap (&L->Tail, n) if prv->Locked == 1 : // Self uniquely identifies the current thread if swap(&prv->Locked, Self) != 0 : while prv->Locked != 0 : Park() CLHNodeFree (prv) L->Owner = nRelease(L) : auto w = swap (&L->Owner->Locked, 0) assert w != 0 if w != 1 : Unpark(w) This changes the encoding of the node "Locked" field from the usual 0/1 to 0, 1 or a thread reference. 0 continues to mean available, 1 means locked, and any other value indicates locked and identifies the thread waiting on that queue node. We assume "1" is distinguished and no thread will be identified by that value. CLHNodeAllocate and CLHNodeFree are implemented with thread-local caches of free queue nodes, and typically only require a few load and store operations. We can do a bit better than the above if you're willing to make the queue nodes type-stable and immortal. I'd like to avoid that, however. We'll now shift to Brandenburg's PF-T reader-writer lock. This is an extremely clever algorithm and is quite subtle. It's also very terse. If you're not familiar with the algorithm, I recommend reading section 4.1 of their paper in RTSJ11. The lock is phase-fair, which, depending on your tastes, may be a useful property. As written, it uses pure spinning. Additionally, the only atomic it needs is fetch-and-add. PF-T requires a more elaborate transformation -- which I'll step through incrementally -- to use spin-then-park. For the purposes of explication I'll assume a sequentially consistent memory model. All shared fields are assumed to be atomic<:T>. Class Brandenburg : int rin int rout int win int wout // The lowest order bit of "rin" encodes the phase number.// The next higher bit if "rin" is the writer-present bit. // Note that rin+rout resemble the ingress-egress variables// we use in our cohort reader-writer locks. // // Bit 0x100 and above in rin and rout is the count of arrived readers and// departed readers, respectively. 0x100 was used in the original algorithm as // that enabled the use a byte-store instead of an atomic to clear the phase and // writer-present bits when releasing write permission. Mixed-size accesses in / the context of shared or atomic operations can be problematic, so I've eliminated// the byte-store optimization and just use an atomic to clear those bits. Reader(L) : // acquire R permission auto w = FetchAdd (&L->rin, 0x100) & 3 if w != 0 : while w == (L->rin & 3) : Pause() // release R permission FetchAdd (&L->rout, 0x100) Writer(L) : // acquire W permission // resolve W-W conflicts auto Ticket = FetchAdd (&L->win, 1) while Ticket != L->wout: Pause() // Next, wait for readers to drain auto w = 2 | (Ticket & 1) auto tx = FetchAdd (&L->rin, w) while tx != L->rout : Pause() // Clear the low-order 2 bits of L->rin FetchAdd (&L->rin, -(L->rin & 3)) ; L->wout ++The first thing to note is that win and wout really constitute a ticket lock used to adjudicate access between writers. The implementation also leverages the low-order bit of "win" as the phase number. So we can replace win+wout with a proper mutex, and add an explicit phase variable. We'll assume that our mutex is implemented properly and avoids unbounded spinning. This results in the following: Class Brandenburg : int rin int rout mutex WriterLock int PhaseReader(L) : // acquire R permission auto w = FetchAdd (&L->rin, 0x100) & 3 if w != 0 : while w == (L->rin & 3) : Pause() // release R permission FetchAdd (&L->rout, 0x100) Writer(L) : // acquire W permission MutexLock (WriterLock) // Next, wait for readers to drain auto p = (++L->Phase) & 1 auto tx = FetchAdd (&L->rin, p|2) while ((tx ^ L->rout) & ~3) != 0 : Pause() // Clear the low-order 2 bits of L->rin FetchAdd (&L->rin, -(L->rin & 3)) ; MutexUnlock (WriterLock) So at this point the writers are in check (no unbounded spinning) except for at most one writer that's waiting for readers to drain. In anticipation of subsequent changes I'm going to clean up the rin+rout encodings and move the phase number back into the LSB of rin. The use of 0x100 as the increment for a single reader is an historical artifact related to the word-tearing trick, so we'll use 4 instead. (The low order bits remain the writer-present bit and phase number). We also change the code to advance the phase when releasing write permission. Class Brandenburg : int rin int rout mutex WriterLockReader(L) : // acquire R permission auto w = FetchAdd (&L->rin, 4) if (w & 2) != 0 : // writer is active; wait for next read phase while ((v ^ L->rin) & 1) == 0 : Pause() // release R permission FetchAdd (&L->rout, 4) Writer(L) : // acquire W permission MutexLock (WriterLock) // Resolve W-R conflicts : specifically W vs extant R auto t = FetchAdd (&L->rin, 2) & ~1 ; // Wait for readers to drain while L->rout != t : Pause() ; // Clear the writer-present bit and advance phase auto cur = L->rin ; FetchAdd (&L->rin, (cur ^ 3) - cur) ; MutexUnlock (WriterLock) The changes in the step above were relatively cosmetic, but the next transformation is rather large. We're going to implement an explicit stack of waiting readers. This is the "RList" field, the low order bit of which also serves as the phase number. There are now 2 copies of the phase bit, one in the LSB of rin and one in the LSB or RList. They change almost in unison. We'll also assume that the thread structure has "Next" and "Grant" fields that we can use. "Self" is thread-local variable that refers to a thread's own thread structure. Note that our approach is not vulnerable to ABA problem. Class Brandenburg : mutex WriterLock Thread * RList int rin int routReader(L) : // acquire R permission auto w = FetchAdd (&L->rin, 4) if (w & 2) != 0 : // writer is active; wait for next read phase // Attempt to push ourselves on the RList auto ph = v & 1 // arrival phase of this thread Self->Grant = 0 Thread * head = L->RList for : // If phase changed then proceed directly to RCS // Threads with the previous phase "bounce off" RList. // RList accepts threads only of the current phase if (head & 1) != ph : goto EnterRCS ; Self->Next = head & ~1 auto v = CAS (&L->RList, head, Self|ph) ; if v == head : break ; head = v ; while Self->Grant == 0 : Park() EnterRCS: // release R permission FetchAdd (&L->rout, 4) Writer(L) : // acquire W permission MutexLock (WriterLock) // Resolve W-R conflicts : specifically W vs extant R auto t = FetchAdd (&L->rin, 2) assert (t & 2) == 0 // Wait for readers to drain while L->rout != (t & ~1) : Pause() auto cur = L->rin // Detach the list of waiting readers and advance the phase number // in the least significant bit of RList auto List = swap (&L->RList, NULL|((cur + 1) & 1) ) assert (List & 1) == (cur & 1) // Clear the writer-present bit and advance phase FetchAdd (&L->rin, (cur ^ 3) - cur) ; // Wake the list of waiting readers we have in-hand // We could drop WriterLock early but there's no point as we're changing // phase and readers run next. List = List & ~1 while List != NULL : auto w = List List = List->Next assert w->Grant == 0 w->Grant = 1 Unpark(w) MutexUnlock (WriterLock) At this point we're close to complete. There's only one remaining case of unbounded spinning. That appears when a writer waits for the readers to drain. That's relatively easy to remedy by having the waiting writer make its ID or reference visible so readers can wake it. (There's a bit of a Dekker-ish protocol involved, but nothing elaborate). We can also run into performance problems where the writer, while unparking readers, is preempted by one of those readers. This is the so-called wakee-preempts-waker problem. We can address that issue with concurrent helping -- possibly by forming the set of threads to wake into a tree. Finally, we can make the lock NUMA-friendly by using a cohort lock for WriterLock and by using per-node rin+rout fields. I believe the final form remains phase-fair. And in fact it performs quite well on NUMA systems. Note that the "C-RW-NP" Cohort NUMA-aware reader-writer has only one instance of unbounded indefinite spinning, where the next writer waits for readers to depart. Again, that particular case is trivial to address. .

I thought I'd collect a few notes on the topic of waiting policies in the context of locks. Lock algorithms in the literature are usually implemented via unbounded spinning. That's fine for...

General

Using reader-writer locks to improve hardware TLE : TLE-RW

At its most simple, traditional hardware TLE (Transactional Lock Elision) operates as follows : start a hardware transaction; "subscribe" to the lock state (committing or aborting if the lock is found to be held); execute the critical section; and finally commit at the end of the critical section. For the purposes of discussion I'm assuming an Intel TSX-RTM implementation of hardware transactional memory (HTM). If we can't make progress using the optimistic transactional mode then we revert to using classic pessimistic mutual exclusion by taking the lock, running the critical section in non-transactional execution mode, and finally releasing lock. Myriad retry policies are possible. It's worth noting that HTM implementations built on top of the existing cache coherence protocol with "requester-wins" conflict resolution policies will usually admit or allow mutual abort with no progress by any parties. That is, the progress properties can be rather poor. A relatively simple way to improve on the implementation above is to employ a reader-writer lock. Threads trying to use optimistic transactions to execute the critical section must acquire and hold shared "R" permission. Threads executing the critical section via the classic non-transactional path must hold exclusive "W" permission. This approach confers a number of advantages. First, this algorithm obviates the need for subscription in optimistic path. 2nd, it avoids the lemming effect. (Perhaps I should have used the term "pyrrhic locking" instead, as we all know full well that lemmings don't really follow each other off cliffs to their death. It's pyrrhic in the sense that you "win" the lock, but lose the ability to runs transactions). 3rd, under the usual simplistic TLE implementations, a thread taking the classic path will cause all concurrent threads in the optimistic path to abort. But with the reader-writer lock variant, threads taking the pessimistic path wait politely for the optimistic threads to finish. The existence of threads in the R path is visible to threads trying to acquire W -- this is vaguely analogous to the idea of "visible readers" in STMs. (See also TLRW). Visible readers and polite writers may result in fewer aborts and cycles wasted on futile transactions. Both the simple traditional form and the form based on reader-writer locks prohibit simultaneous execution of transactional and non-transactional modes under the lock, but the reader-writer lock variety is a bit less medieval. In a sense we have a "guarded" execution mode for transactions. Finally, some reader-writer lock implementations intentionally control admission policy so as to promote large groups of simultaneous readers -- we call this "R-group formation". When used for TLE, a reader-writer lock that promotes R-group formation will also tend to exhibit better throughput, say, than a reader-writer lock that uses a strict FIFO admission policy. Specifically, larger R-groups let us co-schedule larger groups of concurrent transactions. As side-note, hardware TLE can be implemented either within pthread_mutex primitives, or on top of such primitives. In the latter case, if your TLE implementation covers all critical sections protected by the lock in question then you can add a separate "isLocked" tracking variable for the purposes of subscription. In this case the slow "classic" pessimistic path would acquire the lock and store "true" into isLocked. On x86 you also need a fence or fence-equivalent instruction after the store. In some circumstances, however, you might not have access to all the critical sections to impose the use of the isLocked convention. In that case it'd helpful if glib/libc were to expose a pthread_mutex_islocked_np() operator which would test the lock state and subscribe. TLE could call that function in the fast optimistic path. Note that the reader-writer lock form needs neither an isLocked() call to inspect the lock, or an explicit tracking variable. There are a few ways to embellish this approach. I’ve only explored this on a single-node system, but it looks useful to use a cohort NUMA reader-writer lock with 2 synthetic logical cohort nodes, one for R and one for W. This tends to further favor and promote R-group formation, and thus "batch" or co-schedule groups of transactions together to an even greater degree, yielding better parallelism. Another idea is to augment the R path with K-exclusion. (See US8402464 for more details). Because of the mutual abort with no progress issue mentioned above, we can encounter fratricide if we have too many threads simultaneously in the R path under a given lock. By decreasing the number of threads in the R-path, we can sometimes improve aggregate throughput. K-exclusion around the R path gives us a convenient way to throttle the number of threads. Specifically, we vary K based on the recent transactional success or abort rates. When K reaches 1 we have a degenerate case and just revert to mutual exclusion. An AIMD policy for K seems to work reasonably well. AIMD policies are usually associated with TCP congestion windows, but seem to map cleanly to problems involving optimistic synchronization. The TCP sending window is also optimistic, as elements in the window may require re-transmission. (A few years ago I experimented with AIMD policies to experiment with the spin duration for contended java locks). The canonical use case for reader-writer locks is the usual and obvious single-writer vs multiple-reader roles. Our example above is single pessimistic/normal vs multiple optimistic/transactional. Another mode we see is in lock-free code with deferred memory reclamation, where R permission confers the right to access some data (guaranteeing continued existence of some object) and the W role is required to unmap or physically free the underlying memory. That is, R confers read and write access and W confers "destroy" permission. We see similar usage for trees, where W is required for structural changes to the tree, and R allows traversal of the tree and read-write permission for non-structural fields in the nodes. A related usage is to protect the file system mount tables in operating systems. R permission enables a thread to parse pathnames and access the filesystem without worry of dismount, while W is used by unmount to make sure the filesystem is quiesced. Another use-case is multiple mutator threads vs a garbage collector where the heap is protected by a single reader-writer lock. If you have a JVM or other managed runtime with a stop-the-world copying collector, then normal heap mutator threads have R access. If a mutator calls out of the JVM via JNI, or parks itself, then it relinquishes R for the duration of that operation. The garbage collector takes the W role when active. Mutators also periodically poll for pending GC and relinquish R to allow the collector to run.Instead of using reader-write locks, we could also employ Blelloch's room synchronization algorithm, providing one room for classic "physical" locking and another for "virtual" transactional execution. See also PhTM : Phased Transactional Memory which appeared in Transact 2007.Note that a good quality reader-writer lock will provide automatic gang wakeup. Invisible text : still takes up space Invisible text : no space allocated

At its most simple, traditional hardware TLE (Transactional Lock Elision) operates as follows : start a hardware transaction; "subscribe" to the lock state (committing or aborting if the lock is found...

General

AVIS Frankfurt

I recently rented a car from AVIS at Frankfurt for travel to Dagstuhl. We drove to Dagstuhl, left it parked for 5 days, and returned without incident. On return, the attendant -- very pleasant, btw -- found 'damage' to the front driver's side rim and wanted to know details about the 'accident' . There was no accident while the car was in my hands. I related such to the attendant, but didn't make any progress. AVIS tacked an additional 147€ change on my corporate credit card. This was immediately after the Paris Hebdo incident, so we knew security lines would be long and slow, so even though we arrived 3 hours before the flight, we didn't have any spare time to argue the point. Interestingly, there was no outbound inspection by AVIS when I rented the car. Below is a picture of the rim after the attendant pointed out the issue. There's certainly something on the rim at about the 6:00 and 3:30 positions, but it didn't happen while I had the car. The photographic quality is bad, as the return area was very poorly lit. I took a quick look around the car when I rented it, but didn't spot the supposedly damaged area.Hopefully this isn't systemic behavior or practice on the part of AVIS. I suppose the lesson is (a) take a walk-around video when your rent the car -- hopefully there's enough light -- or (b) perhaps use a different vendor. A subsequent letter from AVIS detailing the damage charge ended as follows :If you would like to exclude this risk in your next rental, you can purchase an excess-waiver for a small additional charge.

I recently rented a car from AVIS at Frankfurt for travel to Dagstuhl. We drove to Dagstuhl, left it parked for 5 days, and returned without incident. On return, the attendant -- very pleasant, btw --...

General

Measuring long-term fairness for locks

Lets say we have N concurrent and homogenous threads that each loop as follows : acquire a central lock; execute a critical section; release the lock; execute a non-critical section. We start the threads at the same time. At the end of the measurement interval we have N data points : the number of iterations completed by each thread. This is the so-called fixed-time-report-work benchmark methodology. Aggregate throughput is measured by the sum of the data points, but we're also interested in quantifying the fairness, as reflected by the distribution of those points. For simplicity, we'll assume that the scheduler is ideally fair, there are no NUMA or physical "geographic" issues in play, and that each thread has about the same cache footprint. The only remaining source of unfairness is the lock implementation or policies. (In practice there are myriad confounding factors, including energy and thermal caps, but we'll ignore those and assume an idealized model). A histogram of the completion counts is a good way to visualize the fairness or unfairness of the lock. I recommend HdrHistogram. Sometimes, though, it's convenient to describe unfairness in terms of a simple real-valued numerical statistic. Such a descriptive statistic can be used to quantity the fairness of various lock policies, and in particular to help establish the trade-off between fairness and throughput. Ideally that statistic would be scale-invariant -- that property is desirable but optional. Some of the usual statistics are standard deviation or variance. Extrema-based statistics such as (max-min) or (max-min)/max can also be useful. These can give us a sense of the size of the range of the data points. The average divided by the max can also provide some insight. IQR is another commonly used statistic, as is Jain's Fairness Index. Cederman et al. suggested another useful fairness metric. In recent papers I've reported the relative standard deviation. (In communications literature it's not uncommon to see fairness expressed in terms of average/stddev, which is the reciprocal of the relative standard deviation). Median Absolute Deviation (MAD) is also informative. I don't have one favorite -- my lock benchmark harnesses report all of the above. Recently I was looking at other measures of disparity or uniformity and came up with the following idea. First, we sort our N completion counts in ascending order. We then plot as follows. On the X-axis we have the number of threads T, and on the Y-axis we have the cumulative sum of iteration counts up to thread #T. (Think: CDF or Riemann Integral) If the completion counts were 10,15,22,33 then the Y values would be C(x) = 0,10,25,37,70, for instance, for 0,1,2,3,4 threads, respectively. Beware that the C(x) function is obviously discrete, but for the purposes of discussion and terminology I'll treat it as a continuous real-valued function. Next, we normalize both axes to [0,1]. For the Y-axis, we can simply divide C(x) by the maximum -- final -- C(x) value. If all the data points are equal -- ideally fair -- then C(x) is just the identity diagonal function : C(x)=x. Note that C(x) is convex. The more unfair the distribution, the larger the area under the diagonal and above the normalized C(x) function. And in fact that area measure makes a useful index of inequality. We could derive additional statistics from this approach, such as the maximum x-C(x) value, but the area between the diagonal and C(x) seems viable. Subsequently, I realized a similar approach had been long used in economics to describe income disparity : Gini Coefficient.See also : Measuring Short-Term Fairness for locks.

Lets say we have N concurrent and homogenous threads that each loop as follows : acquire a central lock; execute a critical section; release the lock; execute a non-critical section. We start...

General

A simple lazy subscription pathology

Following up on The Pitfalls of lazy subscription, I thought I'd provide a simple case that illustrates where transactional lock elision (TLE) with lazy subscription can fail. I've managed to reproduce the failure in-house on an i7-4770 (haswell). Say we have a ring-buffer that’s used for logging. It just records the last 16 messages posted. (This is kind of common for “black box flight recorders” that appear in kernels and the JVM). The memory layout is : int pos; intptr_t RingBuffer[16]; volatile int Lock. We'll also assume that we compile with code at low optimization levels, so accesses to "pos" result in loads and stores with no caching in registers. “Lock” is a simple test-and-test-and-set lock possibly augmented with TLE and late subscription. The critical section is :Post(m) : Acquire(&Lock) auto index = pos ++ // load and store pos RingBuffer[index] = m if pos == 16 : pos = 0 // reload pos Release(&Lock)A number of threads just loop, calling Post(0). Some run take the lock, others use TLE with late subscription. It's possible for one of threads using TLE with late subscription to observe an intermediate and transient "pos" value of 16 by virtue of some thread that holds "Lock" and has incremented "pos" from 15 to 16, but not yet reset the variable to 0. The thread using TLE that observed "pos" at 16 will then overwrite "Lock" with 0, so the late subscription check succeeds and the transaction inadvertently commits. This is a case of "wrongful commit" via lock corruption. Characterizing which critical sections remain safe under TLE with late subscription remains an interesting research topic. But in the general case it's unsafe. It's worth pointing out that for weakly ordered processors, the load or loads that constitute the classic early subscription check must have proper fencing. None is needed on x86, of course.

Following up on The Pitfalls of lazy subscription, I thought I'd provide a simple case that illustrates where transactional lock elision (TLE) with lazy subscription can fail. I've managed to...

General

PTLQueue : a scalable bounded-capacity MPMC queue

I've used the following concurrent queue algorithm enough that it warrants a blog entry. I'll sketch out the design of a fast and scalable multiple-producer multiple-consumer (MPMC) concurrent queue called PTLQueue. The queue has bounded capacity and is implemented via a circular array. Bounded capacity can be a useful property if there's a mismatch between producer rates and consumer rates where an unbounded queue might otherwise result in excessive memory consumption by virtue of the container nodes that -- in some queue implementations -- are used to hold values. A bounded-capacity queue can provide flow control between components. Beware, however, that bounded collections can also result in resource deadlock if abused. The put() and take() operators are partial and wait for the collection to become non-full or non-empty, respectively. Put() and take() do not allocate memory, and are not vulnerable to the ABA pathologies. The PTLQueue algorithm can be implemented equally well in C/C++ and Java. Partial operators are often more convenient than total methods. In many use cases if the preconditions aren't met, there's nothing else useful the thread can do, so it may as well wait via a partial method. An exception is in the case of work-stealing queues where a thief might scan a set of queues from which it could potentially steal. Total methods return ASAP with a success-failure indication. (It's tempting to describe a queue or API as blocking or non-blocking instead of partial or total, but non-blocking is already an overloaded concurrency term. Perhaps waiting/non-waiting or patient/impatient might be better terms). It's also trivial to construct partial operators by busy-waiting via total operators, but such constructs may be less efficient than an operator explicitly and intentionally designed to wait. A PTLQueue instance contains an array of slots, where each slot has volatile Turn and MailBox fields. The array has power-of-two length allowing mod/div operations to be replaced by masking. We assume sensible padding and alignment to reduce the impact of false sharing. (On x86 I recommend 128-byte alignment and padding because of the adjacent-sector prefetch facility). Each queue also has PutCursor and TakeCursor cursor variables, each of which should be sequestered as the sole occupant of a cache line or sector. You can opt to use 64-bit integers if concerned about wrap-around aliasing in the cursor variables. Put(null) is considered illegal, but the caller or implementation can easily check for and convert null to a distinguished non-null proxy value if null happens to be a value you'd like to pass. Take() will accordingly convert the proxy value back to null. An advantage of PTLQueue is that you can use atomic fetch-and-increment for the partial methods. We initialize each slot at index I with (Turn=I, MailBox=null). Both cursors are initially 0. All shared variables are considered "volatile" and atomics such as CAS and AtomicFetchAndIncrement are presumed to have bidirectional fence semantics. Finally T is the templated type. pre:nth-child(odd) { background-color:#ff0000; }pre:nth-child(even) { background-color:#0000ff; } // PTLQueue :Put(v) : // producer : partial method - waits as necessary assert v != null assert Mask >= 1 && (Mask & (Mask+1)) == 0 // Document invariants // doorway step // Obtain a sequence number -- ticket // As a practical concern the ticket value is temporally unique // The ticket also identifies and selects a slot auto tkt = AtomicFetchIncrement (&PutCursor, 1) slot * s = &Slots[tkt & Mask] // waiting phase : // wait for slot's generation to match the tkt value assigned to this put() invocation. // The "generation" is implicitly encoded as the upper bits in the cursor // above those used to specify the index : tkt div (Mask+1) // The generation serves as an epoch number to identify a cohort of threads // accessing disjoint slots while s->Turn != tkt : Pause assert s->MailBox == null s->MailBox = v // deposit and pass messageTake() : // consumer : partial method - waits as necessary auto tkt = AtomicFetchIncrement (&TakeCursor,1) slot * s = &Slots[tkt & Mask] // 2-stage waiting : // First wait for turn for our generation // Acquire exclusive "take" access to slot's MailBox field // Then wait for the slot to become occupied while s->Turn != tkt : Pause // Concurrency in this section of code is now reduced to just 1 producer thread // vs 1 consumer thread. // For a given queue and slot, there will be most one Take() operation running // in this section. // Consumer waits for producer to arrive and make slot non-empty // Extract message; clear mailbox; advance Turn indicator // We have an obvious happens-before relation : HBO // Put(m) happens-before corresponding Take() that returns that same "m" - antecedent for T v = s->MailBox if v != null : s->MailBox = null ST-ST barrier s->Turn = tkt + Mask + 1 // unlock slot to admit next producer and consumer return v PausePTLQueue borrows and derives from the Partitioned Ticket Lock "PTL" (US20120240126-A1) and the MultiLane Concurrent Bag (US8689237). The latter is essentially a circular ring-buffer where the elements themselves are queues or concurrent collections. You can think of the PTLQueue as a partitioned ticket lock "PTL" augmented to pass values from lock to unlock via the slots. Alternatively, you could conceptualize of PTLQueue as a degenerate MultiLane bag where each slot or "lane" consists of a simple single-word MailBox instead of a general queue. Each lane in PTLQueue also has a private Turn field which acts like the Turn (Grant) variables found in PTL. Turn enforces strict FIFO ordering and restricts concurrency on the slot mailbox field to at most one simultaneous put() and take() operation. PTL uses a single "ticket" variable and per-slot Turn (grant) fields while MultiLane has distinct PutCursor and TakeCursor cursors and abstract per-slot sub-queues but does not use per-slot Turn variables. Both PTL and MultiLane advance their cursor and ticket variables with atomic fetch-and-increment. PTLQueue borrows from both PTL and MultiLane and incorporates distinct put and take cursors and per-slot Turn fields. Instead of a per-slot queues, PTLQueue uses a simple single-word MailBox field. PutCursor and TakeCursor act like a pair of ticket locks, conferring "put" and "take" access to a given slot. PutCursor, for instance, assigns an incoming put() request to a slot and serves as a PTL "Ticket" to acquire "put" permission to that slot's MailBox field. To better explain the operation of PTLQueue we deconstruct the operation of put() and take() as follows. Put() first increments PutCursor obtaining a new unique ticket. That ticket value also identifies a slot. Put() next waits for that slot's Turn field to match that ticket value. This is tantamount to using a PTL to acquire "put" permission on the slot's MailBox field. Finally, having obtained exclusive "put" permission on the slot, put() stores the message value into the slot's MailBox. Take() similarly advances TakeCursor, identifying a slot, and then acquires and secures "take" permission on a slot by waiting for Turn. Take() then waits for the slot's MailBox to become non-empty, extracts the message, and clears MailBox. Finally, take() advances the slot's Turn field, which releases both "put" and "take" access to the slot's MailBox. Note the asymmetry : put() acquires "put" access to the slot, but take() releases that lock. At any given time, for a given slot in a PTLQueue, at most one thread has "put" access and at most one thread has "take" access. This restricts concurrency from general MPMC to 1-vs-1. We have 2 ticket locks -- one for put() and one for take() -- each with its own "ticket" variable in the form of the corresponding cursor, but they share a single "Grant" egress variable in the form of the slot's Turn variable. Advancing the PutCursor, for instance, serves two purposes. First, we obtain a unique ticket which identifies a slot. Second, incrementing the cursor is the doorway protocol step to acquire the per-slot mutual exclusion "put" lock. The cursors and operations to increment those cursors serve double-duty : slot-selection and ticket assignment for locking the slot's MailBox field. At any given time a slot MailBox field can be in one of the following states: empty with no pending operations -- neutral state; empty with one or more waiting take() operations pending -- deficit; occupied with no pending operations; occupied with one or more waiting put() operations -- surplus; empty with a pending put() or pending put() and take() operations -- transitional; or occupied with a pending take() or pending put() and take() operations -- transitional. The partial put() and take() operators can be implemented with an atomic fetch-and-increment operation, which may confer a performance advantage over a CAS-based loop. In addition we have independent PutCursor and TakeCursor cursors. Critically, a put() operation modifies PutCursor but does not access the TakeCursor and a take() operation modifies the TakeCursor cursor but does not access the PutCursor. This acts to reduce coherence traffic relative to some other queue designs. It's worth noting that slow threads or obstruction in one slot (or "lane") does not impede or obstruct operations in other slots -- this gives us some degree of obstruction isolation. With respect to progress properties, however, PTLQueue is not lock-free.The implementation above is expressed with polite busy-waiting (Pause) but it's trivial to implement per-slot parking and unparking to deschedule waiting threads. It's also easy to convert the queue to a more general deque by replacing the PutCursor and TakeCursor cursors with Left/Front and Right/Back cursors that can move either direction. Specifically, to push and pop from the "left" side of the deque we would decrement and increment the Left cursor, respectively, and to push and pop from the "right" side of the deque we would increment and decrement the Right cursor, respectively.We used a variation of PTLQueue for message passing in our recent OPODIS 2013 paper.I'll now introduce a variation on the above, but borrow Dmitry Vyukov's encoding for the Turn variable. After a thread deposits a value into the MailBox, it advances Turn by 1 to indicate to Take() operations that a value is present. Turn values follow this trajectory : K, K+1, K+1+Mask, etc. (More precisely, the Turn field for slot I takes on values of the form: I*G, (I*G)+1, I*(G+1), etc where G is the generation number). This form is friendlier for a total tryTake() operator. Put() and Take() operators are also more symmetric than in the version above. It also allows null to be passed without substituting proxy values. Like the form above, it can use and benefit from fetch-and-increment instead of CAS for the partial forms. the partial tryTake() operator, however, uses CAS. tryPut() -- not shown -- is analogous to tryTake(). UPDATE: see the comments section for a concern identified by Nitsan. // PTLQueueV2 :Put(v) : // producer : partial method - waits as necessary assert Mask > 1 && (Mask & (Mask+1)) == 0 // Document invariants // doorway step // Obtain a sequence number -- ticket // As a practical concern the ticket value is temporally unique // The ticket also identifies and selects a slot auto tkt = AtomicFetchIncrement (&PutCursor, 1) slot * s = &Slots[tkt & Mask] // waiting phase : // wait for slot's generation to match the tkt value assigned to this put() invocation. // The "generation" is implicitly encoded as the upper bits in the cursor // above those used to specify the index : tkt div (Mask+1) // The generation serves as an epoch number to identify a cohort of threads // accessing disjoint slots while s->Turn != tkt : Pause assert s->MailBox == null s->MailBox = v // deposit and pass message s->Turn = tkt + 1 // mark occupied - release corresponding Take() operation // We pass & cede ownership and exclusive access of MailBox to the next Take() // That Take() operation will be in the same generation as this Put()Take() : // consumer : partial method - waits as necessary auto tkt = AtomicFetchIncrement (&TakeCursor,1) slot * s = &Slots[tkt & Mask] // Wait turn for our generation and for the slot become occupied while s->Turn != (tkt+1) : Pause T v = s->MailBox // extract message // In a garbage-collected environment we might want to set s->MailBox = null s->Turn = tkt + Mask + 1 // mark unoccupied - unlock slot to admit next Put() // We pass & cede ownership and exclusive access to MailBox to the next Put() // That Put() will be in the next generation return v// tryTake() is a "strong" operator in that it can return null IFF// the queue was empty at some point during the tryTake() invocation. // Spurious false-null return values are not allowed.// Note that if we provide tryTake() with the interface below, then null can not// be a legal message value passed via Put(null). Alternatively, we could allow // null messages but augment tryTake() to return a success-failure indication as well// as the value. // Thanks to Nathan Reyonds for comments on an earlier version of this algorithm.tryTake() : // consume : "strong" total method - returns ASAP for auto tkt = TakeCursor slot * s = &Slots[tkt & Mask] auto delta = s->Turn - (tkt+1) if delta == 0 : if CAS (&TakeCursor, tkt, tkt+1) == tkt : // our thread has exclusive access to s->MailBox T v = s->MailBox s->Turn = tkt + Mask + 1 // ceded MailBox access to next Put() return v continue if delta lessThan 0 : return null // inopportune concurrent interleaving - race // tkt is stale with respect to TakeCursor // Other Take() or tryTake() operations bypassed this operation // and raced past // This can happen if we stall after loading TakeCursor, for instance // just retry assert TakeCursor != tktThere's quite a bit of related literature in this area. I'll call out a few relevant references:Wilson's NYU Courant Institute UltraComputer dissertation from 1988 is classic and the canonical starting point : Operating System Data Structures for Shared-Memory MIMD Machines with Fetch-and-Add. Regarding provenance and priority, I think PTLQueue or queues effectively equivalent to PTLQueue have been independently rediscovered a number of times. See CB-Queue and BNPBV, below, for instance. But Wilson's dissertation anticipates the basic idea and seems to predate all the others. Gottlieb et al : Basic Techniques for the Efficient Coordination of Very Large Numbers of Cooperating Sequential Processors Orozco et al : CB-Queue in Toward high-throughput algorithms on many-core architectures which appeared in TACO 2012. Meneghin et al : BNPVB family in Performance evaluation of inter-thread communication mechanisms on multicore/multithreaded architecture Dmitry Vyukov : bounded MPMC queue (highly recommended) Alex Otenko : US8607249 (highly related). John Mellor-Crummey : Concurrent queues: Practical fetch-and-phi algorithms. Technical Report 229, Department of Computer Science, University of RochesterThomasson : FIFO Distributed Bakery Algorithm (very similar to PTLQueue). Scott and Scherer : Dual Data StructuresMorrison and Afek : Fast concurrent queues for x86 processors (LCRQ - another queue based on fetch-and-add)Lamport : Concurrent Reading and Writing covers the case of the single-producer single-consumer circular buffer. FaceBook Folly. I'll propose an optimization left as an exercise for the reader. Say we wanted to reduce memory usage by eliminating inter-slot padding. Such padding is usually "dark" memory and otherwise unused and wasted. But eliminating the padding leaves us at risk of increased false sharing. Furthermore lets say it was usually the case that the PutCursor and TakeCursor were numerically close to each other. (That's true in some use cases). We might still reduce false sharing by incrementing the cursors by some value other than 1 that is not trivially small and is coprime with the number of slots. Alternatively, we might increment the cursor by one and mask as usual, resulting in a logical index. We then use that logical index value to index into a permutation table, yielding an effective index for use in the slot array. The permutation table would be constructed so that nearby logical indices would map to more distant effective indices. (Open question: what should that permutation look like? Possibly some perversion of a Gray code or De Bruijn sequence might be suitable). As an aside, say we need to busy-wait for some condition as follows : "while C == 0 : Pause". Lets say that C is usually non-zero, so we typically don't wait. But when C happens to be 0 we'll have to spin for some period, possibly brief. We can arrange for the code to be more machine-friendly with respect to the branch predictors by transforming the loop into : "if C == 0 : for { Pause; if C != 0 : break; }". Critically, we want to restructure the loop so there's one branch that controls entry and another that controls loop exit. A concern is that your compiler or JIT might be clever enough to transform this back to "while C == 0 : Pause". You can sometimes avoid this by inserting a call to a some type of very cheap "opaque" method that the compiler can't elide or reorder. On Solaris, for instance, you could use :"if C == 0 : { gethrtime(); for { Pause; if C != 0 : break; }}". It's worth noting the obvious duality between locks and queues. If you have strict FIFO lock implementation with local spinning and succession by direct handoff such as MCS or CLH,then you can usually transform that lock into a queue. If you want a multiple-producer single-consumer MPSC queue then you can replace atomic operations on the TakeCursor with normal updates. More generally you can take a simple SPSC queue and wrap the put() and take() operations with put and take mutexes, restricting concurrency to 1-vs-1, but the performance isn't usually as good as what you'd get from a queue designed for MPSC usage.

I've used the following concurrent queue algorithm enough that it warrants a blog entry. I'll sketch out the design of a fast and scalable multiple-producer multiple-consumer (MPMC) concurrent queue...

General

malloc for Haswell - Hardware Transactional Memory

Invisible messageI've been looking into "malloc" dynamic storage allocators targeted specifically at Intel Haswell i7-4770 processors. For background, the i7-4770 has relatively simple cache geometry. The L1 (level-1 cache) is 32KB with 64-byte lines, is physically tagged, and is 8-way set-associative. There are 64 possibly indices (sets). As such the cache page size is 4KB -- addresses that differ by an integer multiple of 4K will map to the same index (set) in the L1. The low-order 6 bits of the address presented to the L1 form the offset into the line, and the next higher 6 bits serve as the L1 index. The MMU base page size is 4KB, so there is no overlap between the virtual page number and the index field in a virtual address. The L1 index field passes through address translation verbatim. As such, OS-level page coloring is not in play with respect to the L1. (An advantage of this design is that indexing can commence before the virtual address is translated to a physical address, although we still need the physical address for tag comparison). Some CPUs hash addresses -- usually XORing high-order physical address bits into the index bits -- in order to reduce the odds of index hotspots and index imbalance, but experiments suggest that does not appear to be the case with the i7-4770.Such simple caches -- without the index hashing mentioned above -- can be vulnerable to excessive index conflicts, but malloc allocators can be made index-aware (local copy) to mitigate and reduce the frequency of index conflicts. Index imbalance results in underutilization of the cache. Some indices will be "cold"(less frequently accessed) while others are "hot" and thus incur relatively higher miss rates. It's worth pointing out that most application/allocator combinations don't exhibit excessive index conflicts, but for those that do, the performance impact can be significant. An index-aware allocator can act to "immunize" an application against some common cases of index-imbalance while typically incurring no additional cost over index-oblivious allocators. Think of the index-aware allocator as cheap insurance against a rare but painful performance disorder. The paper above describes an index-aware allocator designed for the L1 in a SPARC T2+ processor, but it's trivial to change a few cache geometry constants and retarget the allocator to the i7-4770. The "CIA-Malloc" (Cache-Index Aware) allocator described in the paper has a number of other useful design properties. It also happens to be NUMA-friendly and large-page-friendly. Underlying pages are allocated on the node where the malloc() was invoked. Put another way, the pages underlying a block returned by malloc() will typically reside on the node where the malloc() was invoked. The allocator is also scalable with very little internal lock contention or coherence traffic. Each per-CPU sub-heap has a private lock -- the only time we'll encounter contention is via migration or preemption, which is relatively rare. The critical sections are also constant-time and very short. We also make heavy use of trylock(), so if a thread is obstructed it can usually make progress by reverting to another data structure. Remote free() operations are lock-free. Critically, the allocator acts to reduce the cost of malloc() and free() operations as well as the cost to the application when accessing blocks allocated via malloc(). The allocator is also designed specifically to reduce common cases of false sharing : allocator metadata-vs-metadata; metadata-vs-block; and inter-block block-vs-block. Metadata-vs-metadata sharing and false sharing is reduced by using per-CPU sub-heaps. False sharing arising between adjacent data blocks -- blocks returned by malloc() -- is addressed by placement and alignment. These attributes will prove even more useful when we use CIA-Malloc in conjunction with hardware transactions. The i7-4770 provides hardware transactional memory (HTM). For the purposes of discussion we'll assume we're using TSX-RTM for the purposes of transactional lock elision (TLE). The critical section body contains unmodified HTM-oblivious legacy code that expects to run under the lock in the usual fashion, but via TLE we can modify the lock implementation to attempt optimistic execution, reverting to the lock only as necessary. The i7-4770's HTM implementation tracks the transactional write-set in the L1 and the read-set over the cache hierarchy. It uses a requester-wins conflict resolution strategy implemented via the MESIF coherence protocol. At most a single cache can have a given line in M/E state at any one time -- a classic multiple-reader single-writer model. Eviction or invalidation of a tracked cache line results in a transactional abort. For example if a transaction on CPU C loads address A, and some other CPU writes A before C commits, the write will invalidate the line from C's cache and cause an abort. Similarly, if C stores into A and some other CPU loads or stores into A before C commits, the invalidation of A will cause C's transaction to abort. Read-write or write-write sharing on locations accessed within a transaction results in coherence invalidation and consequent abort. (The HTM implementation in the i7-4770 shares quite a few aspects with Sun's experimental ROCK processor). In addition to coherence traffic, self-displacement via conflict misses can also result in aborts. This is where a CIA-Malloc allocator may provide benefit relative to other allocators. Normally an index-aware allocator is expected to reduce conflict misses arising from index-imbalance, but it can also reduce transactional aborts caused by eviction of read-set or write-set entries from index conflicts. Aborts are usually far more expensive than simple cache misses. (Absent any potential benefit from warming up of caches, aborts are pure wasted and futile effort). Lets take an actual example. The following data was collected on an i7-4770 running Ubuntu 14.04. We use a simple C single-threaded benchmark that uses malloc() to individually allocate a set of 250 nodes, and then arranges those nodes into an circular intrusive singly linked list. The benchmark was compiled with gcc 4.8.2 using the x32 ABI. The node structure has a "next" field at offset 0 followed by a volatile integer "w" field. A command-line switch gives us ability to specify the effective size of the node as passed to malloc(). Since there may be a correlation between allocation order and virtual address, we randomize the order of the nodes with a Fisher-Yates shuffle in order to minimize the impact of automatic hardware stride-based prefetchers. (Such a randomized order can put stress on the TLBs with lots of page crossings as we traverse the list, but that's not the dominant performance issue for the cases we'll discuss). We then report the time needed to complete 10000000 steps of the following loop body : a->w = 0 ; a = a->next If we use an effective node size of 950 bytes, then the default glibc malloc() allocator places our nodes at 960 byte intervals (1024-64) and each step of the loop requires 2.1 nsecs. When we increase the node size to 1010 the interval is 1024 bytes and each step takes 8.1 nsecs. If we further increase the node size to 1080 bytes then the interval is 1088 bytes (1024+64)and the time drops back to 2.1 nsecs. The performance drop at 1010 bytes was caused by the 1024-byte placement interval. The base addresses of our 250 nodes resided on just 4 of the 64 possible indices, so we grossly underutilized the L1. This nicely illustrates index conflicts arising from index-oblivious allocator placement policies. An index-aware allocator will avoid this scenario. Now we'll extend our benchmark to use RTM hardware transactions. Each traversal of the ring is wrapped in a RTM XBEGIN-XEND transaction and we'll measure and report success rates. This is intended to model TLE or "naked" use of RTM transactions. We keep the ring circumference at 250 nodes, so each transaction will iterate over that many elements. With nodes of 960 bytes the failure rate is 0.145%. Increasing the node size to 1010 bytes, the failure rate becomes 100%. (We have no progress). But if we bump the size to 1080 bytes then the failure rate drops back to 0.2%. The complete and persistent failure at 1010 bytes was caused by elements of the write-set being evicted and consequent abort. But if we use CIA-Malloc modified for the i7-4770 we can avoid such abrupt performance inflections. To recap, an index-aware allocator can help avoid performance pathologies for normal non-transactional code as well as improving the success rate of hardware transactions. Specifically, conflict misses become aborts in transactions, and aborts are more expensive than normal misses. ul { list-style:none; padding-left:0; padding:0; margin:0; margin-left:0; }ul#myTagID { padding: 0px; margin: 0px; list-style:none; margin-left:0;}Ideally, an HTM-friendly allocator will satisfy the desiderata enumerated in Cache-Index Aware Memory Allocation and also act to reduce the abort rate. The following properties are desirable:malloc() and free() should be callable from within transactions with low odds of abortmemory accesses within transactions to blocks returned by malloc() are less prone to abort; specifically we want to reduce conflict misses and coherence invalidation from sharing, both of which cause aborts. Sharing and false-sharing: allocator metadata-vs-blocksallocator metadata-vs-metadata : we note that internal lock contention or promiscuous shared locks internal to the allocator are a special case of metadata-vs-metadata sharing that can cause aborts. block-vs-block : inter-block false sharing. Application code can also use explicit memalign() and padding to reduce the odds of inter-block false sharing. An allocator or application that increases alignment to avoid false sharing may also cause quantization of block sizes and increased internal fragmentation. malloc() and free() operations are scalable. Highly scalable allocators also tend to be transaction-friendly as they avoid or reduce the use shared global mutable data. As a side note, under a requester-wins conflict resolution strategy, to the extent possible and reasonable it's a good idea to shift stores of frequently accessed shared variables toward the end of a transaction. You can do this by hand, or a transaction-aware compiler or JIT can perform some of the transformations. The modCount field Java's hashtable is a canonical example of an update that should be shifted. Shifting reduces the window of vulnerability where the store resides in the transaction's write-set. But the asymmetry in the i7-4770 where the write-set is tracked in the L1 and the read-set in the L1-L2-L3 gives us yet another reason to shift stores toward the end of a transaction. Consider a transaction that executes a store followed by large number of loads. Those loads may displace the store and cause an abort. But if we shift the store to the end of the transaction, the same set of accesses (just reordered) can succeed without abort. The store may displace a loaded line from the L1, but the L2 can still track the line. Finally, when a given size-class is index-unfriendly, we can use the punctuated array approach as described in the original CIA-Malloc paper. A more radical approach is to intentionally and explicitly pick size-classes that are prime multiples of the cache line size. This helps to reduce inter-size-class index conflicts. Another approach to index-aware allocation (other than friendly size-classes or punctuated arrays) is to simply intercept malloc(S) calls and run a Bernoulli trial to decide if we want to add 64 to S. The probably P should be low. As such, we'll occasionally, on a random basis, add 64 to S. A slightly more elaborate variation will make P a function of S, increasing the odds for larger S values. Finally, we describe how to use hardware transactional memory within an allocator implementation in Simplifying Concurrent Algorithms by Exploiting Hardware Transactional Memory which appeared in SPAA 2010 (DOI). That I know of, this is the first use of HTM in an allocator implementation. The allocator used in the SPAA paper was derived from the allocator described in ISMM 2002 Mostly Lock-Free Malloc (DOI), which used restartable critical sections instead of transactions.

Invisible messageI've been looking into "malloc" dynamic storage allocators targeted specifically at Intel Haswell i7-4770 processors. For background, the i7-4770 has relatively simple cache geometry....

General

A simple PRNG construction idiom

The academic literature is rife with algorithms for pseudo-random number generators (PRNGs). Typically, there's a trade-off between performance and the quality of the distribution. In most cases I need PRNGs to implement very lightweight Bernoulli trials for randomized stress tests, benchmarks, or scalable probabilistic counters. My budget is usually less that 100 cycles to generate a uniformly distributed value. Marsaglia's xor-shift PRNG is one of my personal favorites. If I need better quality I'll step up to Ziff's four tap or Mersenne twister. One variation of Marsaglia has only one word of state, a 4G-1 period, and requires just 3 shifts and 3 XOR operations to generate a new value. 0 is an absorbing state that we avoid. See MarsagliaNext(), below. Ignoring 0, the trajectory or stream of values forms a cycle -- conceptually a ring. The initialization and seeding operation should act to place different threads at different positions on that ring. In a sense the ring is shared by all threads but we start the threads at different points. Unfortunately, we can sometimes find that 2 different threads are at about the same position at about the same time through simple bad luck, and thus generate similar streams of values. (Longer PRNG cycles reduce the odds of such scenarios, of course). Ideally, to avoid such inopportune and undesirable behavior, each thread would have its own private ring of values. A simple approach that is tantamount to giving each thread its own ring -- trajectory of values -- is shown below in NextRandom(). Hash32() is a hash function, which we'll describe later. Note that we explicitly "color" the value passed into Hash32() with the address of the thread-local PRNG state. Recall that at any one time, such an address will be associated with at most one thread, and is thus temporally unique. This approach gives us additional independence over concurrently executing threads. It also makes NextRandom() self-seeding -- explicit initialization is not required. The Hash32() hash function is the key to this particular PRNG, and its implementation directly embodies the trade-off between performance and the quality of the resulting distribution. Conceptually, you could think of Hash32() as representing a randomized cycle with a period of 4G. We might implement Hash32() via a strong cryptographic hash, such as SHA-3 or MD5. Those hash functions would work perfectly well, but tend to be high quality but somewhat expensive, so we'd prefer a cheaper hash. Critically, we want the hash function to exhibit a high degree of avalanche effect. (A good encryption function could also be used as a hash operator). Some cheaper candidates for the hash function include: MurmurHash ;CityHash ; FNV hash family; siphash; and Jenkins hash. Doug Lea and Guy Steele invented some of the best candidates for our hash function: see Mix32() and Mix64() below. These are relatively cheap but do well on avalanche tests and strike a reasonable balance between quality and performance. Beware that they're obviously not cryptographically strong. Mix32() and Mix64() are related to mix functions found in java.util.SplittableRandom.Update 2014-03-13 : Thanks to Paul Sandoz for pointing out Stafford's mix function.static int32_t MarsagliaNext () { static __thread int32_t rv = 0 ; if (rv == 0) rv = GenerateSeed() ; int32_t v = rv ; v ^= v > 21 ; v ^= v

The academic literature is rife with algorithms for pseudo-random number generators (PRNGs). Typically, there's a trade-off between performance and the quality of the distribution. In most cases I...

General

PRNG optimizations

Say you wanted to generate a stream of uniformly distributed random integers in the range [0,N) where N > 0. Java's Random.nextInt(N) is an example of a convenient API that does just that. Furthermore lets say you willing to trade off the quality of the distribution in order to gain better performance -- lower latency for each call to nextInt(N). Since we care about speed, we'll assume the implementation can't use floating point. A good starting point might be Marsaglia's xor-shift pseudo-random number generator (PRNG), which can generate a new 32-bit value in [0,4G) in just a few cycles. Once you have that 32-bit value in hand you next need to reduce it to the [0,N) range. The most obvious approach is to use the "MOD" operator. The MOD (and DIV) instructions can be relatively expensive, however. Somewhat perversely, it's not uncommon to find that the MOD instruction is more expensive than generating the value via the Marsaglia xor-shift PRNG, and thus dominates performance of your nextInt(N) implementation. Furthermore, some processors have only one processing unit per core that can handle MOD instructions, so you might encounter both scaling and latency problems. If N happens to be a constant known at compile-time then your compiler might be sufficiently clever to apply strength reduction techniques, such as those described by Granlund and Montgomery in Division by Invariant Integers using Multiplication (PLDI 1994). Hacker's Delight (2E) is another good resource. Alternatively, you could explicitly apply the optimization yourself. But lets assume that N is a variable. One approach to avoid the MOD instruction is to use a degenerate scaled or fixed-point representation. Say "v" is the value returned by the Marsgalia xor-shift PRNG. We can calculate and return (((v & 0x3FF) * N) >> 10). 0x3FF and 10 are related as 0x3FF is equal to ((2^10)-1), of course. Note that we traded MOD for multiplication, which is usually profitable. While this approach is tempting -- and might be viable in some circumstances -- it suffers from range overflow concerns and quantization (really additional discretization). Note that only 1024 possible values can be returned. We might change 0x3FF and 10 to 0x7FF and 11, respectively, allowing 2048 distinct values, but if we keep increasing those values the "interesting" bits will be truncated by overflow in the 32-bit multiplication and the distribution will suffer. Another approach is to borrow an idea from rejection sampling. For discussion lets assume N is 10. Ignoring overflow, we can easily compute M where M is the least integer power of 2 greater than or equal to N. Since N is 10, M will be 16. See "Hacker's Delight" 2E, Figure 3-3, or the following article. Critically, given N, we can quickly compute M in constant time with no memory references and no conditional branches. Our nextInt(N) implementation can first compute M from N and then invoke the Marsaglia xor-shift PRNG (or PRNG of your choice) to obtain a 32-bit value "v". Next, we mask "v" with (M-1) yielding a value in [0,M). If that value is less than N then we're done and can return -- we accept the value. But if the value is greater than or equal to N then we reject the value and loop, calling the PRNG again and retrying until we finally find a value less than N. We're effectively running Bernoulli trails until the 1st success, and the probability P of success (N/M) will always be greater than 0.5. The number of trials (iterations) until the 1st success has a geometric distribution so the average running time -- iterations required -- is reasonable if you're willing to tolerate the non-deterministic nature of rejection sampling. If you're concerned about the worst-case you could fall back to the MOD instruction after some number of rejections within a nextInt() episode. One downside to this approach is that the branch to exit the loop might not be very well predicted. Whether or not this approach is profitable depends on the cost of MOD relative to the additional work to compute M and the cost of the retries. Somewhat sadly, it appears useful on a range of systems. (Beware that the low-order bits from a Marsaglia xor-shift PRNG may have a skewed distribution over short intervals, but you can remedy this with a shift or by using a slightly better but still cheap PRNG).

Say you wanted to generate a stream of uniformly distributed random integers in the range [0,N) where N > 0. Java's Random.nextInt(N) is an example of a convenient API that does just that....

General

Java @Contended annotation to help reduce false sharing

See this posting by Aleksey Shipilev for details -- @Contended is something we've wanted for a long time. The JVM provides automatic layout and placement of fields. Usually it'll (a) sort fields by descending size to improve footprint, and (b) pack reference fields so the garbage collector can process a contiguous run of reference fields when tracing. @Contended gives the program a way to provide more explicit guidance with respect to concurrency and false sharing. Using this facility we can sequester hot frequently written shared fields away from other mostly read-only or cold fields. The simple rule is that read-sharing is cheap, and write-sharing is very expensive. We can also pack fields together that tend to be written together by the same thread at about the same time. More generally, we're trying to influence relative field placement to minimize coherency misses. In a simple single-threaded environment fields that are accessed closely together in time should be placed proximally in space to promote cache locality. That is, temporal locality should condition spatial locality. Fields accessed together in time should be nearby in space. That having been said, when threads are accessing our fields concurrently we have to be careful to avoid false sharing and excessive invalidation from coherence traffic. As such, we try to cluster or otherwise sequester fields that tend to written at approximately the same time by the same thread onto the same cache line. Note that there's a tension at play: if we try too hard to minimize single-threaded capacity misses then we can end up with excessive coherency misses running in a parallel environment. In native C/C++ code it's fairly typical for programmers to use informed concurrency-aware structure layout. @Contended should give use the same capability in Java, although in native code the binding of fields to offsets happens at compile-time, while it happens at load-time for the Java. It's worth pointing out that in the general case there is no single optimal layout for both single-thread and multithreaded environments. And the ideal layout problem itself is NP-hard. Ideally, a JVM would employ hardware monitoring facilities to detect sharing behavior and change the layout on the fly. That's a bit difficult as we don't yet have the right plumbing to provide efficient and expedient information to the JVM. Hint: we need to disintermediate the OS and hypervisor. Another challenge is that raw field offsets are used in the unsafe facility, so we'd need to address that issue, possibly with an extra level of indirection. Finally, I'd like to be able to pack final fields together as well, as those are known to be read-only.

See this posting by Aleksey Shipilev for details -- @Contended is something we've wanted for a long time. The JVM provides automatic layout and placement of fields. Usually it'll (a) sort fields...

General

NUMA-aware placement of communication variables

For classic NUMA-aware programming I'm typically most concerned about simple cold, capacity and compulsory misses and whether we can satisfy the miss by locally connected memory or whether we have to pull the line from its home node over the coherent interconnect -- we'd like to minimize channel contention and conserve interconnect bandwidth. That is, for this style of programming we're quite aware of where memory is homed relative to the threads that will be accessing it. Ideally, a page is collocated on the node with the thread that's expected to most frequently access the page, as simple misses on the page can be satisfied without resorting to transferring the line over the interconnect. The default "first touch" NUMA page placement policy tends to work reasonable well in this regard. When a virtual page is first accessed, the operating system will attempt to provision and map that virtual page to a physical page allocated from the node where the accessing thread is running. It's worth noting that the node-level memory interleaving granularity is usually a multiple of the page size, so we can say that a given page P resides on some node N. That is, the memory underlying a page resides on just one node. But when thinking about accesses to heavily-written communication variables we normally consider what caches the lines underlying such variables might be resident in, and in what states. We want to minimize coherence misses and cache probe activity and interconnect traffic in general. I don't usually give much thought to the location of the home NUMA node underlying such highly shared variables. On a SPARC T5440, for instance, which consists of 4 T2+ processors connected by a central coherence hub, the home node and placement of heavily accessed communication variables has very little impact on performance. The variables are frequently accessed so likely in M-state in some cache, and the location of the home node is of little consequence because a requester can use cache-to-cache transfers to get the line. Or at least that's what I thought. Recently, though, I was exploring a simple shared memory point-to-point communication model where a client writes a request into a request mailbox and then busy-waits on a response variable. It's a simple example of delegation based on message passing. The server polls the request mailbox, and having fetched a new request value, performs some operation and then writes a reply value into the response variable. As noted above, on a T5440 performance is insensitive to the placement of the communication variables -- the request and response mailbox words. But on a Sun/Oracle X4800 I noticed that was not the case and that NUMA placement of the communication variables was actually quite important. For background an X4800 system consists of 8 Intel X7560 Xeons . Each package (socket) has 8 cores with 2 contexts per core, so the system is 8x8x2. Each package is also a NUMA node and has locally attached memory. Every package has 3 point-to-point QPI links for cache coherence, and the system is configured with a twisted ladder "mobius" topology. The cache coherence fabric is glueless -- there's not central arbiter or coherence hub. The maximum distance between any two nodes is just 2 hops over the QPI links. For any given node, 3 other nodes are 1 hop distant and the remaining 4 nodes are 2 hops distant. Using a single request (client) thread and a single response (server) thread, a benchmark harness explored all permutations of NUMA placement for the two threads and the two communication variables, measuring the average round-trip-time and throughput rate between the client and server. In this benchmark the server simply acts as a simple transponder, writing the request value plus 1 back into the reply field, so there's no particular computation phase and we're only measuring communication overheads. In addition to varying the placement of communication variables over pairs of nodes, we also explored variations where both variables were placed on one page (and thus on one node) -- either on the same cache line or different cache lines -- while varying the node where the variables reside along with the placement of the threads. The key observation was that if the client and server threads were on different nodes, then the best placement of variables was to have the request variable (written by the client and read by the server) reside on the same node as the client thread, and to place the response variable (written by the server and read by the client) on the same node as the server. That is, if you have a variable that's to be written by one thread and read by another, it should be homed with the writer thread. For our simple client-server model that means using split request and response communication variables with unidirectional message flow on a given page. This can yield up to twice the throughput of less favorable placement strategies. Our X4800 uses the QPI 1.0 protocol with source-based snooping. Briefly, when node A needs to probe a cache line it fires off snoop requests to all the nodes in the system. Those recipients then forward their response not to the original requester, but to the home node H of the cache line. H waits for and collects the responses, adjudicates and resolves conflicts and ensures memory-model ordering, and then sends a definitive reply back to the original requester A. If some node B needed to transfer the line to A, it will do so by cache-to-cache transfer and let H know about the disposition of the cache line. A needs to wait for the authoritative response from H. So if a thread on node A wants to write a value to be read by a thread on node B, the latency is dependent on the distances between A, B, and H. We observe the best performance when the written-to variable is co-homed with the writer A. That is, we want H and A to be the same node, as the writer doesn't need the home to respond over the QPI link, as the writer and the home reside on the very same node. With architecturally informed placement of communication variables we eliminate at least one QPI hop from the critical path. Newer Intel processors use the QPI 1.1 coherence protocol with home-based snooping. As noted above, under source-snooping a requester broadcasts snoop requests to all nodes. Those nodes send their response to the home node of the location, which provides memory ordering, reconciles conflicts, etc., and then posts a definitive reply to the requester. In home-based snooping the snoop probe goes directly to the home node and are not broadcast. The home node can consult snoop filters -- if present -- and send out requests to retrieve the line if necessary. The 3rd party owner of the line, if any, can respond either to the home or the original requester (or even to both) according to the protocol policies. There are myriad variations that have been implemented, and unfortunately vendor terminology doesn't always agree between vendors or with the academic taxonomy papers. The key is that home-snooping enables the use of a snoop filter to reduce interconnect traffic. And while home-snooping might have a longer critical path (latency) than source-based snooping, it also may require fewer messages and less overall bandwidth. It'll be interesting to reprise these experiments on a platform with home-based snooping.While collecting data I also noticed that there are placement concerns even in the seemingly trivial case when both threads and both variables reside on a single node. Internally, the cores on each X7560 package are connected by an internal ring. (Actually there are multiple contra-rotating rings). And the last-level on-chip cache (LLC) is partitioned in banks or slices, which with each slice being associated with a core on the ring topology. A hardware hash function associates each physical address with a specific home bank. Thus we face distance and topology concerns even for intra-package communications, although the latencies are not nearly the magnitude we see inter-package. I've not seen such communication distance artifacts on the T2+, where the cache banks are connected to the cores via a high-speed crossbar instead of a ring -- communication latencies seem more regular. Finally, I've seen strong hints that the placement of threads relative to lock metadata accessed by those threads plays an important part in performance for contended locks.

For classic NUMA-aware programming I'm typically most concerned about simple cold, capacity and compulsory misses and whether we can satisfy the miss by locally connected memory or whether we have to...

General

C-states and P-states : confounding factors for benchmarking

I was recently looking into a performance issue in the java.util.concurrent (JUC) fork-join pool framework related to particularly long latencies when trying to wake (unpark) threads in the pool. Eventually I tracked the issue down to the power & scaling governor and idle-state policies on x86. Briefly, P-states refer to the set of clock rates (speeds) at which a processor can run. C-states reflect the possible idle states. The deeper the C-state (higher numerical values) the less power the processor will draw, but the longer it takes the processor to respond and exit that sleep state on the next idle to non-idle transition. In some cases the latency can be worse than 100 microseconds. C0 is normal execution state, and P0 is "full speed" with higher Pn values reflecting reduced clock rates. C-states are P-states are orthogonal, although P-states only have meaning at C0. You could also think of the states as occupying a spectrum as follows : P0, P1, P2, Pn, C1, C2, ... Cn, where all the P-states are at C0. Our fork-join framework was calling unpark() to wake a thread from the pool, and that thread was being dispatched onto a processor at deep C-state, so we were observing rather impressive latencies between the time of the unpark and the time the thread actually resumed and was able to accept work. (I originally thought we were seeing situations where the wakee was preempting the waker, but that wasn't the case. I'll save that topic for a future blog entry). It's also worth pointing out that higher P-state values draw less power and there's usually some latency in ramping up the clock (P-states) in response to offered load. The issue of C-states and P-states isn't new and has been described at length elsewhere, but it may be new to Java programmers, adding a new confounding factor to benchmarking methodologies and procedures. To get stable results I'd recommend running at C0 and P0, particularly for server-side applications. As appropriate, disabling "turbo" mode may also be prudent. But it also makes sense to run with the system defaults to understand if your application exhibits any performance sensitivity to power management policies. The operating system power management sub-system typically control the P-state and C-states based on current and recent load. The scaling governor manages P-states. Operating systems often use adaptive policies that try to avoid deep C-states for some period if recent deep idle episodes proved to be very short and futile. This helps make the system more responsive under bursty or otherwise irregular load. But it also means the system is stateful and exhibits a memory effect, which can further complicate benchmarking. Forcing C0 + P0 should avoid this issue.

I was recently looking into a performance issue in the java.util.concurrent (JUC) fork-join pool framework related to particularly long latencies when trying to wake (unpark) threads in the...

General

Performance triage

Folks often ask me how to approach a suspected performance issue. My personal strategy is informed by the fact that I work on concurrency issues. (When you have a hammer everything looks like a nail, but I'll try to keep this general). A good starting point is to ask yourself if the observed performance matches your expectations. Expectations might be derived from known system performance limits, prototypes, and other software or environments that are comparable to your particular system-under-test. Some simple comparisons and microbenchmarks can be useful at this stage. It's also useful to write some very simple programs to validate some of the reported or expected system limits. Can that disk controller really tolerate and sustain 500 reads per second? To reduce the number of confounding factors it's better to try to answer that question with a very simple targeted program. And finally, nothing beats having familiarity with the technologies that underlying your particular layer. On the topic of confounding factors, as our technology stacks become deeper and less transparent, we often find our own technology working against us in some unexpected way to choke performance rather than simply running into some fundamental system limit. A good example is the warm-up time needed by just-in-time compilers in Java Virtual Machines. I won't delve too far into that particular hole except to say that it's rare to find good benchmarks and methodology for java code. Another example is power management on x86. Power management is great, but it can take a while for the CPUs to throttle up from low(er) frequencies to full throttle. And while I love "turbo" mode, it makes benchmarking applications with multiple threads a chore as you have to remember to turn it off and then back on otherwise short single-threaded runs may look abnormally fast compared to runs with higher thread counts. In general for performance characterization I disable turbo mode and fix the power governor at "performance" state. Another source of complexity is the scheduler, which I've discussed in prior blog entries. Lets say I have a running application and I want to better understand its behavior and performance. We'll presume it's warmed up, is under load, and is an execution mode representative of what we think the norm would be. It should be in steady-state, if a steady-state mode even exists. On Solaris the very first thing I'll do is take a set of "pstack" samples. Pstack briefly stops the process and walks each of the stacks, reporting symbolic information (if available) for each frame. For Java, pstack has been augmented to understand java frames, and even report inlining. A few pstack samples can provide powerful insight into what's actually going on inside the program. You'll be able to see calling patterns, which threads are blocked on what system calls or synchronization constructs, memory allocation, etc. If your code is CPU-bound then you'll get a good sense where the cycles are being spent. (I should caution that normal C/C++ inlining can diffuse an otherwise "hot" method into other methods. This is a rare instance where pstack sampling might not immediately point to the key problem). At this point you'll need to reconcile what you're seeing with pstack and your mental model of what you think the program should be doing. They're often rather different. And generally if there's a key performance issue, you'll spot it with a moderate number of samples. I'll also use OS-level observability tools to lock for the existence of bottlenecks where threads contend for locks; other situations where threads are blocked; and the distribution of threads over the system. On Solaris some good tools are mpstat and too a lesser degree, vmstat. Try running "mpstat -a 5" in one window while the application program runs concurrently. One key measure is the voluntary context switch rate "vctx" or "csw" which reflects threads descheduling themselves. It's also good to look at the user; system; and idle CPU percentages. This can give a broad but useful understanding if your threads are mostly parked or mostly running. For instance if your program makes heavy use of malloc/free, then it might be the case you're contending on the central malloc lock in the default allocator. In that case you'd see malloc calling lock in the stack traces, observe a high csw/vctx rate as threads block for the malloc lock, and your "usr" time would be less than expected. Solaris dtrace is a wonderful and invaluable performance tool as well, but in a sense you have to frame and articulate a meaningful and specific question to get a useful answer, so I tend not to use it for first-order screening of problems. It's also most effective for OS and software-level performance issues as opposed to HW-level issues. For that reason I recommend mpstat & pstack as my the 1st step in performance triage. If some other OS-level issue is evident then it's good to switch to dtrace to drill more deeply into the problem. Only after I've ruled out OS-level issues do I switch to using hardware performance counters to look for architectural impediments.

Folks often ask me how to approach a suspected performance issue. My personal strategy is informed by the fact that I work on concurrency issues. (When you have a hammer everything looks like a nail,...

General

TXPAUSE : polite waiting for hardware transactional memory

Classic locks are an appropriate tool to prevent potentially conflicting operations A and B, invoked by different threads, from running at the same time. In a sense the locks cause either A to run before B or vice-versa. Similarly, we can replace the locks with hardware transactional memory, or use transactional lock elision to leverage potential disjoint access parallelism between A and B. But often we want A to wait until B has run. In a Pthreads environment we'd usually use locks in conjunction with condition variables to implement our "wait until" constraint. MONITOR-MWAIT is another way to wait for a memory location to change, but it only allows us to track one cache line and it's only available on x86. There's no similar "wait until" construct for hardware transactions. At the instruction-set level a simple way to express "wait until" in transactions would be to add a new TXPAUSE instruction that could be used within an active hardware transaction. TXPAUSE would politely stall the invoking thread, possibly surrendering or yielding compute resources, while at the same time continuing to track the transaction's address-set. Once a transaction has executed TXPAUSE it can only abort. Ideally that'd happen when some other thread modifies a variable that's in the transaction's read-set or write-set. And since we're aborting all writes would be discarded. In a sense this gives us multi-location MWAIT but with much more flexibility. We could also augment the TXPAUSE with a cycle-count bound to cap the time spent stalled. (For a discussion of the benefits of polite waiting, see this blog entry on the use of WRPAUSE.)A related concept for software transactional memory is the retry facility proposed by Harris et al. in Composable memory transactions.I should note that we can already use hardware transactions to enter a tight spin loop in a transaction to wait for updates to the address-set, which will in turn cause an abort. Assuming that the implementation monitors the address-set via cache-coherence probes, by waiting in this fashion we actually communicate via the probes, and not via memory values. That is, the updating thread signals the waiter via probes instead of by traditional memory values. But TXPAUSE takes that a step further and gives us a polite way to spin within transactions.Lets consider a classic 'polite' test-and-test-and-set loop as might be find in a simple spin lock. The contending threads loop, loading the lock word value to see if the lock has transitioned from LOCKED to UNLOCKED state. If they observe the UNLOCKED they'll then try to acquire the lock with an atomic operation such as XCHG or compare-and-swap. While busy-waiting, the cache line underlying the lockword might be in MESI "S" = shared state in the contending thread's local cache. When the owner ultimately drops the lock -- typically by using a STORE instruction to overwrite LOCKED with UNLOCKED -- the line will change from "S" to "I" (Invalid) in the contending thread's cache. The contending thread continues to loop. The next load by the contending thread misses because the line is in local "I" state, and then causes a read-to-share (RTS) coherence operation to get the line back in the local cache in "S" state. Assuming the spinning thread sees UNLOCKED it'll then try the atomic read-modify-write instruction to acquire the lock. This causes another read-to-owner (RFO or RTO) coherence operation to upgrade the line from "S" to "M" so the atomic can run. So the operations in the local cache were : a remote invalidation from the owner, an RTS via the load and then an RTO via the atomic. (This is the best case, btw). Now lets say the spinning thread uses a hardware transaction to wait. Assuming Haswell-like hardware, the contending thread would execute an XBEGIN instruction to start the transaction, and then load the lock word. If the lock word is LOCKED, then the transaction simply enters a tight loop, where the only exit from the loop is via abort. If the lock word happened to contain UNLOCKED then the transaction can try to acquire the lock with a store of LOCKED followed by a COMMIT instruction. In that case we're done and the thread has acquired ownership. The case where we spin in the transaction is more interesting. The transaction will have issued a load on the lock word, so the line is in local "S" state. The store used by the owner to subsequently drop the lock will cause the line underlying the lock to be invalidated in the spinning thread's local cache. This causes an abort. After the abort, the contending thread can try an atomic compare-and-swap to acquire lock. This will typically transition the cache line from I to M by virtue of an RTO coherence operation. Note that we've saved the intermediate RTS operation by spinning in this manner, so, even absent TXPAUSE, it can be useful to use transactions to wait for values in memory to change. (As an aside, ignoring the use of hardware transactions, MOESI protocols are much more tolerant of spin locks than are MESI systems). See also US20100169895.Keywords: Hardware Transactional Memory; HTM; Haswell; RTM; TSX; spin lock; busy-waitOn the topic of new instructions, it might be useful to have a selective memory fence instruction that specified prior and subsequent addresses. Currently we might write "ST A=1; MEMBAR StoreLoad; LD B". A selective fence would allow us to write "ST A=1; MEMBAR StoreLoad A, B; LD B;" The selective fence would ensure that the store to A was visible before the fetch of B, but without imposing full store-load barrier semantics for all accesses. A degenerate form might be expressed as "MEMBAR A" which would specify that all subsequent (in program order) accesses to A would be visible before any subsequent memory accesses. On a simple in-order processor with a store buffer, "MEMBAR A" could simply wait while an address matching "A" appear in the store buffer. When A finally drained out to visible & coherent space, the MEMBAR would complete. It's possibly such selective fences might give the CPU designers more latitude to improve performance.

Classic locks are an appropriate tool to prevent potentially conflicting operations A and B, invoked by different threads, from running at the same time. In a sense the locks cause either A to run...

General

Polite busy-waiting with WRPAUSE on SPARC

Unbounded busy-waiting is an poor idea for user-space code, so we typically use spin-then-block strategies when, say, waiting for a lock to be released or some other event. If we're going to spin, even briefly, then we'd prefer to do so in a manner that minimizes performance degradation for other sibling logical processors ("strands") that share compute resources. We want to spin politely and refrain from impeding the progress and performance of other threads — ostensibly doing useful work and making progress — that run on the same core. On a SPARC T4, for instance, 8 strands will share a core, and that core has its own L1 cache and 2 pipelines. On x86 we have the PAUSE instruction, which, naively, can be thought of as a hardware "yield" operator which temporarily surrenders compute resources to threads on sibling strands. Of course this helps avoid intra-core performance interference. On the SPARC T2 our preferred busy-waiting idiom was "RD %CCR,%G0" which is a high-latency no-nop. The T4 provides a dedicated and extremely useful WRPAUSE instruction. The processor architecture manuals are the authoritative source, but briefly, WRPAUSE writes a cycle count into the the PAUSE register, which is ASR27. Barring interrupts, the processor then delays for the requested period. There's no need for the operating system to save the PAUSE register over context switches as it always resets to 0 on traps. Digressing briefly, if you use unbounded spinning then ultimately the kernel will preempt and deschedule your thread if there are other ready threads than are starving. But by using a spin-then-block strategy we can allow other ready threads to run without resorting to involuntary time-slicing, which operates on a long-ish time scale. Generally, that makes your application more responsive. In addition, by blocking voluntarily we give the operating system far more latitude regarding power management. Finally, I should note that while we have OS-level facilities like sched_yield() at our disposal, yielding almost never does what you'd want or naively expect. Returning to WRPAUSE, it's natural to ask how well it works. To help answer that question I wrote a very simple C/pthreads benchmark that launches 8 concurrent threads and binds those threads to processors 0..7. The processors are numbered geographically on the T4, so those threads will all be running on just one core. Unlike the SPARC T2, where logical CPUs 0,1,2 and 3 were assigned to the first pipeline, and CPUs 4,5,6 and 7 were assigned to the 2nd, there's no fixed mapping between CPUs and pipelines in the T4. And in some circumstances when the other 7 logical processors are idling quietly, it's possible for the remaining logical processor to leverage both pipelines -- "pipeline fusion". Some number T of the threads will iterate in a tight loop advancing a simple Marsaglia xor-shift pseudo-random number generator. T is a command-line argument. The main thread loops, reporting the aggregate number of PRNG steps performed collectively by those T threads in the last 10 second measurement interval. The other threads (there are 8-T of these) run in a loop busy-waiting concurrently with the T threads. We vary T between 1 and 8 threads, and report on various busy-waiting idioms. The values in the table are the aggregate number of PRNG steps completed by the set of T threads. The unit is millions of iterations per 10 seconds. For the "PRNG step" busy-waiting mode, the busy-waiting threads execute exactly the same code as the T worker threads. We can easily normalize the scores and compute the average rate of progress for individual worker threads by dividing the aggregate score by the number of worker threads T. I should note that the PRNG steps are extremely cycle-heavy and access almost no memory, so arguably this microbenchmark is not as representative of "normal" code as it could be. And for the purposes of comparison I included a row in the table that reflects a waiting policy where the waiting threads call poll(NULL,0,1000) and block in the kernel. Obviously this isn't busy-waiting, but the data is interesting for reference. As we can see in the table below, WRPAUSE provides a good way to spin politely. And for short-term waiting it's much more efficient than parking in the kernel and potentially creating software timers for timed OS-level waiting. So we have a new facility that's as polite and effective -- with respect to sibling interference -- as is parking a thread, but that avoids the trip to the kernel and the other overheads associated with context switching. It's worth pointing out that the T=1 and T=2 scores for poll() and WRPAUSE forms are about equal because at T=1 we're leveraging both pipelines. And 3348 units of work is the approximate cycle cap for a core. _table { border:2px black dotted; margin: auto; width: auto; } _tr { border: 2px red dashed; } _td { border: 1px green solid; } _table { border:2px black dotted; margin: auto; width: auto; } _tr { border: 2px red dashed; } td { background-color : #E0E0E0 ; text-align : right ; } th { text-align : left ; }td { background-color : #E0E0E0 ; text-align : right ; }th { text-align : left ; } Aggregate progressT = #worker threadsWait Mechanism for 8-T threadsT=1T=2T=3T=4T=5T=6T=7T=8Park thread in poll() 32653347334833483348334833483348no-op 415 831 124316482060249729303349RD %ccr,%g0 "pause" 14262429269228623013316232553349PRNG step 412 829 124616702092251029303348WRPause(8000) 32443361333133483349334833483348WRPause(4000) 32153308331533223347334833473348WRPause(1000) 30853199322432513310334833483348WRPause(500) 29173070315032223270330933483348WRPause(250) 26942864294930773205338833483348WRPause(100) 21552469262227902911321433303348

Unbounded busy-waiting is an poor idea for user-space code, so we typically use spin-then-block strategies when, say, waiting for a lock to be released or some other event. If we're going to...

General

Thread placement policies on NUMA systems - update

In a prior blog entry I noted that Solaris used a "maximum dispersal" placement policy to assign nascent threads to their initial processors. The general idea is that threads should be placed as far away from each other as possible in the resource topology in order to reduce resource contention between concurrently running threads. This policy assumes that resource contention -- pipelines, memory channel contention, destructive interference in the shared caches, etc -- will likely outweigh (a) any potential communication benefits we might achieve by packing our threads more densely onto a subset of the NUMA nodes, and (b) benefits of NUMA affinity between memory allocated by one thread and accessed by other threads. We want our threads spread widely over the system and not packed together. Conceptually, when placing a new thread, the kernel picks the least loaded node NUMA node (the node with lowest aggregate load average), and then the least loaded core on that node, etc. Furthermore, the kernel places threads onto resources -- sockets, cores, pipelines, etc -- without regard to the thread's process membership. That is, initial placement is process-agnostic. Keep reading, though. This description is incorrect. On Solaris 10 on a SPARC T5440 with 4 x T2+ NUMA nodes, if the system is otherwise unloaded and we launch a process that creates 20 compute-bound concurrent threads, then typically we'll see a perfect balance with 5 threads on each node. We see similar behavior on an 8-node x86 x4800 system, where each node has 8 cores and each core is 2-way hyperthreaded. So far so good; this behavior seems in agreement with the policy I described in the 1st paragraph. I recently tried the same experiment on a 4-node T4-4 running Solaris 11. Both the T5440 and T4-4 are 4-node systems that expose 256 logical thread contexts. To my surprise, all 20 threads were placed onto just one NUMA node while the other 3 nodes remained completely idle. I checked the usual suspects such as processor sets inadvertently left around by colleagues, processors left offline, and power management policies, but the system was configured normally. I then launched multiple concurrent instances of the process, and, interestingly, all the threads from the 1st process landed on one node, all the threads from the 2nd process landed on another node, and so on. This happened even if I interleaved thread creating between the processes, so I was relatively sure the effect didn't related to thread creation time, but rather that placement was a function of process membership.I this point I consulted the Solaris sources and talked with folks in the Solaris group. The new Solaris 11 behavior is intentional. The kernel is no longer using a simple maximum dispersal policy, and thread placement is process membership-aware. Now, even if other nodes are completely unloaded, the kernel will still try to pack new threads onto the home lgroup (socket) of the primordial thread until the load average of that node reaches 50%, after which it will pick the next least loaded node as the process's new favorite node for placement. On the T4-4 we have 64 logical thread contexts (strands) per socket (lgroup), so if we launch 48 concurrent threads we will find 32 placed on one node and 16 on some other node. If we launch 64 threads we'll find 32 and 32. That means we can end up with our threads clustered on a small subset of the nodes in a way that's quite different that what we've seen on Solaris 10. So we have a policy that allows process-aware packing but reverts to spreading threads onto other nodes if a node becomes too saturated. It turns out this policy was enabled in Solaris 10, but certain bugs suppressed the mixed packing/spreading behavior. There are configuration variables in /etc/system that allow us to dial the affinity between nascent threads and their primordial thread up and down: see lgrp_expand_proc_thresh, specifically. In the OpenSolaris source code the key routine is mpo_update_tunables(). This method reads the /etc/system variables and sets up some global variables that will subsequently be used by the dispatcher, which calls lgrp_choose() in lgrp.c to place nascent threads. Lgrp_expand_proc_thresh controls how loaded an lgroup must be before we'll consider homing a process's threads to another lgroup. Tune this value lower to have it spread your process's threads out more. To recap, the 'new' partial packing placement policy is as follows. Threads from the same process are packed onto a subset of the strands of a socket (50% for T-series). Once that socket reaches the 50% threshold the kernel then picks another preferred socket for that process. Threads from unrelated processes are spread across sockets. More precisely, different processes may have different preferred sockets (lgroups). Beware that I've simplified and elided details for the purposes of explication. The truth is in the code. Remarks:It's worth noting that initial thread placement is just that. If there's a gross imbalance between the load on different nodes then the kernel will migrate threads to achieve a better and more even distribution over the set of available nodes. Once a thread runs and gains some affinity for a node, however, it becomes "stickier" under the assumption that the thread has residual cache residency on that node, and that memory allocated by that thread resides on that node given the default "first-touch" page-level NUMA allocation policy. Exactly how the various policies interact and which have precedence under what circumstances could the topic of a future blog entry. The scheduler is work-conserving.The x4800 mentioned above is an interesting system. Each of the 8 sockets houses an Intel 7500-series processor. Each processor has 3 coherent QPI links and the system is arranged as a glueless 8-socket twisted ladder "mobius" topology. Nodes are either 1 or 2 hops distant over the QPI links. As an aside the mapping of logical CPUIDs to physical resources is rather interesting on Solaris/x4800. On SPARC/Solaris the CPUID layout is strictly geographic, with the highest order bits identifying the socket, the next lower bits identifying the core within that socket, following by the pipeline (if present) and finally the logical thread context ("strand") on the core. But on Solaris on the x4800 the CPUID layout is as follows. [6:6] identifies the hyperthread on a core; bits [5:3] identify the socket, or package in Intel terminology; bits [2:0] identify the core within a socket. Such low-level details should be of interest only if you're binding threads -- a bad idea, the kernel typically handles placement best -- or if you're writing NUMA-aware code that's aware of the ambient placement and makes decisions accordingly. Solaris introduced the so-called critical-threads mechanism, which is expressed by putting a thread into the FX scheduling class at priority 60. The critical-threads mechanism applies to placement on cores, not on sockets, however. That is, it's an intra-socket policy, not an inter-socket policy. Solaris 11 introduces the Power Aware Dispatcher (PAD) which packs threads instead of spreading them out in an attempt to be able to keep sockets or cores at lower power levels. Maximum dispersal may be good for performance but is anathema to power management. PAD is off by default, but power management polices constitute yet another confounding factor with respect to scheduling and dispatching. The new policy is a compromise between packing and maximum dispersal. If your threads communicate heavily -- one thread reads cache lines last written by some other thread -- then the new dense packing policy may improve performance by reducing traffic on the coherent interconnect. On the other hand if your threads in your process communicate rarely, then it's possible the new packing policy might result on contention on shared computing resources. Unfortunately there's no simple litmus test that says whether packing or spreading is optimal in a given situation. The optimal answer varies by system load, application, number of threads, and platform hardware characteristics. Currently we don't have the necessary tools and sensoria to decide at runtime, so we're reduced to an empirical approach where we run trials and try to decide on a placement policy. The situation is quite frustrating. Relatedly, it's often hard to determine just the right level of concurrency to maximize throughput. (Understanding constructive vs destructive interference in the shared caches would be a good start. We could augment the lines with a small tag field indicating which strand last installed or accessed a line. Given that, we could add new CPU with performance counters that tallied misses where a thread evicts a line it installed and misses where a thread displaces a line installed by some other thread.)

In a prior blog entry I noted that Solaris used a "maximum dispersal" placement policy to assign nascent threads to their initial processors. The general idea is that threads should be placed as far...

General

Inverted schedctl usage in the JVM

The schedctl facility in Solaris allows a thread to request that the kernel defer involuntary preemption for a brief period. The mechanism is strictly advisory - the kernel can opt to ignore the request. Schedctl is typically used to bracket lock critical sections. That, in turn, can avoid convoying -- threads piling up on a critical section behind a preempted lock-holder -- and other lock-related performance pathologies. If you're interested see the man pages for schedctl_start() and schedctl_stop() and the schedctl.h include file. The implementation is very efficient. schedctl_start(), which asks that preemption be deferred, simply stores into a thread-specific structure -- the schedctl block -- that the kernel maps into user-space. Similarly, schedctl_stop() clears the flag set by schedctl_stop() and then checks a "preemption pending" flag in the block. Normally, this will be false, but if set schedctl_stop() will yield to politely grant the CPU to other threads. Note that you can't abuse this facility for long-term preemption avoidance as the deferral is brief. If your thread exceeds the grace period the kernel will preempt it and transiently degrade its effective scheduling priority. Further reading : US05937187 and various papers by Andy Tucker.We'll now switch topics to the implementation of the "synchronized" locking construct in the HotSpot JVM. (I should note that everything I'm describing resides only in my local workspaces and isn't yet integrated into HotSpot). If a lock is contended then on multiprocessor systems we'll spin briefly to try to avoid context switching. Context switching is wasted work and inflicts various cache and TLB penalties on the threads involved. If context switching were "free" then we'd never spin to avoid switching, but that's not the case. We use an adaptive spin-then-park strategy. One potentially undesirable outcome is that we can be preempted while spinning. When our spinning thread is finally rescheduled the lock may or may not be available. If not, we'll spin and then potentially park (block) again, thus suffering a 2nd context switch. Recall that the reason we spin is to avoid context switching. To avoid this scenario I've found it useful to enable schedctl to request deferral while spinning. But while spinning I've arranged for the code to periodically check or poll the "preemption pending" flag. If that's found set we simply abandon our spinning attempt and park immediately. This avoids the double context-switch scenario above. This particular usage of schedctl is inverted in the sense that we cover the spin loop instead of the critical section. (I've experimented with extending the schedctl preemption deferral period over the critical section -- more about that in a subsequent blog entry). One annoyance is that the schedctl blocks for the threads in a given process are tightly packed on special pages mapped from kernel space into user-land. As such, writes to the schedctl blocks can cause false sharing on other adjacent blocks. Hopefully the kernel folks will make changes to avoid this by padding and aligning the blocks to ensure that one cache line underlies at most one schedctl block at any one time. It's vaguely ironic that a facility designed to improve cooperation between threads suffers from false sharing. Schedctl also exposes a thread's scheduling state. So if thread T2 holds a lock L, and T1 is contending for L, T1 can check T2's state to see whether it's running (ONPROC in Solaris terminology), ready, or blocked. If T2 is not running then it's usually prudent for T1 to park instead of continuing to spin, as the spin attempt is much more likely to be futile.

The schedctl facility in Solaris allows a thread to request that the kernel defer involuntary preemption for a brief period. The mechanism is strictly advisory - the kernel can opt to ignore the...

General

atomic fetch-and-add vs compare-and-swap

There are a number of cases where an atomic fetch-and-add instruction might yield better performance than the classic Load;Φ;CAS loop, where CAS is the atomic compare-and-swap instruction. The x86 architecture exposes LOCK:XADD which I'll use in the discussion below. (If you don't need the original value you can also use LOCK:INC or LOCK:ADD instead of LOCK:XADD). CAS is "optimistic" and admits failure, whereas XADD does not. With XADD there's no explicit window of vulnerability to remote interference, and thus no need for a retry loop. Arguably, XADD has better progress properties, assuming the underlying XADD implementation doesn't have an implicit loop, but even in that case the window would be narrower than with Load;Φ;CAS.If you use the typical Load;Φ;CAS idiom, assuming normal snoop-based cache coherence, the Load may cause a read-to-share bus transaction to get the underlying cache line into S or E state. The CAS, which effectively has store semantics with respect to the cache coherence protocol, may induce another bus transaction to upgrade the line to M state. So in the worst case the idiom might incur two bus transactions, but an XADD implementation will usually drive the line directly into M state, requiring just one bus transaction. Of course you might be able to speculate on the value and have fast-path that tries a "naked" CAS without any prior load. (I use this frequently in native HotSpot code). Also, it's possible for sophisticated processor implementations to perform coherence speculation and aggressively probe the target line into M state. Finally, in some cases you can profitably insert a prefetch-for-write (PREFETCHW) instruction prior to the load to avoid the upgrade transaction. But that approach has to be applied carefully as in some cases it can do more harm than good. Given all that, XADD, where available, has an advantage.Lets say you were trying to increment a variable with the usual Load;INC;CAS loop. When the CAS starts failing with sufficient frequency you can find that the branch to exit the loop (normally taken under no or light contention) starts to predict toward the failure path where we stay in the loop. So when the CAS ultimately succeeds, you'll incur a branch mispredict when trying to exit the loop. This can be quite painful on processors with deep pipelines and lots of out-of-order speculative machinery. Typically, this is in a piece of code where you don't want a long stall. Relatedly, when the CAS starts to fail frequently, the branch begins to predict that control stays in the loop and in turn the loop runs faster and we cycle around more quickly to the CAS. Typically, we'd want some back-off in the loop. Under light load with infrequent failures the mispredict to retry the loop served as a potentially useful implicit back-off. But under higher load we lose the benefit of implicit back-off arising from the mispredict. There's no loop and no such issues with XADDEven though CAS has a higher (infinite) consensus number, there are times when fetch-and-add -- which has a consensus number of just 2 -- is preferable. Update: see CR7023898

There are a number of cases where an atomic fetch-and-add instruction might yield better performance than the classic Load;Φ;CAS loop, where CAS is the atomic compare-and-swap instruction. The x86...

General

False sharing induced by card table marking

Garbage-collected runtime environments frequently use card tables in conjunction with write barriers to accelerate root-scanning during garage collection. (See A Fast write barrier for generational garbage collectors by Urs Holzle, ECOOP-OOPSLA 1993 for details). Briefly, and skipping a few details, this design partitions the heap into an array of power-of-two sized card pages. When a mutator stores into a reference field of an object the runtime write barrier code will mark the card table entry for the card page containing (covering) that field as dirty. In the HotSpot JVM the card page size is 512 bytes and the card table is implemented as a simple array of bytes. That is, a given card table entry, which represents the state of a card page, is just one byte. The write barrier is emitted directly by the JIT and is usually just a shift and store instruction. In a subsequent non-moving minor GC, the collector can avoid scanning reference fields in a card that is not dirty. This design is well-proven widely employed but unfortunately it can result in performance issues in highly concurrent environments. Lets say our cache line size is 64 bytes, which is fairly common in modern processors. This means that 64 cards (32KB = 64\*512) will share a cache line in the card table. So reference stores by different threads that just happen to fall within the same 32KB region cause writes to the same cache line underlying the card table. This can result in excessive write invalidation and cache coherence traffic, which can reduce performance and impede scaling. Interestingly, the impact can worsen after a full/moving GC as threads tend to allocate into different address ranges by virtue of thread-local allocation buffers (TLABs), but after a full collection the remaining objects tend to be more tightly packed and thus more prone to the problem. Furthermore, most card table stores are redundant, as often the card is already marked dirty. This suggests a simple solution: instead of using an unconditional store in the barrier, we first check the card table entry and only store if it is clean. This slightly increases the barrier path-length and adds a conditional branch -- unless we were to be somewhat clever with conditional moves by annulling a redundant store by changing the destination address to be a thread-local dummy variable. On the other hand it avoids the problem. (For historical background, some years ago Doug Lea noticed an odd slow-down after a full GC in some concurrent Java code. He contacted me and I speculated that false sharing in the card table could be the issue. We conjured up a JVM with an experimental -XX:+UseCondCardMark flag that let us emit write barriers as either the usual unconditional store, or a conditional form that avoids redundant stores. The conditional form provided relief). I ran into the problem recently when experimenting with some concurrent queue implementations, which are reference-heavy, on a Sun 256-way Sun/Oracle T5440. This is 4-socket system where each socket contains a Niagara T2+ UltraSPARC processor having 64 logical processors. My benchmark has 50 producer threads and 50 consumer threads and measures and reports the message throughput over a timing interval. In the default configuration we can pass about 8.8K messages per msec. Using the same JVM, when I add the -XX:+UseCondCardMark flag we can achieve 52.5K messages per msec, clearly demonstrating the magnitude of the effect. I should also note that we ran into this same issue when experimenting with Java-level lock elision using hardware transactional memory. If two unrelated concurrent transactions happened to store into reference fields in the same 32KB region we'd have false aborts because of write-sharing on the card table cache line. Again, -XX:+UseCondCardMark provided relief.Update: see CR7029167

Garbage-collected runtime environments frequently use card tables in conjunction with write barriers to accelerate root-scanning during garage collection. (See A Fast write barrier for generational...

General

Solaris scheduling : SPARC and CPUIDs

Since it's a commonly asked question and source of confusion I thought I'd write up the following.First, I should introduce some terminology and state the mapping between solaris logical CPUIDs and physical resources. On a Solaris/SPARC Niagara-2 system the logical CPUIDs map geographically to physical IDs and resources. You can interpret a logical CPUID as follows: (DieNumber: D; CoreNumber:3 ; ThreadGroup:1 ; Strand:2). That is, you have 8 cores per die, 2 thread groups per core, and 4 strands per thread group. All the logical processors on a given core share an level-1 cache. The "ThreadGroup" is a rather obscure name for a pipeline. On a Niagara-1, for instance, there is only 1 pipeline per core, but you have 2 per core on an N2. You can query the CPUID on which a thread is currently running with getcpuid(), which is extremely fast.Assuming a simple model where all your threads remain runnable and there's no preemption ...With unbound threads, the solaris scheduler will balance (disperse fairly) 1st over dies, then over cores, then over pipelines, in order to maximize true parallelism and, to the extent possible, avoiding contention over shared resources by placing threads as 'far away' from each other as possible. This is usually the best policy -- particularly for completely independent threads -- but beware that it ignores the issue of inter-chip coherency costs. If you have high coherence costs (writes to shared variables) then packing a group of communicating threads on-chip can sometimes be better than letting them disperse over multiple chips. (As an aside, you're typically much better off letting the scheduler assign CPUs than by trying to bind yourself. Naive binding -- say, with sequential CPUIDs -- will almost always result in suboptimal performance).In this future these polices might shift to allow better power management by trying to keep dies or cores "parked" (idle, drawing less power). Digressing slightly, recent linux schedulers _do try to impose some type of power management by default, making it sometimes hard to squeeze maximal performance out of a MP Nehalem/AMD system and introducing confounding factors for those of us benchmarking.If the threads are entirely CPU-bound then typically the scheduler will place them and the thread:cpu relationship then becomes fairly stable. Even if the thread blocks briefly, if it comes back ONPROC (running) in under 3 msecs it's considered to have residual affinity and will go back to the processor where it last ran, barring gross dispatch queue length imbalance. Check the OpenSolaris kernel sources for "rechoose_interval" if you're curious.For background, each core has its own local dispatch queue and makes its own local scheduling decisions. There's no centralized global scheduler agent and no centralized scheduling data structures (with the exception of a queue for unbound real-time threads). The scheduling policies and parameters are designed such that the local & independent decisions collectively result in achieving the desired global scheduling policy. (If you're biologically inspired, think of a communal insect colony). At a high level the scheduler attempts to maximize aggregate useful work completed in unit-time. Specifically, the scheduler tries to maximize the # of cpus doing useful work as well as attempting, where possible, to maintain affinity (minimize migration). With CMT/CMP/HT systems it also tries to disperse work (LWPs) over cores. As I mentioned above, over time more constraints are being added to the scheduler's responsibilities, such as trying to minimize or reduce power consumption, striking a balance between performance and energy.The system disperses threads via stealing (pulling, where idle CPU steal from other dispatch queues) and queue balancing (pushing, where if there's a gross imbalance in dispatch queue depth, a local scheduler will try to pass some blocked threads to other less-loaded dispatch queues).As an aside, threads in the real-time scheduling class cause more centralized dispatch. If they're unbound they can induce lots of preemption and migration, but that's a topic for another day.

Since it's a commonly asked question and source of confusion I thought I'd write up the following. First, I should introduce some terminology and state the mapping between solaris logical CPUIDs and...

General

QPI Quiescence

It's not uncommon to find Dekker-like idioms in modern concurrent programs. On platforms with weaker memory models -- say where a store followed by a load in program order can be reordered by the architecture to appear as a load and then a store in the effective memory order (sometimes called the "visibility order") -- programs must use barrier instructions to enforce memory ordering to implement Dekker correctly. For the purposes of discussion and assuming a relatively common system model we'll define memory order as the order of operations as they appear at the interface between the processor and the first-level coherent cache. Examples of barriers are MFENCE on x86 and MEMBAR #storeload on SPARC. In addition, x86 and SPARC TSO memory models allow only one variety of architectural reordering, the store-then-load form noted above. (For simplicity we'll restrict the discussion to TSO-like memory consistency models). On some platforms barriers introduce significant local latency. Perversely, we sometimes find that atomic instructions which have barrier semantics (are barrier-equivalent) are faster than the purpose-defined barrier instructions. A simplistic barrier implementation might simply quiesce the pipeline and wait for the store buffer to drain. To allay a common misconception it's worth pointing out that barriers -- sometimes called fences -- are typically implemented as processor-local operations and don't cause any distinguished action on the bus or interconnect and instead simply instruct the processor to ensure that prior stores become visible before subsequent loads (subsequent and prior refer to the barrier in program order). That is, they don't force anything to happen -- such as coherence messages on the bus -- that were not already destined to occur. Instead, they simply enforce an order, momentarily reconciling program and memory order. Crucially, at least with current x86 and SPARC implementations, barriers don't force anything to occur off-processor. That also means they don't impede or impair scalability. There's no fundamental reason, however, why barriers should be so slow. The processor implementation is free to speculate over the barrier, for instance, as long as stores in the speculative episode are not made visible and loads in the episode are tracked for coherence. And in fact on at least one processor, barrier instructions effectively have 0 latency. Returning to the Dekker idiom, threads T1 and T2 might coordinate as follows: T1 might execute (ST A; barrier; LD B) and T2 executes (ST B; barrier; LD A), and in particular we refer to this pattern as the Dekker duality. As a concrete example, we coordinate thread state transitions in the HotSpot JVM via a similar protocol, where T1 is a Java thread (mutator) executing the reentry path from a JNI call, T2 has the role of the VM thread coordinating a stop-the-world safepoint, A is a variable that indicates T1's thread state (executing outside the JVM on a JNI call, or executing inside the managed runtime), and B indicates if a stop-the-world safepoint is pending. Critically, if T1 is running on a JNI call and attempts to return back into the managed environment while a safepoint is executing, we need to stall T1 at the point of ingress, as the VM expects that Java threads will not access the heap during a safepoint. (Among other uses, Safepoints are employed for certain types of garbage collection operations, for instance where we don't want the collector and the Java threads accessing the heap simultaneously). For the purposes of illustration I'm showing just a single mutator thread T1 and a single VM thread T2, but in practice the mechanism is much more general. T1's path, above, is likely to execute much more often than T2's, as JNI calls could be expected to occur more frequently than safepoints. As such, to improve performance we'd like to elide the barrier instruction from T1's path. ADS  (Asymmetric Dekker Synchronization) is a family of related mechanisms that allow us to safely remove the barrier from T1 while shifting the responsibility of dealing with T1's potential reorderings to T2. We call it asymmetric because to be profitable T1's path needs to run much more frequently than T2's. We then call T1's path the fast-path and T2's the slow-path. (This mechanism can enabled and disabled by way of the -XX:+/-UseMembar switch ). The Asymmetric Dekker Synchronization document mentioned above enumerates a number of ways in which we might allow T1 and T2 to coordinate while still removing the barrier from T1's hot path, including signals, cross-calls (inter-processor interrupts), page-protection mechanisms, etc. On windows T2 might simply invoke the FlushProcessWriteBuffers facility, which seems to precisely match our needs. (Some time ago I filed an RFE -- request-for-enhancement -- for Solaris to provide a similar facility). Still, we're always looking for better ways to implement our asymmetric protocol, which almost brings us to QPI quiescence, but first we need some historical background. Long ago Intel implemented atomics with a global bus lock. It was called the #LOCK signal and driven by the LOCK: prefix on instructions, thus the names we have today. Bus locking was conceptually simple as most multiprocess Intel systems used a common front-side bus (FSB) between the processor and memory. Unrelated atomic operations, however, could impair overall performance as #LOCK had to quiesce the bus. The old FSB was a split-transaction request-response bus, and allowed multiple requests in-flight at a given time, so to assert #LOCK too often could rob the system of performance. Bus locking also supported atomics that spanned 2 cache lines. Intel subsequently switched to so-called cache-locking, where atomic read-modify-write instructions were implemented directly in the local cache of the processor executing the atomic, avoiding the need to lock the shared bus. From the perspective of the bus such atomic operations are no different than a store. (All SPARC systems that I know of use cache-locking). Cache-locking was a good step forward as atomics now scale ideally if there's no sharing of the underlying cache lines. Despite that, Intel preserved bus locking to handle the exotic legacy case of atomics that span cache lines (split atomics), which, by definition, are misaligned accesses. For this odd case the best solution was to simply resort to bus locking so the two lines underlying the operand could be accessed atomically. Note that Intel and AMD appear to frown up such behavior in their reference manuals, but the processors still support it for legacy reasons, at least as of today. With the advent of QuickPath Interconnect (QPI) Intel eliminated the common FSB and switched to a topology more akin to AMD's hypertransport. Nehalem is the first processor to use QPI. But even with QPI the architects needed a way to support those legacy split atomics. To that end, QPI has the ability of quiesce the whole system to allow the split atomic to execute. It appears that QPI quiescence also drains the pipes and forces at least of the equivalent of barrier semantics over the whole system. That is, split atomics may serve as a way to force a system wide "remote" barrier. Additional remarks Beware, it's not clear that QPI quiescence is actually safe and provides true quiescence. Empirical tests with a program designed to stress the -UseMembar facility and inspired by a simple hunch about the QPI implementation suggest so, but absence of evidence isn't evidence of absence --we might yet find Karl Popper's black swan. At best, QPI quiescence should be considered an academic curiosity and should never be used in production code. I'm sure processor vendors would be loath to endorse such stunts because it could ultimately limit their latitude in future bus designs. QPI quiescence is simply an implementation artifact and not an architecturally defined or guaranteed trait. Even if the facility were somehow blessed I doubt vendors would want to expose it by means of split atomics so perhaps a new instruction might be called for. (Put another way, there are two issues, should the facility be provided, and if so how to expose it to programmers). So for the moment QPI quiescence is only good for prototyping what \*might\* be accomplished with a hypothetical instruction. It's possible such mechanism might be applicable to certain forms of RCU (read-copy-update). Dmitriy V'jukov recently posted some timing results for FlushProcessWriteBuffers. It's pretty efficient on his system. I hope to be able to run similar benchmarks to measure the cost of QPI quiescence in the near future, at which point I'll post the results here. (Dmitriy is also the author of the relacy race detector which I recommend). It's worth noting that implementations of facilities such as FlushProcessWriteBuffers can be made very efficient. For example the implementation might be able to avoid a cross-call to a processor if it's known that no thread in the process is executing on that processor, using the knowledge that context switches are serializing events. Some extremely preliminary data shows that QPI quiescence by way of a split atomic incurs a local penalty of about 4800 cycles on an i7-920, and the degree of impact on the progress of other processors is very much a function of the miss rate of those processors. I mentioned the idea of QPI quiescence to Dmitriy, who in turn pointed point out a relevant article on QPI in Dr. Dobbs. The "Locks" section is particularly interesting. As Dmitry noted, if it's possible to use quiescence for hot-plugging then it's not unreasonable to think that both the bus and processors are completely quiesced with no pending stores languishing in store buffers, which is precisely the behavior in which we're interested. Is there a working analog to QPI quiescence on AMD's coherent hypertransport? This is left as an exercise for the curious reader. Can QPI quiescence lead to a break-down of performance isolation in virtual machines running on the same physical system? (That's pretty easy to test even in the absence of virtual machines, with a simple multithreaded program or a few single-threaded processes). Related reading For another example of the Dekker duality in the HotSpot JVM and further discussion about the challenges of weak memory models refer to a previous blog entry about a long-standing bug in the park-unpark subsystem. Biased locking, which is used in the HotSpot JVM, is a mechanism that attempts to the reduce the impact of high-latency atomic instructions. Interestingly, as processor vendors make improvements in the latency of such instructions there may come a time in the near future when biased locking is no longer profitable, at least on some platforms. US07644409 and US7475395

It's not uncommon to find Dekker-likeidioms in modern concurrent programs. On platforms with weaker memory models -- say where a store followed by a load in program order can be reordered by the...

General

A scheduling and dispatching oddity on Linux

While benchmarking a concurrent application on Linux I ran into an odd problem worth relating. Specifically, I'm using ubuntu 9.10 with linux kernel 2.6.31-1 running on a 1x4x2 Core2 i7-920 "Nehalem" (1 package; 4 cores/package; 2 logical processors/core via hyperthreading). I'd noticed that our scaling numbers were a bit odd, with more than the usual fall off past 4 threads (it's a 1x4x2 system so we expect some fade past 4 threads, even for ideally scalable benchmarks) and more variance than I expected. The benchmark harness runs for fixed time and reports aggregate progress over the measurement interval, and of course the system is otherwise idle. As a sanity check the benchmark also reports the total user CPU time consumed over the measurement interval. Interestingly, the CPU times values were unexpectedly low, considerably less than min(#threads,#cpus) \* MeasurementInterval. All the threads should stay running, or least ready, for the duration of the interval. Note too that the effect is independent of and still occurs with long intervals, so it's not a simple issue of allowing the scheduler time to steal and rebalance or otherwise converge on a state where the runnable threads are well-dispersed over the processors. It appeared that the scheduler wasn't aggressively dispatching ready threads onto idle CPUs. Put another way there were prolonged periods where we had both idle CPUs and ready threads at the same time -- the kernel was failing to saturate the available processors. To avoid the problem I initially tried binding threads to processors via sched_setaffinity, which provided complete relief. Still, I'm cautious about binding because it requires knowledge of the platform topology. On SPARC/CMT/Solaris, for instance, logical CPUIDs map to physical resources geographically in the following manner: the bits in the logical CPUID select, from most significant bits to least significant, chip then core then pipeline ("thread group" in Sun terminology) then strand. So if you just bind threads by "natural" order (thread N to CPUID N) then you'll end up with many threads sharing some cores and other cores completely idle, which is likely undesirable and may yield skewed scaling results. This, btw, is a common benchmarking pitfall. On Solaris/SPARC you're better off letting the kernel disperse threads onto processors as it'll balance 1st over chips, then cores, then pipelines, which is optimal for independent threads to make headway. (That policy is clearly not optimal if there's sharing -- particular if there are writers -- in which case you might win by packing the threads less "distant" from each other, for some interconnect distance metric, and assuming the increased cache pressure and replication doesn't do too much harm). Unlike SPARC/Solaris, the logical operating system-level CPUID to physical resource mappings on my ubuntu/x64 system are well-dispersed if you use natural CPU assignment, but there's no hard guarantee of that property although I vaguely recall that Intel advises a certain canonical mapping. In more detail, the logical CPUID to physical mapping on my system -- as discovered by iterating over the logical CPUID values and using the CPUID instruction to query physical resources -- is : 0 to C0S0, 1 to C1S0, 2 to C2S0, 3 to C3S0, 4 to C0S1, 5 to C1S1, 6 to C2S1, 7 to C3S1, where C is the core# and S is the relative strand# on the core. I'm guessing that the linux kernel I'm using institutes polices that attempt to balance power with performance whereas Solaris currently optimizes for performance. After further poking through the Linux kernel sources I realized we could adjust the scheduling policy more to our liking via tunables exposed via the /proc file system. At that point I came upon the tune-sched-domains script that makes it easy to quickly adjust scheduler tunables. (Note that the script assumes bash). First, run tune-sched-domains with no arguments and examine the SD_WAKE_AFFINITY and SD_WAKE_IDLE settings. We want SD_WAKE_AFFINITY clear and SD_WAKE_IDLE set. (If I'm interpreting the comments in the kernel code correctly, WAKE_AFFINITY appears to try to place the wakee on the same CPU as the waker, presuming they communicate through memory that's already in the local cache, while WAKE_IDLE instructs the kernel to aggressively wake idle CPUs when making threads ready). If necessary, compute a new SD_ mask value and run the script again, passing the value (in decimal) as an argument to the script. These settings provided relief for the under-utilization problem. In addition I noticed that the HotSpot JVM performed much better on multi-threaded workloads under the settings mentioned above. While I didn't have time for the experiments, it may be the case that adjusting the LB_BIAS flag may also provide relief.

While benchmarking a concurrent application on Linux I ran into an odd problem worth relating. Specifically, I'm using ubuntu 9.10 with linux kernel 2.6.31-1 running on a 1x4x2 Core2 i7-920 "Nehalem"...

General

A race in LockSupport park() arising from weak memory models

I recently diagnosed the root cause of a concurrency bug, CR6822370, and thought it sufficiently interesting to share the details. (CR 6822370 actually represents a cluster of bugs that are now thought to be related by a common underlying issue). Briefly, we have a lost wakeup bug in the native C++ Parker::park() platform-specific infrastructure code that implements java.util.concurrent.LockSupport.park(). The lost wakeup arises from a race that itself arises because of architectural reordering that in turn occurs because of missing memory barrier instructions. The lost wakeup may manifest as various 'hangs' or instances of progress failure.For background, a Parker is a native HotSpot C++ type that implements a construct that's somewhat like a restricted-range binary semaphore, except that park() is allowed to return spuriously. See LockSupport for details. If a thread is blocked in park() we're guaranteed that a subsequent unpark() will make it ready. (A perfectly legal but low-quality implementation of park() and unpark() would be empty methods, in which the program degenerates to simple spinning. An in fact that's the litmus test for correct park()-unpark() usage). A Parker instance is associated at most one thread at any one time and only that designed thread call invoke park() on that particular Parker. Any thread, however, may call unpark() on a given Parker. Furthermore, Parker instances also type-stable and immortal, at least in Sun's HotSpot implementation. On Solaris and Linux a Parker contains a pthread condvar (named _condvar), mutex (named _mutex), and a volatile integer, _counter, that represents the state of the semaphore. Despite its name, _counter only takes on the values 0 (neutral) and 1 (signaled). Each thread has a Parker instance dedicated to use by LockSupport. Parker:: park() and unpark() // Redacted and annotated for clarityvoid Parker::unpark() { pthread_mutex_lock (_mutex) ; int s = _counter; _counter = 1; pthread_mutex_unlock (_mutex) ; if (s < 1) { // We signal after having dropped the lock to minimize the hold time and // in case the underlying implementation doesn't provide wait morphing. pthread_cond_signal (_cond) ; }} void Parker::park() { if (_counter > 0) { // Optional optimization to avoid taking the lock _counter = 0 ; return ; } if (pthread_mutex_trylock(_mutex) != 0) { // Another optional optimization // Must be a concurrent unpark() - just return return; } if (_counter > 0) { // no wait needed _counter = 0; pthread_mutex_unlock(_mutex); return; } pthread_cond_wait (_cond, _mutex) ; _counter = 0 ; pthread_mutex_unlock(_mutex);}Failure scenario Lets suppose we have a thread T1 whose LockSupport/Parker instance has _counter=1. T1 calls LockSupport.park() which then invokes Parker::park(). Park() fetches _counter and observes 1, and then sets _counter=0 and returns. There are no atomic or fence instructions in this path. Crucially, the store of 0 into _counter languishes in the processor's local store buffer and is not yet visible to other processors.T1 returns from park(), checks for the condition of interest and observes that it's not yet signaled, and again calls park(). The store to _counter from step (1) is still languishing in the store buffer and is not yet globally visible. Control reaches the very top of park(). Thread T2 causes the condition of interest to enter signaled state via some stores. These are typically stores to Java volatile variables, so the JIT will emit the necessary memory barriers after the stores. Thread T2 then calls unpark(T1). The unpark() operator stores 1 into _counter. We'll assume that this store becomes globally visible immediately. T2 returns from unpark(). Thread T1's old store of 0 into _counter from step (1) finally drains from its local store buffer and becomes globally visible , overwriting the value 1 just written by T2 in step (6). With due lack of formal precision, _counter is now "globally" 0. Thread T1 in park() fetches _counter, observes 0, and then blocks on the condvar. We have a lost wakeup — T1 stalls. Remarks and analysisThe underlying issue is a classic optimization in unpark() which tries to avoid taking a lock. The _counter instance variable is often but not always accessed under _mutex. If the underlying platform provided sequential consistency then the code as shown above would be correct, but x86 and SPARC provide a weaker memory consistency models, allowing memory or visibility order to differ from program order. (Refer to the excellent papers by Peter Sewell et al. for background). Given those architectural reorderings, the program admits a race which can in turn result in missed wakeups. The problem would only manifest when we were using the -UseMembar optimization that lets us remove fences from certain hot thread state transitions paths that need to coordinate safepoints between mutator threads and the JVM. This feature is enabled by default, but we can turn it off with the -XX:+UseMembar switch, which causes the JVM to emit normal fence instructions in the state transitions paths. (That particular optimization is an example of asymmetric Dekker synchronization). Crucially, the park() path contains such a state transition. In reality the fence emitted by the +UseMembar switch was simply covering up the otherwise latent Parker:: bug. +UseMembar constitutes a work-around. Sensitivity to UseMembar was initially confounding but ultimately a valuable clue. After thinking about various timing scenarios I settled on the one given above as the most likely culprit. To support that hypothesis I wrote a simple C model of the pathology and verified that it would "fail" in a similar fashion. Having collected data with the C model on various platforms I suspect that processors where stores can languish in the store buffer for longer periods are more exposed to the bug. Inserting appropriate barrier instructions after both stores of 0 into _counter in park() provides a solution. Furthermore, we're not formally guaranteed that pthread_mutex_unlock() has barrier semantics, so to be conservative we need a barrier in that location as well. For our purposes a fence instruction prevents subsequent loads (subsequent in program order) from executing before prior stores become globally visible. We typically use volatile to control for compile-time reordering and fences to control for architectural reordering. The bug will not manifest on uniprocessors or environments where threads are otherwise constrained to just a single processor. The bug is a "day-one" bug and present in all versions of HotSpot. Parker::park() and unpark() reside in os_linux.cpp, os_solaris.cpp and os_windows.cpp for Linux, Solaris and Windows, respectively.The built-in synchronized implementation uses a different park mechanism (PlatformPark::) whereas the java.util.concurrent infrastructure uses Parker::. Only Parker:: is vulnerable. Additional reading:Memory barriers and fencesInstruction selection for volatile fences in JavaJava Memory Model Cookbook (Doug Lea)Multiprocessors Should Support Simple Memory Models (Mark Hill)Appendix - elaborated scenarioWe have a volatile Java variable C, initially 0. Thread T1's Parker::_counter value is 1. Thread T1 executes: while (C == 0) park() ; Thread T2 executes: C=1; unpark(T1) Timing:Thread T1 loads C, observes 0, and calls park()Thread T1 in 1st park() invocation:Load _counter (1)Store _counter (0) — languishes in store bufferreturn from park()Thread T1: Load C (0)Thread T1: call park() a 2nd timeThread T2: Store C=1; MFENCE (by virtue of java volatile). MFENCE is a completely local operation, influencing only T2. Thread T2: call unpark(T1)lock _mutex with atomic instruction such as CAS or LOCK:CMPXCHG Load _counter (0) — observes value from memory, completely legalStore _counter (1)unlock _mutexThread T1's store of 0 into _counter finally drains from the store buffer and becomesglobally visible, overwriting the value 1 just stored by T2Thread T1 continues in 2nd invocation of park()Load _counter (0)lock _mutex Load _counter (0)block on _condvarAnother way to think of the problem is via the Dekker-duality. Observe that T1 executes { Store _counter; Load C; } while T2 executes { Store C; MFENCE; Load _counter;}. Note the missing MFENCE from T1's path. The duality is slightly obscured because the store into _counter occurs in the 1st invocation of park() which returns immediately and the load of C occurs in the application code. An in fact that's what distinguishes this bug - that the failure idiom is distributed over two invocations of park(). Update: Scott Owens in Peter Sewell's group coined the term triangular race in his 2010 ECOOP paper "Reasoning about the Implementation of Concurrency Abstractions on x86-TSO" to describe this type of situation.

I recently diagnosed the root cause of a concurrency bug, CR6822370,and thought it sufficiently interesting to share the details. (CR 6822370 actually represents a cluster of bugs that are now...

General

The perils of negative scalability

I've seen the following issue confound customers and colleagues of late, so thought it worth a blog entry. Lets say you have an application that exhibits negative scalability. That is, if you were to plot throughput on the Y-axis and concurrency on the X-axis the shape of the curve would be convex downward -- performance climbs up to an apex and then falls off. (How can this happen? A common reason is that the communication overheads start to dominate and the benefit of concurrency is overcome by communication and synchronization costs). Under such circumstances it's common to introduce some type of admission control -- say, simple back-off or more elaborate mechanisms -- to restrict concurrency. Ideally, this yields an asymptotic curve where performance remains constant after reaching the peak, avoiding any fall-off. If you tune the performance of such an application using the usual measure-analyze-modify cycle but pay attention only to the throughput values at high concurrency levels then you might be badly misled. The usual development feedback loop can fail because poor "optimizations" that slow down the code may actually serve as inadvertent implicit back-off (contention management or admission control) that will attenuate the negative scalability at higher concurrency levels but also needlessly impair performance at lower concurrency levels. Ideally, back-off should be applied only as needed, in response to contention. A related effect is that inserting diagnostic probes might yield better performance in the region of negative scalability because of probe overhead -- a performance "Heisenbug" where performance improves when execution is more closely observed. The take-away is that we should be careful to measure the performance of any proposed change over a wide range of concurrency values, and not just at the extremes. An interesting human analog is Brooks's law. The same issues -- the overheads of orchestrating large numbers of humans or threads -- may underly both effects.

I've seen the following issue confound customers and colleagues of late, so thought it worth a blog entry. Lets say you have an application that exhibits negative scalability. That is, if you were to...

General

memcpy() concurrency curiosities

I was recently asked to diagnose a problem a customer was encountering that involved Java and the JNI getIntArrayElements() and releaseIntArrayElements() primitives. The outcome of the exploration was sufficiently interesting that I thought I'd mention it here in order that other JNI users might avoid the same pitfall. Briefly, the customer was running a highly threaded application on a machine with lots of available concurrency. The Java threads would repeatedly read from an array of int[]s that was effectively immutable. References to the array were also passed to native C code via JNI calls. Those native methods would access the array via the getIntArrayElements() and releaseIntArrayElements() primitives, but in no case did the native code ever modify the array contents. The problem was that the Java readers would occasionally observe scenarios that suggested that the contents of the array were transiently "flickering" to unexpected values -- ephemeral corruption -- and then returning to the expected values. (Recall that the array was effectively immutable, so this shouldn't have been happening). I carefully checked the Java code and native code and quickly convinced myself the source of the bug was elsewhere. I next switched my attention to the implementation of getIntArrayElements() and releaseIntArrayElements(). Briefly GetIntArrayElements() will, as used by this particular application, (a) switch the caller's thread state back to "InVM". If there's a stop-the-world garbage collection event going on the state transition operator should block the caller until the GC completes. At that point the thread can safely access the heap to extract the array data. The "InVM" thread state is such that GC should be inhibited while we're copying the data out. (b) malloc a properly sized array to hold the data; (c) memcpy() the data from the heap data into that just allocated array; (d) restore the thread state; and (e) return a pointer to the just allocated array to the caller. ReleaseIntArrayElements() operates in a similar fashion but reverses the memcpy() direction (copying into the heap) and then releases the allocated array. Note that in our case the source and destination memcpy() regions are completely disjoint. My first thought was that something might have been wrong with the state transition operations, allowing what should have been stop-the-world GC to inadvertently run concurrently with the memcpy() operations. Further investigation ruled that out as there were no GC events when the apparent corruption was observed. Upon reflection, however, I recalled that some optimized memcpy() implementations may execute explicitly coded benign speculative stores into the destination region. Another possibility is the block-initializing-store (BIS) used by memcpy(). A concurrent observer might fetch from a memcpy() target cache line in the narrow timing window between when the line was zeroed and when the contents were copied back into the line. In either case, the memcpy() in releaseIntArrayCritical could transiently mutate the contents of the java array in the heap to some unexpected value. By the time the memcpy() finishes, of course, the destination buffer will again contain the expected contents. I quickly confirmed this scenario might occur by writing a quick throw-away multi-threaded C test case that modeled the JNI behavior and demonstrated the effect. The underlying mechanism at work deserves a bit more explanation. Lets say we're writing C code and have an effectively immutable array A. A contains a known set of values. Concurrently, we have a 1st thread reading from A and 2nd thread memcpy()ing precisely the same set of values known to be in A back into A. At first glance it's not unreasonable to assume that the reader (the 1st thread) will always see the expected values in A. That is, we're assuming that memcpy()ing the same value back into an array is idempotent. The memcpy() operation is certainly idempotent from the point of view of the caller of memcpy(), but interestingly it's not idempotent from the point of view concurrent observers. With optimized memcpy() implementations concurrent observers can see "odd" intermediate states. Naively, you'd expect that concurrent observers would see the array as unchanging, but that isn't the case on all platforms. The case at hand the memcpy() in releaseIntArrayElements() is the source of the problem, as array in heap might "flicker". To further confirm this diagnosis I provided the customer with a simplistic byte-by-byte unoptimized memcpy() implementation in an LD_PRELOAD-able mempcy.so shared object. And indeed, the problem wasn't evident when running with this simple memcpy() instead of the optimized system form. While technicall legaly, such behavior in memcpy() -- and the JNI operators that call memcpy() -- is, at best, surprising and violates the principle of least astonishment. Strictly speaking, memcpy() provides no particular guarantees about concurrently observable intermediate states, but rather only specifies the state of the destination buffer at the end of the invocation. That is, while a memcpy() is in flight the destination buffer might contain transient or ephemeral values which could be observed by other threads. Overwriting an array with an exact copy of the array will be idempotent from the perspective of the caller after memcpy() completes, but concurrent observers of the array are permitted to see transient ephemeral "garbage" values. My first thought when we ran into this behavior some years ago was that it was a bug in the memcpy() implementation, but upon deeper reflection I concluded the bug was in my errant assumptions about how memcpy() worked. Having said that, while such behavior is technically permissible I'm also sympathetic that such effects are, at best, unexpected. I's not unreasonable to assume the destination buffer would remain unmolested. One could imagine customers easily falling into this trap, in both C code and Java.

I was recently asked to diagnose a problem a customer was encountering that involved Java and the JNI getIntArrayElements() and releaseIntArrayElements() primitives. The outcome of the exploration was...

General

CAS and Cache Trivia - Invalidate or Update In-Place

In a previous blog entry on biased locking I noted that CAS (the atomic Compare-And-Swap instruction) usually has the same semantics as store with respect to caches and the interconnect. It's worth calling out an interesting exception, however. For background, the level-1 data cache -- often shortened to L1D$ or just D$ -- of Niagara-1 and Niagara-2 processors uses a write-around policy. All stores go over the cross-bar, but if the line is also present in the D$, the store instruction (ST) will also update the line in-place. This is where CAS differs from ST. If the line targeted by the CAS is also in the the D$ of processor executing the CAS, the line will be invalidated instead of being updated in-place. Of course it's not uncommon for the line to be present in the cache given the prevalence of the LD...CAS usage idiom. More importantly, it's extremely common for a thread to access the same line in short order after a CAS. This is where the CAS-Invalidates policy can impact performance. A good example is a lock that's acquired via CAS where the lock metadata is collocated with the data covered by the lock. Obviously, after having acquire a lock a thread is quite likely to access data protected that same lock. The first data access to the just-CASed-to-line will miss in the D$. Another example would be a thread that repeatedly locks and unlocks the same lock. Assuming the lock and unlock operators both use a LD...CAS idiom, even if the line containing the lock metadata wasn't accessed within the critical section, the CAS in lock operation will cause the LD in the subsequent unlock to miss, and the CAS in unlock will cause the LD in the subsequent lock operation to miss. Thankfully on Niagara-family processors a miss in the D$ that hits in the level-2 cache can be satisfied fairly quickly. Still, we'd like to avoid that D$ miss if at all possible. I'd argue that to the extent feasible, processor designers should avoid a CAS-Invalidates policy, which I believe to be an implementation artifact and not strictly fundamental to CAS. CAS-Invalidates is cache-antagonistic to common synchronization usage.

In a previous blog entry on biased locking I noted that CAS (the atomic Compare-And-Swap instruction) usually has the same semantics as store with respect to caches and the interconnect. It's...

Oracle

Integrated Cloud Applications & Platform Services