Monday Jul 28, 2014

Hardware extensions to make lazy subscription safe

Hardware extensions to make lazy subscription safe is a follow-on to our WTTM 2014 paper, Pitfalls of Lazy Subscription . We describe a number of hardware approaches to avoid the wrongful commit pathology admitted by generalized transactional lock elision using lazy subscription. Eager (early) subscription remains safe, of course.

Wednesday Jun 11, 2014

PTLQueue : a scalable bounded-capacity MPMC queue

I've used the following concurrent queue algorithm enough that it warrants a blog entry. I'll sketch out the design of a fast and scalable multiple-producer multiple-consumer (MPMC) concurrent queue called PTLQueue. The queue has bounded capacity and is implemented via a circular array. Bounded capacity can be a useful property if there's a mismatch between producer rates and consumer rates where an unbounded queue might otherwise result in excessive memory consumption by virtue of the container nodes that -- in some queue implementations -- are used to hold values. A bounded-capacity queue can provide flow control between components. Beware, however, that bounded collections can also result in resource deadlock if abused. The put() and take() operators are partial and wait for the collection to become non-full or non-empty, respectively. Put() and take() do not allocate memory, and are not vulnerable to the ABA pathologies. The PTLQueue algorithm can be implemented equally well in C/C++ and Java.

Partial operators are often more convenient than total methods. In many use cases if the preconditions aren't met, there's nothing else useful the thread can do, so it may as well wait via a partial method. An exception is in the case of work-stealing queues where a thief might scan a set of queues from which it could potentially steal. Total methods return ASAP with a success-failure indication. (It's tempting to describe a queue or API as blocking or non-blocking instead of partial or total, but non-blocking is already an overloaded concurrency term. Perhaps waiting/non-waiting or patient/impatient might be better terms). It's also trivial to construct partial operators by busy-waiting via total operators, but such constructs may be less efficient than an operator explicitly and intentionally designed to wait.

A PTLQueue instance contains an array of slots, where each slot has volatile Turn and MailBox fields. The array has power-of-two length allowing mod/div operations to be replaced by masking. We assume sensible padding and alignment to reduce the impact of false sharing. (On x86 I recommend 128-byte alignment and padding because of the adjacent-sector prefetch facility). Each queue also has PutCursor and TakeCursor cursor variables, each of which should be sequestered as the sole occupant of a cache line or sector. You can opt to use 64-bit integers if concerned about wrap-around aliasing in the cursor variables. Put(null) is considered illegal, but the caller or implementation can easily check for and convert null to a distinguished non-null proxy value if null happens to be a value you'd like to pass. Take() will accordingly convert the proxy value back to null. An advantage of PTLQueue is that you can use atomic fetch-and-increment for the partial methods. We initialize each slot at index I with (Turn=I, MailBox=null). Both cursors are initially 0. All shared variables are considered "volatile" and atomics such as CAS and AtomicFetchAndIncrement are presumed to have bidirectional fence semantics. Finally T is the templated type.


// PTLQueue :

Put(v) : // producer : partial method - waits as necessary
assert v != null
assert Mask >= 1 && (Mask & (Mask+1)) == 0 // Document invariants
// doorway step
// Obtain a sequence number -- ticket
// As a practical concern the ticket value is temporally unique
// The ticket also identifies and selects a slot
auto tkt = AtomicFetchIncrement (&PutCursor, 1)
slot * s = &Slots[tkt & Mask]
// waiting phase :
// wait for slot's generation to match the tkt value assigned to this put() invocation.
// The "generation" is implicitly encoded as the upper bits in the cursor
// above those used to specify the index : tkt div (Mask+1)
// The generation serves as an epoch number to identify a cohort of threads
// accessing disjoint slots
while s->Turn != tkt : Pause
assert s->MailBox == null
s->MailBox = v // deposit and pass message

Take() : // consumer : partial method - waits as necessary
auto tkt = AtomicFetchIncrement (&TakeCursor,1)
slot * s = &Slots[tkt & Mask]
// 2-stage waiting :
// First wait for turn for our generation
// Acquire exclusive "take" access to slot's MailBox field
// Then wait for the slot to become occupied
while s->Turn != tkt : Pause
// Concurrency in this section of code is now reduced to just 1 producer thread
// vs 1 consumer thread.
// For a given queue and slot, there will be most one Take() operation running
// in this section.
// Consumer waits for producer to arrive and make slot non-empty
// Extract message; clear mailbox; advance Turn indicator
// We have an obvious happens-before relation : HBO
// Put(m) happens-before corresponding Take() that returns that same "m" - antecedent
T v = s->MailBox
if v != null :
s->MailBox = null
ST-ST barrier
s->Turn = tkt + Mask + 1 // unlock slot to admit next producer and consumer
return v

PTLQueue borrows and derives from the Partitioned Ticket Lock "PTL" (US20120240126-A1) and the MultiLane Concurrent Bag (US8689237). The latter is essentially a circular ring-buffer where the elements themselves are queues or concurrent collections. You can think of the PTLQueue as a partitioned ticket lock "PTL" augmented to pass values from lock to unlock via the slots. Alternatively, you could conceptualize of PTLQueue as a degenerate MultiLane bag where each slot or "lane" consists of a simple single-word MailBox instead of a general queue. Each lane in PTLQueue also has a private Turn field which acts like the Turn (Grant) variables found in PTL. Turn enforces strict FIFO ordering and restricts concurrency on the slot mailbox field to at most one simultaneous put() and take() operation.

PTL uses a single "ticket" variable and per-slot Turn (grant) fields while MultiLane has distinct PutCursor and TakeCursor cursors and abstract per-slot sub-queues but does not use per-slot Turn variables. Both PTL and MultiLane advance their cursor and ticket variables with atomic fetch-and-increment. PTLQueue borrows from both PTL and MultiLane and incorporates distinct put and take cursors and per-slot Turn fields. Instead of a per-slot queues, PTLQueue uses a simple single-word MailBox field. PutCursor and TakeCursor act like a pair of ticket locks, conferring "put" and "take" access to a given slot. PutCursor, for instance, assigns an incoming put() request to a slot and serves as a PTL "Ticket" to acquire "put" permission to that slot's MailBox field. To better explain the operation of PTLQueue we deconstruct the operation of put() and take() as follows. Put() first increments PutCursor obtaining a new unique ticket. That ticket value also identifies a slot. Put() next waits for that slot's Turn field to match that ticket value. This is tantamount to using a PTL to acquire "put" permission on the slot's MailBox field. Finally, having obtained exclusive "put" permission on the slot, put() stores the message value into the slot's MailBox. Take() similarly advances TakeCursor, identifying a slot, and then acquires and secures "take" permission on a slot by waiting for Turn. Take() then waits for the slot's MailBox to become non-empty, extracts the message, and clears MailBox. Finally, take() advances the slot's Turn field, which releases both "put" and "take" access to the slot's MailBox. Note the asymmetry : put() acquires "put" access to the slot, but take() releases that lock. At any given time, for a given slot in a PTLQueue, at most one thread has "put" access and at most one thread has "take" access. This restricts concurrency from general MPMC to 1-vs-1. We have 2 ticket locks -- one for put() and one for take() -- each with its own "ticket" variable in the form of the corresponding cursor, but they share a single "Grant" egress variable in the form of the slot's Turn variable. Advancing the PutCursor, for instance, serves two purposes. First, we obtain a unique ticket which identifies a slot. Second, incrementing the cursor is the doorway protocol step to acquire the per-slot mutual exclusion "put" lock. The cursors and operations to increment those cursors serve double-duty : slot-selection and ticket assignment for locking the slot's MailBox field.

At any given time a slot MailBox field can be in one of the following states: empty with no pending operations -- neutral state; empty with one or more waiting take() operations pending -- deficit; occupied with no pending operations; occupied with one or more waiting put() operations -- surplus; empty with a pending put() or pending put() and take() operations -- transitional; or occupied with a pending take() or pending put() and take() operations -- transitional.

The partial put() and take() operators can be implemented with an atomic fetch-and-increment operation, which may confer a performance advantage over a CAS-based loop. In addition we have independent PutCursor and TakeCursor cursors. Critically, a put() operation modifies PutCursor but does not access the TakeCursor and a take() operation modifies the TakeCursor cursor but does not access the PutCursor. This acts to reduce coherence traffic relative to some other queue designs.

It's worth noting that slow threads or obstruction in one slot (or "lane") does not impede or obstruct operations in other slots -- this gives us some degree of obstruction isolation. With respect to progress properties, however, PTLQueue is not lock-free.

The implementation above is expressed with polite busy-waiting (Pause) but it's trivial to implement per-slot parking and unparking to deschedule waiting threads. It's also easy to convert the queue to a more general deque by replacing the PutCursor and TakeCursor cursors with Left/Front and Right/Back cursors that can move either direction. Specifically, to push and pop from the "left" side of the deque we would decrement and increment the Left cursor, respectively, and to push and pop from the "right" side of the deque we would increment and decrement the Right cursor, respectively.

We used a variation of PTLQueue for message passing in our recent OPODIS 2013 paper.

I'll now introduce a variation on the above, but borrow Dmitry Vyukov's encoding for the Turn variable. After a thread deposits a value into the MailBox, it advances Turn by 1 to indicate to Take() operations that a value is present. Turn values follow this trajectory : K, K+1, K+1+Mask, etc. (More precisely, the Turn field for slot I takes on values of the form: I*G, (I*G)+1, I*(G+1), etc where G is the generation number). This form is friendlier for a total tryTake() operator. Put() and Take() operators are also more symmetric than in the version above. It also allows null to be passed without substituting proxy values. Like the form above, it can use and benefit from fetch-and-increment instead of CAS for the partial forms. the partial tryTake() operator, however, uses CAS. tryPut() -- not shown -- is analogous to tryTake(). UPDATE: see the comments section for a concern identified by Nitsan.


// PTLQueueV2 :

Put(v) : // producer : partial method - waits as necessary
assert Mask > 1 && (Mask & (Mask+1)) == 0 // Document invariants
// doorway step
// Obtain a sequence number -- ticket
// As a practical concern the ticket value is temporally unique
// The ticket also identifies and selects a slot
auto tkt = AtomicFetchIncrement (&PutCursor, 1)
slot * s = &Slots[tkt & Mask]

// waiting phase :
// wait for slot's generation to match the tkt value assigned to this put() invocation.
// The "generation" is implicitly encoded as the upper bits in the cursor
// above those used to specify the index : tkt div (Mask+1)
// The generation serves as an epoch number to identify a cohort of threads
// accessing disjoint slots
while s->Turn != tkt : Pause

assert s->MailBox == null
s->MailBox = v // deposit and pass message
s->Turn = tkt + 1 // mark occupied - release corresponding Take() operation
// We pass & cede ownership and exclusive access of MailBox to the next Take()
// That Take() operation will be in the same generation as this Put()

Take() : // consumer : partial method - waits as necessary
auto tkt = AtomicFetchIncrement (&TakeCursor,1)
slot * s = &Slots[tkt & Mask]

// Wait turn for our generation and for the slot become occupied
while s->Turn != (tkt+1) : Pause

T v = s->MailBox // extract message
// In a garbage-collected environment we might want to set s->MailBox = null
s->Turn = tkt + Mask + 1 // mark unoccupied - unlock slot to admit next Put()
// We pass & cede ownership and exclusive access to MailBox to the next Put()
// That Put() will be in the next generation
return v

// tryTake() is a "strong" operator in that it can return null IFF
// the queue was empty at some point during the tryTake() invocation.
// Spurious false-null return values are not allowed.
// Note that if we provide tryTake() with the interface below, then null can not
// be a legal message value passed via Put(null). Alternatively, we could allow
// null messages but augment tryTake() to return a success-failure indication as well
// as the value.
// Thanks to Nathan Reyonds for comments on an earlier version of this algorithm.

tryTake() : // consume : "strong" total method - returns ASAP
auto tkt = TakeCursor
slot * s = &Slots[tkt & Mask]
auto delta = s->Turn - (tkt+1)
if delta == 0 :
if CAS (&TakeCursor, tkt, tkt+1) == tkt :
// our thread has exclusive access to s->MailBox
T v = s->MailBox
s->Turn = tkt + Mask + 1 // ceded MailBox access to next Put()
return v
if delta lessThan 0 :
return null
// inopportune concurrent interleaving - race
// tkt is stale with respect to TakeCursor
// Other Take() or tryTake() operations bypassed this operation
// and raced past
// This can happen if we stall after loading TakeCursor, for instance
// just retry
assert TakeCursor != tkt

There's quite a bit of related literature in this area. I'll call out a few relevant references:

I'll propose an optimization left as an exercise for the reader. Say we wanted to reduce memory usage by eliminating inter-slot padding. Such padding is usually "dark" memory and otherwise unused and wasted. But eliminating the padding leaves us at risk of increased false sharing. Furthermore lets say it was usually the case that the PutCursor and TakeCursor were numerically close to each other. (That's true in some use cases). We might still reduce false sharing by incrementing the cursors by some value other than 1 that is not trivially small and is coprime with the number of slots. Alternatively, we might increment the cursor by one and mask as usual, resulting in a logical index. We then use that logical index value to index into a permutation table, yielding an effective index for use in the slot array. The permutation table would be constructed so that nearby logical indices would map to more distant effective indices. (Open question: what should that permutation look like? Possibly some perversion of a Gray code or De Bruijn sequence might be suitable).

As an aside, say we need to busy-wait for some condition as follows : "while C == 0 : Pause". Lets say that C is usually non-zero, so we typically don't wait. But when C happens to be 0 we'll have to spin for some period, possibly brief. We can arrange for the code to be more machine-friendly with respect to the branch predictors by transforming the loop into : "if C == 0 : for { Pause; if C != 0 : break; }". Critically, we want to restructure the loop so there's one branch that controls entry and another that controls loop exit. A concern is that your compiler or JIT might be clever enough to transform this back to "while C == 0 : Pause". You can sometimes avoid this by inserting a call to a some type of very cheap "opaque" method that the compiler can't elide or reorder. On Solaris, for instance, you could use :"if C == 0 : { gethrtime(); for { Pause; if C != 0 : break; }}".

It's worth noting the obvious duality between locks and queues. If you have strict FIFO lock implementation with local spinning and succession by direct handoff such as MCS or CLH,then you can usually transform that lock into a queue.

If you want a multiple-producer single-consumer MPSC queue then you can replace atomic operations on the TakeCursor with normal updates. More generally you can take a simple SPSC queue and wrap the put() and take() operations with put and take mutexes, restricting concurrency to 1-vs-1, but the performance isn't usually as good as what you'd get from a queue designed for MPSC usage.

Wednesday Apr 30, 2014

SPAA 2014 : Persistent Unfairness Arising From Cache Residency Imbalance

Persistent Unfairness Arising From Cache Residency Imbalance by Dave Dice, Virendra J. Marathe and Nir Shavit will appear in SPAA 2014.

This is the Matthew Effect for caches.

The description of the methodology is necessarily terse in the paper, but we took pains to measure and minimize DRAM channel imbalance and to control for DRAM bank/page locality.

An interesting related paper is Efficient techniques for predicting cache sharing and throughput by Andreas Sandberg, David Black-Schaffer and Erik Hagersten (Uppsala).

Tuesday Apr 29, 2014

malloc for Haswell - Hardware Transactional Memory

Haswell i7-4770 processors.

For background, the i7-4770 has relatively simple cache geometry. The L1 (level-1 cache) is 32KB with 64-byte lines, is physically tagged, and is 8-way set-associative. There are 64 possibly indices (sets). As such the cache page size is 4KB -- addresses that differ by an integer multiple of 4K will map to the same index (set) in the L1. The low-order 6 bits of the address presented to the L1 form the offset into the line, and the next higher 6 bits serve as the L1 index. The MMU base page size is 4KB, so there is no overlap between the virtual page number and the index field in a virtual address. The L1 index field passes through address translation verbatim. As such, OS-level page coloring is not in play with respect to the L1. (An advantage of this design is that indexing can commence before the virtual address is translated to a physical address, although we still need the physical address for tag comparison). Some CPUs hash addresses -- usually XORing high-order physical address bits into the index bits -- in order to reduce the odds of index hotspots and index imbalance, but experiments suggest that does not appear to be the case with the i7-4770.

Such simple caches -- without the index hashing mentioned above -- can be vulnerable to excessive index conflicts, but malloc allocators can be made index-aware (local copy) to mitigate and reduce the frequency of index conflicts. Index imbalance results in underutilization of the cache. Some indices will be "cold"(less frequently accessed) while others are "hot" and thus incur relatively higher miss rates. It's worth pointing out that most application/allocator combinations don't exhibit excessive index conflicts, but for those that do, the performance impact can be significant. An index-aware allocator can act to "immunize" an application against some common cases of index-imbalance while typically incurring no additional cost over index-oblivious allocators. Think of the index-aware allocator as cheap insurance against a rare but painful performance disorder. The paper above describes an index-aware allocator designed for the L1 in a SPARC T2+ processor, but it's trivial to change a few cache geometry constants and retarget the allocator to the i7-4770.

The "CIA-Malloc" (Cache-Index Aware) allocator described in the paper has a number of other useful design properties. It also happens to be NUMA-friendly and large-page-friendly. Underlying pages are allocated on the node where the malloc() was invoked. Put another way, the pages underlying a block returned by malloc() will typically reside on the node where the malloc() was invoked. The allocator is also scalable with very little internal lock contention or coherence traffic. Each per-CPU sub-heap has a private lock -- the only time we'll encounter contention is via migration or preemption, which is relatively rare. The critical sections are also constant-time and very short. We also make heavy use of trylock(), so if a thread is obstructed it can usually make progress by reverting to another data structure. Remote free() operations are lock-free. Critically, the allocator acts to reduce the cost of malloc() and free() operations as well as the cost to the application when accessing blocks allocated via malloc(). The allocator is also designed specifically to reduce common cases of false sharing : allocator metadata-vs-metadata; metadata-vs-block; and inter-block block-vs-block. Metadata-vs-metadata sharing and false sharing is reduced by using per-CPU sub-heaps. False sharing arising between adjacent data blocks -- blocks returned by malloc() -- is addressed by placement and alignment. These attributes will prove even more useful when we use CIA-Malloc in conjunction with hardware transactions.

The i7-4770 provides hardware transactional memory (HTM). For the purposes of discussion we'll assume we're using TSX-RTM for the purposes of transactional lock elision (TLE). The critical section body contains unmodified HTM-oblivious legacy code that expects to run under the lock in the usual fashion, but via TLE we can modify the lock implementation to attempt optimistic execution, reverting to the lock only as necessary. The i7-4770's HTM implementation tracks the transactional write-set in the L1 and the read-set over the cache hierarchy. It uses a requester-wins conflict resolution strategy implemented via the MESIF coherence protocol. At most a single cache can have a given line in M/E state at any one time -- a classic multiple-reader single-writer model. Eviction or invalidation of a tracked cache line results in a transactional abort. For example if a transaction on CPU C loads address A, and some other CPU writes A before C commits, the write will invalidate the line from C's cache and cause an abort. Similarly, if C stores into A and some other CPU loads or stores into A before C commits, the invalidation of A will cause C's transaction to abort. Read-write or write-write sharing on locations accessed within a transaction results in coherence invalidation and consequent abort. (The HTM implementation in the i7-4770 shares quite a few aspects with Sun's experimental ROCK processor).

In addition to coherence traffic, self-displacement via conflict misses can also result in aborts. This is where a CIA-Malloc allocator may provide benefit relative to other allocators. Normally an index-aware allocator is expected to reduce conflict misses arising from index-imbalance, but it can also reduce transactional aborts caused by eviction of read-set or write-set entries from index conflicts. Aborts are usually far more expensive than simple cache misses. (Absent any potential benefit from warming up of caches, aborts are pure wasted and futile effort).

Lets take an actual example. The following data was collected on an i7-4770 running Ubuntu 14.04. We use a simple C single-threaded benchmark that uses malloc() to individually allocate a set of 250 nodes, and then arranges those nodes into an circular intrusive singly linked list. The benchmark was compiled with gcc 4.8.2 using the x32 ABI. The node structure has a "next" field at offset 0 followed by a volatile integer "w" field. A command-line switch gives us ability to specify the effective size of the node as passed to malloc(). Since there may be a correlation between allocation order and virtual address, we randomize the order of the nodes with a Fisher-Yates shuffle in order to minimize the impact of automatic hardware stride-based prefetchers. (Such a randomized order can put stress on the TLBs with lots of page crossings as we traverse the list, but that's not the dominant performance issue for the cases we'll discuss). We then report the time needed to complete 10000000 steps of the following loop body :

a->w = 0 ; a = a->next
If we use an effective node size of 950 bytes, then the default glibc malloc() allocator places our nodes at 960 byte intervals (1024-64) and each step of the loop requires 2.1 nsecs. When we increase the node size to 1010 the interval is 1024 bytes and each step takes 8.1 nsecs. If we further increase the node size to 1080 bytes then the interval is 1088 bytes (1024+64)and the time drops back to 2.1 nsecs. The performance drop at 1010 bytes was caused by the 1024-byte placement interval. The base addresses of our 250 nodes resided on just 4 of the 64 possible indices, so we grossly underutilized the L1. This nicely illustrates index conflicts arising from index-oblivious allocator placement policies. An index-aware allocator will avoid this scenario.

Now we'll extend our benchmark to use RTM hardware transactions. Each traversal of the ring is wrapped in a RTM XBEGIN-XEND transaction and we'll measure and report success rates. This is intended to model TLE or "naked" use of RTM transactions. We keep the ring circumference at 250 nodes, so each transaction will iterate over that
many elements. With nodes of 960 bytes the failure rate is 0.145%. Increasing the node size to 1010 bytes, the failure rate becomes 100%. (We have no progress). But if we bump the size to 1080 bytes then the failure rate drops back to 0.2%. The complete and persistent failure at 1010 bytes was caused by elements of the write-set being evicted and consequent abort. But if we use CIA-Malloc modified for the i7-4770 we can avoid such abrupt performance inflections.

To recap, an index-aware allocator can help avoid performance pathologies for normal non-transactional code as well as improving the success rate of hardware transactions. Specifically, conflict misses become aborts in transactions, and aborts are more expensive than normal misses.

Ideally, an HTM-friendly allocator will satisfy the desiderata enumerated in Cache-Index Aware Memory Allocation and also act to reduce the abort rate. The following properties are desirable:

  • malloc() and free() should be callable from within transactions with low odds of abort
  • memory accesses within transactions to blocks returned by malloc() are less prone to abort; specifically we want to reduce conflict misses and coherence invalidation from sharing, both of which cause aborts.
      Sharing and false-sharing:

    • allocator metadata-vs-blocks
    • allocator metadata-vs-metadata : we note that internal lock contention or promiscuous shared locks internal to the allocator are a special case of metadata-vs-metadata sharing that can cause aborts.
    • block-vs-block : inter-block false sharing. Application code can also use explicit memalign() and padding to reduce the odds of inter-block false sharing. An allocator or application that increases alignment to avoid false sharing may also cause quantization of block sizes and increased internal fragmentation.

  • malloc() and free() operations are scalable. Highly scalable allocators also tend to be transaction-friendly as they avoid or reduce the use shared global mutable data.

As a side note, under a requester-wins conflict resolution strategy, to the extent possible and reasonable it's a good idea to shift stores of frequently accessed shared variables toward the end of a transaction. You can do this by hand, or a transaction-aware compiler or JIT can perform some of the transformations. The modCount field Java's hashtable is a canonical example of an update that should be shifted. Shifting reduces the window of vulnerability where the store resides in the transaction's write-set. But the asymmetry in the i7-4770 where the write-set is tracked in the L1 and the read-set in the L1-L2-L3 gives us yet another reason to shift stores toward the end of a transaction. Consider a transaction that executes a store followed by large number of loads. Those loads may displace the store and cause an abort. But if we shift the store to the end of the transaction, the same set of accesses (just reordered) can succeed without abort. The store may displace a loaded line from the L1, but the L2 can still track the line.

Finally, when a given size-class is index-unfriendly, we can use the punctuated array approach as described in the original CIA-Malloc paper. A more radical approach is to intentionally and explicitly pick size-classes that are prime multiples of the cache line size. This helps to reduce inter-size-class index conflicts.

Another approach to index-aware allocation (other than friendly size-classes or punctuated arrays) is to simply intercept malloc(S) calls and run a Bernoulli trial to decide if we want to add 64 to S. The probably P should be low. As such, we'll occasionally, on a random basis, add 64 to S. A slightly more elaborate variation will make P a function of S, increasing the odds for larger S values.

Finally, we describe how to use hardware transactional memory within an allocator implementation in Simplifying Concurrent Algorithms by Exploiting Hardware Transactional Memory which appeared in SPAA 2010 (DOI). That I know of, this is the first use of HTM in an allocator implementation. The allocator used in the SPAA paper was derived from the allocator described in ISMM 2002 Mostly Lock-Free Malloc (DOI), which used restartable critical sections instead of transactions.

Tuesday Mar 11, 2014

A simple PRNG construction idiom

The academic literature is rife with algorithms for pseudo-random number generators (PRNGs). Typically, there's a trade-off between performance and the quality of the distribution. In most cases I need PRNGs to implement very lightweight Bernoulli trials for randomized stress tests, benchmarks, or scalable probabilistic counters. My budget is usually less that 100 cycles to generate a uniformly distributed value. Marsaglia's xor-shift PRNG is one of my personal favorites. If I need better quality I'll step up to Ziff's four tap or Mersenne twister.

One variation of Marsaglia has only one word of state, a 4G-1 period, and requires just 3 shifts and 3 XOR operations to generate a new value. 0 is an absorbing state that we avoid. See MarsagliaNext(), below. Ignoring 0, the trajectory or stream of values forms a cycle -- conceptually a ring. The initialization and seeding operation should act to place different threads at different positions on that ring. In a sense the ring is shared by all threads but we start the threads at different points. Unfortunately, we can sometimes find that 2 different threads are at about the same position at about the same time through simple bad luck, and thus generate similar streams of values. (Longer PRNG cycles reduce the odds of such scenarios, of course). Ideally, to avoid such inopportune and undesirable behavior, each thread would have its own private ring of values.

A simple approach that is tantamount to giving each thread its own ring -- trajectory of values -- is shown below in NextRandom(). Hash32() is a hash function, which we'll describe later. Note that we explicitly "color" the value passed into Hash32() with the address of the thread-local PRNG state. Recall that at any one time, such an address will be associated with at most one thread, and is thus temporally unique. This approach gives us additional independence over concurrently executing threads. It also makes NextRandom() self-seeding -- explicit initialization is not required.

The Hash32() hash function is the key to this particular PRNG, and its implementation directly embodies the trade-off between performance and the quality of the resulting distribution. Conceptually, you could think of Hash32() as representing a randomized cycle with a period of 4G. We might implement Hash32() via a strong cryptographic hash, such as SHA-3 or MD5. Those hash functions would work perfectly well, but tend to be high quality but somewhat expensive, so we'd prefer a cheaper hash. Critically, we want the hash function to exhibit a high degree of avalanche effect. (A good encryption function could also be used as a hash operator). Some cheaper candidates for the hash function include: MurmurHash ;CityHash ; FNV hash family; siphash; and Jenkins hash.

Doug Lea and Guy Steele invented some of the best candidates for our hash function: see Mix32() and Mix64() below. These are relatively cheap but do well on avalanche tests and strike a reasonable balance between quality and performance. Beware that they're obviously not cryptographically strong. Mix32() and Mix64() are related to mix functions found in java.util.SplittableRandom.

Update 2014-03-13 : Thanks to Paul Sandoz for pointing out Stafford's mix function.

static int32_t MarsagliaNext () {
static __thread int32_t rv = 0 ;
if (rv == 0) rv = GenerateSeed() ;
int32_t v = rv ;
v ^= v << 6 ;
v ^= uint32_t(v) >> 21 ;
v ^= v << 7 ;
rv = v ;
return v ;

static int NextRandom() {
// assumes a 32-bit environment : ILP32
// Instead of incrementing by 1 we could also
// increment by a large prime or by the MIX constant
// if we're using Mix32() as our hash function.
static __thread int rv = 0 ;
return Hash32 (int(&rv) + (++rv)) ;

// Mix32 and Mix64 are provided courtesy of Doug Lea and Guy Steele

static int32_t Mix32 (int32_t z) {
static const int32_t MIX = 0x9abe94e3;
// Another workable constant is 0xa0ca9b6b
z = (z ^ (uint32_t(z) >> 16)) * MIX;
z = (z ^ (uint32_t(z) >> 16)) * MIX;
return z ^ (uint32_t(z) >> 16);

static int64_t Mix64 (int64_t z) {
static const int64_t MIX = 0xdaba0b6eb09322e3LL;
z = (z ^ (uint64_t(z) >> 32)) * MIX;
z = (z ^ (uint64_t(z) >> 32)) * MIX;
return z ^ (uint64_t(z) >> 32);

MSPC 2014 submission deadline has been extended until April 4th

See the workshop web site for details.

Monday Mar 10, 2014

PRNG optimizations

Say you wanted to generate a stream of uniformly distributed random integers in the range [0,N) where N > 0. Java's Random.nextInt(N) is an example of a convenient API that does just that. Furthermore lets say you willing to trade off the quality of the distribution in order to gain better performance -- lower latency for each call to nextInt(N). Since we care about speed, we'll assume the implementation can't use floating point.

A good starting point might be Marsaglia's xor-shift pseudo-random number generator (PRNG), which can generate a new 32-bit value in [0,4G) in just a few cycles. Once you have that 32-bit value in hand you next need to reduce it to the [0,N) range. The most obvious approach is to use the "MOD" operator. The MOD (and DIV) instructions can be relatively expensive, however. Somewhat perversely, it's not uncommon to find that the MOD instruction is more expensive than generating the value via the Marsaglia xor-shift PRNG, and thus dominates performance of your nextInt(N) implementation. Furthermore, some processors have only one processing unit per core that can handle MOD instructions, so you might encounter both scaling and latency problems. If N happens to be a constant known at compile-time then your compiler might be sufficiently clever to apply strength reduction techniques, such as those described by Granlund and Montgomery in Division by Invariant Integers using Multiplication (PLDI 1994). Hacker's Delight (2E) is another good resource. Alternatively, you could explicitly apply the optimization yourself. But lets assume that N is a variable.

One approach to avoid the MOD instruction is to use a degenerate scaled or fixed-point representation. Say "v" is the value returned by the Marsgalia xor-shift PRNG. We can calculate and return (((v & 0x3FF) * N) >> 10). 0x3FF and 10 are related as 0x3FF is equal to ((2^10)-1), of course. Note that we traded MOD for multiplication, which is usually profitable. While this approach is tempting -- and might be viable in some circumstances -- it suffers from range overflow concerns and quantization (really additional discretization). Note that only 1024 possible values can be returned. We might change 0x3FF and 10 to 0x7FF and 11, respectively, allowing 2048 distinct values, but if we keep increasing those values the "interesting" bits will be truncated by overflow in the 32-bit multiplication and the distribution will suffer.

Another approach is to borrow an idea from rejection sampling. For discussion lets assume N is 10. Ignoring overflow, we can easily compute M where M is the least integer power of 2 greater than or equal to N. Since N is 10, M will be 16. See "Hacker's Delight" 2E, Figure 3-3, or the following article. Critically, given N, we can quickly compute M in constant time with no memory references and no conditional branches. Our nextInt(N) implementation can first compute M from N and then invoke the Marsaglia xor-shift PRNG (or PRNG of your choice) to obtain a 32-bit value "v". Next, we mask "v" with (M-1) yielding a value in [0,M). If that value is less than N then we're done and can return -- we accept the value. But if the value is greater than or equal to N then we reject the value and loop, calling the PRNG again and retrying until we finally find a value less than N. We're effectively running Bernoulli trails until the 1st success, and the probability P of success (N/M) will always be greater than 0.5. The number of trials (iterations) until the 1st success has a geometric distribution so the average running time -- iterations required -- is reasonable if you're willing to tolerate the non-deterministic nature of rejection sampling. If you're concerned about the worst-case you could fall back to the MOD instruction after some number of rejections within a nextInt() episode. One downside to this approach is that the branch to exit the loop might not be very well predicted.

Whether or not this approach is profitable depends on the cost of MOD relative to the additional work to compute M and the cost of the retries. Somewhat sadly, it appears useful on a range of systems.

(Beware that the low-order bits from a Marsaglia xor-shift PRNG may have a skewed distribution over short intervals, but you can remedy this with a shift or by using a slightly better but still cheap PRNG).

Monday Feb 24, 2014

MSPC 2014

MSPC 2014 will be co-located with PLDI 2014 in Edinburgh. The paper due-date is March 10th.

Tuesday Jan 28, 2014

Selective Core Boosting: The Return of the Turbo Button

Available via TU Dresden or locally.

Friday Jan 24, 2014

Do you use sun.misc.Unsafe ?

Take the following survey on API usage to help JDK planning. (Yes, we're not supposed to use it directly. Caveat Utor. But many folks do, including myself.)

Friday May 31, 2013

Jam Jar Jet

A simple working model of a valveless pulse jet engine inspired by an article in Make magazine. I found our worked better without the diffuser. Read the full article before you try this at home. Sometimes our blog systems does not allow embedded video, so just in case you can use this direct link.

Scalable Statistics Counters

Scalable Statistics Counters will appear in SPAA 2013.

Lightweight Contention Management for Efficient Compare-and-Swap Operations

"Lightweight Contention Management for Efficient Compare-and-Swap Operations" by Dave Dice, Danny Hendler and Ilya Mirsky will appear in Euro-Par 2013. A longer technical report is also available in arxiv.

Monday Jan 07, 2013

Fixing the Krieger read-write lock

Using Hardware Transactional Memory to Correct and Simplify a Readers-Writer Lock Algorithm will appear in PPoPP 2013.

NUMA-Aware Reader-Writer Locks

NUMA-Aware Reader-Writer Locks will appear in PPoPP 2013.

As a side note, it seems Vyukov's distributed reader-write lock is similar to the concept of sub-latches and super-latches found in some database implementations.


ACM DL Author-ize serviceNUMA-aware reader-writer locks
Irina Calciu, Dave Dice, Yossi Lev, Victor Luchangco, Virendra J. Marathe, Nir Shavit
PPoPP '13 Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming, 2013

Friday Nov 23, 2012

Java @Contended annotation to help reduce false sharing

See this posting by Aleksey Shipilev for details -- @Contended is something we've wanted for a long time. The JVM provides automatic layout and placement of fields. Usually it'll (a) sort fields by descending size to improve footprint, and (b) pack reference fields so the garbage collector can process a contiguous run of reference fields when tracing. @Contended gives the program a way to provide more explicit guidance with respect to concurrency and false sharing. Using this facility we can sequester hot frequently written shared fields away from other mostly read-only or cold fields. The simple rule is that read-sharing is cheap, and write-sharing is very expensive. We can also pack fields together that tend to be written together by the same thread at about the same time.

More generally, we're trying to influence relative field placement to minimize coherency misses. In a simple single-threaded environment fields that are accessed closely together in time should be placed proximally in space to promote cache locality. That is, temporal locality should condition spatial locality. Fields accessed together in time should be nearby in space. That having been said, when threads are accessing our fields concurrently we have to be careful to avoid false sharing and excessive invalidation from coherence traffic. As such, we try to cluster or otherwise sequester fields that tend to written at approximately the same time by the same thread onto the same cache line. Note that there's a tension at play: if we try too hard to minimize single-threaded capacity misses then we can end up with excessive coherency misses running in a parallel environment. In native C/C++ code it's fairly typical for programmers to use informed concurrency-aware structure layout. @Contended should give use the same capability in Java, although in native code the binding of fields to offsets happens at compile-time, while it happens at load-time for the Java. It's worth pointing out that in the general case there is no single optimal layout for both single-thread and multithreaded environments. And the ideal layout problem itself is NP-hard.

Ideally, a JVM would employ hardware monitoring facilities to detect sharing behavior and change the layout on the fly. That's a bit difficult as we don't yet have the right plumbing to provide efficient and expedient information to the JVM. Hint: we need to disintermediate the OS and hypervisor. Another challenge is that raw field offsets are used in the unsafe facility, so we'd need to address that issue, possibly with an extra level of indirection.

Finally, I'd like to be able to pack final fields together as well, as those are known to be read-only.

Thursday Nov 22, 2012

NUMA-aware placement of communication variables

For classic NUMA-aware programming I'm typically most concerned about simple cold, capacity and compulsory misses and whether we can satisfy the miss by locally connected memory or whether we have to pull the line from its home node over the coherent interconnect -- we'd like to minimize channel contention and conserve interconnect bandwidth. That is, for this style of programming we're quite aware of where memory is homed relative to the threads that will be accessing it. Ideally, a page is collocated on the node with the thread that's expected to most frequently access the page, as simple misses on the page can be satisfied without resorting to transferring the line over the interconnect. The default "first touch" NUMA page placement policy tends to work reasonable well in this regard. When a virtual page is first accessed, the operating system will attempt to provision and map that virtual page to a physical page allocated from the node where the accessing thread is running. It's worth noting that the node-level memory interleaving granularity is usually a multiple of the page size, so we can say that a given page P resides on some node N. That is, the memory underlying a page resides on just one node.

But when thinking about accesses to heavily-written communication variables we normally consider what caches the lines underlying such variables might be resident in, and in what states. We want to minimize coherence misses and cache probe activity and interconnect traffic in general. I don't usually give much thought to the location of the home NUMA node underlying such highly shared variables. On a SPARC T5440, for instance, which consists of 4 T2+ processors connected by a central coherence hub, the home node and placement of heavily accessed communication variables has very little impact on performance. The variables are frequently accessed so likely in M-state in some cache, and the location of the home node is of little consequence because a requester can use cache-to-cache transfers to get the line.

Or at least that's what I thought. Recently, though, I was exploring a simple shared memory point-to-point communication model where a client writes a request into a request mailbox and then busy-waits on a response variable. It's a simple example of delegation based on message passing. The server polls the request mailbox, and having fetched a new request value, performs some operation and then writes a reply value into the response variable. As noted above, on a T5440 performance is insensitive to the placement of the communication variables -- the request and response mailbox words. But on a Sun/Oracle X4800 I noticed that was not the case and that NUMA placement of the communication variables was actually quite important.

For background an X4800 system consists of 8 Intel X7560 Xeons . Each package (socket) has 8 cores with 2 contexts per core, so the system is 8x8x2. Each package is also a NUMA node and has locally attached memory. Every package has 3 point-to-point QPI links for cache coherence, and the system is configured with a twisted ladder "mobius" topology. The cache coherence fabric is glueless -- there's not central arbiter or coherence hub. The maximum distance between any two nodes is just 2 hops over the QPI links. For any given node, 3 other nodes are 1 hop distant and the remaining 4 nodes are 2 hops distant.

Using a single request (client) thread and a single response (server) thread, a benchmark harness explored all permutations of NUMA placement for the two threads and the two communication variables, measuring the average round-trip-time and throughput rate between the client and server. In this benchmark the server simply acts as a simple transponder, writing the request value plus 1 back into the reply field, so there's no particular computation phase and we're only measuring communication overheads. In addition to varying the placement of communication variables over pairs of nodes, we also explored variations where both variables were placed on one page (and thus on one node) -- either on the same cache line or different cache lines -- while varying the node where the variables reside along with the placement of the threads. The key observation was that if the client and server threads were on different nodes, then the best placement of variables was to have the request variable (written by the client and read by the server) reside on the same node as the client thread, and to place the response variable (written by the server and read by the client) on the same node as the server. That is, if you have a variable that's to be written by one thread and read by another, it should be homed with the writer thread. For our simple client-server model that means using split request and response communication variables with unidirectional message flow on a given page. This can yield up to twice the throughput of less favorable placement strategies.

Our X4800 uses the QPI 1.0 protocol with source-based snooping. Briefly, when node A needs to probe a cache line it fires off snoop requests to all the nodes in the system. Those recipients then forward their response not to the original requester, but to the home node H of the cache line. H waits for and collects the responses, adjudicates and resolves conflicts and ensures memory-model ordering, and then sends a definitive reply back to the original requester A. If some node B needed to transfer the line to A, it will do so by cache-to-cache transfer and let H know about the disposition of the cache line. A needs to wait for the authoritative response from H. So if a thread on node A wants to write a value to be read by a thread on node B, the latency is dependent on the distances between A, B, and H. We observe the best performance when the written-to variable is co-homed with the writer A. That is, we want H and A to be the same node, as the writer doesn't need the home to respond over the QPI link, as the writer and the home reside on the very same node. With architecturally informed placement of communication variables we eliminate at least one QPI hop from the critical path.

Newer Intel processors use the QPI 1.1 coherence protocol with home-based snooping. As noted above, under source-snooping a requester broadcasts snoop requests to all nodes. Those nodes send their response to the home node of the location, which provides memory ordering, reconciles conflicts, etc., and then posts a definitive reply to the requester. In home-based snooping the snoop probe goes directly to the home node and are not broadcast. The home node can consult snoop filters -- if present -- and send out requests to retrieve the line if necessary. The 3rd party owner of the line, if any, can respond either to the home or the original requester (or even to both) according to the protocol policies. There are myriad variations that have been implemented, and unfortunately vendor terminology doesn't always agree between vendors or with the academic taxonomy papers. The key is that home-snooping enables the use of a snoop filter to reduce interconnect traffic. And while home-snooping might have a longer critical path (latency) than source-based snooping, it also may require fewer messages and less overall bandwidth. It'll be interesting to reprise these experiments on a platform with home-based snooping.

While collecting data I also noticed that there are placement concerns even in the seemingly trivial case when both threads and both variables reside on a single node. Internally, the cores on each X7560 package are connected by an internal ring. (Actually there are multiple contra-rotating rings). And the last-level on-chip cache (LLC) is partitioned in banks or slices, which with each slice being associated with a core on the ring topology. A hardware hash function associates each physical address with a specific home bank. Thus we face distance and topology concerns even for intra-package communications, although the latencies are not nearly the magnitude we see inter-package. I've not seen such communication distance artifacts on the T2+, where the cache banks are connected to the cores via a high-speed crossbar instead of a ring -- communication latencies seem more regular.

Finally, I've seen strong hints that the placement of threads relative to lock metadata accessed by those threads plays an important part in performance for contended locks.

Tuesday Nov 06, 2012

C-states and P-states : confounding factors for benchmarking

I was recently looking into a performance issue in the java.util.concurrent (JUC) fork-join pool framework related to particularly long latencies when trying to wake (unpark) threads in the pool. Eventually I tracked the issue down to the power & scaling governor and idle-state policies on x86. Briefly, P-states refer to the set of clock rates (speeds) at which a processor can run. C-states reflect the possible idle states. The deeper the C-state (higher numerical values) the less power the processor will draw, but the longer it takes the processor to respond and exit that sleep state on the next idle to non-idle transition. In some cases the latency can be worse than 100 microseconds. C0 is normal execution state, and P0 is "full speed" with higher Pn values reflecting reduced clock rates. C-states are P-states are orthogonal, although P-states only have meaning at C0. You could also think of the states as occupying a spectrum as follows : P0, P1, P2, Pn, C1, C2, ... Cn, where all the P-states are at C0. Our fork-join framework was calling unpark() to wake a thread from the pool, and that thread was being dispatched onto a processor at deep C-state, so we were observing rather impressive latencies between the time of the unpark and the time the thread actually resumed and was able to accept work. (I originally thought we were seeing situations where the wakee was preempting the waker, but that wasn't the case. I'll save that topic for a future blog entry). It's also worth pointing out that higher P-state values draw less power and there's usually some latency in ramping up the clock (P-states) in response to offered load.

The issue of C-states and P-states isn't new and has been described at length elsewhere, but it may be new to Java programmers, adding a new confounding factor to benchmarking methodologies and procedures. To get stable results I'd recommend running at C0 and P0, particularly for server-side applications. As appropriate, disabling "turbo" mode may also be prudent. But it also makes sense to run with the system defaults to understand if your application exhibits any performance sensitivity to power management policies.

The operating system power management sub-system typically control the P-state and C-states based on current and recent load. The scaling governor manages P-states. Operating systems often use adaptive policies that try to avoid deep C-states for some period if recent deep idle episodes proved to be very short and futile. This helps make the system more responsive under bursty or otherwise irregular load. But it also means the system is stateful and exhibits a memory effect, which can further complicate benchmarking. Forcing C0 + P0 should avoid this issue.

Thursday Oct 25, 2012

NUMA-aware constructs for java.util.concurrent

The constructs in the java.util.concurrent JSR-166 "JUC" concurrency library are currently NUMA-oblivious. That's because we currently don't have the topology discovery infrastructure and underpinnings in place that would allow and enable NUMA-awareness. But some quick throw-away prototypes show that it's possible to write NUMA-aware library code. I happened to use JUC Exchanger as a research vehicle. Another interesting idea is to adapt fork-join work-stealing to favor stealing from queues associated with 'nearby' threads.

Performance triage

Folks often ask me how to approach a suspected performance issue. My personal strategy is informed by the fact that I work on concurrency issues. (When you have a hammer everything looks like a nail, but I'll try to keep this general). A good starting point is to ask yourself if the observed performance matches your expectations. Expectations might be derived from known system performance limits, prototypes, and other software or environments that are comparable to your particular system-under-test. Some simple comparisons and microbenchmarks can be useful at this stage. It's also useful to write some very simple programs to validate some of the reported or expected system limits. Can that disk controller really tolerate and sustain 500 reads per second? To reduce the number of confounding factors it's better to try to answer that question with a very simple targeted program. And finally, nothing beats having familiarity with the technologies that underlying your particular layer.

On the topic of confounding factors, as our technology stacks become deeper and less transparent, we often find our own technology working against us in some unexpected way to choke performance rather than simply running into some fundamental system limit. A good example is the warm-up time needed by just-in-time compilers in Java Virtual Machines. I won't delve too far into that particular hole except to say that it's rare to find good benchmarks and methodology for java code. Another example is power management on x86. Power management is great, but it can take a while for the CPUs to throttle up from low(er) frequencies to full throttle. And while I love "turbo" mode, it makes benchmarking applications with multiple threads a chore as you have to remember to turn it off and then back on otherwise short single-threaded runs may look abnormally fast compared to runs with higher thread counts. In general for performance characterization I disable turbo mode and fix the power governor at "performance" state. Another source of complexity is the scheduler, which I've discussed in prior blog entries.

Lets say I have a running application and I want to better understand its behavior and performance. We'll presume it's warmed up, is under load, and is an execution mode representative of what we think the norm would be. It should be in steady-state, if a steady-state mode even exists. On Solaris the very first thing I'll do is take a set of "pstack" samples. Pstack briefly stops the process and walks each of the stacks, reporting symbolic information (if available) for each frame. For Java, pstack has been augmented to understand java frames, and even report inlining. A few pstack samples can provide powerful insight into what's actually going on inside the program. You'll be able to see calling patterns, which threads are blocked on what system calls or synchronization constructs, memory allocation, etc. If your code is CPU-bound then you'll get a good sense where the cycles are being spent. (I should caution that normal C/C++ inlining can diffuse an otherwise "hot" method into other methods. This is a rare instance where pstack sampling might not immediately point to the key problem). At this point you'll need to reconcile what you're seeing with pstack and your mental model of what you think the program should be doing. They're often rather different. And generally if there's a key performance issue, you'll spot it with a moderate number of samples.

I'll also use OS-level observability tools to lock for the existence of bottlenecks where threads contend for locks; other situations where threads are blocked; and the distribution of threads over the system. On Solaris some good tools are mpstat and too a lesser degree, vmstat. Try running "mpstat -a 5" in one window while the application program runs concurrently. One key measure is the voluntary context switch rate "vctx" or "csw" which reflects threads descheduling themselves. It's also good to look at the user; system; and idle CPU percentages. This can give a broad but useful understanding if your threads are mostly parked or mostly running. For instance if your program makes heavy use of malloc/free, then it might be the case you're contending on the central malloc lock in the default allocator. In that case you'd see malloc calling lock in the stack traces, observe a high csw/vctx rate as threads block for the malloc lock, and your "usr" time would be less than expected.

Solaris dtrace is a wonderful and invaluable performance tool as well, but in a sense you have to frame and articulate a meaningful and specific question to get a useful answer, so I tend not to use it for first-order screening of problems. It's also most effective for OS and software-level performance issues as opposed to HW-level issues. For that reason I recommend mpstat & pstack as my the 1st step in performance triage. If some other OS-level issue is evident then it's good to switch to dtrace to drill more deeply into the problem.

Only after I've ruled out OS-level issues do I switch to using hardware performance counters to look for architectural impediments.


Dave is a senior research scientist in the Scalable Synchronization Research Group within Oracle Labs : Google Scholar.


« May 2016