Friday Feb 20, 2015

Cohort Locks in ACM TOPC

Lock Cohorting: A General Technique for Designing NUMA Locks appears in ACM Transactions on Parallel Computing (TOPC) Volume 1 Issue 2, January 2015. (local copy -- apologies for gzip-ed pdf but the file size exceeded Oracle blog limits).

This is an extended journal version of our PPoPP 2012 conference paper. Since the PPoPP version we've found ways to make the lock friendlier for Intel MESI/MESIF-based systems. Our preferred form now uses a partitioned ticket lock for the top-level global lock, and normal ticket locks for the node-level local locks. In addition, the journal version shows that cohort locks can provide benefit even on single-node non-NUMA systems by treating each core as a node for the purposes of cohort formation.

See also US8694706.

Wednesday Jan 28, 2015

Using reader-writer locks to improve hardware TLE

At its most simple, traditional hardware TLE (Transactional Lock Elision) operates as follows : start a hardware transaction; "subscribe" to the lock state (committing or aborting if the lock is found to be held); execute the critical section; and finally commit at the end of the critical section. For the purposes of discussion I'm assuming an Intel TSX-RTM implementation of hardware transactional memory (HTM). If we can't make progress using the optimistic transactional mode then we revert to using classic pessimistic mutual exclusion by taking the lock, running the critical section in non-transactional execution mode, and finally releasing lock. Myriad retry policies are possible. It's worth noting that HTM implementations built on top of the existing cache coherence protocol with "requester-wins" conflict resolution policies will usually admit or allow mutual abort with no progress by any parties. That is, the progress properties can be rather poor.

A relatively simple way to improve on the implementation above is to employ a reader-writer lock. Threads trying to use optimistic transactions to execute the critical section must acquire and hold shared "R" permission. Threads executing the critical section via the classic non-transactional path must hold exclusive "W" permission. This approach confers a number of advantages. First, this algorithm obviates the need for subscription in optimistic path. 2nd, it avoids the lemming effect. (Perhaps I should have used the term "pyrrhic locking" instead, as we all know full well that lemmings don't really follow each other off cliffs to their death. It's pyrrhic in the sense that you "win" the lock, but lose the ability to runs transactions). 3rd, under the usual simplistic TLE implementations, a thread taking the classic path will cause all concurrent threads in the optimistic path to abort. But with the reader-writer lock variant, threads taking the pessimistic path wait politely for the optimistic threads to finish. The existence of threads in the R path is visible to threads trying to acquire W. This may result in fewer aborts and cycles wasted on futile transactions. Both the simple traditional form and the form based on reader-writer locks prohibit simultaneous execution of transactional and non-transactional modes under the lock, but the reader-writer lock variety is a bit less medieval. In a sense we have a "guarded" execution mode for transactions. Finally, some reader-writer lock implementations intentionally control admission policy so as to promote large groups of simultaneous readers -- we call this "R-group formation". When used for TLE, a reader-writer lock that promotes R-group formation will also tend to exhibit better throughput, say, than a reader-writer lock that uses a strict FIFO admission policy. Specifically, larger R-groups let us co-schedule larger groups of concurrent transactions.

As side-note, hardware TLE can be implemented either within pthread_mutex primitives, or on top of such primitives. In the latter case, if your TLE implementation covers all critical sections protected by the lock in question then you can add a separate "isLocked" tracking variable for the purposes of subscription. In this case the slow "classic" pessimistic path would acquire the lock and store "true" into isLocked. On x86 you also need a fence or fence-equivalent instruction after the store. In some circumstances, however, you might not have access to all the critical sections to impose the use of the isLocked convention. In that case it'd helpful if glib/libc were to expose a pthread_mutex_islocked_np() operator which would test the lock state and subscribe. TLE could call that function in the fast optimistic path. Note that the reader-writer lock form needs neither an isLocked() call to inspect the lock, or an explicit tracking variable.

There are a few ways to embellish this approach. I’ve only explored this on a single-node system, but it looks useful to use a cohort NUMA reader-writer lock with 2 synthetic logical cohort nodes, one for R and one for W. This tends to further favor and promote R-group formation, and thus "batch" or co-schedule groups of transactions together to an even greater degree, yielding better parallelism.

Another idea is to augment the R path with K-exclusion. (See US8402464 for more details). Because of the mutual abort with no progress issue mentioned above, we can encounter fratricide if we have too many threads simultaneously in the R path under a given lock. By decreasing the number of threads in the R-path, we can sometimes improve aggregate throughput. K-exclusion around the R path gives us a convenient way to throttle the number of threads. Specifically, we vary K based on the recent transactional success or abort rates. When K reaches 1 we have a degenerate case and just revert to mutual exclusion.

An AIMD policy for K seems to work reasonably well. AIMD policies are usually associated with TCP congestion windows, but seem to map cleanly to problems involving optimistic synchronization. The TCP sending window is also optimistic, as elements in the window may require re-transmission. (A few years ago I experimented with AIMD policies to experiment with the spin duration for contended java locks).

The canonical use case for reader-writer locks is the usual and obvious single-writer vs multiple-reader roles. Our example above is single pessimistic/normal vs multiple optimistic/transactional. Another mode we see is in lock-free code with deferred memory reclamation, where R permission confers the right to access some data, and the W role is required to unmap or physically free the underlying memory. A related usage is to protect the file system mount tables in operating systems. R permission enables a thread to parse pathnames and access the filesystem without worry of dismount, while W is used by unmount to make sure the filesystem is quiesced. Another use-case is multiple mutator threads vs a garbage collector where the heap is protected by a single reader-writer lock. If you have a JVM or other managed runtime with a stop-the-world copying collector, then normal heap mutator threads have R access. If a mutator calls out of the JVM via JNI, or parks itself, then it relinquishes R for the duration of that operation. The garbage collector takes the W role when active. Mutators also periodically poll for pending GC and relinquish R to allow the collector to run.

See also PhTM : Phased Transactional Memory which appeared in Transact 2007.

Note that a good quality reader-writer lock will provide automatic gang wakeup.

Invisible text : still takes up space

Invisible text : no space allocated

Monday Jan 26, 2015

Blizzard Preparations

3 cars in a 2-car garage

AVIS Frankfurt

I recently rented a car from AVIS at Frankfurt for travel to Dagstuhl. We drove to Dagstuhl, left it parked for 5 days, and returned without incident. On return, the attendant -- very pleasant, btw -- found 'damage' to the front driver's side rim and wanted to know details about the 'accident' . There was no accident while the car was in my hands. I related such to the attendant, but didn't make any progress. AVIS tacked an additional 147€ change on my corporate credit card. This was immediately after the Paris Hebdo incident, so we knew security lines would be long and slow, so even though we arrived 3 hours before the flight, we didn't have any spare time to argue the point.

Interestingly, there was no outbound inspection by AVIS when I rented the car. Below is a picture of the rim after the attendant pointed out the issue. There's certainly something on the rim at about the 6:00 and 3:30 positions, but it didn't happen while I had the car. The photographic quality is bad, as the return area was very poorly lit. I took a quick look around the car when I rented it, but didn't spot the supposedly damaged area.

Hopefully this isn't systemic behavior or practice on the part of AVIS.

I suppose the lesson is (a) take a walk-around video when your rent the car -- hopefully there's enough light -- or (b) perhaps use a different vendor.

A subsequent letter from AVIS detailing the damage charge ended as follows :

If you would like to exclude this risk in your next rental, you can purchase an excess-waiver for a small additional charge.

AVIS

Wednesday Jan 14, 2015

Seen in St. John

I originally mistook this as "Dave".

Just back from Dagstuhl

Concurrent Computing in the Many-core Era 2015 - Schloss Dagstuhl.

As usual, a productive and invigorating week. (All talks My slides).

We also visited the Völklingen Ironworks. (wikipedia).

Monday Dec 29, 2014

Measuring short-term fairness for locks

In a companion blog entry I discussed ways to measure or characterize the long-term fairness of lock implementations. Measuring short-term fairness is also interesting. We could easily have a lock, such as cohort locks, that are fair over long intervals but intentionally unfair over the short-term. It's useful to have ways to define exactly how unfair a lock is over some interval. We'll assume the lock admission policy is work-conserving.

It's worth mentioning that measuring short-term fairness is slightly tricky because of induced probe effect. As such, we sometimes use statistical sampling to reduce the overheads albeit with reduced certainty. I'll leave that discussion for subsequent blog entries.

I'll borrow Denning's working set concept, but apply it locks instead of memory management. The working set of a program over some interval I is the set of distinct pages accessed during I. Where convenient and unambiguous, we'll sometimes quantify the working set as the number of pages in the working set. Similarly, I'll define the lock working set (LWS) for a given lock L over interval I as the set of distinct threads that acquired the lock during the interval. Instead of defining the interval I over wall-clock time, it's more convenient to treat each acquire of the lock as a virtual increment of our interval clock. We'll define LWSS as the size of the LWS. Even though we might define the interval I to be small, we can partition a long run into a set of M I-length disjoint intervals and report the average or median LWSS over those M points.

A related statistic is the median time to reacquire (MTTR). Conceptually, we can think of a measurement run as generating a stream of thread IDs that acquired the lock. We can then iterate through that stream, and for every acquire, determine the number of intervening acquire operations that occurred since the thread last obtained the lock. That gives us a time-to-reacquire value, and in turn, we can compute the MTTR over the run.

Note that we might measure either the number of intervening acquires, or the number of distinct threads to acquire in the interim.

My colleague Tim Harris noted that it could be useful to plot by varying the interval length I on the x-axis, and the average LWSS on the y-axis. This is somewhat reminiscent of MMU (minimum mutator utilization) graphs that appear in the garbage collection literature.

When a contended lock is unfair over the short term, it's typically the case that the set of threads circulating over the lock is partitioned into active and passive sets. The active circulating set corresponds to the LWSS.

Measuring long-term fairness for locks

Lets say we have N concurrent and homogenous threads that each loop as follows : acquire a central lock; execute a critical section; release the lock; execute a non-critical section. We start the threads at the same time. At the end of the measurement interval we have N data points : the number of iterations completed by each thread. This is the so-called fixed-time-report-work benchmark methodology. Aggregate throughput is measured by the sum of the data points, but we're also interested in quantifying the fairness, as reflected by the distribution of those points. For simplicity, we'll assume that the scheduler is ideally fair, there are no NUMA or physical "geographic" issues in play, and that each thread has about the same cache footprint. The only remaining source of unfairness is the lock implementation or policies. (In practice there are myriad confounding factors, including energy and thermal caps, but we'll ignore those and assume an idealized model). A histogram of the completion counts is a good way to visualize the fairness or unfairness of the lock. I recommend HdrHistogram.

Sometimes, though, it's convenient to describe unfairness in terms of a simple real-valued numerical statistic. Such a descriptive statistic can be used to quantity the fairness of various lock policies, and in particular to help establish the trade-off between fairness and throughput. Ideally that statistic would be scale-invariant -- that property is desirable but optional. Some of the usual statistics are standard deviation or variance. Extrema-based statistics such as (max-min) or (max-min)/max can also be useful. These can give us a sense of the size of the range of the data points. The average divided by the max can also provide some insight. IQR is another commonly used statistic, as is Jain's Fairness Index. Cederman et al. suggested another useful fairness metric. In recent papers I've reported the relative standard deviation. (In communications literature it's not uncommon to see fairness expressed in terms of average/stddev, which is the reciprocal of the relative standard deviation). Median Absolute Deviation (MAD) is also informative. I don't have one favorite -- my lock benchmark harnesses report all of the above.

Recently I was looking at other measures of disparity or uniformity and came up with the following idea. First, we sort our N completion counts in ascending order. We then plot as follows. On the X-axis we have the number of threads T, and on the Y-axis we have the cumulative sum of iteration counts up to thread #T. If the completion counts were 10,15,22,33 then the Y values would be C(x) = 0,10,25,37,70, for instance, for 0,1,2,3,4 threads, respectively. Beware that the C(x) function is obviously discrete, but for the purposes of discussion and terminology I'll treat it as a continuous real-valued function. Next, we normalize both axes to [0,1]. For the Y-axis, we can simply divide C(x) by the maximum -- final -- C(x) value. If all the data points are equal -- ideally fair -- then C(x) is just the identity diagonal function : C(x)=x. Note that C(x) is convex. The more unfair the distribution, the larger the area under the diagonal and above the normalized C(x) function. And in fact that area measure makes a useful index of inequality. We could derive additional statistics from this approach, such as the maximum x-C(x) value, but the area between the diagonal and C(x) seems viable.

Subsequently, I realized a similar approach had been long used in economics to describe income disparity : Gini Coefficient.

See also : Measuring Short-Term Fairness for locks.

Tuesday Dec 02, 2014

No blocking parking

Taken at lovely Gillette Stadium; possibly amusing to synchronization practitioners. "Easy exit" sounds interesting as well.

Thursday Oct 09, 2014

Synchronization horror story of the day

While evaluating some new lock algorithms (exposed via LD_PRELOAD interposition on the pthread_lock and condvar interfaces), I noticed that C++ catch and throw on Solaris/SPARC compiled with GCC 4.9.0 will take runtime paths that take locks and impede scaling. Specifically, if you have a set of threads that simply execute "try { throw 20; } catch (int e) {;}", they won't scale as expected because of runtime locks. Unwind_RaiseException and Unwind_Find_FDE always appear on the stack when we encounter the contention. Interestingly, I didn't see any evidence of this behavior on Ubuntu 14.04 on x86.

Tuesday Jul 29, 2014

A simple lazy subscription pathology

Following up on The Pitfalls of lazy subscription, I thought I'd provide a simple case that illustrates where transactional lock elision (TLE) with lazy subscription can fail. I've managed to reproduce the failure in-house on an i7-4770 (haswell).

Say we have a ring-buffer that’s used for logging. It just records the last 16 messages posted. (This is kind of common for “black box flight recorders” that appear in kernels and the JVM). The memory layout is : int pos; intptr_t RingBuffer[16]; volatile int Lock. We'll also assume that we compile with code at low optimization levels, so accesses to "pos" result in loads and stores with no caching in registers. “Lock” is a simple test-and-test-and-set lock possibly augmented with TLE and late subscription.

The critical section is :



Post(m) :
Acquire(&Lock)
auto index = pos ++ // load and store pos
RingBuffer[index] = m
if pos == 16 : pos = 0 // reload pos
Release(&Lock)

A number of threads just loop, calling Post(0). Some run take the lock, others use TLE with late subscription. It's possible for one of threads using TLE with late subscription to observe an intermediate and transient "pos" value of 16 by virtue of some thread that holds "Lock" and has incremented "pos" from 15 to 16, but not yet reset the variable to 0. The thread using TLE that observed "pos" at 16 will then overwrite "Lock" with 0, so the late subscription check succeeds and the transaction inadvertently commits. This is a case of "wrongful commit" via lock corruption.

Characterizing which critical sections remain safe under TLE with late subscription remains an interesting research topic. But in the general case it's unsafe.

It's worth pointing out that for weakly ordered processors, the load or loads that constitute the classic early subscription check must have proper fencing. None is needed on x86, of course.

Monday Jul 28, 2014

Hardware extensions to make lazy subscription safe

Hardware extensions to make lazy subscription safe is a follow-on to our WTTM 2014 paper, Pitfalls of Lazy Subscription . We describe a number of hardware approaches to avoid the wrongful commit pathology admitted by generalized transactional lock elision using lazy subscription. Eager (early) subscription remains safe, of course.

Wednesday Jun 11, 2014

PTLQueue : a scalable bounded-capacity MPMC queue

I've used the following concurrent queue algorithm enough that it warrants a blog entry. I'll sketch out the design of a fast and scalable multiple-producer multiple-consumer (MPMC) concurrent queue called PTLQueue. The queue has bounded capacity and is implemented via a circular array. Bounded capacity can be a useful property if there's a mismatch between producer rates and consumer rates where an unbounded queue might otherwise result in excessive memory consumption by virtue of the container nodes that -- in some queue implementations -- are used to hold values. A bounded-capacity queue can provide flow control between components. Beware, however, that bounded collections can also result in resource deadlock if abused. The put() and take() operators are partial and wait for the collection to become non-full or non-empty, respectively. Put() and take() do not allocate memory, and are not vulnerable to the ABA pathologies. The PTLQueue algorithm can be implemented equally well in C/C++ and Java.

Partial operators are often more convenient than total methods. In many use cases if the preconditions aren't met, there's nothing else useful the thread can do, so it may as well wait via a partial method. An exception is in the case of work-stealing queues where a thief might scan a set of queues from which it could potentially steal. Total methods return ASAP with a success-failure indication. (It's tempting to describe a queue or API as blocking or non-blocking instead of partial or total, but non-blocking is already an overloaded concurrency term. Perhaps waiting/non-waiting or patient/impatient might be better terms). It's also trivial to construct partial operators by busy-waiting via total operators, but such constructs may be less efficient than an operator explicitly and intentionally designed to wait.

A PTLQueue instance contains an array of slots, where each slot has volatile Turn and MailBox fields. The array has power-of-two length allowing mod/div operations to be replaced by masking. We assume sensible padding and alignment to reduce the impact of false sharing. (On x86 I recommend 128-byte alignment and padding because of the adjacent-sector prefetch facility). Each queue also has PutCursor and TakeCursor cursor variables, each of which should be sequestered as the sole occupant of a cache line or sector. You can opt to use 64-bit integers if concerned about wrap-around aliasing in the cursor variables. Put(null) is considered illegal, but the caller or implementation can easily check for and convert null to a distinguished non-null proxy value if null happens to be a value you'd like to pass. Take() will accordingly convert the proxy value back to null. An advantage of PTLQueue is that you can use atomic fetch-and-increment for the partial methods. We initialize each slot at index I with (Turn=I, MailBox=null). Both cursors are initially 0. All shared variables are considered "volatile" and atomics such as CAS and AtomicFetchAndIncrement are presumed to have bidirectional fence semantics. Finally T is the templated type.

 

// PTLQueue :

Put(v) : // producer : partial method - waits as necessary
assert v != null
assert Mask >= 1 && (Mask & (Mask+1)) == 0 // Document invariants
// doorway step
// Obtain a sequence number -- ticket
// As a practical concern the ticket value is temporally unique
// The ticket also identifies and selects a slot
auto tkt = AtomicFetchIncrement (&PutCursor, 1)
slot * s = &Slots[tkt & Mask]
// waiting phase :
// wait for slot's generation to match the tkt value assigned to this put() invocation.
// The "generation" is implicitly encoded as the upper bits in the cursor
// above those used to specify the index : tkt div (Mask+1)
// The generation serves as an epoch number to identify a cohort of threads
// accessing disjoint slots
while s->Turn != tkt : Pause
assert s->MailBox == null
s->MailBox = v // deposit and pass message

Take() : // consumer : partial method - waits as necessary
auto tkt = AtomicFetchIncrement (&TakeCursor,1)
slot * s = &Slots[tkt & Mask]
// 2-stage waiting :
// First wait for turn for our generation
// Acquire exclusive "take" access to slot's MailBox field
// Then wait for the slot to become occupied
while s->Turn != tkt : Pause
// Concurrency in this section of code is now reduced to just 1 producer thread
// vs 1 consumer thread.
// For a given queue and slot, there will be most one Take() operation running
// in this section.
// Consumer waits for producer to arrive and make slot non-empty
// Extract message; clear mailbox; advance Turn indicator
// We have an obvious happens-before relation : HBO
// Put(m) happens-before corresponding Take() that returns that same "m" - antecedent
for
T v = s->MailBox
if v != null :
s->MailBox = null
ST-ST barrier
s->Turn = tkt + Mask + 1 // unlock slot to admit next producer and consumer
return v
Pause


PTLQueue borrows and derives from the Partitioned Ticket Lock "PTL" (US20120240126-A1) and the MultiLane Concurrent Bag (US8689237). The latter is essentially a circular ring-buffer where the elements themselves are queues or concurrent collections. You can think of the PTLQueue as a partitioned ticket lock "PTL" augmented to pass values from lock to unlock via the slots. Alternatively, you could conceptualize of PTLQueue as a degenerate MultiLane bag where each slot or "lane" consists of a simple single-word MailBox instead of a general queue. Each lane in PTLQueue also has a private Turn field which acts like the Turn (Grant) variables found in PTL. Turn enforces strict FIFO ordering and restricts concurrency on the slot mailbox field to at most one simultaneous put() and take() operation.

PTL uses a single "ticket" variable and per-slot Turn (grant) fields while MultiLane has distinct PutCursor and TakeCursor cursors and abstract per-slot sub-queues but does not use per-slot Turn variables. Both PTL and MultiLane advance their cursor and ticket variables with atomic fetch-and-increment. PTLQueue borrows from both PTL and MultiLane and incorporates distinct put and take cursors and per-slot Turn fields. Instead of a per-slot queues, PTLQueue uses a simple single-word MailBox field. PutCursor and TakeCursor act like a pair of ticket locks, conferring "put" and "take" access to a given slot. PutCursor, for instance, assigns an incoming put() request to a slot and serves as a PTL "Ticket" to acquire "put" permission to that slot's MailBox field. To better explain the operation of PTLQueue we deconstruct the operation of put() and take() as follows. Put() first increments PutCursor obtaining a new unique ticket. That ticket value also identifies a slot. Put() next waits for that slot's Turn field to match that ticket value. This is tantamount to using a PTL to acquire "put" permission on the slot's MailBox field. Finally, having obtained exclusive "put" permission on the slot, put() stores the message value into the slot's MailBox. Take() similarly advances TakeCursor, identifying a slot, and then acquires and secures "take" permission on a slot by waiting for Turn. Take() then waits for the slot's MailBox to become non-empty, extracts the message, and clears MailBox. Finally, take() advances the slot's Turn field, which releases both "put" and "take" access to the slot's MailBox. Note the asymmetry : put() acquires "put" access to the slot, but take() releases that lock. At any given time, for a given slot in a PTLQueue, at most one thread has "put" access and at most one thread has "take" access. This restricts concurrency from general MPMC to 1-vs-1. We have 2 ticket locks -- one for put() and one for take() -- each with its own "ticket" variable in the form of the corresponding cursor, but they share a single "Grant" egress variable in the form of the slot's Turn variable. Advancing the PutCursor, for instance, serves two purposes. First, we obtain a unique ticket which identifies a slot. Second, incrementing the cursor is the doorway protocol step to acquire the per-slot mutual exclusion "put" lock. The cursors and operations to increment those cursors serve double-duty : slot-selection and ticket assignment for locking the slot's MailBox field.

At any given time a slot MailBox field can be in one of the following states: empty with no pending operations -- neutral state; empty with one or more waiting take() operations pending -- deficit; occupied with no pending operations; occupied with one or more waiting put() operations -- surplus; empty with a pending put() or pending put() and take() operations -- transitional; or occupied with a pending take() or pending put() and take() operations -- transitional.

The partial put() and take() operators can be implemented with an atomic fetch-and-increment operation, which may confer a performance advantage over a CAS-based loop. In addition we have independent PutCursor and TakeCursor cursors. Critically, a put() operation modifies PutCursor but does not access the TakeCursor and a take() operation modifies the TakeCursor cursor but does not access the PutCursor. This acts to reduce coherence traffic relative to some other queue designs.

It's worth noting that slow threads or obstruction in one slot (or "lane") does not impede or obstruct operations in other slots -- this gives us some degree of obstruction isolation. With respect to progress properties, however, PTLQueue is not lock-free.

The implementation above is expressed with polite busy-waiting (Pause) but it's trivial to implement per-slot parking and unparking to deschedule waiting threads. It's also easy to convert the queue to a more general deque by replacing the PutCursor and TakeCursor cursors with Left/Front and Right/Back cursors that can move either direction. Specifically, to push and pop from the "left" side of the deque we would decrement and increment the Left cursor, respectively, and to push and pop from the "right" side of the deque we would increment and decrement the Right cursor, respectively.

We used a variation of PTLQueue for message passing in our recent OPODIS 2013 paper.

I'll now introduce a variation on the above, but borrow Dmitry Vyukov's encoding for the Turn variable. After a thread deposits a value into the MailBox, it advances Turn by 1 to indicate to Take() operations that a value is present. Turn values follow this trajectory : K, K+1, K+1+Mask, etc. (More precisely, the Turn field for slot I takes on values of the form: I*G, (I*G)+1, I*(G+1), etc where G is the generation number). This form is friendlier for a total tryTake() operator. Put() and Take() operators are also more symmetric than in the version above. It also allows null to be passed without substituting proxy values. Like the form above, it can use and benefit from fetch-and-increment instead of CAS for the partial forms. the partial tryTake() operator, however, uses CAS. tryPut() -- not shown -- is analogous to tryTake(). UPDATE: see the comments section for a concern identified by Nitsan.

 

// PTLQueueV2 :

Put(v) : // producer : partial method - waits as necessary
assert Mask > 1 && (Mask & (Mask+1)) == 0 // Document invariants
// doorway step
// Obtain a sequence number -- ticket
// As a practical concern the ticket value is temporally unique
// The ticket also identifies and selects a slot
auto tkt = AtomicFetchIncrement (&PutCursor, 1)
slot * s = &Slots[tkt & Mask]

// waiting phase :
// wait for slot's generation to match the tkt value assigned to this put() invocation.
// The "generation" is implicitly encoded as the upper bits in the cursor
// above those used to specify the index : tkt div (Mask+1)
// The generation serves as an epoch number to identify a cohort of threads
// accessing disjoint slots
while s->Turn != tkt : Pause

assert s->MailBox == null
s->MailBox = v // deposit and pass message
s->Turn = tkt + 1 // mark occupied - release corresponding Take() operation
// We pass & cede ownership and exclusive access of MailBox to the next Take()
// That Take() operation will be in the same generation as this Put()

Take() : // consumer : partial method - waits as necessary
auto tkt = AtomicFetchIncrement (&TakeCursor,1)
slot * s = &Slots[tkt & Mask]

// Wait turn for our generation and for the slot become occupied
while s->Turn != (tkt+1) : Pause

T v = s->MailBox // extract message
// In a garbage-collected environment we might want to set s->MailBox = null
s->Turn = tkt + Mask + 1 // mark unoccupied - unlock slot to admit next Put()
// We pass & cede ownership and exclusive access to MailBox to the next Put()
// That Put() will be in the next generation
return v

// tryTake() is a "strong" operator in that it can return null IFF
// the queue was empty at some point during the tryTake() invocation.
// Spurious false-null return values are not allowed.
// Note that if we provide tryTake() with the interface below, then null can not
// be a legal message value passed via Put(null). Alternatively, we could allow
// null messages but augment tryTake() to return a success-failure indication as well
// as the value.
// Thanks to Nathan Reyonds for comments on an earlier version of this algorithm.

tryTake() : // consume : "strong" total method - returns ASAP
for
auto tkt = TakeCursor
slot * s = &Slots[tkt & Mask]
auto delta = s->Turn - (tkt+1)
if delta == 0 :
if CAS (&TakeCursor, tkt, tkt+1) == tkt :
// our thread has exclusive access to s->MailBox
T v = s->MailBox
s->Turn = tkt + Mask + 1 // ceded MailBox access to next Put()
return v
continue
if delta lessThan 0 :
return null
// inopportune concurrent interleaving - race
// tkt is stale with respect to TakeCursor
// Other Take() or tryTake() operations bypassed this operation
// and raced past
// This can happen if we stall after loading TakeCursor, for instance
// just retry
assert TakeCursor != tkt


There's quite a bit of related literature in this area. I'll call out a few relevant references:

I'll propose an optimization left as an exercise for the reader. Say we wanted to reduce memory usage by eliminating inter-slot padding. Such padding is usually "dark" memory and otherwise unused and wasted. But eliminating the padding leaves us at risk of increased false sharing. Furthermore lets say it was usually the case that the PutCursor and TakeCursor were numerically close to each other. (That's true in some use cases). We might still reduce false sharing by incrementing the cursors by some value other than 1 that is not trivially small and is coprime with the number of slots. Alternatively, we might increment the cursor by one and mask as usual, resulting in a logical index. We then use that logical index value to index into a permutation table, yielding an effective index for use in the slot array. The permutation table would be constructed so that nearby logical indices would map to more distant effective indices. (Open question: what should that permutation look like? Possibly some perversion of a Gray code or De Bruijn sequence might be suitable).

As an aside, say we need to busy-wait for some condition as follows : "while C == 0 : Pause". Lets say that C is usually non-zero, so we typically don't wait. But when C happens to be 0 we'll have to spin for some period, possibly brief. We can arrange for the code to be more machine-friendly with respect to the branch predictors by transforming the loop into : "if C == 0 : for { Pause; if C != 0 : break; }". Critically, we want to restructure the loop so there's one branch that controls entry and another that controls loop exit. A concern is that your compiler or JIT might be clever enough to transform this back to "while C == 0 : Pause". You can sometimes avoid this by inserting a call to a some type of very cheap "opaque" method that the compiler can't elide or reorder. On Solaris, for instance, you could use :"if C == 0 : { gethrtime(); for { Pause; if C != 0 : break; }}".

It's worth noting the obvious duality between locks and queues. If you have strict FIFO lock implementation with local spinning and succession by direct handoff such as MCS or CLH,then you can usually transform that lock into a queue.

If you want a multiple-producer single-consumer MPSC queue then you can replace atomic operations on the TakeCursor with normal updates. More generally you can take a simple SPSC queue and wrap the put() and take() operations with put and take mutexes, restricting concurrency to 1-vs-1, but the performance isn't usually as good as what you'd get from a queue designed for MPSC usage.

Wednesday Apr 30, 2014

SPAA 2014 : Persistent Unfairness Arising From Cache Residency Imbalance

Persistent Unfairness Arising From Cache Residency Imbalance by Dave Dice, Virendra J. Marathe and Nir Shavit will appear in SPAA 2014.

This is the Matthew Effect for caches.

The description of the methodology is necessarily terse in the paper, but we took pains to measure and minimize DRAM channel imbalance and to control for DRAM bank/page locality.

An interesting related paper is Efficient techniques for predicting cache sharing and throughput by Andreas Sandberg, David Black-Schaffer and Erik Hagersten (Uppsala).

Tuesday Apr 29, 2014

malloc for Haswell - Hardware Transactional Memory


Haswell i7-4770 processors.

For background, the i7-4770 has relatively simple cache geometry. The L1 (level-1 cache) is 32KB with 64-byte lines, is physically tagged, and is 8-way set-associative. There are 64 possibly indices (sets). As such the cache page size is 4KB -- addresses that differ by an integer multiple of 4K will map to the same index (set) in the L1. The low-order 6 bits of the address presented to the L1 form the offset into the line, and the next higher 6 bits serve as the L1 index. The MMU base page size is 4KB, so there is no overlap between the virtual page number and the index field in a virtual address. The L1 index field passes through address translation verbatim. As such, OS-level page coloring is not in play with respect to the L1. (An advantage of this design is that indexing can commence before the virtual address is translated to a physical address, although we still need the physical address for tag comparison). Some CPUs hash addresses -- usually XORing high-order physical address bits into the index bits -- in order to reduce the odds of index hotspots and index imbalance, but experiments suggest that does not appear to be the case with the i7-4770.

Such simple caches -- without the index hashing mentioned above -- can be vulnerable to excessive index conflicts, but malloc allocators can be made index-aware (local copy) to mitigate and reduce the frequency of index conflicts. Index imbalance results in underutilization of the cache. Some indices will be "cold"(less frequently accessed) while others are "hot" and thus incur relatively higher miss rates. It's worth pointing out that most application/allocator combinations don't exhibit excessive index conflicts, but for those that do, the performance impact can be significant. An index-aware allocator can act to "immunize" an application against some common cases of index-imbalance while typically incurring no additional cost over index-oblivious allocators. Think of the index-aware allocator as cheap insurance against a rare but painful performance disorder. The paper above describes an index-aware allocator designed for the L1 in a SPARC T2+ processor, but it's trivial to change a few cache geometry constants and retarget the allocator to the i7-4770.

The "CIA-Malloc" (Cache-Index Aware) allocator described in the paper has a number of other useful design properties. It also happens to be NUMA-friendly and large-page-friendly. Underlying pages are allocated on the node where the malloc() was invoked. Put another way, the pages underlying a block returned by malloc() will typically reside on the node where the malloc() was invoked. The allocator is also scalable with very little internal lock contention or coherence traffic. Each per-CPU sub-heap has a private lock -- the only time we'll encounter contention is via migration or preemption, which is relatively rare. The critical sections are also constant-time and very short. We also make heavy use of trylock(), so if a thread is obstructed it can usually make progress by reverting to another data structure. Remote free() operations are lock-free. Critically, the allocator acts to reduce the cost of malloc() and free() operations as well as the cost to the application when accessing blocks allocated via malloc(). The allocator is also designed specifically to reduce common cases of false sharing : allocator metadata-vs-metadata; metadata-vs-block; and inter-block block-vs-block. Metadata-vs-metadata sharing and false sharing is reduced by using per-CPU sub-heaps. False sharing arising between adjacent data blocks -- blocks returned by malloc() -- is addressed by placement and alignment. These attributes will prove even more useful when we use CIA-Malloc in conjunction with hardware transactions.

The i7-4770 provides hardware transactional memory (HTM). For the purposes of discussion we'll assume we're using TSX-RTM for the purposes of transactional lock elision (TLE). The critical section body contains unmodified HTM-oblivious legacy code that expects to run under the lock in the usual fashion, but via TLE we can modify the lock implementation to attempt optimistic execution, reverting to the lock only as necessary. The i7-4770's HTM implementation tracks the transactional write-set in the L1 and the read-set in the L1 and L2. It uses a requester-wins conflict resolution strategy implemented via the MESIF coherence protocol. At most a single cache can have a given line in M/E state at any one time -- a classic multiple-reader single-writer model. Eviction or invalidation of a tracked cache line results in a transactional abort. For example if a transaction on CPU C loads address A, and some other CPU writes A before C commits, the write will invalidate the line from C's cache and cause an abort. Similarly, if C stores into A and some other CPU loads or stores into A before C commits, the invalidation of A will cause C's transaction to abort. Read-write or write-write sharing on locations accessed within a transaction results in coherence invalidation and consequent abort. (The HTM implementation in the i7-4770 shares quite a few aspects with Sun's experimental ROCK processor).

In addition to coherence traffic, self-displacement via conflict misses can also result in aborts. This is where a CIA-Malloc allocator may provide benefit relative to other allocators. Normally an index-aware allocator is expected to reduce conflict misses arising from index-imbalance, but it can also reduce transactional aborts caused by eviction of read-set or write-set entries from index conflicts. Aborts are usually far more expensive than simple cache misses. (Absent any potential benefit from warming up of caches, aborts are pure wasted and futile effort).

Lets take an actual example. The following data was collected on an i7-4770 running Ubuntu 14.04. We use a simple C single-threaded benchmark that uses malloc() to individually allocate a set of 250 nodes, and then arranges those nodes into an circular intrusive singly linked list. The benchmark was compiled with gcc 4.8.2 using the x32 ABI. The node structure has a "next" field at offset 0 followed by a volatile integer "w" field. A command-line switch gives us ability to specify the effective size of the node as passed to malloc(). Since there may be a correlation between allocation order and virtual address, we randomize the order of the nodes with a Fisher-Yates shuffle in order to minimize the impact of automatic hardware stride-based prefetchers. (Such a randomized order can put stress on the TLBs with lots of page crossings as we traverse the list, but that's not the dominant performance issue for the cases we'll discuss). We then report the time needed to complete 10000000 steps of the following loop body :

a->w = 0 ; a = a->next
If we use an effective node size of 950 bytes, then the default glibc malloc() allocator places our nodes at 960 byte intervals (1024-64) and each step of the loop requires 2.1 nsecs. When we increase the node size to 1010 the interval is 1024 bytes and each step takes 8.1 nsecs. If we further increase the node size to 1080 bytes then the interval is 1088 bytes (1024+64)and the time drops back to 2.1 nsecs. The performance drop at 1010 bytes was caused by the 1024-byte placement interval. The base addresses of our 250 nodes resided on just 4 of the 64 possible indices, so we grossly underutilized the L1. This nicely illustrates index conflicts arising from index-oblivious allocator placement policies. An index-aware allocator will avoid this scenario.

Now we'll extend our benchmark to use RTM hardware transactions. Each traversal of the ring is wrapped in a RTM XBEGIN-XEND transaction and we'll measure and report success rates. This is intended to model TLE or "naked" use of RTM transactions. We keep the ring circumference at 250 nodes, so each transaction will iterate over that
many elements. With nodes of 960 bytes the failure rate is 0.145%. Increasing the node size to 1010 bytes, the failure rate becomes 100%. (We have no progress). But if we bump the size to 1080 bytes then the failure rate drops back to 0.2%. The complete and persistent failure at 1010 bytes was caused by elements of the write-set being evicted and consequent abort. But if we use CIA-Malloc modified for the i7-4770 we can avoid such abrupt performance inflections.

To recap, an index-aware allocator can help avoid performance pathologies for normal non-transactional code as well as improving the success rate of hardware transactions. Specifically, conflict misses become aborts in transactions, and aborts are more expensive than normal misses.

Ideally, an HTM-friendly allocator will satisfy the desiderata enumerated in Cache-Index Aware Memory Allocation and also act to reduce the abort rate. The following properties are desirable:


  • malloc() and free() should be callable from within transactions with low odds of abort
  • memory accesses within transactions to blocks returned by malloc() are less prone to abort; specifically we want to reduce conflict misses and coherence invalidation from sharing, both of which cause aborts.
      Sharing and false-sharing:

    • allocator metadata-vs-blocks
    • allocator metadata-vs-metadata : we note that internal lock contention or promiscuous shared locks internal to the allocator are a special case of metadata-vs-metadata sharing that can cause aborts.
    • block-vs-block : inter-block false sharing. Application code can also use explicit memalign() and padding to reduce the odds of inter-block false sharing. An allocator or application that increases alignment to avoid false sharing may also cause quantization of block sizes and increased internal fragmentation.

  • malloc() and free() operations are scalable. Highly scalable allocators also tend to be transaction-friendly as they avoid or reduce the use shared global mutable data.

As a side note, under a requester-wins conflict resolution strategy, to the extent possible and reasonable it's a good idea to shift stores of frequently accessed shared variables toward the end of a transaction. You can do this by hand, or a transaction-aware compiler or JIT can perform some of the transformations. The modCount field Java's hashtable is a canonical example of an update that should be shifted. Shifting reduces the window of vulnerability where the store resides in the transaction's write-set. But the asymmetry in the i7-4770 where the write-set is tracked in the L1 and the read-set in the L1 and L2 gives us yet another reason to shift stores toward the end of a transaction. Consider a transaction that executes a store followed by large number of loads. Those loads may displace the store and cause an abort. But if we shift the store to the end of the transaction, the same set of accesses (just reordered) can succeed without abort. The store may displace a loaded line from the L1, but the L2 can still track the line.

Finally, when a given size-class is index-unfriendly, we can use the punctuated array approach as described in the original CIA-Malloc paper. A more radical approach is to intentionally and explicitly pick size-classes that are prime multiples of the cache line size. This helps to reduce inter-size-class index conflicts.

Another approach to index-aware allocation (other than friendly size-classes or punctuated arrays) is to simply intercept malloc(S) calls and run a Bernoulli trial to decide if we want to add 64 to S. The probably P should be low. As such, we'll occasionally, on a random basis, add 64 to S. A slightly more elaborate variation will make P a function of S, increasing the odds for larger S values.

Tuesday Mar 11, 2014

A simple PRNG construction idiom


The academic literature is rife with algorithms for pseudo-random number generators (PRNGs). Typically, there's a trade-off between performance and the quality of the distribution. In most cases I need PRNGs to implement very lightweight Bernoulli trials for randomized stress tests, benchmarks, or scalable probabilistic counters. My budget is usually less that 100 cycles to generate a uniformly distributed value. Marsaglia's xor-shift PRNG is one of my personal favorites. If I need better quality I'll step up to Ziff's four tap or Mersenne twister.


One variation of Marsaglia has only one word of state, a 4G-1 period, and requires just 3 shifts and 3 XOR operations to generate a new value. 0 is an absorbing state that we avoid. See MarsagliaNext(), below. Ignoring 0, the trajectory or stream of values forms a cycle -- conceptually a ring. The initialization and seeding operation should act to place different threads at different positions on that ring. In a sense the ring is shared by all threads but we start the threads at different points. Unfortunately, we can sometimes find that 2 different threads are at about the same position at about the same time through simple bad luck, and thus generate similar streams of values. (Longer PRNG cycles reduce the odds of such scenarios, of course). Ideally, to avoid such inopportune and undesirable behavior, each thread would have its own private ring of values.


A simple approach that is tantamount to giving each thread its own ring -- trajectory of values -- is shown below in NextRandom(). Hash32() is a hash function, which we'll describe later. Note that we explicitly "color" the value passed into Hash32() with the address of the thread-local PRNG state. Recall that at any one time, such an address will be associated with at most one thread, and is thus temporally unique. This approach gives us additional independence over concurrently executing threads. It also makes NextRandom() self-seeding -- explicit initialization is not required.


The Hash32() hash function is the key to this particular PRNG, and its implementation directly embodies the trade-off between performance and the quality of the resulting distribution. Conceptually, you could think of Hash32() as representing a randomized cycle with a period of 4G. We might implement Hash32() via a strong cryptographic hash, such as SHA-3 or MD5. Those hash functions would work perfectly well, but tend to be high quality but somewhat expensive, so we'd prefer a cheaper hash. Critically, we want the hash function to exhibit a high degree of avalanche effect. (A good encryption function could also be used as a hash operator). Some cheaper candidates for the hash function include: MurmurHash ;CityHash ; FNV hash family; siphash; and Jenkins hash.


Doug Lea and Guy Steele invented some of the best candidates for our hash function: see Mix32() and Mix64() below. These are relatively cheap but do well on avalanche tests and strike a reasonable balance between quality and performance. Beware that they're obviously not cryptographically strong. Mix32() and Mix64() are related to mix functions found in java.util.SplittableRandom.


Update 2014-03-13 : Thanks to Paul Sandoz for pointing out Stafford's mix function.


static int32_t MarsagliaNext () {
static __thread int32_t rv = 0 ;
if (rv == 0) rv = GenerateSeed() ;
int32_t v = rv ;
v ^= v << 6 ;
v ^= uint32_t(v) >> 21 ;
v ^= v << 7 ;
rv = v ;
return v ;
}

static int NextRandom() {
// assumes a 32-bit environment : ILP32
// Instead of incrementing by 1 we could also
// increment by a large prime or by the MIX constant
// if we're using Mix32() as our hash function.
static __thread int rv = 0 ;
return Hash32 (int(&rv) + (++rv)) ;
}

// Mix32 and Mix64 are provided courtesy of Doug Lea and Guy Steele

static int32_t Mix32 (int32_t z) {
static const int32_t MIX = 0x9abe94e3;
// Another workable constant is 0xa0ca9b6b
z = (z ^ (uint32_t(z) >> 16)) * MIX;
z = (z ^ (uint32_t(z) >> 16)) * MIX;
return z ^ (uint32_t(z) >> 16);
}

static int64_t Mix64 (int64_t z) {
static const int64_t MIX = 0xdaba0b6eb09322e3LL;
z = (z ^ (uint64_t(z) >> 32)) * MIX;
z = (z ^ (uint64_t(z) >> 32)) * MIX;
return z ^ (uint64_t(z) >> 32);
}



MSPC 2014 submission deadline has been extended until April 4th

See the workshop web site for details.

Monday Mar 10, 2014

PRNG optimizations

Say you wanted to generate a stream of uniformly distributed random integers in the range [0,N) where N > 0. Java's Random.nextInt(N) is an example of a convenient API that does just that. Furthermore lets say you willing to trade off the quality of the distribution in order to gain better performance -- lower latency for each call to nextInt(N). Since we care about speed, we'll assume the implementation can't use floating point.

A good starting point might be Marsaglia's xor-shift pseudo-random number generator (PRNG), which can generate a new 32-bit value in [0,4G) in just a few cycles. Once you have that 32-bit value in hand you next need to reduce it to the [0,N) range. The most obvious approach is to use the "MOD" operator. The MOD (and DIV) instructions can be relatively expensive, however. Somewhat perversely, it's not uncommon to find that the MOD instruction is more expensive than generating the value via the Marsaglia xor-shift PRNG, and thus dominates performance of your nextInt(N) implementation. Furthermore, some processors have only one processing unit per core that can handle MOD instructions, so you might encounter both scaling and latency problems. If N happens to be a constant known at compile-time then your compiler might be sufficiently clever to apply strength reduction techniques, such as those described by Granlund and Montgomery in Division by Invariant Integers using Multiplication (PLDI 1994). Hacker's Delight (2E) is another good resource. Alternatively, you could explicitly apply the optimization yourself. But lets assume that N is a variable.

One approach to avoid the MOD instruction is to use a degenerate scaled or fixed-point representation. Say "v" is the value returned by the Marsgalia xor-shift PRNG. We can calculate and return (((v & 0x3FF) * N) >> 10). 0x3FF and 10 are related as 0x3FF is equal to ((2^10)-1), of course. Note that we traded MOD for multiplication, which is usually profitable. While this approach is tempting -- and might be viable in some circumstances -- it suffers from range overflow concerns and quantization (really additional discretization). Note that only 1024 possible values can be returned. We might change 0x3FF and 10 to 0x7FF and 11, respectively, allowing 2048 distinct values, but if we keep increasing those values the "interesting" bits will be truncated by overflow in the 32-bit multiplication and the distribution will suffer.

Another approach is to borrow an idea from rejection sampling. For discussion lets assume N is 10. Ignoring overflow, we can easily compute M where M is the least integer power of 2 greater than or equal to N. Since N is 10, M will be 16. See "Hacker's Delight" 2E, Figure 3-3, or the following article. Critically, given N, we can quickly compute M in constant time with no memory references and no conditional branches. Our nextInt(N) implementation can first compute M from N and then invoke the Marsaglia xor-shift PRNG (or PRNG of your choice) to obtain a 32-bit value "v". Next, we mask "v" with (M-1) yielding a value in [0,M). If that value is less than N then we're done and can return -- we accept the value. But if the value is greater than or equal to N then we reject the value and loop, calling the PRNG again and retrying until we finally find a value less than N. We're effectively running Bernoulli trails until the 1st success, and the probability P of success (N/M) will always be greater than 0.5. The number of trials (iterations) until the 1st success has a geometric distribution so the average running time -- iterations required -- is reasonable if you're willing to tolerate the non-deterministic nature of rejection sampling. If you're concerned about the worst-case you could fall back to the MOD instruction after some number of rejections within a nextInt() episode. One downside to this approach is that the branch to exit the loop might not be very well predicted.

Whether or not this approach is profitable depends on the cost of MOD relative to the additional work to compute M and the cost of the retries. Somewhat sadly, it appears useful on a range of systems.

(Beware that the low-order bits from a Marsaglia xor-shift PRNG may have a skewed distribution over short intervals, but you can remedy this with a shift or by using a slightly better but still cheap PRNG).

Monday Feb 24, 2014

MSPC 2014

MSPC 2014 will be co-located with PLDI 2014 in Edinburgh. The paper due-date is March 10th.

Tuesday Jan 28, 2014

Selective Core Boosting: The Return of the Turbo Button

Available via TU Dresden or locally.

Friday Jan 24, 2014

Do you use sun.misc.Unsafe ?

Take the following survey on API usage to help JDK planning. (Yes, we're not supposed to use it directly. Caveat Utor. But many folks do, including myself.)

About

Dave is a senior research scientist in the Scalable Synchronization Research Group within Oracle Labs : Google Scholar.

Search

Categories
Archives
« February 2015
SunMonTueWedThuFriSat
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
21
22
23
24
25
26
27
28
       
       
Today