Fence Strength

A memory model discussion about fence strength that spans 2 or more days or two or more participants strongly suggests that the most conservatively strong fencing should be used.    

Friday, September 29, 2017 | Read More

Towards an Efficient Pauseless Java GC with Selective HTM-Based Access Barriers - in ManLang 2017

Towards an Efficient Pauseless Java GC with Selective HTM-Based Access Barriers appears in ManLang 2017 -- formerly PPPJ  (http://d3s.mff.cuni.cz/conferences/manlang17/).   Abstract The garbage collector (GC) is a critical component of any managed runtime environment (MRE), such as the Java virtual machine. While the main goal of the GC is to simplify and automate memory management, it may have a negative impact on the application performance, especially on multi-core systems....

Friday, September 15, 2017 | Read More

Malthusian Locks

Malthusian Locks appears in EuroSys 2017. An extended version is in arxiv.Abstract: Applications running in modern multithreaded environments are sometimes overthreaded. The excess threads do not improve performance, and in fact may act to degrade performance via scalability collapse, which can manifest even when there are fewer ready threads than available cores. Often, such software also has highly contended locks. We leverage the existence of such locks by modifying the...

Wednesday, March 15, 2017 | General | Read More

Transactional Pointers: Experiences with HTM-Based Reference Counting in C++

Transactional Pointers: Experiences with HTM-Based Reference Counting in C++ by Maria Carpen-Amarie , Dave Dice, Gaël Thomas and Pascal Felber appeared in NETYS 2016. Apologies -- the paper is paywalled.AbstractThe most popular programming languages, such as C++ or Java, have libraries and data structures designed to automatically address concurrency hazards in order to run on multiple threads. In particular, this trend has also been adopted in the memory management domain....

Tuesday, September 20, 2016 | General | Read More

An Inflatable SeqLock

The following is inspired by some discussions with John Rose on how we might implement atomic access to value type "tuples" for subsequent JDK releases. As I understand it, the basic idea is to have something close C++'s atomic<POD> facility. Since reading (loading) a value type is expected to be the dominant mode of access, we'd like to use something like SeqLocks so that readers don't have to write to any shared locations. (See also). But at the same time we'd like writers...

Wednesday, August 10, 2016 | General | Read More

Mitigating the Java nanoTime coherence hotspot

Java's nanoTime() API guarantees a monotonic (really, non-retrograde) relative clock source. It's also expected to be causal in the following sense. Say thread A calls nanoTime(), gets value V, and then stores V into memory. Some other thread B then observes A's store of V, then calls nanoTime() itself and gets W. We expect W should be greater than or equal to V.In an ideal world the clock sources underlying nanoTime() would be causal. But that's not always the case on all...

Monday, July 18, 2016 | General | Read More

Preemption tolerant MCS locks

A simple test-and-set based spin lock is a reasonable choice when contention is nil or low. Lock handover is relatively efficient and there's no need to maintain a list of waiting threads, yielding a simple design. Under higher contention, queue-based such as MCS often provide better throughput and be a better choice. Classic MCS locks are strictly FIFO queue-based locks that use local spinning. (Clever lock implementations can try to adaptively select the best algorithm...

Saturday, July 16, 2016 | General | Read More

Fast Non-intrusive Memory Reclamation for Highly-Concurrent Data Structures

Fast Non-intrusive Memory Reclamation for Highly-Concurrent Data Structures by Dave Dice, Maurice Herlihy and Alex Kogan, in ISMM 2016Abstract:Current memory reclamation mechanisms for highly-concurrent data structures present an awkward trade-off. Techniques such as epoch-based reclamation perform well when all threads are running on dedicated processors, but the delay or failure of a single thread will prevent any other thread from reclaiming memory. Alternatives such as...

Wednesday, May 4, 2016 | General | Read More

Refined Transactional Lock Elision

Refined Transactional Lock Elision in PPoPP 2016 by Dave Dice, Alex Kogan and Yossi Lev.AbstractTransactional lock elision (TLE) is a well-known technique that exploits hardware transactional memory (HTM) to introduce con- currency into lock-based software. It achieves that by attempting to execute a critical section protected by a lock in an atomic hard- ware transaction, reverting to the lock if these attempts fail. One significant drawback of TLE is that it disables...

Saturday, January 16, 2016 | General | Read More

JNI performance - false sharing on the "-UseMembar" serialization page

For background on the membar elision techniques and the serialization page, see the following: 7644409; Asymmetric Dekker Synchronization; and QPI Quiescence. On normal x86 and SPARC systems these are strictly local latency optimizations (because MEMBAR is a local operation) although on some systems where fences have global effects, they may actually improve scalability. As an aside, such optimizations may no longer be profitable on modern processors where the cost of fences...

Tuesday, November 17, 2015 | General | Read More

Dekker’s mutual exclusion algorithm made RW-safe

Dekker’s mutual exclusion algorithm made RW-safe by Peter Buhr, Dave Dice and Wim H. Hesselink appears in CCPE 2015. (Sorry, it's pay-walled).

Wednesday, September 30, 2015 | General | Read More

Evaluating HTM for pauseless garbage collectors in Java

Evaluating HTM for pauseless garbage collectors in Java by Maria Carpen-Amarie, Dave Dice, Patrick Marlier, Gaël Thomas and Pascal Felber appeared in The 13th IEEE International Symposium on Parallel and Distributed Processing with Applications (IEEE ISPA-15).Abstract:While garbage collectors (GCs) significantly simplify programmers’ tasks by transparently handling memory management, they also introduce various overheads and sources of unpredictability. Most importantly, GCs...

Wednesday, September 30, 2015 | General | Read More

Malthusian Locks

Arxiv version

Saturday, September 26, 2015 | General | Read More

Locks with LIFO admission order

Why would we ever want a lock with a LIFO admission policy?First, a LIFO lock provides a useful measure of real-world scalability. Lets say we have a set of threads that each iterate as follows : acquire some lock L; execute a fixed-length critical section body of duration C; release L; and finally execute a non-critical section of length N. We run T threads concurrently, and at the end of a measurement interval we report the total number of iterations completed, as well as...

Thursday, April 23, 2015 | General | Read More

waiting policies for locks : spin-then-park

I thought I'd collect a few notes on the topic of waiting policies in the context of locks. Lock algorithms in the literature are usually implemented via unbounded spinning. That's fine for an academic paper, but in practice it's almost never viable to use unbounded spinning. More typically, we'll use a spin-then-park policy where the spin phase is bounded, and a thread reverts to parking as necessary. In theory we can also yield occasionally while spinning, but yielding...

Monday, April 20, 2015 | General | Read More

Cohort Locks in ACM TOPC

Lock Cohorting: A General Technique for Designing NUMA Locks appears in ACM Transactions on Parallel Computing (TOPC) Volume 1 Issue 2, January 2015. (local copy -- apologies for gzip-ed pdf but the file size exceeded Oracle blog limits). This is an extended journal version of our PPoPP 2012 conference paper. Since the PPoPP version we've found ways to make the lock friendlier for Intel MESI/MESIF-based systems. Our preferred form now uses a partitioned ticket lock for the...

Friday, February 20, 2015 | General | Read More

Using reader-writer locks to improve hardware TLE : TLE-RW

At its most simple, traditional hardware TLE (Transactional Lock Elision) operates as follows : start a hardware transaction; "subscribe" to the lock state (committing or aborting if the lock is found to be held); execute the critical section; and finally commit at the end of the critical section. For the purposes of discussion I'm assuming an Intel TSX-RTM implementation of hardware transactional memory (HTM). If we can't make progress using the optimistic transactional mode...

Wednesday, January 28, 2015 | General | Read More

Blizzard Preparations

3 cars in a 2-car garage

Tuesday, January 27, 2015 | General | Read More

AVIS Frankfurt

I recently rented a car from AVIS at Frankfurt for travel to Dagstuhl. We drove to Dagstuhl, left it parked for 5 days, and returned without incident. On return, the attendant -- very pleasant, btw -- found 'damage' to the front driver's side rim and wanted to know details about the 'accident' . There was no accident while the car was in my hands. I related such to the attendant, but didn't make any progress. AVIS tacked an additional 147€ change on my corporate credit card....

Monday, January 26, 2015 | General | Read More

Seen in St. John

I originally mistook this as "Dave".

Wednesday, January 14, 2015 | General | Read More

Just back from Dagstuhl

Concurrent Computing in the Many-core Era 2015 - Schloss Dagstuhl. As usual, a productive and invigorating week. (All talks My slides). Conference publication We also visited the Völklingen Ironworks. (wikipedia).

Wednesday, January 14, 2015 | General | Read More

Measuring short-term fairness for locks

In a companion blog entry I discussed ways to measure or characterize the long-term fairness of lock implementations. Measuring short-term fairness is also interesting. We could easily have a lock, such as cohort locks, that are fair over long intervals but intentionally unfair over the short-term. It's useful to have ways to define exactly how unfair a lock is over some interval. We'll assume the lock admission policy is work-conserving.It's worth mentioning that measuring...

Monday, December 29, 2014 | General | Read More

Measuring long-term fairness for locks

Lets say we have N concurrent and homogenous threads that each loop as follows : acquire a central lock; execute a critical section; release the lock; execute a non-critical section. We start the threads at the same time. At the end of the measurement interval we have N data points : the number of iterations completed by each thread. This is the so-called fixed-time-report-workbenchmark methodology. Aggregate throughput is measured by the sum of the data points, but we're...

Monday, December 29, 2014 | General | Read More

No blocking parking

Taken at lovely Gillette Stadium; possibly amusing to synchronization practitioners. "Easy exit" sounds interesting as well.

Tuesday, December 2, 2014 | General | Read More

Synchronization horror story of the day

While evaluating some new lock algorithms (exposed via LD_PRELOAD interposition on the pthread_lock and condvar interfaces), I noticed that C++ catch and throw on Solaris/SPARC compiled with GCC 4.9.0 will take runtime paths that take locks and impede scaling. Specifically, if you have a set of threads that simply execute "try { throw 20; } catch (int e) {;}", they won't scale as expected because of runtime locks. Unwind_RaiseException and Unwind_Find_FDE always appear on the...

Thursday, October 9, 2014 | General | Read More

A simple lazy subscription pathology

Following up on The Pitfalls of lazy subscription, I thought I'd provide a simple case that illustrates where transactional lock elision (TLE) with lazy subscription can fail. I've managed to reproduce the failure in-house on an i7-4770 (haswell).Say we have a ring-buffer that’s used for logging. It just records the last 16 messages posted. (This is kind of common for “black box flight recorders” that appear in kernels and the JVM). The memory layout is : int pos; intptr_t...

Tuesday, July 29, 2014 | General | Read More

Hardware extensions to make lazy subscription safe

Hardware extensions to make lazy subscription safe is a follow-on to our WTTM 2014 paper, Pitfalls of Lazy Subscription . We describe a number of hardware approaches to avoid the wrongful commit pathology admitted by generalized transactional lock elision using lazy subscription. Eager (early) subscription remains safe, of course.

Monday, July 28, 2014 | General | Read More

PTLQueue : a scalable bounded-capacity MPMC queue

I've used the following concurrent queue algorithm enough that it warrants a blog entry. I'll sketch out the design of a fast and scalable multiple-producer multiple-consumer (MPMC) concurrent queue called PTLQueue. The queue has bounded capacity and is implemented via a circular array. Bounded capacity can be a useful property if there's a mismatch between producer rates and consumer rates where an unbounded queue might otherwise result in excessive memory consumption by...

Wednesday, June 11, 2014 | General | Read More

SPAA 2014 : Persistent Unfairness Arising From Cache Residency Imbalance

Persistent Unfairness Arising From Cache Residency Imbalance by Dave Dice, Virendra J. Marathe and Nir Shavit will appear in SPAA 2014. This is the Matthew Effect for caches. The description of the methodology is necessarily terse in the paper, but we took pains to measure and minimize DRAM channel imbalance and to control for DRAM bank/page locality. An interesting related paper is Efficient techniques for predicting cache sharing and throughput by Andreas Sandberg, David...

Wednesday, April 30, 2014 | General | Read More

malloc for Haswell - Hardware Transactional Memory

Invisible messageI've been looking into "malloc" dynamic storage allocators targeted specifically at Intel Haswell i7-4770 processors. For background, the i7-4770 has relatively simple cache geometry. The L1 (level-1 cache) is 32KB with 64-byte lines, is physically tagged, and is 8-way set-associative. There are 64 possibly indices (sets). As such the cache page size is 4KB -- addresses that differ by an integer multiple of 4K will map to the same index (set) in the L1. The...

Tuesday, April 29, 2014 | General | Read More

A simple PRNG construction idiom

The academic literature is rife with algorithms for pseudo-random number generators (PRNGs). Typically, there's a trade-off between performance and the quality of the distribution. In most cases I need PRNGs to implement very lightweight Bernoulli trials for randomized stress tests, benchmarks, or scalable probabilistic counters. My budget is usually less that 100 cycles to generate a uniformly distributed value. Marsaglia's xor-shift PRNGis one of my personal favorites. If I...

Tuesday, March 11, 2014 | General | Read More

MSPC 2014 submission deadline has been extended until April 4th

See the workshop web site for details.

Tuesday, March 11, 2014 | General | Read More

PRNG optimizations

Say you wanted to generate a stream of uniformly distributed random integers in the range [0,N) where N > 0. Java's Random.nextInt(N) is an example of a convenient API that does just that. Furthermore lets say you willing to trade off the quality of the distribution in order to gain better performance -- lower latency for each call to nextInt(N). Since we care about speed, we'll assume the implementation can't use floating point. A good starting point might be Marsaglia's...

Monday, March 10, 2014 | General | Read More

MSPC 2014

MSPC 2014 will be co-located with PLDI 2014 in Edinburgh. The paper due-date is March 10th.

Monday, February 24, 2014 | General | Read More

Do you use sun.misc.Unsafe ?

Take the following survey on API usage to help JDK planning. (Yes, we're not supposed to use it directly. Caveat Utor. But many folks do, including myself.)

Friday, January 24, 2014 | General | Read More

Jam Jar Jet

A simple working model of a valveless pulse jet engine inspired by an article in Make magazine. I found our worked better without the diffuser. Read the full article before you try this at home. Sometimes our blog systems does not allow embedded video, so just in case you can use this direct link.

Friday, May 31, 2013 | General | Read More

NUMA-Aware Reader-Writer Locks

NUMA-Aware Reader-Writer Locks will appear in PPoPP 2013. As a side note, it seems Vyukov's distributed reader-write lock is similar to the concept of sub-latches and super-latches found in some database implementations. US8966491NUMA-aware reader-writer locks Irina Calciu, Dave Dice, Yossi Lev, Victor Luchangco, Virendra J. Marathe, Nir Shavit PPoPP '13 Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming, 2013 frames are...

Monday, January 7, 2013 | General | Read More

Java @Contended annotation to help reduce false sharing

See this posting by Aleksey Shipilev for details -- @Contended is something we've wanted for a long time. The JVM provides automatic layout and placement of fields. Usually it'll (a) sort fields by descending size to improve footprint, and (b) pack reference fields so the garbage collector can process a contiguous run of reference fields when tracing. @Contended gives the program a way to provide more explicit guidance with respect to concurrency and false sharing. Using...

Friday, November 23, 2012 | General | Read More

NUMA-aware placement of communication variables

For classic NUMA-aware programming I'm typically most concerned about simple cold, capacity and compulsory misses and whether we can satisfy the miss by locally connected memory or whether we have to pull the line from its home node over the coherent interconnect -- we'd like to minimize channel contention and conserve interconnect bandwidth. That is, for this style of programming we're quite aware of where memory is homed relative to the threads that will be accessing it....

Thursday, November 22, 2012 | General | Read More

C-states and P-states : confounding factors for benchmarking

I was recently looking into a performance issue in the java.util.concurrent (JUC) fork-join pool framework related to particularly long latencies when trying to wake (unpark) threads in the pool. Eventually I tracked the issue down to the power & scaling governor and idle-state policies on x86. Briefly, P-states refer to the set of clock rates (speeds) at which a processor can run. C-states reflect the possible idle states. The deeper the C-state (higher numerical values) the...

Tuesday, November 6, 2012 | General | Read More

NUMA-aware constructs for java.util.concurrent

The constructs in the java.util.concurrent JSR-166 "JUC" concurrency library are currently NUMA-oblivious. That's because we currently don't have the topology discovery infrastructure and underpinnings in place that would allow and enable NUMA-awareness. But some quick throw-away prototypes show that it's possible to write NUMA-aware library code. I happened to use JUC Exchanger as a research vehicle. Another interesting idea is to adapt fork-join work-stealing to favor...

Thursday, October 25, 2012 | General | Read More

Performance triage

Folks often ask me how to approach a suspected performance issue. My personal strategy is informed by the fact that I work on concurrency issues. (When you have a hammer everything looks like a nail, but I'll try to keep this general). A good starting point is to ask yourself if the observed performance matches your expectations. Expectations might be derived from known system performance limits, prototypes, and other software or environments that are comparable to your...

Thursday, October 25, 2012 | General | Read More

TXPAUSE : polite waiting for hardware transactional memory

Classic locks are an appropriate tool to prevent potentially conflicting operations A and B, invoked by different threads, from running at the same time. In a sense the locks cause either A to run before B or vice-versa. Similarly, we can replace the locks with hardware transactional memory, or use transactional lock elision to leverage potential disjoint access parallelism between A and B. But often we want A to wait until B has run. In a Pthreads environment we'd usually...

Thursday, October 25, 2012 | General | Read More

Polite busy-waiting with WRPAUSE on SPARC

Unbounded busy-waiting is an poor idea for user-space code, so we typically use spin-then-block strategies when, say, waiting for a lock to be released or some other event. If we're going to spin, even briefly, then we'd prefer to do so in a manner that minimizes performance degradation for other sibling logical processors ("strands") that share compute resources. We want to spin politely and refrain from impeding the progress and performance of other threads &#8212...

Wednesday, October 24, 2012 | General | Read More

Thread placement policies on NUMA systems - update

In a prior blog entry I noted that Solaris used a "maximum dispersal" placement policy to assign nascent threads to their initial processors. The general idea is that threads should be placed as far away from each other as possible in the resource topology in order to reduce resource contention between concurrently running threads. This policy assumes that resource contention -- pipelines, memory channel contention, destructive interference in the shared caches, etc -- will...

Tuesday, July 3, 2012 | General | Read More

HCLH Locks - an additional constraint

HCLH Locks are NUMA-friendly hierarchical queue-based CLH locks. Subsequently, we've discovered that for correctness threads must be bound 1:1. Absent such binding the locks are vulnerable to exclusion failure related to node circulation & lifecycle issues. That particular invariant wasn't made clear in the HCLH paper.

Friday, May 25, 2012 | General | Read More

Foxboro says "No Dice"

Tuesday, March 20, 2012 | General | Read More

Lock Cohorting - to appear in PPoPP 2012

Lock Cohorting: A General Technique for Designing NUMA Locks by Dave Dice, Virendra Marathe and Nir Shavit in PPoPP 2012. The Cohort lock technique allows developers to construct NUMA-aware locks from NUMA-oblivious locks.

Thursday, December 22, 2011 | General | Read More

Call for papers : Transact 2012

Title: 7th Workshop on Transactional Computing Deadline: December 1, 2011 Webpage: http://transact2012.cse.lehigh.edu Workshop: February 26, 2012 Contact: spear@cse.lehigh.edu Location: New Orleans, Louisiana, USA (with PPoPP 2012) Synopsis:TRANSACT 2012 is a forum for the presentation of research on all aspects of transactional computing. The scope of the workshop is intentionally broad, with the goal of encouraging interaction across the languages, architecture, systems,...

Thursday, November 3, 2011 | General | Read More

MONITOR-MWAIT for spin loops

Some ramblings regarding the use of MONITOR-MWAIT for busy-wait loops.

Tuesday, November 1, 2011 | General | Read More

Make your ambition match your education

Seen in the DC Metro system in ads for a certain institution of higher education. How sad.

Monday, August 1, 2011 | General | Read More

Inverted schedctl usage in the JVM

The schedctl facility in Solaris allows a thread to request that the kernel defer involuntary preemption for a brief period. The mechanism is strictly advisory - the kernel can opt to ignore the request. Schedctl is typically used to bracket lock critical sections. That, in turn, can avoid convoying -- threads piling up on a critical section behind a preempted lock-holder -- and other lock-related performance pathologies. If you're interested see the man pages for...

Tuesday, June 21, 2011 | General | Read More

MultiLane : a concurrent blocking multiset

MultiLane : a concurrent blocking multiset appeared in SPAA 2011. It's rather trivial to extend the general idea and construct a deque instead of a queue-like construct.

Tuesday, June 14, 2011 | General | Read More

Partitioned Ticket Lock

Partitioned Ticket Lock appeared in SPAA 2011. When the number of "lanes" is just 1, a partitioned ticket lock is just a degenerate instance of a classic ticket lock. We can easily provide K-exclusion by setting the number of lanes to K or larger and initialization the 1st K slots to 0 through K-1. For instance if we have 8 lanes and K=3, the the slots would be initialized to 0,1,2,0,0,0,0,0 respectively.Brief announcement: a partitioned ticket lock David DiceSPAA '11...

Tuesday, June 14, 2011 | General | Read More

Trends in processor count and memory growth

Of late system address space has been growing at a rate of between 1 and 1.5 addressable bits per year. Interestingly, though, we're now adding processors to systems faster than we're adding memory. That is, processors/GB RAM is currently increasing. If that trend holds true and programming models remain the same, then, arguably, synchronization will become even more important as more processors will be coordinating access to a given amount of memory. Proper fine-grained...

Friday, June 10, 2011 | General | Read More

Writing Musical History

Two sections of American Popular Music at Bridgewater State University are discussing the music of the last decade. The textbook stops around 1999. They are looking for your help. Imagine that it is 10 years from now and you must update the American popular music textbook to include a chapter on the years 2000-2011. What artists/groups do you feel absolutely must be in that chapter? The survey takes about a minute to complete.

Saturday, April 16, 2011 | General | Read More

Flat-Combining NUMA Locks

Flat-Combining NUMA Locks to appear in SPAA 2011.

Thursday, April 14, 2011 | General | Read More

Cache index-aware memory allocation

Cache Index-Aware Memory Allocation to appear in International Symposium on Memory Managment (ISMM) 2011. As a consequence of the design, this "malloc" allocator is also NUMA-aware and NUMA-friendly assuming the system uses the usual default first-touch NUMA page placement policy. Cache index-aware memory allocation Yehuda Afek, Dave Dice, Adam Morrison ISMM '11 Proceedings of the international symposium on Memory management, 2011 frames are not supported

Tuesday, April 5, 2011 | General | Read More

crazy concurrency

Crazy Concurrency -- high-end nerdcore.

Tuesday, March 1, 2011 | General | Read More

atomic fetch-and-add vs compare-and-swap

There are a number of cases where an atomic fetch-and-add instruction might yield better performance than the classic Load;Φ;CAS loop, where CAS is the atomic compare-and-swap instruction. The x86 architecture exposes LOCK:XADD which I'll use in the discussion below. (If you don't need the original value you can also use LOCK:INC or LOCK:ADD instead of LOCK:XADD). CAS is "optimistic" and admits failure, whereas XADD does not. With XADD there's no explicit window of...

Tuesday, February 15, 2011 | General | Read More

False sharing induced by card table marking

Garbage-collected runtime environments frequently use card tables in conjunction with write barriers to accelerate root-scanning during garage collection. (See A Fast write barrier for generational garbage collectors by Urs Holzle, ECOOP-OOPSLA 1993 for details). Briefly, and skipping a few details, this design partitions the heap into an array of power-of-two sized card pages. When a mutator stores into a reference field of an object the runtime write barrier code will mark...

Monday, February 14, 2011 | General | Read More

MultiLane : a concurrent blocking multiset

MultiLane : a concurrent blocking multiset - under submission. It's rather trivial to extend the general idea and construct a deque instead of a queue-like construct.

Thursday, January 20, 2011 | General | Read More

Partitioned Ticket Lock

The Partitioned Ticket Lock - under submission.

Friday, January 14, 2011 | General | Read More

Submissions are being accepted for VEE 2010

Submissions are now being accepted for VEE 2010 (more formally, the 2011 ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments). VEE is co-located with ASPLOS this year.

Friday, October 15, 2010 | General | Read More

Cache index-aware memory allocation

Cache index-Aware memory allocation(under submission) by Dave Dice, Yehuda Afek and Adam Morrison. We describe a relatively simple set of changes to malloc allocators that may reduce the conflict miss rate of applications accessing blocks returned by malloc.

Tuesday, September 7, 2010 | General | Read More

Solaris scheduling : SPARC and CPUIDs

Since it's a commonly asked question and source of confusion I thought I'd write up the following. First, I should introduce some terminology and state the mapping between solaris logical CPUIDs and physical resources. On a Solaris/SPARC Niagara-2 system the logical CPUIDs map geographically to physical IDs and resources. You can interpret a logical CPUID as follows: (DieNumber: D; CoreNumber:3 ; ThreadGroup:1 ; Strand:2). That is, you have 8 cores per die, 2 thread groups per...

Wednesday, July 21, 2010 | General | Read More

Europar 2010 : Transactional Mutex Locks

Transactional Mutex Locks by Luke Dalessandro, Dave Dice, Michael Scott, Nir Shavit and Michael Spear will appear in Europar 2010.

Tuesday, June 1, 2010 | General | Read More

TLRW: Return of the Read-Write Lock

TLRW: Return of the Read-Write Lock by Dave Dice and Nir Shavit will appear in SPAA 2010. (The paper introduces the concept of the TLRW-ByteLock).

Monday, April 5, 2010 | General | Read More

QPI Quiescence

It's not uncommon to find Dekker-like idioms in modern concurrent programs. On platforms with weaker memory models -- say where a store followed by a load in program order can be reordered by the architecture to appear as a load and then a store in the effective memory order (sometimes called the "visibility order") -- programs must use barrier instructions to enforce memory ordering to implement Dekker correctly. For the purposes of discussion and assuming a...

Friday, February 12, 2010 | General | Read More

A scheduling and dispatching oddity on Linux

While benchmarking a concurrent application on Linux I ran into an odd problem worth relating. Specifically, I'm using ubuntu 9.10 with linux kernel 2.6.31-1 running on a 1x4x2 Core2 i7-920 "Nehalem" (1 package; 4 cores/package; 2 logical processors/core via hyperthreading). I'd noticed that our scaling numbers were a bit odd, with more than the usual fall off past 4 threads (it's a 1x4x2 system so we expect some fade past 4 threads, even for ideally scalable benchmarks) and...

Tuesday, January 26, 2010 | General | Read More

A race in LockSupport park() arising from weak memory models

I recently diagnosed the root cause of a concurrency bug, CR6822370, and thought it sufficiently interesting to share the details. (CR 6822370 actually represents a cluster of bugs that are now thought to be related by a common underlying issue). Briefly, we have a lost wakeup bug in the native C++ Parker::park() platform-specificinfrastructure code that implements java.util.concurrent.LockSupport.park(). The lost wakeup arises from a race that itself arises because of...

Monday, November 30, 2009 | General | Read More

ROCK Hardware Transactional memory - Sun technical report

We recent published an expanded version of our earlier ASPLOS paper as a Sun labs technical report. Early Experience with a Commercial Hardware Transactional Memory Implementation by Dave Dice, Yossi Lev, Mark Moir, Dan Nussbaum, and Marek Olszewski, published as SMLI-TR-2009-180.

Thursday, October 29, 2009 | General | Read More

The perils of negative scalability

I've seen the following issue confound customers and colleagues of late, so thought it worth a blog entry. Lets say you have an application that exhibits negative scalability. That is, if you were to plot throughput on the Y-axis and concurrency on the X-axis the shape of the curve would be convex downward -- performance climbs up to an apex and then falls off. (How can this happen? A common reason is that the communication overheads start to dominate and the benefit...

Wednesday, September 23, 2009 | General | Read More

X86 platform memory model - clarifications in recent Intel manuals

Intel has taken steps toward addressing some of the concerns about potential platform memory model relaxations that I identified in a previous blog entry. Specifically see section of their latest Software Developer's Manual which now guarantees global store ordering.

Monday, June 1, 2009 | General | Read More

Instruction selection for volatile fences : MFENCE vs LOCK:ADD

In the past the JVM has used MFENCE but, because of latency issues on AMD processors and potential pipeline issues on modern Intel processors it appears that a LOCK:ADD of 0 to the top of stack is preferable (Gory Details).

Friday, May 29, 2009 | General | Read More

memcpy() concurrency curiosities

I was recently asked to diagnose a problem a customer was encountering that involved Java and the JNI getIntArrayElements() and releaseIntArrayElements() primitives. The outcome of the exploration was sufficiently interesting that I thought I'd mention it here in order that other JNI users might avoid the same pitfall. Briefly, the customer was running a highly threaded application on a machine with lots of available concurrency. The Java threads would repeatedly read from an...

Sunday, May 24, 2009 | General | Read More

Carmina Burana Reinterpreted

Among other things, musicologists recontextualize and reinterpret older works for the current generation. If you love Carmina Burana but struggle with medieval high German or ecclesiastic "church" Latin, musicologists have an easy-to-digest answer (SFW).

Wednesday, March 19, 2008 | General | Read More

CAS and Cache Trivia - Invalidate or Update In-Place

In a previous blog entry on biased locking I noted that CAS (the atomic Compare-And-Swap instruction) usually has the same semantics as store with respect to caches and the interconnect. It's worth calling out an interesting exception, however. For background, the level-1 data cache -- often shortened to L1D$ or just D$ -- of Niagara-1 and Niagara-2 processors uses a write-aroundpolicy. All stores go over the cross-bar, but if the line is also present in the D$, the store...

Friday, March 14, 2008 | General | Read More

Rock-style Transactional Memory - Lock Elision for Java, etc.

In Applications of the Adaptive Transactional Memory Test Platform to appear in Transact 2008 we should how Rock-like transactional memory can be applied to various components of the Solaris software stack, including transactional lock elision for "synchronized" blocks in Java.

Wednesday, February 13, 2008 | General | Read More

National Wear Red Day

February 1st is National Wear Red Day. This is particularly dear to my heart because of my Wife's experience with cardiac arrest last May 15th.

Monday, January 28, 2008 | General | Read More

Questions about Java-SE on Solaris?

Feel free to post questions to our Ask the Experts forum.

Friday, January 18, 2008 | General | Read More

Integrated Cloud Applications & Platform Services