MultiLane : a concurrent blocking multiset

MultiLane : a concurrent blocking multiset - under submission.


It's rather trivial to extend the general idea and construct a deque instead of a queue-like construct.

Comments:

nice, very simple.

minor nit, wouldn't making the Lanes.length − 1 (or Width - 1) a final var be a minor performance improvement?

Posted by Jed Wesley-Smith on January 20, 2011 at 12:44 PM EST #

Jed, thanks. That's a good point about using a final field. The "implementation" in the paper is intentionally condensed because of space limits related to publication. In practice there are a few things you'd do differently.

The theorem prover in a good JIT _may be able to reliably prove that (X & (Length-1)) < Length for all X, and thus safely elide the array bounds check for the code in the paper. If we pull Length-1 into a final field then the JIT also has perform much more program analysis to be able to elide the bounds check. So if we use the final field it becomes more likely the JIT will not be able to elide the AIOOB check. (It depends on the specific JIT, of course).

What we're most concerned about for performance are the loads, branches, and dependent operations. ALU operations like "subtract" are _almost free.

Regards, -Dave

Posted by guest on January 21, 2011 at 02:29 AM EST #

David, interesting, point about the bounds check optimisation - does that actually work/happen on the current Java6 VMs?

Posted by Jed Wesley-Smith on January 21, 2011 at 10:21 AM EST #

Jed, there are certainly JITs that can optimize such idioms. But assume for a moment that the JIT doesn't. In that case an operation will fetch the final precomputed mask value, mask the index with that value, fetch the array bound from the array header, compare and branch (index out of bounds check) and then final fetch the desired slot. The branch to throw the exception will never be taken, of course. So we'll have 3 loads. If we use (array.length-1) we'll have only two loads. Even if the JIT isn't sufficient smart to elide the bounds check, it will normally be good enough (global value numbering) to only fetch the array length field once.

But I agree there's something aesthetically pleasing about keeping the mask in a final field (:>).

Regards, -Dave

Posted by guest on January 22, 2011 at 09:32 AM EST #

Hi David,

Some comments on the paper.

"Because of the disjunct between advancing the cursor and then accessing the sub-collection identified by the cursor value, our collection are not strictly
FIFO even if the sub-collections are strictly FIFO."
It's a bit unclear what do you mean here. In a single-consumer scenario MultiLane provides the same FIFO guarantees as an underlying queue. And in multi-consumer scenario, ordering is weakly defined; and as far as I see, per-producer/per-consumer order is still preserved (that is, a single consumer can't recieve 2 messages from a single producer in reversed order). So, what exactly do you mean here?

"A MultiLane collection is work-conserving if the underlying collections are work-conserving. Specifically, take() will return promptly when and if any messages are available in the collection."
I don't think it's the case. Because the MultiLane is not linearizable, and it may be the case that several messages are in the queue (threads are returned from push()), however a subsequent pop() will return 'false'. The most wicked scenraio is: thread 1 pushes a message; thread 2 pushes a message and then pops a message; no other threads - it may be the case that thread 2 will get 'false' from pop().

By the way, I think it's worth nothing that MultiLane breaks linearizability and lock-free property of an underlying queue.

"The key benefits to our approach, as compared to existing collections, are (a) reduced coherence traffic as we distribute operations over the lanes."
It's unclear how it reduces coherence traffic, because all the traffic caused by an underlying queue is still in place + additional traffic is caused by ticket obtaining.

"general we find that a MultiLane form will significantly outperform its underlying counterpart"
It would be interesting to benchmark it with some other underlying queues. Because, well, MultiLane has a highly contented tecket sequencers operated with wait-free atomic RMW, and some queue implementation are able to accomplish all useful work with a single wait-free atomic RMW. For example, take a look at the following MPSC queue, which has single XCHG for producers and no atomic RMW for a consumer:
http://www.1024cores.net/home/lock-free-algorithms/queues/non-intrusive-mpsc-node-based-queue

I've also posted an article with some comments, however it's mostly concentrated on how the problem can be approached on architectural level:
http://www.1024cores.net/home/lock-free-algorithms/queues/non-intrusive-mpsc-node-based-queue

Best regards,
Dmitry Vyukov

Posted by Dmitry Vyukov on January 25, 2011 at 01:24 AM EST #

Hi Dmitry,

Regarding non-FIFO behavior, consider the following scenario where we have 2 lanes and the read and write cursors are initially 0:

Put thread A increments the put cursor from 0 to 1 then stalls before putting into the underlying lane[0].
Put thread B increments the put cursor from 1 to 2 and puts message X into lane[1].
Put thread B increments the put cursor from 2 to 3 and puts message Y into lane[0].
Take thread C increments the take cursor from 0 to 1 and takes Y from lane[0].
Take thread C increments the take cursor from 1 to 2 and takes X from lane[1].
Thus, thread C observed thread B's put() operations out of order.
Often, applications can tolerate such out-of-order delivery. (Consider the use of UDP vs TCP).

Recall that the mechanism requires the underlying collections to be blocking, thus pop() would never return false, so the scenario you mentioned won't happen. Likewise, the progress properties are largely irrelevant because both the underlying collection and the aggregate multilane collection are both blocking. (In practice this seems to be what customers often want).

I absolutely agree that the centralized cursors in the multilane mechanism can constitute algorithmic and coherence hot spots and ultimately impede scaling, but it's often the case that it'll scale much better than an single instance of the underlying type.

Regards, -Dave

Posted by guest on January 25, 2011 at 02:06 AM EST #

> Regarding non-FIFO behavior...

Aha! I see.

> Recall that the mechanism requires the underlying collections to be blocking...

Oh, I assumed that it can be used with any queue by some reason. But why you limited it to blocking queues? Well, it will break linealizability, but often application are Ok with that (I still believe that M&S queue is not linearizable - one can't blindly replace a mutex-based queue with M&S queue).

By the way, I think that MultiLane breaks lock-free property in either case - a producer stalled right after ticket increment can block system-wide progress.

Posted by Dmitry Vyukov on January 25, 2011 at 02:23 AM EST #

The original problem and customer requirements were in terms of blocking multisets. That's actually a fairly common need. Theoreticians love polling but commercial engineers often just want message passing and prefer blocking to polling. If it's a resource pool instead of a communication channel they're also OK with blocking. So the issue was framed that way and the multilane mechanism strictly requires such blocking underlying collections. (If we could had thought of a way to make it more general and support polling or operations with timeouts without adding excess complexity, we'd have done so, but those "solutions" where rather ugly).

As for blocking system-wide progress, if a put() that stalls after incrementing the cursor then it can stall itself and perhaps a paired-up take() but generally we'll still have flow over the aggregate multiset, making it fairly forgiving.

Regards, -Dave

Posted by guest on January 25, 2011 at 02:54 AM EST #

> but those "solutions" where rather ugly

I see, the problem is that a consumer has already incremented the sequencer, and there is no easy way to rollback it.

> making it fairly forgiving

Fairly forgiving - yes, lock-free - no. Think of a situation with a single consumer - it's blocked while there are messages available in a queue.

Posted by Dmitry Vyukov on January 25, 2011 at 09:04 PM EST #

Since we require the underlying collections to be fully blocking I don't think it useful to try to label the top-level construct as lock-free or not- lock-free.

But I understand what your concern about progress properties. In your example with a single consumer, yes, it can certainly stall if the producer it was fated to pair with incremented the cursor and then stalled. And it'll stay stalled even if other producers subsequently arrive. I think the key is that even though individual threads can stall and impede a complementary thread on that same lane, we can still have good aggregate throughput if we have lots of producers and consumers. Somewhat imprecisely, it's closer to obstruction-free than lock-free.
Regards, -Dave

Posted by guest on January 26, 2011 at 01:16 AM EST #

Following up on some private comments sent under separate cover, someone mentioned the multilane approach might be similar to that used in TBB concurrent queues. There's a superficial similarity, but if I understand correctly, if a thread needs to wait in the TBB environment it'll spin on the central global variables, whereas the multilane approach will spin in the subcollection code. There's also some similarity to Mellor-Crummey's "Concurrent Queues: Practical Fetch-and-Φ Algorithms" (http://hdl.handle.net/1802/6107) but our approach is blocking and uses fewer central variables, and is arguably more simple. The old NYU ultracomputer references in the Mellor-Crummey paper are worth reading, btw. Regards, -Dave

Posted by guest on February 17, 2011 at 10:33 AM EST #

Post a Comment:
  • HTML Syntax: NOT allowed
About

Dave

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today