A simple test-and-set based spin lock is a reasonable choice when contention is nil or low. Lock handover is relatively efficient and there's no need to maintain a list of waiting threads, yielding a simple design. Under higher contention, queue-based such as MCS often provide better throughput and be a better choice. Classic MCS locks
are strictly FIFO queue-based locks that use local spinning. (Clever lock implementations
can try to adaptively select the best algorithm based on recent load, in order to try to have the best of both).
However, if there are more ready threads than logical CPUs and involuntary context switching -- preemption -- is in play, then MCS performance can suffer. Specifically, the unlock operator might pass ownership to a preempted thread, resulting in undesirable lock waiter preemption
which in turn can result in convoying and head-of-line blocking phenomena. Somewhat perversely, simple test-and-set locks often perform better than MCS locks when preemption is active, as threads have to actively compete for the lock. That is, ownership will never be passed to a preempted thread. (Both MCS and test-and-set locks are vulnerable to lock holder preemption
, and it can certainly be the case that a thread acquires the test-and-set lock and is then immediately preempted, but by definition the successor was running when it acquired a test-and-set lock).
To make MCS more preemption tolerant, we can use the Solaris schedctl
facility to avoid handoff of the lock to threads that are preempted. Schedctl gives us an extremely efficient way -- requiring just a load instruction -- to query the scheduling state of threads from user-mode code. The unlock operator uses schedctl to interrogate the scheduling state of a tentative successor. The successor is the next node in the MCS chain after the owner's node. If that thread is running on a CPU, then the unlock proceeds normally and the caller passes ownership of the lock to the successor in the usual fashion. But if the successor has been preempted, the unlock operator splices the successor's MCS node out of the MCS chain and writes a special evicted
value into that node. We can use the same field on which threads spin, using 3 possible values instead of the usual 2. The unlock operator then inspects the chain and repeats as necessary. When the preempted thread eventually resumes and is dispatched onto a CPU, it notices the evicted value and simply reenqueues "re-arriving" at the lock. If the unlock operator evicts the terminal element in the chain, then the lock is released and set to available state. (Evicting interior nodes requires just pointer manipulation. Excising the terminal node requires as CAS, as would be the case when the lock is made available).
This clearly breaks the FIFO queue policy of normal MCS. On the other hand it forces the lock's admission policy to respect and accede to the kernel's scheduling and priority decisions instead of blindly and obstinately using FIFO. Put a different way, it's troubling if excessively rigid lock policies run counter to kernel scheduling policies. The kernel is ultimately in control, after all. If the kernel decided that thread A should run and B should be preempted, the lock should respect that decision and admit accordingly. And critically, this techniques preserves MCS performance when there are more ready threads than CPUs and preemption is in play. Normally we might find a fairly abrupt cliff at that point, but the augmented MCS avoids the usual drop at performance exhibited by classic MCS when the number of threads exceeds the number of CPUs.
It's also worth noting that there's a window between the inspection of the schedctl state and passing the lock to a successor where that thread might be preempted, but in practice the window is small and the risk is no greater than normal lock holder preemption. Kontothanassis et al.
and He et al.
suggested related ways to mitigate the issue of handoff to preempted waiters.
I'll try to post a graph in next week or so.