Inverted schedctl usage in the JVM

The schedctl facility in Solaris allows a thread to request that the kernel defer involuntary preemption for a brief period. The mechanism is strictly advisory - the kernel can opt to ignore the request. Schedctl is typically used to bracket lock critical sections. That, in turn, can avoid convoying -- threads piling up on a critical section behind a preempted lock-holder -- and other lock-related performance pathologies. If you're interested see the man pages for schedctl_start() and schedctl_stop() and the schedctl.h include file. The implementation is very efficient. schedctl_start(), which asks that preemption be deferred, simply stores into a thread-specific structure -- the schedctl block -- that the kernel maps into user-space. Similarly, schedctl_stop() clears the flag set by schedctl_stop() and then checks a "preemption pending" flag in the block. Normally, this will be false, but if set schedctl_stop() will yield to politely grant the CPU to other threads. Note that you can't abuse this facility for long-term preemption avoidance as the deferral is brief. If your thread exceeds the grace period the kernel will preempt it and transiently degrade its effective scheduling priority. Further reading : US05937187 and various papers by Andy Tucker.

We'll now switch topics to the implementation of the "synchronized" locking construct in the HotSpot JVM. If a lock is contended then on multiprocessor systems we'll spin briefly to try to avoid context switching. Context switching is wasted work and inflicts various cache and TLB penalties on the threads involved. If context switching were "free" then we'd never spin to avoid switching, but that's not the case. We use an adaptive spin-then-park strategy. One potentially undesirable outcome is that we can be preempted while spinning. When our spinning thread is finally rescheduled the lock may or may not be available. If not, we'll spin and then potentially park (block) again, thus suffering a 2nd context switch. Recall that the reason we spin is to avoid context switching. To avoid this scenario I've found it useful to enable schedctl to request deferral while spinning. But while spinning I've arranged for the code to periodically check or poll the "preemption pending" flag. If that's found set we simply abandon our spinning attempt and park immediately. This avoids the double context-switch scenario above. This particular usage of schedctl is inverted in the sense that we cover the spin loop instead of the critical section. (I've experimented with extending the schedctl preemption deferral period over the critical section -- more about that in a subsequent blog entry).

One annoyance is that the schedctl blocks for the threads in a given process are tightly packed on special pages mapped from kernel space into user-land. As such, writes to the schedctl blocks can cause false sharing on other adjacent blocks. Hopefully the kernel folks will make changes to avoid this by padding and aligning the blocks to ensure that one cache line underlies at most one schedctl block at any one time. It's vaguely ironic that a facility designed to improve cooperation between threads suffers from false sharing.

Schedctl also exposes a thread's scheduling state. So if thread T2 holds a lock L, and T1 is contending for L, T1 can check T2's state to see whether it's running (ONPROC in Solaris terminology), ready, or blocked. If T2 is not running then it's usually prudent for T1 to park instead of continuing to spin, as the spin attempt is much more likely to be futile.

Comments:

Frankly I wanted a CPU feature for this and I believe certain parts are already there.
wait_on_cache_line addr
if (write) {
spin and test for value
} else { // this is the timesliced out case
park wherever this thread should be parked
}

The expectation being the if branch is taken if cacheline sees a write and else taken if no such cacheline write takes place instead a switchout/powerdown event takes place prior. There are some corner cases involving cacheline associativity. Short of the above two, the core/hwthread goes into a slightly low power state. I believe ARM probably has something like this.

----
But yes I agree with the convoy problem. I d coined a different term for this viz., "avalanche effect". Even the smallest code segment protected with a bare spin lock can turn ugly if it gets preempted out while holding the lock. Now it will find it hard to get its timeslice because there will be a large # of useless threads trying to take up its fulltimeslice spinning when you clearly know they have no chance of getting the lock.... (starts simple then turns into an avalanche)

Posted by guest on June 21, 2011 at 09:16 AM EDT #

Frankly I d wanted a CPU feature for this to be more power efficient and I believe it already exists in some form on ARM/x86 though I m not very sure of that.

wait_on_cache_line_write addr
if (true) {
spin and check if the write is what you want
} else { // this is the ctx_switch_out_case
// alrite you re getting booted
// do you want to park someplace special?
}

As you can imagine the if case is when that cacheline gets written by some other core. The else case is if timeslice expires or thread gets kicked out for whatever reasons (low power event/interrupts?)

Between the two you dont want to spin at all .. let the underlying bus/lines go into low power if possible. You get your wish and power folks are happy too.
-----------------------

On a slightly different note, I d termed a different term *avalanche effect* in place of convoy. The problem I was seeing was a thread holding an adaptive lock was getting ctx sw'ed out [1]. And now it was not able to get back on the cpu because hoards of threads were slicing in doing their useless adaptive spinning when we clearly know it has no chance of gaining the lock.

So no matter how small your code segment is, once switched out holding the lock you can easily cause an avalanche hence the name :-)

Posted by bank kus on June 21, 2011 at 09:22 AM EDT #

Hi Banks,

Have you looked at MONITOR-MWAIT?

BTW, another facility in schedctl is that I can check some other thread's schedctl block to see if it's currently running. I can abort spinning if the current owner is descheduled, which works rather nicely.

Posted by guest on June 21, 2011 at 09:50 AM EDT #

Hi Dave,

Do you recommend any tools like Oracle Studio for Solaris for looking at how threads and cores interact ? I use Java. Concurrency is hard for people other than experts and it looks like there is a paucity of tools. Intel's tools might be expensive.

Posted by guest on October 16, 2011 at 11:37 PM EDT #

SS is pretty good. You can use collect/analyze to sample many of the key performance counters, which is typically one of the best ways to understand how threads and cores interact. Regards, -Dave

Posted by guest on October 21, 2011 at 07:53 AM EDT #

Post a Comment:
  • HTML Syntax: NOT allowed
About

Dave

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today