poking at the kernel dispatcher.
By timatworkhomeandinbetween on Jul 07, 2005
I have been drawn into an escalation involving the kernel dispatcher, this is the code that picks and schedules threads onto the available cpu's according to rules in the dispatch tables. It is by far the most complex code in the whole machine and IMHO it is the code that allows Solaris to scale so well.
In this escalation a load of threads including the clock are spinning waiting for another thread to release a dispatcher lock, sadly no one releases it and the machine hangs until the deadman or a cluster heartbeat stops the machine. Dispatcher locks are simple spin locks designed for easy use to protect modifications to the dispatching object when a thread is transitions from sleepq to runq to oncpu. As a thread moves from object to object the thread's t_lockp pointer is moved to point to the lock in the new object.
Now Solaris is quite complex internally especially when you considers doors, these are zero context switch function calls from a thread into another thread. Somewhere in all of the swtch/resume/door code we know that something is switching a thread t's t_lockp from one cpu lock to another cpu lock whilst at the same time ts_update_list() is doing thread_lock(t)/thread_unlock(t) so unlocks a different lock to the one it locked.
My job was made easier by some clever tracing code that a collegue of mine has developed that showed the lock and unlock of different locks, sadly it couldn't spot who manipulated t's t_lockp without having done it's own thread_lock(t) first. Code inspection has us looking at the door code and maybe the interupt pin/unpin code.
It would be good to extend the mechanism we have for statically checking locking sequences in storage drivers to deal with dispatcher locks, maybe that is something the opensolaris community could look at?