Debugging Solaris Scheduling Problems, and Other Fun Party Tricks

Debugging Solaris Scheduling Problems, and Other Fun Party Tricks The day has arrived. At last. This morning you awoke to the sound of birds singing a pleasant song, sunshine beaming in from outside...and your first waking thought brought a smile to your face....as it did mine. Today Solaris is open, and to you we proudly offer it. Welcome.

Great. So now what? OpenSolaris is yours, and you'd like to get your feet wet, right? Well, where should we start? Over the last few years most of my professional time has been spent understanding, improving and evolving the scheduler/dispatcher. That's right, many a performance issue have I seen, and to many a hung system have I cursed. I've learned a lot in that time, but the time has come to begin dumping a cerebral core of my share of arcane information that will hopefully save someone else working in this area of the kernel some time and frustration. That's not to say that the scheduler/ dispatcher code is frustrating, but some of it is "historical", and more often than not, it's easier to understand why and how the code works if you know why it was written in the first place.
The Dispatcher and Scheduling Classes Intro
When someone says "hey, I think there's a scheduling problem here", they might be talking about one (or more) of a few different things:
  • Unfairness - Some threads are getting more (or less) CPU time than they should
  • Starvation - An extreme example of the fairness problem, some runnable thread(s) are barely running, or aren't getting to run at all
  • Excess Migrations - Some thread(s) are hopping between CPUs too often. This can hurt performance since a migrating thread will likely be welcomed by a cold cache.
  • Poor load balancing - Scheduling threads to run on CPUs in such a way that they are forced to excessively contend for resources...where a resource could be CPU time, bandwidth, use of a shared component (e.g. memory controller, pipeline, etc.)
But really, the above deals with both dispatching (where threads run, (on which CPUs)) and scheduling (when they run (designated by thread priorities)).

In Solaris, the core dispatcher code lives in disp.c under usr/src/uts/common/disp. When a thread becomes runnable, this code is responsible for picking the CPU on which the thread should next run. When it will get a chance to run depends on that thread's priority relative to other runnable threads waiting for that CPU. A threads priority is governed by it's scheduling class. You can take a look at the current scheduling classes loaded on the system from within mdb:

# mdb -k
Loading modules: [ unix krtld genunix specfs ufs sd sgsbbc sgenv ip sctp fcp fctl ssd nca random nfs sppp crypto ptm logindmux lofs cpc ]
> ::class
SLOT NAME       INIT FCN                 CLASS FCN
   0 SYS        sys_init                 sys_classfuncs
   1 TS         ts_init                  ts_classfuncs
   2 FX         fx_init                  fx_classfuncs
   3 IA         ia_init                  ia_classfuncs
   4 RT         rt_init                  rt_classfuncs
   5            0                        0
   6            0                        0
   7            0                        0
   8            0                        0
   9            0                        0
>

Note that mdb -k invokes the debugger on the running kernel, \*very\* useful for debugging a live system. If you want to debug the running kernel, and have it stop so that you can look at a "snapshot" of the kernel, set breakpoints, data watchpoints, etc, you would use "mdb -K", which would drop you into kmdb. For those who remember kadb, kmdb is it's successor. All hail ::print!!! So moving along, from ::class we see:
  • System scheduling class - For kernel service threads. Threads in this class have no time quantums, and they will remain on the CPU until such time as they voluntarily yield it. Typical thread priorities: 60 - 99
  • Time Share scheduling class - The default class for user threads. Threads in this class will have their priority lowered as they consume more CPU time, and increased as they use less. Thread priorities: 0 - 59
  • FX (fixed priority) scheduling class - Threads in this class have fixed priority. By default it's 0, but privileged users can run threads in this scheduling class at priorities greater than 0. Thread priorities: 0 - 60.
  • Interactive scheduling class - Used by the windowing system. Very similar to TS, except that there's a mechanism allowing threads to have a "boost" of priority. The windowing system uses this to give processes associated with "in focus" windows higher priority. Thread priorities: 0 - 59
  • Real Time scheduling class - Threads in the RT class assume the highest system priorities, even higher than system. Thread priorities: 100 - 159
You can find more about these scheduling classes from the priocntl(1) man page, Note that there's a difference between a scheduling class priority, and the thread's system wide priority. The priorities cited above are system priorities, which is the global, scheduling class independent scale of priorities used for all threads by the dispatcher. In the source base, the code which implements these scheduling classes can again be found under usr/src/uts/common/disp.

esaxe@sark$ ls
SCCS/         disp_lock.c   ia.c          shuttle.c     ts_dptbl.c
class.c       fss.c         priocntl.c    sysclass.c
cpupart.c     fx.c          rt.c          thread.c
disp.c        fx_dptbl.c    rt_dptbl.c    ts.c

As an example of how the dispatcher and scheduling classes hook together, let's look at preempt(), the function invoked when the currently running thread on a CPU should yield in favor of a higher priority runnable thread:

/\*
 \* Preempt the currently running thread in favor of the highest
 \* priority thread.  The class of the current thread controls
 \* where it goes on the dispatcher queues. If panicking, turn
 \* preemption off.
 \*/
void
preempt()
{
        kthread_t       \*t = curthread;
        klwp_t          \*lwp = ttolwp(curthread);

        if (panicstr)
                return;

        TRACE_0(TR_FAC_DISP, TR_PREEMPT_START, "preempt_start");

        thread_lock(t);

        if (t->t_state != TS_ONPROC || t->t_disp_queue != CPU->cpu_disp) {
                /\*
                 \* this thread has already been chosen to be run on
                 \* another CPU. Clear kprunrun on this CPU since we're
                 \* already headed for swtch().
                 \*/
                CPU->cpu_kprunrun = 0;
                thread_unlock_nopreempt(t);
                TRACE_0(TR_FAC_DISP, TR_PREEMPT_END, "preempt_end");
        } else {
                if (lwp != NULL)
                        lwp->lwp_ru.nivcsw++;
                CPU_STATS_ADDQ(CPU, sys, inv_swtch, 1);
                THREAD_TRANSITION(t);
                CL_PREEMPT(t);
                DTRACE_SCHED(preempt);
                thread_unlock_nopreempt(t);

                TRACE_0(TR_FAC_DISP, TR_PREEMPT_END, "preempt_end");

                swtch();                /\* clears CPU->cpu_runrun via disp() \*/
        }
}

Typically the "else" route is taken, and we see that the scheduling class specific preemption routine is called, CL_PREEMPT(). If the thread is in the TS (timeshare) scheduling class, then this evaluates to ts_preempt(), which ultimately results in either setbackdq() or setfrondq() being called, which place the thread at the front, or the back of the queue (respectively) of threads that exist at that thread's priority. At the end of preempt() above, we see swtch() is called (switch threads), which in turn calls disp() (return the highest priority thread that the CPU can now run (or the idle thread if there's nothing)), and finally resume() which does the actual context switch. Whew!

There's something important to note here. All this is being done in the context of the thread that's being preempted! That is, the thread that is being preempted is actually placing itself on a run queue somewhere, is finding the next best thread to run, and is then context switching to it! The thread is told it needs to preempt by the setting of the "cpu_runrun, or cpu_kprunrun" flags in the CPU structure. That flag may have been set by another, higher priority thread that placed itself on the run queue (see cpu_resched()). Whenever the thread takes a trap, returns from a system call, etc. It checks for this flag and proceeds to preempt itself if it is set. The idea that the dispatcher isn't some group of one or more bureaucratic threads "directing traffic" may seem surprising, but it's implemented this way for performance and scalability. All threads on the system share in executing dispatcher code, and we must take care that the bits of policy employed by every thread add up to serve the greater good.
Debugging
So lets talk a bit about debugging. What's that you say, debugging is painful? Tired of adding printk() statements here and there, re-compiling, and adding more printk() statements? I feel your pain. But let me assure you that when that smile erupted on your face this morning, it was for a reason. I enjoy developing Solaris, and it's not because I think hunting down fantastically complex and subtle problems with print statements is fun, it's because tools like (k)mdb and DTrace free me to investigate, observe, and understand what's going on without having to figure out what's needed to make the problem observable.

For now, lets look at one of the tools at your disposal:
(k)mdb, dcmds, and CTF
Let's invoke the kernel debugger on my build machine's running kernel to see what it's up to:

# mdb -k
Loading modules: [ unix krtld genunix specfs ufs sd sgsbbc sgenv ip sctp fcp fctl ssd nca random nfs sppp ptm logindmux lofs cpc ]
> ::cpuinfo
 ID ADDR        FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD      PROC
  0 0000180c000  1b    0    0  -1   no    no t-11   2a10001fcc0 (idle)
  1 300423b2000  1b    0    0  -1   no    no t-63   2a1002f7cc0 (idle)
  2 300423ba000  1b    0    0  -1   no    no t-1    2a10037fcc0 (idle)
  3 300423c6000  1b    0    0  -1   no    no t-91   2a1003c7cc0 (idle)
  8 300423ce000  1b    0    0  59   no    no t-0    30d4775abc0 mdb
  9 300423de000  1b    0    0  -1   no    no t-0    2a100539cc0 (idle)
 10 300423ea000  1b    0    0  -1   no    no t-71   2a1005c1cc0 (idle)
 11 300423f2000  1b    0    0  -1   no    no t-8    2a100609cc0 (idle)
 16 300423fa000  1b    0    0  -1   no    no t-10   2a100691cc0 (idle)
 17 30042402000  1b    0    0  -1   no    no t-191  2a100719cc0 (idle)
 18 3004240a000  1b    0    0  -1   no    no t-0    2a1007a1cc0 (idle)
 19 30042412000  1b    0    0  -1   no    no t-0    2a1007e9cc0 (idle)
>

Well, that's not very exciting. All but CPU 8 is idle, and CPU 8 is running my mdb process thread. Let me kick off a build:

> ::cpuinfo
 ID ADDR        FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD      PROC
  0 0000180c000  1b    0    0  30   no    no t-0    325d59b8820 acomp
  1 300423b2000  1b    1    0  30   no    no t-0    308e67c02a0 acomp
  2 300423ba000  1b    1    0  55   no    no t-2    30760fbce60 cg
  3 300423c6000  1b    2    0  53   no    no t-1    33886adf500 cg
  8 300423ce000  1b    1    0  39   no    no t-1    35876d042c0 iropt
  9 300423de000  1b    2    0  59   no    no t-1    30d4775abc0 mdb
 10 300423ea000  1b    0    0  42   no    no t-1    3081f8cc200 cg
 11 300423f2000  1b    2    0  32   no    no t-0    319c6969400 cg
 16 300423fa000  1b    1    0  53   no    no t-0    338853f1ae0 cg
 17 30042402000  1b    0    0  43   no    no t-0    307215023a0 cg
 18 3004240a000  1b    1    0  49   no    no t-0    30d4771c5e0 cg
 19 30042412000  1b    0    0  43   no    no t-1    308e67dbc80 cg

That's better, ::cpuinfo gives you an overview of what threads every CPU is running. ::cpuinfo -v gives you even more. Lets look at CPU 1 in detail:

> 300423b2000::cpuinfo -v
 ID ADDR        FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD      PROC
  1 300423b2000  1b    4    0   0   no    no t-5    3075514a440 acomp
                  |    |
       RUNNING <--+    +-->  PRI THREAD      PROC
         READY                 0 35876d042c0 sh
        EXISTS                 0 3075514b140 acomp
        ENABLE                 0 30041c654a0 dmake
                               0 308e629b120 dmake

So CPU1 has 4 runnable threads, all at priority 0. The current thread "acomp" has been on for 5 ticks (5/100) of a second, and this CPU has the right set of flags set for an online and active CPU.
There's a whole array of dcmds available to ease looking around the system. You can get a list using the ::dcmds dcmd. Also of interest are the walkers, which allow you to easily traverse lists of things....CPUs, threads, pages, etc. For example, let's walk the list of all active thread structures:

> ::walk thread
180e000
2a10001fcc0
2a100017cc0
2a10000fcc0
2a100007cc0
2a10003fcc0
2a100037cc0
2a10002fcc0
2a100027cc0
2a10005fcc0
2a100057cc0
2a10004fcc0
>> More [, , q, n, c, a] ?

(It's a long list, and i'll spare you all the output). Now let's pipe the output of ::walk thread to ::findstack, which prints a stack trace for a thread given a thread structure:

> ::walk thread |::findstack
stack pointer for thread 180e000: 180b0f1
[ 000000000180b0f1 sched+0x570() ]
  000000000180b211 main+0x458()
  000000000180b2f1 _start+0x110()
stack pointer for thread 2a10001fcc0: 2a10001f021
[ 000002a10001f021 ktl0+0x48() ]
  000002a10001f171 disp_getwork+0x164()
  000002a10001f221 idle+0xdc()
  000002a10001f2d1 thread_start+4()
stack pointer for thread 2a100017cc0: 2a100017151
[ 000002a100017151 cv_wait+0x5c() ]
  000002a100017201 taskq_thread+0x13c()
  000002a1000172d1 thread_start+4()
stack pointer for thread 2a10000fcc0: 2a10000f141
[ 000002a10000f141 sema_p+0x25c() ]
  000002a10000f1f1 kmdbmod`kctl_wr_thread+0xbc()
  000002a10000f2d1 thread_start+4()
stack pointer for thread 2a100047cc0: 2a100046f51
  000002a100046ff1 clock+0x20()
  000002a1000470e1 cyclic_softint+0x374()
  000002a100047221 cbe_level10+8()
  000002a1000472d1 intr_thread+0x294()
  000002a10007f2d1 utl0+0x4c()
  000002a10001f171 disp_getwork+0x164()
  000002a10001f221 idle+0xdc()
  000002a10001f2d1 thread_start+4()
>> More [, , q, n, c, a] ?

When a system seems hung. I'll typically break into kmdb, and will do a ::walk thread |::findstack to look for interesting stacks. If there's a pile of threads all asleep with mutex_vector_enter() near the top of the stack, then from there i'll dig down to find who owns what locks, and why the various lock owners aren't letting go (or why they can't let go because they can't run).

Let's walk all the CPU structures, and for each of them print out the cpu_thread field (pointer to the kthread_t for the on_proc) thread, and for each of those threads, let's print out their t_pri field, the thread's current priority:

> ::walk cpu |::print cpu_t cpu_thread |::print kthread_t t_pri
t_pri = 0xffff
t_pri = 0xffff
t_pri = 0xffff
t_pri = 0xffff
t_pri = 0xffff
t_pri = 0xffff
t_pri = 0x3b
t_pri = 0xffff
t_pri = 0xffff
t_pri = 0xffff
t_pri = 0xffff
t_pri = 0xffff

So most of the CPUs are running a priority -1 thread (idle), and one is running a priority 59 thread (my mdb process). You'll find that ::print is your friend:

> cpu0::print cpu_t cpu_disp |::print disp_t
{
    disp_lock = 0
    disp_npri = 0xaa
    disp_q = 0x309c1a19000
    disp_q_limit = 0x309c1a19ff0
    disp_qactmap = 0x300425ed380
    disp_maxrunpri = 0xffff
    disp_max_unbound_pri = 0xffff
    disp_nrunnable = 0
    disp_cpu = cpu0
}

mdb and kmdb know how to print out \*every\* structure thanks to the CTF data compiled in. This eliminates the need for clunky adb macros. Relating to the scheduler/dispatcher, you may want to consider getting familiar with the kthread_t, cpu_t, and disp_t structures.
So with a basic knowledge of a few structures, a few dcmds and walkers, you're on your way! The same tricks we've been doing here are of course relevant on crash dumps. In fact, let's take a look at a crash dump from a recent interrupt related scheduling bug...
The ecache_scrubber stuck in run queue limbo bug
A few weeks back Shiv Agarwal pointed me at his Sun-Fire 6900 test rig that seemed to be hard hanging every now and again while executing some Dynamic Reconfiguration (DR) stress tests. (DR refers to the machine's ability to add and remove CPU/Memory/IO boards/Disks, etc from a running system without requiring reboot). Hard hangs are particularly frustrating, because the system is so hung that it's impossible to break the system into kmdb for further analysis. This is where the "deadman" feature in Solaris becomes useful. When enabled, every second, a high (level 15) interrupt will fire on every CPU, and will execute the deadman() function, which watches kernel variable "lbolt" (a tick counter incremented each time the clock interrupt fires).

If the deadman observes that lbolt hasen't changed in a while, it will induce a panic. The deadman feature can be enabled by setting the kernel variable "snooping" to 1 in /etc/system:

set snooping = 1

... or by hotpatching the variable's value early in boot.

Shiv enabled the deadman, and when he reproduced the problem instead of hard hanging the system panicked:

> panic[cpu1]/thread=30018dd6cc0: deadman: timed out after 9 seconds of
> clock inactivity
>
> 000002a10005fdf0 genunix:cyclic_fire+64 (300138a9fb8, 20d5a6fe1472,
> 3b9ac800, 0, 300121d9698, 300348b0280)
>   %l0-3: 00000300195648c0 0000030012a7eb80 000020d5a6fe17c0 000000003b9aca00
>   %l4-7: 00000300348b0320 000020d5a77b5280 0000000000000014 0000000000000004
> 000002a10005fea0 unix:cbe_level14+38 (0, 0, e, 30006356000, 400080, 100aef8)
>   %l0-3: 000000000000022e 0000000000010002 0000000001911ab8 0000000000000000
>   %l4-7: 0000000000000001 0000000000000001 0000000001911ab0 0000000001910470
> 000002a10005ff50 unix:current_thread+140 (1, 6912166044, 6912166040,
> 18af260, f0d0e, f0d0f)
>   %l0-3: 0000000001007674 000002a102682a31 000000000000000e 0000000000010002
>   %l4-7: 00000000018670b0 0000000000000000 000000000000000b 000002a1026832e0
> 000002a102683380 unix:mp_cpu_quiesce+90 (180c000, 200, 1, 0, 3938700, fc00)
>   %l0-3: 0000000000000000 0000000000000000 0000000000000000 00000000018af260
>   %l4-7: 00000000018af000 00000000018aa660 0000000000000000 000000000085daf8
> 000002a102683430 sbdp:sbdp_cpu_poweroff+d4 (180c000, 1c00, 0, 0, 0,
> f082e520)
>   %l0-3: 00000000000000bb 00000000705e3e20 00000000705e3c00 0000030015828000
>   %l4-7: 000000007043dc00 000007fff0914000 0000000000001fff 00000000705e3e18
> 000002a102683500 unix:cpu_poweroff+24 (180c000, 10, 22, 2f, 2f, 180c000)
>   %l0-3: 0000000000000000 00000000705e2900 00000000705e2930 0000000000000030
>   %l4-7: 000000000190be28 000002a102683590 fffffffff082e520 00000000705e2800
> 000002a1026835b0 sbd:sbd_pre_detach_cpu+1b0 (300062944c0, 300139c0000,
> 180c000, 0, 30006294510, 30012bdd2c8)
>   %l0-3: 0000030015848000 000000000000008f 0000030015b65658 00000000000bab15
>   %l4-7: 0000000000000000 ffffffffffffffff 0000000000000001 0000000000000000
> 000002a102683660 sbd:sbd_dev_unconfigure+50 (300062944c0, 300139c0000,
> 1, 0, 0, 2a10268371c)
>   %l0-3: 0000000000000005 0000000000000001 0000000000000019 000003000f675180
>   %l4-7: 0000000000000000 0000000000000001 0000000000000000 0000000000000001
> 000002a102683720 sbd:sbd_exec_op+94 (300062944c0, 0, 445000, 7, 7, 8c)
>   %l0-3: 0000030015848000 00000000704bf800 0000000000000003 ffffffffffffff1f
>   %l4-7: ffffffffffffffd3 0000000000445208 0000030006294510 0000000000000000
> 000002a1026837f0 sbd:sbd_ioctl+150 (a500000000, 445208, 293c0, 100003,
> 2a102683977, 445000)
>   %l0-3: 0000000000000003 00000000704bf800 0000000000445208 0000000000445201
> 000002a102683990 genunix:ioctl+184 (3, 30019513580, 293c0, ff1d3108,
> ff1ea000, 445208)
>   %l0-3: 000003001301c590 000003001301c5b0 0000000000000004 0000000000000001
>   %l4-7: 000003001301c5b0 0000000000000000 0000000000000000 0000000000000000
>
> panic: entering debugger (continue to save dump)

[1]> $c
kaif_enter(18340f8, 0, 11e0a14, 1828400, 11e0a14, 4c00)
panicsys+0x408(18b3410, 18aa2b0, 30006356000, 1, 20040, 1883800)
vpanic+0xcc(11d6c20, 2a10005fe78, 0, 0, 43c998540, 1b5872)
panic+0x1c(11d6c20, 9, 18c9000, 11d6c00, 18c9000, 0)
cyclic_fire+0x64(300138a9fb8, 20d5a6fe1472, 3b9ac800, 0, 300121d9698,
300348b0280)
cbe_level14+0x38(0, 0, e, 30006356000, 400080, 100aef8)
current_thread+0x140(1, 6912166044, 6912166040, 18af260, f0d0e, f0d0f)
mp_cpu_quiesce+0x90(180c000, 200, 1, 0, 3938700, fc00)
sbdp`sbdp_cpu_poweroff+0xd4(180c000, 1c00, 0, 0, 0, f082e520)
cpu_poweroff+0x24(180c000, 10, 22, 2f, 2f, 180c000)
sbd`sbd_pre_detach_cpu+0x1b0(300062944c0, 300139c0000, 180c000, 0, 30006294510,
30012bdd2c8)
sbd`sbd_dev_unconfigure+0x50(300062944c0, 300139c0000, 1, 0, 0, 2a10268371c)
sbd`sbd_exec_op+0x94(300062944c0, 0, 445000, 7, 7, 8c)
sbd`sbd_ioctl+0x150(a500000000, 445208, 293c0, 100003, 2a102683977, 445000)
ssm`ssm_ioctl+0xcc(a500000000, 3000313f880, 293c0, 100003, 3001218cf80, 445208)
ioctl+0x184(3, 30019513580, 293c0, ff1d3108, ff1ea000, 445208)
syscall_trap32+0xcc(3, 445208, 293c0, ff1d3108, ff1ea000, b)

Ok, so the golden question. Why wasn't the clock interrupt able to fire so that it could change lbolt to make deadman() happy?
Using ::cpuinfo to look around:

> ::cpuinfo
 ID ADDR        FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD      PROC
  0 0000180c000  2d    1    9  -1   no   yes t-31536 2a1015a3cc0 (idle)
  1 0000187e7a0  1b    9    0  49   no    no t-1    30018dd6cc0 cfgadm
  2 300061d2000   d   10    0 169   no    no t-0    2a101433cc0 sched
  3 3000631e000  2d    0    0  -1   no    no t-31541 2a102053cc0 (idle)
  8 30006340000   d    8    0 169   no    no t-0    2a101943cc0 sched
 10 30008c74000  1d   10    0 169   no    no t-0    2a10117bcc0 sched
 11 30008864000   d    6    0 169   no    no t-0    2a101f2bcc0 sched
 16 30008442000  1d    9    0 169   no    no t-0    2a101d23cc0 sched
 17 3000da1e000  1d    6    0 169   no    no t-0    2a100611cc0 sched
 18 3000c9fc000  1d    6    0 169   no    no t-0    2a101983cc0 sched
 19 30005bda000  1d    8    0 169   no    no t-0    2a1020a3cc0 sched

Who are the priority 169 threads running on 2, 8-19?

> 2a101433cc0::findstack
stack pointer for thread 2a101433cc0: 2a101433021
[ 000002a101433021 panic_idle+0x1c() ]
  000002a1014330d1 ktl0+0x48()
  000002a101433221 cpu_pause+0x5c()
  000002a1014332d1 thread_start+4()
>

Ah, those are pause threads. There are certain lists and data structures walked by the dispatcher code without any locks held. One example is the per processor set list of active CPUs walked in disp_anywork(), a function used by the dispatcher to determine if there's any thing better to run in the system besides idle(). This enables lockless performance in the dispatcher, at the cost of requiring that all CPUs be paused (run high priority threads that just spin) when those lists/structures are changed around (ensuring no other CPUs are executing dispatcher code) ...which of course happens very rarely.

OK, so some of the CPUs are running their pause threads, while CPUs 0 and 3 are idle(). The thread on CPU1 is in the middle of a DR operation. Looking at the first argument to mp_cpu_quiesce() above, we see it's trying to power off and quiesce CPU0 of interrupts.

> 30018dd6cc0::findstack
stack pointer for thread 30018dd6cc0: 2a102682ae1
  000002a102682b81 mp_cpu_quiesce+0x90()
  000002a102682c31 sbdp_cpu_poweroff+0xd4()
  000002a102682d01 cpu_poweroff+0x24()
  000002a102682db1 sbd_pre_detach_cpu+0x1b0()
  000002a102682e61 sbd_dev_unconfigure+0x50()
  000002a102682f21 sbd_exec_op+0x94()
  000002a102682ff1 sbd_ioctl+0x150()
  000002a1026830c1 ssm_ioctl+0xcc()
  000002a102683191 ioctl+0x184()
  000002a1026832e1 syscall_trap32+0xcc()
> mp_cpu_quiesce+0x90::dis
mp_cpu_quiesce+0x80:            nop
mp_cpu_quiesce+0x84:            clr       %g2
mp_cpu_quiesce+0x88:            ba        +0x34         <mp_cpu_quiesce+0xbc>
mp_cpu_quiesce+0x8c:            cmp       %g2, 0
mp_cpu_quiesce+0x90:            call      +0x1db71c     <drv_usecwait>
mp_cpu_quiesce+0x94:            mov       %i2, %o0
mp_cpu_quiesce+0x98:            add       %l7, 1, %l7
mp_cpu_quiesce+0x9c:            cmp       %l7, %i4
mp_cpu_quiesce+0xa0:            bl,pt     %icc, -0x40   <mp_cpu_quiesce+0x60>

Which, looking at the C code puts us here in mp_cpu_quiesce():

       /\*
         \* Declare CPU as no longer being READY to process interrupts and
         \* wait for them to stop. A CPU that is not READY can no longer
         \* participate in x-calls or x-traps.
         \*/
        cp->cpu_flags &= ~CPU_READY;
        CPUSET_DEL(cpu_ready_set, cpuid);
        membar_sync();

        for (i = 0; i < sanity_limit; i++) {
                if (cp->cpu_intr_actv == 0 &&
                    cp->cpu_thread == cp->cpu_idle_thread) {
                        found_intr = 0;
                        break;
                }
                DELAY(1); <-------------- \*\*
        }

So CPU1 is caught in the for loop waiting for CPU0's cpu_intr_actv bitmask to become empty (cpu_intr_actv is a bitmap of active interrupts that lives in the CPU structure). Sure enough, it isn't:

> 180c000::print cpu_t cpu_intr_actv
cpu_intr_actv = 0x200

Taking a close look at CPU0:

> 180c000::cpuinfo -v
 ID ADDR        FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD      PROC
  0 0000180c000  2d    1    9  -1   no   yes t-31536 2a1015a3cc0 (idle)
                  |    |    |
       RUNNING <--+    |    +--> PIL THREAD
      QUIESCED         |           9 2a101423cc0
        EXISTS         |
       OFFLINE         +-->  PRI THREAD      PROC
                             168 2a101423cc0 sched
> 2a101423cc0::findstack
stack pointer for thread 2a101423cc0: 2a1014230d1
  000002a101423171 scrub_ecache+0x80()
  000002a101423221 swtch+0x78()
  000002a1014232d1 intr_thread+0x240()

Hey look, there's a level 9 interrupt thread runnable and eagerly awaiting the CPU to grab it, and run it.

What's an interrupt thread? Interrupt service code in Solaris executes while the existing thread on the processor is "pinned". That is, when an interrupt happens, the CPU immediately jumps from executing the interrupted thread to the interrupt service code. When complete, control is restored to the interrupted thread and life moves on. It is not often that interrupts lose the CPU prior to finishing execution, but it does happen, most often because the interrupt tries to grab an adaptive lock that is already held. When this happens, the interrupt is turned into a real thread (rather than leaching off the interrupted thread), and blocks as usual. Interrupt threads, like real time threads are at the top of the food chain in terms of priorities (With the RT class loaded, 160-169). PIL 10 interrupts and below are permitted to grab adaptive locks, and can therefore block (and be turned into real threads). Interrupts above PIL 10 may never block. The dispatcher operates at PIL 11, and if it were possible to take an interrupt that later blocked while already in the dispatcher all sorts of nastiness could ensue.

But for our runnable interrupt thread, despite it's eagerness to run again, it's day will never come. CPU0 is idle(), but looking at the flags we see that it's OFFLINE and QUIESCED. This is the state your CPU assumes after you use psradm -f, for example. The CPU isn't permitted to execute any threads, and strictly speaking it's dangerous for it be servicing interrupts. The offline CPU will remain parked in this loop in idle()

                /\*
                 \* If CPU is completely quiesced by p_online(2), just wait
                 \* here with minimal bus traffic until put online.
                 \*/
                while (cp->cpu_flags & CPU_QUIESCED)
                        (\*idle_cpu)();

What happens if the interrupt thread blocks? You're looking at it. The thread will remain unrunnable. It's only hope is for the QUIESCED flag to be cleared should the CPU be brought back online.

So did this interrupt block? Not exactly. The ecache scrubbing code is brave, and wants to scrub the caches of even offline CPUs. It does this by having one CPU send a cross trap to the target (offline) CPU, which causes the target to take a PIL 9 interrupt. Again, this is very dicey doing this on offline CPUs, but looking at the interrupt code executed there's no evidence of it grabbing any locks. So how did it lose the CPU? Looking at the interrupt thread stack we find our clue:

> 2a101423cc0::findstack
stack pointer for thread 2a101423cc0: 2a1014230d1
  000002a101423171 scrub_ecache+0x80()
  000002a101423221 swtch+0x78()
  000002a1014232d1 intr_thread+0x240()
>

The topmost stack frame already popped off, so really the last bit of code we executed was switch, called by intr_thread+0x240:

intr_thread+0x234:              ldub      [%o2 + 0xe9], %o5
intr_thread+0x238:              brz,pt    %o5, +0x1c    <intr_thread+0x254>
intr_thread+0x23c:              mov       %o2, %l3
intr_thread+0x240:              call      +0x486a8      <preempt>
intr_thread+0x244:              mov       %o3, %l2
intr_thread+0x248:              mov       %l3, %o2
intr_thread+0x24c:              mov       %l2, %o3

Oh oh. Looks like intr_thread+0x240 called preempt(), which then called swtch(). This thread voluntarily gave up the CPU. Why? Code exists in intr_thread() that prevents that pathological scenario of a thread being pinned by an interrupt for an excessive period of time. Some networking interrupt handlers will happily stay in interrupt context as long as there's a steady stream of packets to service, so intr_thread(), once it detects that we've processed 16 or so interrupts without giving the poor pinned thread a chance to run, will force itself, and whomever it's pinning to preempt to give the interrupted thread a chance to migrate elsewhere. Unfortunately, while it may have had the best intentions, it did this on a QUIESCED CPU, which made the ecache scrubber interrupt thread unrunnable. The ecache scrubber interrupt is sent regularly via a cyclic firing on another CPU (the clock, and deadman are both cyclic based).

So with the ecache scrubber left unrunnable, CPU0 became stuck at PIL 9, which left CPU1 stuck waiting in a loop for CPU0's PIL to drop. But CPU1 got stuck after it had already paused CPUs. (Done via pause_cpus()

One key point I should mention, the pause thread raises it's CPU's PIL to 10. Hmm, I wonder where clock should be running...

> ::cycinfo -v
...
CPU     CYC_CPU   STATE NELEMS        ROOT            FIRE HANDLER
 10 30011394900  online     12 300189d9480    a926ae697c00 do_scrub_offline

                                       6
                                       |
                    +------------------+------------------+
                    10                                    1
                    |                                     |
          +---------+--------+                  +---------+---------+
          9                  5                  3                   2
          |                  |                  |                   |
     +----+----+        +----+----+        +----+----+         +----+---+
     4         0        11        7        8
     |         |        |         |        |
   +-+-+    +--+-+    +-+-+    +--+-+    +-+-+

         ADDR NDX HEAP LEVL  PEND            FIRE USECINT HANDLER
  300189d9480   0    8  low 212603    a926cd692a74  705032 cheetah_nudge_buddy
  300189d94a8   1    2  low 14989212    a926ae697c00   10000 do_scrub
  300189d94d0   2    6  low 14989212    a926ae697c00   10000 do_scrub
  300189d94f8   3    5  low 14989212    a926ae697c00   10000 do_scrub
  300189d9520   4    7 high     0    a926c7527474 1000000 deadman
  300189d9548   5    4  low 14989212    a926ae697c00   10000 do_scrub_offline
  300189d9570   6    0  low 14989212    a926ae697c00   10000 do_scrub_offline
  300189d9598   7   10  low 14989212    a926ae697c00   10000 do_scrub_offline
  300189d95c0   8   11  low 149892    a926c6410000 1000000 dtrace_state_deadman
  300189d95e8   9    3  low 15139104    a926adf27592    9900 dtrace_state_clean
  300189d9610  10    1 lock 14989212    a926ae697c00   10000 clock <---
  300189d9638  11    9 high     0    a926ae697c00   10000 cbe_hres_tick
...

Wow, CPU10 has the clock cyclic, and there's a \*lot\* of pending clock interrupts. Note that the clock fires at PIL 10, but the pause thread running at that CPU has masked interrupts PIL10 and below. :P Well, there it is, that's why clock couldn't bump lbolt, and after 9 seconds the deadman pulled the plug. A pause_thread running on CPU0 (in the past) was probably also responsible for backing up the number of pending PIL 9 scrub requests, such that when PIL was lowered, the overwhelming number of pending PIL 9 interrupts triggered the preempt mechanism in intr_thread().

The fix for this bug was pretty trivial. intr_thread shouldn't employ it's preempt() mechanism if the CPU is quiesced. This was bug 6256245.

Fun stuff. In another posting, we'll look at how DTrace can be an indispensable tool in tracking down scheduling problems, and the joys of the sched provider.
Technorati Tag: OpenSolaris
Technorati Tag: Solaris
Technorati Tag: DTrace
Technorati Tag: mdb
Comments:

Post a Comment:
  • HTML Syntax: NOT allowed
About

esaxe

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today