Sunday May 06, 2007

The road ahead...

I'm semi-worried because I should be getting up in a few hours, but the spell of the espresso beans clearly has not i'll keep going. :) Within the last few months, work has wrapped up on our "Processor Groups" project, otherwise known as Multi-Level CMT scheduling optimizations. The putback introduced (among other things) a platform independent abstraction for representing a group of logical CPUs with some physical or characteristic sharing relationship. As of Nevada build 57 (and Solaris 10 update 4), Solaris uses this abstraction to construct groupings of CPUs that have performance relevant hardware sharing relationships, including shared caches, pipelines, pipes to memory, etc. The kernel scheduler/dispatcher then implements load balancing and affinity policy against these groupings in an attempt to maximize throughput and improve cache utilization efficiency.

This infrastructural upgrade has replaced the kernel's legacy CMT scheduling framework (implemented in chip.c/chip.h), which provided for only a single level of optimization (physical processor). The PG infrastructure enables Solaris to optimize for N levels, which is needed in cases where multiple levels/types of hardware sharing exists between logical CPUs in the system. Longer term, we're interested in providing a PG based implementation for the scheduling side of the kernel's NUMA optimizations. In addition to simplifying the implementation, this would potentially get us to the point of a having an affinity/load balancing policy implementation that spans both NUMA and CMT.

The road ahead is an exciting one

Over the next year or two, i'll be focusing my efforts on Solaris platform independent power management policy...which will entail bringing (or coordinating to bring) power management awareness to the platform independent kernel subsystems that deal with power manageable resources. We'll start with the dispatcher. :)

"Tesla" is the code name for the project, which will be run "in the open" via OpenSolaris. Over the last week, i've been working on the logistical aspects of the project (getting content on the page, setting up mail aliases, figuring out the Mercurial SCM, etc.). I'm hoping the project will go live either tomorrow (uh, today) or Tuesday.

Technorati Tags: OpenSolaris

Inspiration strikes (but caffeine helps)

Argh, my poor neglected blog. Pulling it up after all this time, I didn't even look at the date of the last entry, and I think i'll keep it that way. With any luck, i'll post enough entries soon enough that by the time I look, I won't feel like more of a slacker than I already do. But at least i'm here now (thanks partially goes to the chocolate covered espresso beans I happened acress a few hours ago). It's past midnight and i'm fairly it's a perfect time to bring things up to date.

"Eric Saxe's Blog" doesn't cut it. We need a proper title. Hmm...:



What better way to engross an impressionable, video game loving 7 year boy with computers than to tell him that inside that unassuming white box programs in glowing red and blue suits were battling it out in a video game arena? For better or worse, the nostalgia of the time is still with me 26 years later. :) Campy? Perhaps, but who's to say that without such inspiration i'd be doing what I love today?

Wednesday Dec 07, 2005

Do-it-yourself Kernel Development Preso

Russ Blaine gave this presentation to the ACM student chapter at Northeastern University. The first part of the presentation describes (in seat gripping detail) our adventures in root causing a \*very\* recent bug:

6348316 cyclic subsystem baffled by x86 softint behavior

It's a nice ride, with lots of sights along the way...the cyclic subsystem, device configuration, autovectored interrupts, and more... The second part of the presentation, is essentially the same as the OpenSolaris presentation Steve presented at UCSD last week, and the last part talks about why the Solaris Kernel Group is such a great place to work, who should work for us, and why.
Technorati Tags: OpenSolaris

Tuesday Dec 06, 2005

UltraSPARC T1 and Solaris: Threading Throughout

We are unveiling several UltraSPARC T1 (aka Niagara) based servers today. If you don't know what the T1 processor is, and you haven't heard about this chip and the systems that will house it, then you really should have a look. Seriously, this chip is impressive. Working at Sun, every now and again i'm fortunate enough to hear about new products and technologies we've got coming down the pipe. When I had first heard about the Niagara (T1) processor I was in disbelief.

32 logical CPUs presented by a single chip at 72 Watts?! Simply amazing.

Housed in T1's 2 square inch package are 8 processor cores, each capable of running 4 threads simultaneously. For me, the gravity of all this really sunk in when I invoked psrinfo(1M) on a test box and watched as the top of the output scrolled out of view in my xterm:
esaxe@ontario-mc25$ psrinfo
0       on-line   since 10/14/2000 20:54:37
1       on-line   since 10/14/2000 20:54:39
2       on-line   since 10/14/2000 20:54:39
3       on-line   since 10/14/2000 20:54:39
4       on-line   since 10/14/2000 20:54:39
5       on-line   since 10/14/2000 20:54:39
6       on-line   since 10/14/2000 20:54:39
7       on-line   since 10/14/2000 20:54:39
8       on-line   since 10/14/2000 20:54:39
9       on-line   since 10/14/2000 20:54:39
10      on-line   since 10/14/2000 20:54:39
11      on-line   since 10/14/2000 20:54:39
12      on-line   since 10/14/2000 20:54:39
13      on-line   since 10/14/2000 20:54:39
14      on-line   since 10/14/2000 20:54:39
15      on-line   since 10/14/2000 20:54:39
16      on-line   since 10/14/2000 20:54:39
17      on-line   since 10/14/2000 20:54:39
18      on-line   since 10/14/2000 20:54:39
19      on-line   since 10/14/2000 20:54:39
20      on-line   since 10/14/2000 20:54:39
21      on-line   since 10/14/2000 20:54:39
22      on-line   since 10/14/2000 20:54:39
23      on-line   since 10/14/2000 20:54:39
24      on-line   since 10/14/2000 20:54:39
25      on-line   since 10/14/2000 20:54:39
26      on-line   since 10/14/2000 20:54:39
27      on-line   since 10/14/2000 20:54:39
28      on-line   since 10/14/2000 20:54:39
29      on-line   since 10/14/2000 20:54:39
30      on-line   since 10/14/2000 20:54:39
31      on-line   since 10/14/2000 20:54:39
Yes, I know the system's clock is off by a few years. But seriously, output like this is something i'm used to seeing on monsters like the Sun Fire E25K and Sun Fire E6900 Servers. It was mind expanding indeed to see this sort of output from a small box with but a single physical processor.

Like UltraSPARC-IV and UltraSPARC-IV+ the T1 implements a Chip Multi-Threading (CMT) architecture...(which simply means it is able to run multiple threads of execution simultaneously). For a nice bit of background on CMT technology check out this article (I won't rehash it all here).

Solaris, as you might imagine is a natural fit for CMT systems like the T1000 and T2000, as it has efficiently operated across systems having twice as many CPUs (and more) for years.

For CMT however, acheiving good performance requires more than simply being able to scale. In a previous blog entry I talked about some of the CMT scheduling optimizations we've implemented in Solaris. Andrei will be discussing these optimizations more specifically in the context of the T1 processor (so again, I won't rehash), but it is worth underscoring that (especially for threaded processor architectures), the optimized thread placement and load balancing performed by the scheduler is a huge performance win.

Looking ahead, it's likely that workload characterization is going to be an important (and interesting) area of research. For example, we know that throughput on threaded processor architectures is maximized when threads running on the same core (sharing the same pipeline) are able to effectively execute in each other's stall cycles. CPI (Cycles Per Instruction) should therefore be an interesting metric to note when trying to characterize a given workload's scaling ability on this architecture. What other workload characteristics and metrics will be important/useful to collect and observe? We've got our work cut out for us. :)

Technorati Tags:

Wednesday Nov 30, 2005

Good Times @ UCSD

We had a great time presenting to Professor Snoeren's undergrad OS course at UCSD yesterday. Most of the students had used Solaris in the course of their schoolwork, but hadn't yet dabbled much with OpenSolaris. The presentation Steve put together was a very nice introduction to the project...showing how the students could get involved, how to build the source, etc.

Here's Steve's OpenSolaris presentation.

We then followed up with a 30 minute presentation (with demos) of DTrace, which was well received. The presentation I used was derived from another (much longer) presentation Bryan, Mike, and Adam put together. The first part of the presentation provides the foundations for understanding why DTrace is such a key part of Solaris's observability portfolio. The rest of the presentation is structured mainly around the demos which highlight some of the key DTrace features. I would have loved to delve deeper, but there's only so much one can do in 30 minutes.

Here's the DTrace presentation I used.
Here's the corresponding demo scripts.

Monday Nov 28, 2005

Off to UCSD

Tomorrow Steve Lau and I are flying down to our old stomping ground at UCSD, to present OpenSolaris and DTrace to Professor Snoeren's undergraduate operating systems class (CSE 120). What's fun is that the room in which we will be presenting (Center Hall room 109) is the same room in which I was sitting listening to OS lectures (as a student) back in the day. It's funny how some things come around full circle. Steve put together a great set of slides giving an OpenSolaris crash course and i'll be demoing / talking about DTrace. It should be a blast!...(except for the getting up at 4:45 to catch the early flight part) ;)

Thursday Jun 16, 2005

And now for these messages...

I'd like to encourage anyone with an interest in [Open]Solaris performance to participate in the performance community at From there you can join the discussion list, raise issues, ask questions, and exchange ideas.

Much of the Kernel Performance Group (my group) is already heavily involved with the community and OpenSolaris in general. We're very interested in externalizing much of what has (before OpenSolaris) been quite internal. At our project team meeting yesterday we tossed around some ideas for how we could open up some of our meetings to the community. Very exciting indeed!!

Tuesday Jun 14, 2005

Debugging Solaris Scheduling Problems, and Other Fun Party Tricks

Debugging Solaris Scheduling Problems, and Other Fun Party Tricks The day has arrived. At last. This morning you awoke to the sound of birds singing a pleasant song, sunshine beaming in from outside...and your first waking thought brought a smile to your it did mine. Today Solaris is open, and to you we proudly offer it. Welcome.

Great. So now what? OpenSolaris is yours, and you'd like to get your feet wet, right? Well, where should we start? Over the last few years most of my professional time has been spent understanding, improving and evolving the scheduler/dispatcher. That's right, many a performance issue have I seen, and to many a hung system have I cursed. I've learned a lot in that time, but the time has come to begin dumping a cerebral core of my share of arcane information that will hopefully save someone else working in this area of the kernel some time and frustration. That's not to say that the scheduler/ dispatcher code is frustrating, but some of it is "historical", and more often than not, it's easier to understand why and how the code works if you know why it was written in the first place.
The Dispatcher and Scheduling Classes Intro
When someone says "hey, I think there's a scheduling problem here", they might be talking about one (or more) of a few different things:
  • Unfairness - Some threads are getting more (or less) CPU time than they should
  • Starvation - An extreme example of the fairness problem, some runnable thread(s) are barely running, or aren't getting to run at all
  • Excess Migrations - Some thread(s) are hopping between CPUs too often. This can hurt performance since a migrating thread will likely be welcomed by a cold cache.
  • Poor load balancing - Scheduling threads to run on CPUs in such a way that they are forced to excessively contend for resources...where a resource could be CPU time, bandwidth, use of a shared component (e.g. memory controller, pipeline, etc.)
But really, the above deals with both dispatching (where threads run, (on which CPUs)) and scheduling (when they run (designated by thread priorities)).

In Solaris, the core dispatcher code lives in disp.c under usr/src/uts/common/disp. When a thread becomes runnable, this code is responsible for picking the CPU on which the thread should next run. When it will get a chance to run depends on that thread's priority relative to other runnable threads waiting for that CPU. A threads priority is governed by it's scheduling class. You can take a look at the current scheduling classes loaded on the system from within mdb:

# mdb -k
Loading modules: [ unix krtld genunix specfs ufs sd sgsbbc sgenv ip sctp fcp fctl ssd nca random nfs sppp crypto ptm logindmux lofs cpc ]
> ::class
SLOT NAME       INIT FCN                 CLASS FCN
   0 SYS        sys_init                 sys_classfuncs
   1 TS         ts_init                  ts_classfuncs
   2 FX         fx_init                  fx_classfuncs
   3 IA         ia_init                  ia_classfuncs
   4 RT         rt_init                  rt_classfuncs
   5            0                        0
   6            0                        0
   7            0                        0
   8            0                        0
   9            0                        0

Note that mdb -k invokes the debugger on the running kernel, \*very\* useful for debugging a live system. If you want to debug the running kernel, and have it stop so that you can look at a "snapshot" of the kernel, set breakpoints, data watchpoints, etc, you would use "mdb -K", which would drop you into kmdb. For those who remember kadb, kmdb is it's successor. All hail ::print!!! So moving along, from ::class we see:
  • System scheduling class - For kernel service threads. Threads in this class have no time quantums, and they will remain on the CPU until such time as they voluntarily yield it. Typical thread priorities: 60 - 99
  • Time Share scheduling class - The default class for user threads. Threads in this class will have their priority lowered as they consume more CPU time, and increased as they use less. Thread priorities: 0 - 59
  • FX (fixed priority) scheduling class - Threads in this class have fixed priority. By default it's 0, but privileged users can run threads in this scheduling class at priorities greater than 0. Thread priorities: 0 - 60.
  • Interactive scheduling class - Used by the windowing system. Very similar to TS, except that there's a mechanism allowing threads to have a "boost" of priority. The windowing system uses this to give processes associated with "in focus" windows higher priority. Thread priorities: 0 - 59
  • Real Time scheduling class - Threads in the RT class assume the highest system priorities, even higher than system. Thread priorities: 100 - 159
You can find more about these scheduling classes from the priocntl(1) man page, Note that there's a difference between a scheduling class priority, and the thread's system wide priority. The priorities cited above are system priorities, which is the global, scheduling class independent scale of priorities used for all threads by the dispatcher. In the source base, the code which implements these scheduling classes can again be found under usr/src/uts/common/disp.

esaxe@sark$ ls
SCCS/         disp_lock.c   ia.c          shuttle.c     ts_dptbl.c
class.c       fss.c         priocntl.c    sysclass.c
cpupart.c     fx.c          rt.c          thread.c
disp.c        fx_dptbl.c    rt_dptbl.c    ts.c

As an example of how the dispatcher and scheduling classes hook together, let's look at preempt(), the function invoked when the currently running thread on a CPU should yield in favor of a higher priority runnable thread:

 \* Preempt the currently running thread in favor of the highest
 \* priority thread.  The class of the current thread controls
 \* where it goes on the dispatcher queues. If panicking, turn
 \* preemption off.
        kthread_t       \*t = curthread;
        klwp_t          \*lwp = ttolwp(curthread);

        if (panicstr)

        TRACE_0(TR_FAC_DISP, TR_PREEMPT_START, "preempt_start");


        if (t->t_state != TS_ONPROC || t->t_disp_queue != CPU->cpu_disp) {
                 \* this thread has already been chosen to be run on
                 \* another CPU. Clear kprunrun on this CPU since we're
                 \* already headed for swtch().
                CPU->cpu_kprunrun = 0;
                TRACE_0(TR_FAC_DISP, TR_PREEMPT_END, "preempt_end");
        } else {
                if (lwp != NULL)
                CPU_STATS_ADDQ(CPU, sys, inv_swtch, 1);

                TRACE_0(TR_FAC_DISP, TR_PREEMPT_END, "preempt_end");

                swtch();                /\* clears CPU->cpu_runrun via disp() \*/

Typically the "else" route is taken, and we see that the scheduling class specific preemption routine is called, CL_PREEMPT(). If the thread is in the TS (timeshare) scheduling class, then this evaluates to ts_preempt(), which ultimately results in either setbackdq() or setfrondq() being called, which place the thread at the front, or the back of the queue (respectively) of threads that exist at that thread's priority. At the end of preempt() above, we see swtch() is called (switch threads), which in turn calls disp() (return the highest priority thread that the CPU can now run (or the idle thread if there's nothing)), and finally resume() which does the actual context switch. Whew!

There's something important to note here. All this is being done in the context of the thread that's being preempted! That is, the thread that is being preempted is actually placing itself on a run queue somewhere, is finding the next best thread to run, and is then context switching to it! The thread is told it needs to preempt by the setting of the "cpu_runrun, or cpu_kprunrun" flags in the CPU structure. That flag may have been set by another, higher priority thread that placed itself on the run queue (see cpu_resched()). Whenever the thread takes a trap, returns from a system call, etc. It checks for this flag and proceeds to preempt itself if it is set. The idea that the dispatcher isn't some group of one or more bureaucratic threads "directing traffic" may seem surprising, but it's implemented this way for performance and scalability. All threads on the system share in executing dispatcher code, and we must take care that the bits of policy employed by every thread add up to serve the greater good.
So lets talk a bit about debugging. What's that you say, debugging is painful? Tired of adding printk() statements here and there, re-compiling, and adding more printk() statements? I feel your pain. But let me assure you that when that smile erupted on your face this morning, it was for a reason. I enjoy developing Solaris, and it's not because I think hunting down fantastically complex and subtle problems with print statements is fun, it's because tools like (k)mdb and DTrace free me to investigate, observe, and understand what's going on without having to figure out what's needed to make the problem observable.

For now, lets look at one of the tools at your disposal:
(k)mdb, dcmds, and CTF
Let's invoke the kernel debugger on my build machine's running kernel to see what it's up to:

# mdb -k
Loading modules: [ unix krtld genunix specfs ufs sd sgsbbc sgenv ip sctp fcp fctl ssd nca random nfs sppp ptm logindmux lofs cpc ]
> ::cpuinfo
  0 0000180c000  1b    0    0  -1   no    no t-11   2a10001fcc0 (idle)
  1 300423b2000  1b    0    0  -1   no    no t-63   2a1002f7cc0 (idle)
  2 300423ba000  1b    0    0  -1   no    no t-1    2a10037fcc0 (idle)
  3 300423c6000  1b    0    0  -1   no    no t-91   2a1003c7cc0 (idle)
  8 300423ce000  1b    0    0  59   no    no t-0    30d4775abc0 mdb
  9 300423de000  1b    0    0  -1   no    no t-0    2a100539cc0 (idle)
 10 300423ea000  1b    0    0  -1   no    no t-71   2a1005c1cc0 (idle)
 11 300423f2000  1b    0    0  -1   no    no t-8    2a100609cc0 (idle)
 16 300423fa000  1b    0    0  -1   no    no t-10   2a100691cc0 (idle)
 17 30042402000  1b    0    0  -1   no    no t-191  2a100719cc0 (idle)
 18 3004240a000  1b    0    0  -1   no    no t-0    2a1007a1cc0 (idle)
 19 30042412000  1b    0    0  -1   no    no t-0    2a1007e9cc0 (idle)

Well, that's not very exciting. All but CPU 8 is idle, and CPU 8 is running my mdb process thread. Let me kick off a build:

> ::cpuinfo
  0 0000180c000  1b    0    0  30   no    no t-0    325d59b8820 acomp
  1 300423b2000  1b    1    0  30   no    no t-0    308e67c02a0 acomp
  2 300423ba000  1b    1    0  55   no    no t-2    30760fbce60 cg
  3 300423c6000  1b    2    0  53   no    no t-1    33886adf500 cg
  8 300423ce000  1b    1    0  39   no    no t-1    35876d042c0 iropt
  9 300423de000  1b    2    0  59   no    no t-1    30d4775abc0 mdb
 10 300423ea000  1b    0    0  42   no    no t-1    3081f8cc200 cg
 11 300423f2000  1b    2    0  32   no    no t-0    319c6969400 cg
 16 300423fa000  1b    1    0  53   no    no t-0    338853f1ae0 cg
 17 30042402000  1b    0    0  43   no    no t-0    307215023a0 cg
 18 3004240a000  1b    1    0  49   no    no t-0    30d4771c5e0 cg
 19 30042412000  1b    0    0  43   no    no t-1    308e67dbc80 cg

That's better, ::cpuinfo gives you an overview of what threads every CPU is running. ::cpuinfo -v gives you even more. Lets look at CPU 1 in detail:

> 300423b2000::cpuinfo -v
  1 300423b2000  1b    4    0   0   no    no t-5    3075514a440 acomp
                  |    |
       RUNNING <--+    +-->  PRI THREAD      PROC
         READY                 0 35876d042c0 sh
        EXISTS                 0 3075514b140 acomp
        ENABLE                 0 30041c654a0 dmake
                               0 308e629b120 dmake

So CPU1 has 4 runnable threads, all at priority 0. The current thread "acomp" has been on for 5 ticks (5/100) of a second, and this CPU has the right set of flags set for an online and active CPU.
There's a whole array of dcmds available to ease looking around the system. You can get a list using the ::dcmds dcmd. Also of interest are the walkers, which allow you to easily traverse lists of things....CPUs, threads, pages, etc. For example, let's walk the list of all active thread structures:

> ::walk thread
>> More [, , q, n, c, a] ?

(It's a long list, and i'll spare you all the output). Now let's pipe the output of ::walk thread to ::findstack, which prints a stack trace for a thread given a thread structure:

> ::walk thread |::findstack
stack pointer for thread 180e000: 180b0f1
[ 000000000180b0f1 sched+0x570() ]
  000000000180b211 main+0x458()
  000000000180b2f1 _start+0x110()
stack pointer for thread 2a10001fcc0: 2a10001f021
[ 000002a10001f021 ktl0+0x48() ]
  000002a10001f171 disp_getwork+0x164()
  000002a10001f221 idle+0xdc()
  000002a10001f2d1 thread_start+4()
stack pointer for thread 2a100017cc0: 2a100017151
[ 000002a100017151 cv_wait+0x5c() ]
  000002a100017201 taskq_thread+0x13c()
  000002a1000172d1 thread_start+4()
stack pointer for thread 2a10000fcc0: 2a10000f141
[ 000002a10000f141 sema_p+0x25c() ]
  000002a10000f1f1 kmdbmod`kctl_wr_thread+0xbc()
  000002a10000f2d1 thread_start+4()
stack pointer for thread 2a100047cc0: 2a100046f51
  000002a100046ff1 clock+0x20()
  000002a1000470e1 cyclic_softint+0x374()
  000002a100047221 cbe_level10+8()
  000002a1000472d1 intr_thread+0x294()
  000002a10007f2d1 utl0+0x4c()
  000002a10001f171 disp_getwork+0x164()
  000002a10001f221 idle+0xdc()
  000002a10001f2d1 thread_start+4()
>> More [, , q, n, c, a] ?

When a system seems hung. I'll typically break into kmdb, and will do a ::walk thread |::findstack to look for interesting stacks. If there's a pile of threads all asleep with mutex_vector_enter() near the top of the stack, then from there i'll dig down to find who owns what locks, and why the various lock owners aren't letting go (or why they can't let go because they can't run).

Let's walk all the CPU structures, and for each of them print out the cpu_thread field (pointer to the kthread_t for the on_proc) thread, and for each of those threads, let's print out their t_pri field, the thread's current priority:

> ::walk cpu |::print cpu_t cpu_thread |::print kthread_t t_pri
t_pri = 0xffff
t_pri = 0xffff
t_pri = 0xffff
t_pri = 0xffff
t_pri = 0xffff
t_pri = 0xffff
t_pri = 0x3b
t_pri = 0xffff
t_pri = 0xffff
t_pri = 0xffff
t_pri = 0xffff
t_pri = 0xffff

So most of the CPUs are running a priority -1 thread (idle), and one is running a priority 59 thread (my mdb process). You'll find that ::print is your friend:

> cpu0::print cpu_t cpu_disp |::print disp_t
    disp_lock = 0
    disp_npri = 0xaa
    disp_q = 0x309c1a19000
    disp_q_limit = 0x309c1a19ff0
    disp_qactmap = 0x300425ed380
    disp_maxrunpri = 0xffff
    disp_max_unbound_pri = 0xffff
    disp_nrunnable = 0
    disp_cpu = cpu0

mdb and kmdb know how to print out \*every\* structure thanks to the CTF data compiled in. This eliminates the need for clunky adb macros. Relating to the scheduler/dispatcher, you may want to consider getting familiar with the kthread_t, cpu_t, and disp_t structures.
So with a basic knowledge of a few structures, a few dcmds and walkers, you're on your way! The same tricks we've been doing here are of course relevant on crash dumps. In fact, let's take a look at a crash dump from a recent interrupt related scheduling bug...
The ecache_scrubber stuck in run queue limbo bug
A few weeks back Shiv Agarwal pointed me at his Sun-Fire 6900 test rig that seemed to be hard hanging every now and again while executing some Dynamic Reconfiguration (DR) stress tests. (DR refers to the machine's ability to add and remove CPU/Memory/IO boards/Disks, etc from a running system without requiring reboot). Hard hangs are particularly frustrating, because the system is so hung that it's impossible to break the system into kmdb for further analysis. This is where the "deadman" feature in Solaris becomes useful. When enabled, every second, a high (level 15) interrupt will fire on every CPU, and will execute the deadman() function, which watches kernel variable "lbolt" (a tick counter incremented each time the clock interrupt fires).

If the deadman observes that lbolt hasen't changed in a while, it will induce a panic. The deadman feature can be enabled by setting the kernel variable "snooping" to 1 in /etc/system:

set snooping = 1

... or by hotpatching the variable's value early in boot.

Shiv enabled the deadman, and when he reproduced the problem instead of hard hanging the system panicked:

> panic[cpu1]/thread=30018dd6cc0: deadman: timed out after 9 seconds of
> clock inactivity
> 000002a10005fdf0 genunix:cyclic_fire+64 (300138a9fb8, 20d5a6fe1472,
> 3b9ac800, 0, 300121d9698, 300348b0280)
>   %l0-3: 00000300195648c0 0000030012a7eb80 000020d5a6fe17c0 000000003b9aca00
>   %l4-7: 00000300348b0320 000020d5a77b5280 0000000000000014 0000000000000004
> 000002a10005fea0 unix:cbe_level14+38 (0, 0, e, 30006356000, 400080, 100aef8)
>   %l0-3: 000000000000022e 0000000000010002 0000000001911ab8 0000000000000000
>   %l4-7: 0000000000000001 0000000000000001 0000000001911ab0 0000000001910470
> 000002a10005ff50 unix:current_thread+140 (1, 6912166044, 6912166040,
> 18af260, f0d0e, f0d0f)
>   %l0-3: 0000000001007674 000002a102682a31 000000000000000e 0000000000010002
>   %l4-7: 00000000018670b0 0000000000000000 000000000000000b 000002a1026832e0
> 000002a102683380 unix:mp_cpu_quiesce+90 (180c000, 200, 1, 0, 3938700, fc00)
>   %l0-3: 0000000000000000 0000000000000000 0000000000000000 00000000018af260
>   %l4-7: 00000000018af000 00000000018aa660 0000000000000000 000000000085daf8
> 000002a102683430 sbdp:sbdp_cpu_poweroff+d4 (180c000, 1c00, 0, 0, 0,
> f082e520)
>   %l0-3: 00000000000000bb 00000000705e3e20 00000000705e3c00 0000030015828000
>   %l4-7: 000000007043dc00 000007fff0914000 0000000000001fff 00000000705e3e18
> 000002a102683500 unix:cpu_poweroff+24 (180c000, 10, 22, 2f, 2f, 180c000)
>   %l0-3: 0000000000000000 00000000705e2900 00000000705e2930 0000000000000030
>   %l4-7: 000000000190be28 000002a102683590 fffffffff082e520 00000000705e2800
> 000002a1026835b0 sbd:sbd_pre_detach_cpu+1b0 (300062944c0, 300139c0000,
> 180c000, 0, 30006294510, 30012bdd2c8)
>   %l0-3: 0000030015848000 000000000000008f 0000030015b65658 00000000000bab15
>   %l4-7: 0000000000000000 ffffffffffffffff 0000000000000001 0000000000000000
> 000002a102683660 sbd:sbd_dev_unconfigure+50 (300062944c0, 300139c0000,
> 1, 0, 0, 2a10268371c)
>   %l0-3: 0000000000000005 0000000000000001 0000000000000019 000003000f675180
>   %l4-7: 0000000000000000 0000000000000001 0000000000000000 0000000000000001
> 000002a102683720 sbd:sbd_exec_op+94 (300062944c0, 0, 445000, 7, 7, 8c)
>   %l0-3: 0000030015848000 00000000704bf800 0000000000000003 ffffffffffffff1f
>   %l4-7: ffffffffffffffd3 0000000000445208 0000030006294510 0000000000000000
> 000002a1026837f0 sbd:sbd_ioctl+150 (a500000000, 445208, 293c0, 100003,
> 2a102683977, 445000)
>   %l0-3: 0000000000000003 00000000704bf800 0000000000445208 0000000000445201
> 000002a102683990 genunix:ioctl+184 (3, 30019513580, 293c0, ff1d3108,
> ff1ea000, 445208)
>   %l0-3: 000003001301c590 000003001301c5b0 0000000000000004 0000000000000001
>   %l4-7: 000003001301c5b0 0000000000000000 0000000000000000 0000000000000000
> panic: entering debugger (continue to save dump)

[1]> $c
kaif_enter(18340f8, 0, 11e0a14, 1828400, 11e0a14, 4c00)
panicsys+0x408(18b3410, 18aa2b0, 30006356000, 1, 20040, 1883800)
vpanic+0xcc(11d6c20, 2a10005fe78, 0, 0, 43c998540, 1b5872)
panic+0x1c(11d6c20, 9, 18c9000, 11d6c00, 18c9000, 0)
cyclic_fire+0x64(300138a9fb8, 20d5a6fe1472, 3b9ac800, 0, 300121d9698,
cbe_level14+0x38(0, 0, e, 30006356000, 400080, 100aef8)
current_thread+0x140(1, 6912166044, 6912166040, 18af260, f0d0e, f0d0f)
mp_cpu_quiesce+0x90(180c000, 200, 1, 0, 3938700, fc00)
sbdp`sbdp_cpu_poweroff+0xd4(180c000, 1c00, 0, 0, 0, f082e520)
cpu_poweroff+0x24(180c000, 10, 22, 2f, 2f, 180c000)
sbd`sbd_pre_detach_cpu+0x1b0(300062944c0, 300139c0000, 180c000, 0, 30006294510,
sbd`sbd_dev_unconfigure+0x50(300062944c0, 300139c0000, 1, 0, 0, 2a10268371c)
sbd`sbd_exec_op+0x94(300062944c0, 0, 445000, 7, 7, 8c)
sbd`sbd_ioctl+0x150(a500000000, 445208, 293c0, 100003, 2a102683977, 445000)
ssm`ssm_ioctl+0xcc(a500000000, 3000313f880, 293c0, 100003, 3001218cf80, 445208)
ioctl+0x184(3, 30019513580, 293c0, ff1d3108, ff1ea000, 445208)
syscall_trap32+0xcc(3, 445208, 293c0, ff1d3108, ff1ea000, b)

Ok, so the golden question. Why wasn't the clock interrupt able to fire so that it could change lbolt to make deadman() happy?
Using ::cpuinfo to look around:

> ::cpuinfo
  0 0000180c000  2d    1    9  -1   no   yes t-31536 2a1015a3cc0 (idle)
  1 0000187e7a0  1b    9    0  49   no    no t-1    30018dd6cc0 cfgadm
  2 300061d2000   d   10    0 169   no    no t-0    2a101433cc0 sched
  3 3000631e000  2d    0    0  -1   no    no t-31541 2a102053cc0 (idle)
  8 30006340000   d    8    0 169   no    no t-0    2a101943cc0 sched
 10 30008c74000  1d   10    0 169   no    no t-0    2a10117bcc0 sched
 11 30008864000   d    6    0 169   no    no t-0    2a101f2bcc0 sched
 16 30008442000  1d    9    0 169   no    no t-0    2a101d23cc0 sched
 17 3000da1e000  1d    6    0 169   no    no t-0    2a100611cc0 sched
 18 3000c9fc000  1d    6    0 169   no    no t-0    2a101983cc0 sched
 19 30005bda000  1d    8    0 169   no    no t-0    2a1020a3cc0 sched

Who are the priority 169 threads running on 2, 8-19?

> 2a101433cc0::findstack
stack pointer for thread 2a101433cc0: 2a101433021
[ 000002a101433021 panic_idle+0x1c() ]
  000002a1014330d1 ktl0+0x48()
  000002a101433221 cpu_pause+0x5c()
  000002a1014332d1 thread_start+4()

Ah, those are pause threads. There are certain lists and data structures walked by the dispatcher code without any locks held. One example is the per processor set list of active CPUs walked in disp_anywork(), a function used by the dispatcher to determine if there's any thing better to run in the system besides idle(). This enables lockless performance in the dispatcher, at the cost of requiring that all CPUs be paused (run high priority threads that just spin) when those lists/structures are changed around (ensuring no other CPUs are executing dispatcher code) ...which of course happens very rarely.

OK, so some of the CPUs are running their pause threads, while CPUs 0 and 3 are idle(). The thread on CPU1 is in the middle of a DR operation. Looking at the first argument to mp_cpu_quiesce() above, we see it's trying to power off and quiesce CPU0 of interrupts.

> 30018dd6cc0::findstack
stack pointer for thread 30018dd6cc0: 2a102682ae1
  000002a102682b81 mp_cpu_quiesce+0x90()
  000002a102682c31 sbdp_cpu_poweroff+0xd4()
  000002a102682d01 cpu_poweroff+0x24()
  000002a102682db1 sbd_pre_detach_cpu+0x1b0()
  000002a102682e61 sbd_dev_unconfigure+0x50()
  000002a102682f21 sbd_exec_op+0x94()
  000002a102682ff1 sbd_ioctl+0x150()
  000002a1026830c1 ssm_ioctl+0xcc()
  000002a102683191 ioctl+0x184()
  000002a1026832e1 syscall_trap32+0xcc()
> mp_cpu_quiesce+0x90::dis
mp_cpu_quiesce+0x80:            nop
mp_cpu_quiesce+0x84:            clr       %g2
mp_cpu_quiesce+0x88:            ba        +0x34         <mp_cpu_quiesce+0xbc>
mp_cpu_quiesce+0x8c:            cmp       %g2, 0
mp_cpu_quiesce+0x90:            call      +0x1db71c     <drv_usecwait>
mp_cpu_quiesce+0x94:            mov       %i2, %o0
mp_cpu_quiesce+0x98:            add       %l7, 1, %l7
mp_cpu_quiesce+0x9c:            cmp       %l7, %i4
mp_cpu_quiesce+0xa0:            bl,pt     %icc, -0x40   <mp_cpu_quiesce+0x60>

Which, looking at the C code puts us here in mp_cpu_quiesce():

         \* Declare CPU as no longer being READY to process interrupts and
         \* wait for them to stop. A CPU that is not READY can no longer
         \* participate in x-calls or x-traps.
        cp->cpu_flags &= ~CPU_READY;
        CPUSET_DEL(cpu_ready_set, cpuid);

        for (i = 0; i < sanity_limit; i++) {
                if (cp->cpu_intr_actv == 0 &&
                    cp->cpu_thread == cp->cpu_idle_thread) {
                        found_intr = 0;
                DELAY(1); <-------------- \*\*

So CPU1 is caught in the for loop waiting for CPU0's cpu_intr_actv bitmask to become empty (cpu_intr_actv is a bitmap of active interrupts that lives in the CPU structure). Sure enough, it isn't:

> 180c000::print cpu_t cpu_intr_actv
cpu_intr_actv = 0x200

Taking a close look at CPU0:

> 180c000::cpuinfo -v
  0 0000180c000  2d    1    9  -1   no   yes t-31536 2a1015a3cc0 (idle)
                  |    |    |
       RUNNING <--+    |    +--> PIL THREAD
      QUIESCED         |           9 2a101423cc0
        EXISTS         |
       OFFLINE         +-->  PRI THREAD      PROC
                             168 2a101423cc0 sched
> 2a101423cc0::findstack
stack pointer for thread 2a101423cc0: 2a1014230d1
  000002a101423171 scrub_ecache+0x80()
  000002a101423221 swtch+0x78()
  000002a1014232d1 intr_thread+0x240()

Hey look, there's a level 9 interrupt thread runnable and eagerly awaiting the CPU to grab it, and run it.

What's an interrupt thread? Interrupt service code in Solaris executes while the existing thread on the processor is "pinned". That is, when an interrupt happens, the CPU immediately jumps from executing the interrupted thread to the interrupt service code. When complete, control is restored to the interrupted thread and life moves on. It is not often that interrupts lose the CPU prior to finishing execution, but it does happen, most often because the interrupt tries to grab an adaptive lock that is already held. When this happens, the interrupt is turned into a real thread (rather than leaching off the interrupted thread), and blocks as usual. Interrupt threads, like real time threads are at the top of the food chain in terms of priorities (With the RT class loaded, 160-169). PIL 10 interrupts and below are permitted to grab adaptive locks, and can therefore block (and be turned into real threads). Interrupts above PIL 10 may never block. The dispatcher operates at PIL 11, and if it were possible to take an interrupt that later blocked while already in the dispatcher all sorts of nastiness could ensue.

But for our runnable interrupt thread, despite it's eagerness to run again, it's day will never come. CPU0 is idle(), but looking at the flags we see that it's OFFLINE and QUIESCED. This is the state your CPU assumes after you use psradm -f, for example. The CPU isn't permitted to execute any threads, and strictly speaking it's dangerous for it be servicing interrupts. The offline CPU will remain parked in this loop in idle()

                 \* If CPU is completely quiesced by p_online(2), just wait
                 \* here with minimal bus traffic until put online.
                while (cp->cpu_flags & CPU_QUIESCED)

What happens if the interrupt thread blocks? You're looking at it. The thread will remain unrunnable. It's only hope is for the QUIESCED flag to be cleared should the CPU be brought back online.

So did this interrupt block? Not exactly. The ecache scrubbing code is brave, and wants to scrub the caches of even offline CPUs. It does this by having one CPU send a cross trap to the target (offline) CPU, which causes the target to take a PIL 9 interrupt. Again, this is very dicey doing this on offline CPUs, but looking at the interrupt code executed there's no evidence of it grabbing any locks. So how did it lose the CPU? Looking at the interrupt thread stack we find our clue:

> 2a101423cc0::findstack
stack pointer for thread 2a101423cc0: 2a1014230d1
  000002a101423171 scrub_ecache+0x80()
  000002a101423221 swtch+0x78()
  000002a1014232d1 intr_thread+0x240()

The topmost stack frame already popped off, so really the last bit of code we executed was switch, called by intr_thread+0x240:

intr_thread+0x234:              ldub      [%o2 + 0xe9], %o5
intr_thread+0x238:              brz,pt    %o5, +0x1c    <intr_thread+0x254>
intr_thread+0x23c:              mov       %o2, %l3
intr_thread+0x240:              call      +0x486a8      <preempt>
intr_thread+0x244:              mov       %o3, %l2
intr_thread+0x248:              mov       %l3, %o2
intr_thread+0x24c:              mov       %l2, %o3

Oh oh. Looks like intr_thread+0x240 called preempt(), which then called swtch(). This thread voluntarily gave up the CPU. Why? Code exists in intr_thread() that prevents that pathological scenario of a thread being pinned by an interrupt for an excessive period of time. Some networking interrupt handlers will happily stay in interrupt context as long as there's a steady stream of packets to service, so intr_thread(), once it detects that we've processed 16 or so interrupts without giving the poor pinned thread a chance to run, will force itself, and whomever it's pinning to preempt to give the interrupted thread a chance to migrate elsewhere. Unfortunately, while it may have had the best intentions, it did this on a QUIESCED CPU, which made the ecache scrubber interrupt thread unrunnable. The ecache scrubber interrupt is sent regularly via a cyclic firing on another CPU (the clock, and deadman are both cyclic based).

So with the ecache scrubber left unrunnable, CPU0 became stuck at PIL 9, which left CPU1 stuck waiting in a loop for CPU0's PIL to drop. But CPU1 got stuck after it had already paused CPUs. (Done via pause_cpus()

One key point I should mention, the pause thread raises it's CPU's PIL to 10. Hmm, I wonder where clock should be running...

> ::cycinfo -v
 10 30011394900  online     12 300189d9480    a926ae697c00 do_scrub_offline

                    10                                    1
                    |                                     |
          +---------+--------+                  +---------+---------+
          9                  5                  3                   2
          |                  |                  |                   |
     +----+----+        +----+----+        +----+----+         +----+---+
     4         0        11        7        8
     |         |        |         |        |
   +-+-+    +--+-+    +-+-+    +--+-+    +-+-+

  300189d9480   0    8  low 212603    a926cd692a74  705032 cheetah_nudge_buddy
  300189d94a8   1    2  low 14989212    a926ae697c00   10000 do_scrub
  300189d94d0   2    6  low 14989212    a926ae697c00   10000 do_scrub
  300189d94f8   3    5  low 14989212    a926ae697c00   10000 do_scrub
  300189d9520   4    7 high     0    a926c7527474 1000000 deadman
  300189d9548   5    4  low 14989212    a926ae697c00   10000 do_scrub_offline
  300189d9570   6    0  low 14989212    a926ae697c00   10000 do_scrub_offline
  300189d9598   7   10  low 14989212    a926ae697c00   10000 do_scrub_offline
  300189d95c0   8   11  low 149892    a926c6410000 1000000 dtrace_state_deadman
  300189d95e8   9    3  low 15139104    a926adf27592    9900 dtrace_state_clean
  300189d9610  10    1 lock 14989212    a926ae697c00   10000 clock <---
  300189d9638  11    9 high     0    a926ae697c00   10000 cbe_hres_tick

Wow, CPU10 has the clock cyclic, and there's a \*lot\* of pending clock interrupts. Note that the clock fires at PIL 10, but the pause thread running at that CPU has masked interrupts PIL10 and below. :P Well, there it is, that's why clock couldn't bump lbolt, and after 9 seconds the deadman pulled the plug. A pause_thread running on CPU0 (in the past) was probably also responsible for backing up the number of pending PIL 9 scrub requests, such that when PIL was lowered, the overwhelming number of pending PIL 9 interrupts triggered the preempt mechanism in intr_thread().

The fix for this bug was pretty trivial. intr_thread shouldn't employ it's preempt() mechanism if the CPU is quiesced. This was bug 6256245.

Fun stuff. In another posting, we'll look at how DTrace can be an indispensable tool in tracking down scheduling problems, and the joys of the sched provider.
Technorati Tag: OpenSolaris
Technorati Tag: Solaris
Technorati Tag: DTrace
Technorati Tag: mdb



« April 2014