Tuesday Dec 06, 2005

UltraSPARC T1 and Solaris: Threading Throughout

We are unveiling several UltraSPARC T1 (aka Niagara) based servers today. If you don't know what the T1 processor is, and you haven't heard about this chip and the systems that will house it, then you really should have a look. Seriously, this chip is impressive. Working at Sun, every now and again i'm fortunate enough to hear about new products and technologies we've got coming down the pipe. When I had first heard about the Niagara (T1) processor I was in disbelief.

32 logical CPUs presented by a single chip at 72 Watts?! Simply amazing.

Housed in T1's 2 square inch package are 8 processor cores, each capable of running 4 threads simultaneously. For me, the gravity of all this really sunk in when I invoked psrinfo(1M) on a test box and watched as the top of the output scrolled out of view in my xterm:
esaxe@ontario-mc25$ psrinfo
0       on-line   since 10/14/2000 20:54:37
1       on-line   since 10/14/2000 20:54:39
2       on-line   since 10/14/2000 20:54:39
3       on-line   since 10/14/2000 20:54:39
4       on-line   since 10/14/2000 20:54:39
5       on-line   since 10/14/2000 20:54:39
6       on-line   since 10/14/2000 20:54:39
7       on-line   since 10/14/2000 20:54:39
8       on-line   since 10/14/2000 20:54:39
9       on-line   since 10/14/2000 20:54:39
10      on-line   since 10/14/2000 20:54:39
11      on-line   since 10/14/2000 20:54:39
12      on-line   since 10/14/2000 20:54:39
13      on-line   since 10/14/2000 20:54:39
14      on-line   since 10/14/2000 20:54:39
15      on-line   since 10/14/2000 20:54:39
16      on-line   since 10/14/2000 20:54:39
17      on-line   since 10/14/2000 20:54:39
18      on-line   since 10/14/2000 20:54:39
19      on-line   since 10/14/2000 20:54:39
20      on-line   since 10/14/2000 20:54:39
21      on-line   since 10/14/2000 20:54:39
22      on-line   since 10/14/2000 20:54:39
23      on-line   since 10/14/2000 20:54:39
24      on-line   since 10/14/2000 20:54:39
25      on-line   since 10/14/2000 20:54:39
26      on-line   since 10/14/2000 20:54:39
27      on-line   since 10/14/2000 20:54:39
28      on-line   since 10/14/2000 20:54:39
29      on-line   since 10/14/2000 20:54:39
30      on-line   since 10/14/2000 20:54:39
31      on-line   since 10/14/2000 20:54:39
Yes, I know the system's clock is off by a few years. But seriously, output like this is something i'm used to seeing on monsters like the Sun Fire E25K and Sun Fire E6900 Servers. It was mind expanding indeed to see this sort of output from a small box with but a single physical processor.

Like UltraSPARC-IV and UltraSPARC-IV+ the T1 implements a Chip Multi-Threading (CMT) architecture...(which simply means it is able to run multiple threads of execution simultaneously). For a nice bit of background on CMT technology check out this article (I won't rehash it all here).

Solaris, as you might imagine is a natural fit for CMT systems like the T1000 and T2000, as it has efficiently operated across systems having twice as many CPUs (and more) for years.

For CMT however, acheiving good performance requires more than simply being able to scale. In a previous blog entry I talked about some of the CMT scheduling optimizations we've implemented in Solaris. Andrei will be discussing these optimizations more specifically in the context of the T1 processor (so again, I won't rehash), but it is worth underscoring that (especially for threaded processor architectures), the optimized thread placement and load balancing performed by the scheduler is a huge performance win.

Looking ahead, it's likely that workload characterization is going to be an important (and interesting) area of research. For example, we know that throughput on threaded processor architectures is maximized when threads running on the same core (sharing the same pipeline) are able to effectively execute in each other's stall cycles. CPI (Cycles Per Instruction) should therefore be an interesting metric to note when trying to characterize a given workload's scaling ability on this architecture. What other workload characteristics and metrics will be important/useful to collect and observe? We've got our work cut out for us. :)

Technorati Tags:

Tuesday Jun 14, 2005

Debugging Solaris Scheduling Problems, and Other Fun Party Tricks

Debugging Solaris Scheduling Problems, and Other Fun Party Tricks The day has arrived. At last. This morning you awoke to the sound of birds singing a pleasant song, sunshine beaming in from outside...and your first waking thought brought a smile to your face....as it did mine. Today Solaris is open, and to you we proudly offer it. Welcome.

Great. So now what? OpenSolaris is yours, and you'd like to get your feet wet, right? Well, where should we start? Over the last few years most of my professional time has been spent understanding, improving and evolving the scheduler/dispatcher. That's right, many a performance issue have I seen, and to many a hung system have I cursed. I've learned a lot in that time, but the time has come to begin dumping a cerebral core of my share of arcane information that will hopefully save someone else working in this area of the kernel some time and frustration. That's not to say that the scheduler/ dispatcher code is frustrating, but some of it is "historical", and more often than not, it's easier to understand why and how the code works if you know why it was written in the first place.
The Dispatcher and Scheduling Classes Intro
When someone says "hey, I think there's a scheduling problem here", they might be talking about one (or more) of a few different things:
  • Unfairness - Some threads are getting more (or less) CPU time than they should
  • Starvation - An extreme example of the fairness problem, some runnable thread(s) are barely running, or aren't getting to run at all
  • Excess Migrations - Some thread(s) are hopping between CPUs too often. This can hurt performance since a migrating thread will likely be welcomed by a cold cache.
  • Poor load balancing - Scheduling threads to run on CPUs in such a way that they are forced to excessively contend for resources...where a resource could be CPU time, bandwidth, use of a shared component (e.g. memory controller, pipeline, etc.)
But really, the above deals with both dispatching (where threads run, (on which CPUs)) and scheduling (when they run (designated by thread priorities)).

In Solaris, the core dispatcher code lives in disp.c under usr/src/uts/common/disp. When a thread becomes runnable, this code is responsible for picking the CPU on which the thread should next run. When it will get a chance to run depends on that thread's priority relative to other runnable threads waiting for that CPU. A threads priority is governed by it's scheduling class. You can take a look at the current scheduling classes loaded on the system from within mdb:

# mdb -k
Loading modules: [ unix krtld genunix specfs ufs sd sgsbbc sgenv ip sctp fcp fctl ssd nca random nfs sppp crypto ptm logindmux lofs cpc ]
> ::class
SLOT NAME       INIT FCN                 CLASS FCN
   0 SYS        sys_init                 sys_classfuncs
   1 TS         ts_init                  ts_classfuncs
   2 FX         fx_init                  fx_classfuncs
   3 IA         ia_init                  ia_classfuncs
   4 RT         rt_init                  rt_classfuncs
   5            0                        0
   6            0                        0
   7            0                        0
   8            0                        0
   9            0                        0

Note that mdb -k invokes the debugger on the running kernel, \*very\* useful for debugging a live system. If you want to debug the running kernel, and have it stop so that you can look at a "snapshot" of the kernel, set breakpoints, data watchpoints, etc, you would use "mdb -K", which would drop you into kmdb. For those who remember kadb, kmdb is it's successor. All hail ::print!!! So moving along, from ::class we see:
  • System scheduling class - For kernel service threads. Threads in this class have no time quantums, and they will remain on the CPU until such time as they voluntarily yield it. Typical thread priorities: 60 - 99
  • Time Share scheduling class - The default class for user threads. Threads in this class will have their priority lowered as they consume more CPU time, and increased as they use less. Thread priorities: 0 - 59
  • FX (fixed priority) scheduling class - Threads in this class have fixed priority. By default it's 0, but privileged users can run threads in this scheduling class at priorities greater than 0. Thread priorities: 0 - 60.
  • Interactive scheduling class - Used by the windowing system. Very similar to TS, except that there's a mechanism allowing threads to have a "boost" of priority. The windowing system uses this to give processes associated with "in focus" windows higher priority. Thread priorities: 0 - 59
  • Real Time scheduling class - Threads in the RT class assume the highest system priorities, even higher than system. Thread priorities: 100 - 159
You can find more about these scheduling classes from the priocntl(1) man page, Note that there's a difference between a scheduling class priority, and the thread's system wide priority. The priorities cited above are system priorities, which is the global, scheduling class independent scale of priorities used for all threads by the dispatcher. In the source base, the code which implements these scheduling classes can again be found under usr/src/uts/common/disp.

esaxe@sark$ ls
SCCS/         disp_lock.c   ia.c          shuttle.c     ts_dptbl.c
class.c       fss.c         priocntl.c    sysclass.c
cpupart.c     fx.c          rt.c          thread.c
disp.c        fx_dptbl.c    rt_dptbl.c    ts.c

As an example of how the dispatcher and scheduling classes hook together, let's look at preempt(), the function invoked when the currently running thread on a CPU should yield in favor of a higher priority runnable thread:

 \* Preempt the currently running thread in favor of the highest
 \* priority thread.  The class of the current thread controls
 \* where it goes on the dispatcher queues. If panicking, turn
 \* preemption off.
        kthread_t       \*t = curthread;
        klwp_t          \*lwp = ttolwp(curthread);

        if (panicstr)

        TRACE_0(TR_FAC_DISP, TR_PREEMPT_START, "preempt_start");


        if (t->t_state != TS_ONPROC || t->t_disp_queue != CPU->cpu_disp) {
                 \* this thread has already been chosen to be run on
                 \* another CPU. Clear kprunrun on this CPU since we're
                 \* already headed for swtch().
                CPU->cpu_kprunrun = 0;
                TRACE_0(TR_FAC_DISP, TR_PREEMPT_END, "preempt_end");
        } else {
                if (lwp != NULL)
                CPU_STATS_ADDQ(CPU, sys, inv_swtch, 1);

                TRACE_0(TR_FAC_DISP, TR_PREEMPT_END, "preempt_end");

                swtch();                /\* clears CPU->cpu_runrun via disp() \*/

Typically the "else" route is taken, and we see that the scheduling class specific preemption routine is called, CL_PREEMPT(). If the thread is in the TS (timeshare) scheduling class, then this evaluates to ts_preempt(), which ultimately results in either setbackdq() or setfrondq() being called, which place the thread at the front, or the back of the queue (respectively) of threads that exist at that thread's priority. At the end of preempt() above, we see swtch() is called (switch threads), which in turn calls disp() (return the highest priority thread that the CPU can now run (or the idle thread if there's nothing)), and finally resume() which does the actual context switch. Whew!

There's something important to note here. All this is being done in the context of the thread that's being preempted! That is, the thread that is being preempted is actually placing itself on a run queue somewhere, is finding the next best thread to run, and is then context switching to it! The thread is told it needs to preempt by the setting of the "cpu_runrun, or cpu_kprunrun" flags in the CPU structure. That flag may have been set by another, higher priority thread that placed itself on the run queue (see cpu_resched()). Whenever the thread takes a trap, returns from a system call, etc. It checks for this flag and proceeds to preempt itself if it is set. The idea that the dispatcher isn't some group of one or more bureaucratic threads "directing traffic" may seem surprising, but it's implemented this way for performance and scalability. All threads on the system share in executing dispatcher code, and we must take care that the bits of policy employed by every thread add up to serve the greater good.
So lets talk a bit about debugging. What's that you say, debugging is painful? Tired of adding printk() statements here and there, re-compiling, and adding more printk() statements? I feel your pain. But let me assure you that when that smile erupted on your face this morning, it was for a reason. I enjoy developing Solaris, and it's not because I think hunting down fantastically complex and subtle problems with print statements is fun, it's because tools like (k)mdb and DTrace free me to investigate, observe, and understand what's going on without having to figure out what's needed to make the problem observable.

For now, lets look at one of the tools at your disposal:
(k)mdb, dcmds, and CTF
Let's invoke the kernel debugger on my build machine's running kernel to see what it's up to:

# mdb -k
Loading modules: [ unix krtld genunix specfs ufs sd sgsbbc sgenv ip sctp fcp fctl ssd nca random nfs sppp ptm logindmux lofs cpc ]
> ::cpuinfo
  0 0000180c000  1b    0    0  -1   no    no t-11   2a10001fcc0 (idle)
  1 300423b2000  1b    0    0  -1   no    no t-63   2a1002f7cc0 (idle)
  2 300423ba000  1b    0    0  -1   no    no t-1    2a10037fcc0 (idle)
  3 300423c6000  1b    0    0  -1   no    no t-91   2a1003c7cc0 (idle)
  8 300423ce000  1b    0    0  59   no    no t-0    30d4775abc0 mdb
  9 300423de000  1b    0    0  -1   no    no t-0    2a100539cc0 (idle)
 10 300423ea000  1b    0    0  -1   no    no t-71   2a1005c1cc0 (idle)
 11 300423f2000  1b    0    0  -1   no    no t-8    2a100609cc0 (idle)
 16 300423fa000  1b    0    0  -1   no    no t-10   2a100691cc0 (idle)
 17 30042402000  1b    0    0  -1   no    no t-191  2a100719cc0 (idle)
 18 3004240a000  1b    0    0  -1   no    no t-0    2a1007a1cc0 (idle)
 19 30042412000  1b    0    0  -1   no    no t-0    2a1007e9cc0 (idle)

Well, that's not very exciting. All but CPU 8 is idle, and CPU 8 is running my mdb process thread. Let me kick off a build:

> ::cpuinfo
  0 0000180c000  1b    0    0  30   no    no t-0    325d59b8820 acomp
  1 300423b2000  1b    1    0  30   no    no t-0    308e67c02a0 acomp
  2 300423ba000  1b    1    0  55   no    no t-2    30760fbce60 cg
  3 300423c6000  1b    2    0  53   no    no t-1    33886adf500 cg
  8 300423ce000  1b    1    0  39   no    no t-1    35876d042c0 iropt
  9 300423de000  1b    2    0  59   no    no t-1    30d4775abc0 mdb
 10 300423ea000  1b    0    0  42   no    no t-1    3081f8cc200 cg
 11 300423f2000  1b    2    0  32   no    no t-0    319c6969400 cg
 16 300423fa000  1b    1    0  53   no    no t-0    338853f1ae0 cg
 17 30042402000  1b    0    0  43   no    no t-0    307215023a0 cg
 18 3004240a000  1b    1    0  49   no    no t-0    30d4771c5e0 cg
 19 30042412000  1b    0    0  43   no    no t-1    308e67dbc80 cg

That's better, ::cpuinfo gives you an overview of what threads every CPU is running. ::cpuinfo -v gives you even more. Lets look at CPU 1 in detail:

> 300423b2000::cpuinfo -v
  1 300423b2000  1b    4    0   0   no    no t-5    3075514a440 acomp
                  |    |
       RUNNING <--+    +-->  PRI THREAD      PROC
         READY                 0 35876d042c0 sh
        EXISTS                 0 3075514b140 acomp
        ENABLE                 0 30041c654a0 dmake
                               0 308e629b120 dmake

So CPU1 has 4 runnable threads, all at priority 0. The current thread "acomp" has been on for 5 ticks (5/100) of a second, and this CPU has the right set of flags set for an online and active CPU.
There's a whole array of dcmds available to ease looking around the system. You can get a list using the ::dcmds dcmd. Also of interest are the walkers, which allow you to easily traverse lists of things....CPUs, threads, pages, etc. For example, let's walk the list of all active thread structures:

> ::walk thread
>> More [, , q, n, c, a] ?

(It's a long list, and i'll spare you all the output). Now let's pipe the output of ::walk thread to ::findstack, which prints a stack trace for a thread given a thread structure:

> ::walk thread |::findstack
stack pointer for thread 180e000: 180b0f1
[ 000000000180b0f1 sched+0x570() ]
  000000000180b211 main+0x458()
  000000000180b2f1 _start+0x110()
stack pointer for thread 2a10001fcc0: 2a10001f021
[ 000002a10001f021 ktl0+0x48() ]
  000002a10001f171 disp_getwork+0x164()
  000002a10001f221 idle+0xdc()
  000002a10001f2d1 thread_start+4()
stack pointer for thread 2a100017cc0: 2a100017151
[ 000002a100017151 cv_wait+0x5c() ]
  000002a100017201 taskq_thread+0x13c()
  000002a1000172d1 thread_start+4()
stack pointer for thread 2a10000fcc0: 2a10000f141
[ 000002a10000f141 sema_p+0x25c() ]
  000002a10000f1f1 kmdbmod`kctl_wr_thread+0xbc()
  000002a10000f2d1 thread_start+4()
stack pointer for thread 2a100047cc0: 2a100046f51
  000002a100046ff1 clock+0x20()
  000002a1000470e1 cyclic_softint+0x374()
  000002a100047221 cbe_level10+8()
  000002a1000472d1 intr_thread+0x294()
  000002a10007f2d1 utl0+0x4c()
  000002a10001f171 disp_getwork+0x164()
  000002a10001f221 idle+0xdc()
  000002a10001f2d1 thread_start+4()
>> More [, , q, n, c, a] ?

When a system seems hung. I'll typically break into kmdb, and will do a ::walk thread |::findstack to look for interesting stacks. If there's a pile of threads all asleep with mutex_vector_enter() near the top of the stack, then from there i'll dig down to find who owns what locks, and why the various lock owners aren't letting go (or why they can't let go because they can't run).

Let's walk all the CPU structures, and for each of them print out the cpu_thread field (pointer to the kthread_t for the on_proc) thread, and for each of those threads, let's print out their t_pri field, the thread's current priority:

> ::walk cpu |::print cpu_t cpu_thread |::print kthread_t t_pri
t_pri = 0xffff
t_pri = 0xffff
t_pri = 0xffff
t_pri = 0xffff
t_pri = 0xffff
t_pri = 0xffff
t_pri = 0x3b
t_pri = 0xffff
t_pri = 0xffff
t_pri = 0xffff
t_pri = 0xffff
t_pri = 0xffff

So most of the CPUs are running a priority -1 thread (idle), and one is running a priority 59 thread (my mdb process). You'll find that ::print is your friend:

> cpu0::print cpu_t cpu_disp |::print disp_t
    disp_lock = 0
    disp_npri = 0xaa
    disp_q = 0x309c1a19000
    disp_q_limit = 0x309c1a19ff0
    disp_qactmap = 0x300425ed380
    disp_maxrunpri = 0xffff
    disp_max_unbound_pri = 0xffff
    disp_nrunnable = 0
    disp_cpu = cpu0

mdb and kmdb know how to print out \*every\* structure thanks to the CTF data compiled in. This eliminates the need for clunky adb macros. Relating to the scheduler/dispatcher, you may want to consider getting familiar with the kthread_t, cpu_t, and disp_t structures.
So with a basic knowledge of a few structures, a few dcmds and walkers, you're on your way! The same tricks we've been doing here are of course relevant on crash dumps. In fact, let's take a look at a crash dump from a recent interrupt related scheduling bug...
The ecache_scrubber stuck in run queue limbo bug
A few weeks back Shiv Agarwal pointed me at his Sun-Fire 6900 test rig that seemed to be hard hanging every now and again while executing some Dynamic Reconfiguration (DR) stress tests. (DR refers to the machine's ability to add and remove CPU/Memory/IO boards/Disks, etc from a running system without requiring reboot). Hard hangs are particularly frustrating, because the system is so hung that it's impossible to break the system into kmdb for further analysis. This is where the "deadman" feature in Solaris becomes useful. When enabled, every second, a high (level 15) interrupt will fire on every CPU, and will execute the deadman() function, which watches kernel variable "lbolt" (a tick counter incremented each time the clock interrupt fires).

If the deadman observes that lbolt hasen't changed in a while, it will induce a panic. The deadman feature can be enabled by setting the kernel variable "snooping" to 1 in /etc/system:

set snooping = 1

... or by hotpatching the variable's value early in boot.

Shiv enabled the deadman, and when he reproduced the problem instead of hard hanging the system panicked:

> panic[cpu1]/thread=30018dd6cc0: deadman: timed out after 9 seconds of
> clock inactivity
> 000002a10005fdf0 genunix:cyclic_fire+64 (300138a9fb8, 20d5a6fe1472,
> 3b9ac800, 0, 300121d9698, 300348b0280)
>   %l0-3: 00000300195648c0 0000030012a7eb80 000020d5a6fe17c0 000000003b9aca00
>   %l4-7: 00000300348b0320 000020d5a77b5280 0000000000000014 0000000000000004
> 000002a10005fea0 unix:cbe_level14+38 (0, 0, e, 30006356000, 400080, 100aef8)
>   %l0-3: 000000000000022e 0000000000010002 0000000001911ab8 0000000000000000
>   %l4-7: 0000000000000001 0000000000000001 0000000001911ab0 0000000001910470
> 000002a10005ff50 unix:current_thread+140 (1, 6912166044, 6912166040,
> 18af260, f0d0e, f0d0f)
>   %l0-3: 0000000001007674 000002a102682a31 000000000000000e 0000000000010002
>   %l4-7: 00000000018670b0 0000000000000000 000000000000000b 000002a1026832e0
> 000002a102683380 unix:mp_cpu_quiesce+90 (180c000, 200, 1, 0, 3938700, fc00)
>   %l0-3: 0000000000000000 0000000000000000 0000000000000000 00000000018af260
>   %l4-7: 00000000018af000 00000000018aa660 0000000000000000 000000000085daf8
> 000002a102683430 sbdp:sbdp_cpu_poweroff+d4 (180c000, 1c00, 0, 0, 0,
> f082e520)
>   %l0-3: 00000000000000bb 00000000705e3e20 00000000705e3c00 0000030015828000
>   %l4-7: 000000007043dc00 000007fff0914000 0000000000001fff 00000000705e3e18
> 000002a102683500 unix:cpu_poweroff+24 (180c000, 10, 22, 2f, 2f, 180c000)
>   %l0-3: 0000000000000000 00000000705e2900 00000000705e2930 0000000000000030
>   %l4-7: 000000000190be28 000002a102683590 fffffffff082e520 00000000705e2800
> 000002a1026835b0 sbd:sbd_pre_detach_cpu+1b0 (300062944c0, 300139c0000,
> 180c000, 0, 30006294510, 30012bdd2c8)
>   %l0-3: 0000030015848000 000000000000008f 0000030015b65658 00000000000bab15
>   %l4-7: 0000000000000000 ffffffffffffffff 0000000000000001 0000000000000000
> 000002a102683660 sbd:sbd_dev_unconfigure+50 (300062944c0, 300139c0000,
> 1, 0, 0, 2a10268371c)
>   %l0-3: 0000000000000005 0000000000000001 0000000000000019 000003000f675180
>   %l4-7: 0000000000000000 0000000000000001 0000000000000000 0000000000000001
> 000002a102683720 sbd:sbd_exec_op+94 (300062944c0, 0, 445000, 7, 7, 8c)
>   %l0-3: 0000030015848000 00000000704bf800 0000000000000003 ffffffffffffff1f
>   %l4-7: ffffffffffffffd3 0000000000445208 0000030006294510 0000000000000000
> 000002a1026837f0 sbd:sbd_ioctl+150 (a500000000, 445208, 293c0, 100003,
> 2a102683977, 445000)
>   %l0-3: 0000000000000003 00000000704bf800 0000000000445208 0000000000445201
> 000002a102683990 genunix:ioctl+184 (3, 30019513580, 293c0, ff1d3108,
> ff1ea000, 445208)
>   %l0-3: 000003001301c590 000003001301c5b0 0000000000000004 0000000000000001
>   %l4-7: 000003001301c5b0 0000000000000000 0000000000000000 0000000000000000
> panic: entering debugger (continue to save dump)

[1]> $c
kaif_enter(18340f8, 0, 11e0a14, 1828400, 11e0a14, 4c00)
panicsys+0x408(18b3410, 18aa2b0, 30006356000, 1, 20040, 1883800)
vpanic+0xcc(11d6c20, 2a10005fe78, 0, 0, 43c998540, 1b5872)
panic+0x1c(11d6c20, 9, 18c9000, 11d6c00, 18c9000, 0)
cyclic_fire+0x64(300138a9fb8, 20d5a6fe1472, 3b9ac800, 0, 300121d9698,
cbe_level14+0x38(0, 0, e, 30006356000, 400080, 100aef8)
current_thread+0x140(1, 6912166044, 6912166040, 18af260, f0d0e, f0d0f)
mp_cpu_quiesce+0x90(180c000, 200, 1, 0, 3938700, fc00)
sbdp`sbdp_cpu_poweroff+0xd4(180c000, 1c00, 0, 0, 0, f082e520)
cpu_poweroff+0x24(180c000, 10, 22, 2f, 2f, 180c000)
sbd`sbd_pre_detach_cpu+0x1b0(300062944c0, 300139c0000, 180c000, 0, 30006294510,
sbd`sbd_dev_unconfigure+0x50(300062944c0, 300139c0000, 1, 0, 0, 2a10268371c)
sbd`sbd_exec_op+0x94(300062944c0, 0, 445000, 7, 7, 8c)
sbd`sbd_ioctl+0x150(a500000000, 445208, 293c0, 100003, 2a102683977, 445000)
ssm`ssm_ioctl+0xcc(a500000000, 3000313f880, 293c0, 100003, 3001218cf80, 445208)
ioctl+0x184(3, 30019513580, 293c0, ff1d3108, ff1ea000, 445208)
syscall_trap32+0xcc(3, 445208, 293c0, ff1d3108, ff1ea000, b)

Ok, so the golden question. Why wasn't the clock interrupt able to fire so that it could change lbolt to make deadman() happy?
Using ::cpuinfo to look around:

> ::cpuinfo
  0 0000180c000  2d    1    9  -1   no   yes t-31536 2a1015a3cc0 (idle)
  1 0000187e7a0  1b    9    0  49   no    no t-1    30018dd6cc0 cfgadm
  2 300061d2000   d   10    0 169   no    no t-0    2a101433cc0 sched
  3 3000631e000  2d    0    0  -1   no    no t-31541 2a102053cc0 (idle)
  8 30006340000   d    8    0 169   no    no t-0    2a101943cc0 sched
 10 30008c74000  1d   10    0 169   no    no t-0    2a10117bcc0 sched
 11 30008864000   d    6    0 169   no    no t-0    2a101f2bcc0 sched
 16 30008442000  1d    9    0 169   no    no t-0    2a101d23cc0 sched
 17 3000da1e000  1d    6    0 169   no    no t-0    2a100611cc0 sched
 18 3000c9fc000  1d    6    0 169   no    no t-0    2a101983cc0 sched
 19 30005bda000  1d    8    0 169   no    no t-0    2a1020a3cc0 sched

Who are the priority 169 threads running on 2, 8-19?

> 2a101433cc0::findstack
stack pointer for thread 2a101433cc0: 2a101433021
[ 000002a101433021 panic_idle+0x1c() ]
  000002a1014330d1 ktl0+0x48()
  000002a101433221 cpu_pause+0x5c()
  000002a1014332d1 thread_start+4()

Ah, those are pause threads. There are certain lists and data structures walked by the dispatcher code without any locks held. One example is the per processor set list of active CPUs walked in disp_anywork(), a function used by the dispatcher to determine if there's any thing better to run in the system besides idle(). This enables lockless performance in the dispatcher, at the cost of requiring that all CPUs be paused (run high priority threads that just spin) when those lists/structures are changed around (ensuring no other CPUs are executing dispatcher code) ...which of course happens very rarely.

OK, so some of the CPUs are running their pause threads, while CPUs 0 and 3 are idle(). The thread on CPU1 is in the middle of a DR operation. Looking at the first argument to mp_cpu_quiesce() above, we see it's trying to power off and quiesce CPU0 of interrupts.

> 30018dd6cc0::findstack
stack pointer for thread 30018dd6cc0: 2a102682ae1
  000002a102682b81 mp_cpu_quiesce+0x90()
  000002a102682c31 sbdp_cpu_poweroff+0xd4()
  000002a102682d01 cpu_poweroff+0x24()
  000002a102682db1 sbd_pre_detach_cpu+0x1b0()
  000002a102682e61 sbd_dev_unconfigure+0x50()
  000002a102682f21 sbd_exec_op+0x94()
  000002a102682ff1 sbd_ioctl+0x150()
  000002a1026830c1 ssm_ioctl+0xcc()
  000002a102683191 ioctl+0x184()
  000002a1026832e1 syscall_trap32+0xcc()
> mp_cpu_quiesce+0x90::dis
mp_cpu_quiesce+0x80:            nop
mp_cpu_quiesce+0x84:            clr       %g2
mp_cpu_quiesce+0x88:            ba        +0x34         <mp_cpu_quiesce+0xbc>
mp_cpu_quiesce+0x8c:            cmp       %g2, 0
mp_cpu_quiesce+0x90:            call      +0x1db71c     <drv_usecwait>
mp_cpu_quiesce+0x94:            mov       %i2, %o0
mp_cpu_quiesce+0x98:            add       %l7, 1, %l7
mp_cpu_quiesce+0x9c:            cmp       %l7, %i4
mp_cpu_quiesce+0xa0:            bl,pt     %icc, -0x40   <mp_cpu_quiesce+0x60>

Which, looking at the C code puts us here in mp_cpu_quiesce():

         \* Declare CPU as no longer being READY to process interrupts and
         \* wait for them to stop. A CPU that is not READY can no longer
         \* participate in x-calls or x-traps.
        cp->cpu_flags &= ~CPU_READY;
        CPUSET_DEL(cpu_ready_set, cpuid);

        for (i = 0; i < sanity_limit; i++) {
                if (cp->cpu_intr_actv == 0 &&
                    cp->cpu_thread == cp->cpu_idle_thread) {
                        found_intr = 0;
                DELAY(1); <-------------- \*\*

So CPU1 is caught in the for loop waiting for CPU0's cpu_intr_actv bitmask to become empty (cpu_intr_actv is a bitmap of active interrupts that lives in the CPU structure). Sure enough, it isn't:

> 180c000::print cpu_t cpu_intr_actv
cpu_intr_actv = 0x200

Taking a close look at CPU0:

> 180c000::cpuinfo -v
  0 0000180c000  2d    1    9  -1   no   yes t-31536 2a1015a3cc0 (idle)
                  |    |    |
       RUNNING <--+    |    +--> PIL THREAD
      QUIESCED         |           9 2a101423cc0
        EXISTS         |
       OFFLINE         +-->  PRI THREAD      PROC
                             168 2a101423cc0 sched
> 2a101423cc0::findstack
stack pointer for thread 2a101423cc0: 2a1014230d1
  000002a101423171 scrub_ecache+0x80()
  000002a101423221 swtch+0x78()
  000002a1014232d1 intr_thread+0x240()

Hey look, there's a level 9 interrupt thread runnable and eagerly awaiting the CPU to grab it, and run it.

What's an interrupt thread? Interrupt service code in Solaris executes while the existing thread on the processor is "pinned". That is, when an interrupt happens, the CPU immediately jumps from executing the interrupted thread to the interrupt service code. When complete, control is restored to the interrupted thread and life moves on. It is not often that interrupts lose the CPU prior to finishing execution, but it does happen, most often because the interrupt tries to grab an adaptive lock that is already held. When this happens, the interrupt is turned into a real thread (rather than leaching off the interrupted thread), and blocks as usual. Interrupt threads, like real time threads are at the top of the food chain in terms of priorities (With the RT class loaded, 160-169). PIL 10 interrupts and below are permitted to grab adaptive locks, and can therefore block (and be turned into real threads). Interrupts above PIL 10 may never block. The dispatcher operates at PIL 11, and if it were possible to take an interrupt that later blocked while already in the dispatcher all sorts of nastiness could ensue.

But for our runnable interrupt thread, despite it's eagerness to run again, it's day will never come. CPU0 is idle(), but looking at the flags we see that it's OFFLINE and QUIESCED. This is the state your CPU assumes after you use psradm -f, for example. The CPU isn't permitted to execute any threads, and strictly speaking it's dangerous for it be servicing interrupts. The offline CPU will remain parked in this loop in idle()

                 \* If CPU is completely quiesced by p_online(2), just wait
                 \* here with minimal bus traffic until put online.
                while (cp->cpu_flags & CPU_QUIESCED)

What happens if the interrupt thread blocks? You're looking at it. The thread will remain unrunnable. It's only hope is for the QUIESCED flag to be cleared should the CPU be brought back online.

So did this interrupt block? Not exactly. The ecache scrubbing code is brave, and wants to scrub the caches of even offline CPUs. It does this by having one CPU send a cross trap to the target (offline) CPU, which causes the target to take a PIL 9 interrupt. Again, this is very dicey doing this on offline CPUs, but looking at the interrupt code executed there's no evidence of it grabbing any locks. So how did it lose the CPU? Looking at the interrupt thread stack we find our clue:

> 2a101423cc0::findstack
stack pointer for thread 2a101423cc0: 2a1014230d1
  000002a101423171 scrub_ecache+0x80()
  000002a101423221 swtch+0x78()
  000002a1014232d1 intr_thread+0x240()

The topmost stack frame already popped off, so really the last bit of code we executed was switch, called by intr_thread+0x240:

intr_thread+0x234:              ldub      [%o2 + 0xe9], %o5
intr_thread+0x238:              brz,pt    %o5, +0x1c    <intr_thread+0x254>
intr_thread+0x23c:              mov       %o2, %l3
intr_thread+0x240:              call      +0x486a8      <preempt>
intr_thread+0x244:              mov       %o3, %l2
intr_thread+0x248:              mov       %l3, %o2
intr_thread+0x24c:              mov       %l2, %o3

Oh oh. Looks like intr_thread+0x240 called preempt(), which then called swtch(). This thread voluntarily gave up the CPU. Why? Code exists in intr_thread() that prevents that pathological scenario of a thread being pinned by an interrupt for an excessive period of time. Some networking interrupt handlers will happily stay in interrupt context as long as there's a steady stream of packets to service, so intr_thread(), once it detects that we've processed 16 or so interrupts without giving the poor pinned thread a chance to run, will force itself, and whomever it's pinning to preempt to give the interrupted thread a chance to migrate elsewhere. Unfortunately, while it may have had the best intentions, it did this on a QUIESCED CPU, which made the ecache scrubber interrupt thread unrunnable. The ecache scrubber interrupt is sent regularly via a cyclic firing on another CPU (the clock, and deadman are both cyclic based).

So with the ecache scrubber left unrunnable, CPU0 became stuck at PIL 9, which left CPU1 stuck waiting in a loop for CPU0's PIL to drop. But CPU1 got stuck after it had already paused CPUs. (Done via pause_cpus()

One key point I should mention, the pause thread raises it's CPU's PIL to 10. Hmm, I wonder where clock should be running...

> ::cycinfo -v
 10 30011394900  online     12 300189d9480    a926ae697c00 do_scrub_offline

                    10                                    1
                    |                                     |
          +---------+--------+                  +---------+---------+
          9                  5                  3                   2
          |                  |                  |                   |
     +----+----+        +----+----+        +----+----+         +----+---+
     4         0        11        7        8
     |         |        |         |        |
   +-+-+    +--+-+    +-+-+    +--+-+    +-+-+

  300189d9480   0    8  low 212603    a926cd692a74  705032 cheetah_nudge_buddy
  300189d94a8   1    2  low 14989212    a926ae697c00   10000 do_scrub
  300189d94d0   2    6  low 14989212    a926ae697c00   10000 do_scrub
  300189d94f8   3    5  low 14989212    a926ae697c00   10000 do_scrub
  300189d9520   4    7 high     0    a926c7527474 1000000 deadman
  300189d9548   5    4  low 14989212    a926ae697c00   10000 do_scrub_offline
  300189d9570   6    0  low 14989212    a926ae697c00   10000 do_scrub_offline
  300189d9598   7   10  low 14989212    a926ae697c00   10000 do_scrub_offline
  300189d95c0   8   11  low 149892    a926c6410000 1000000 dtrace_state_deadman
  300189d95e8   9    3  low 15139104    a926adf27592    9900 dtrace_state_clean
  300189d9610  10    1 lock 14989212    a926ae697c00   10000 clock <---
  300189d9638  11    9 high     0    a926ae697c00   10000 cbe_hres_tick

Wow, CPU10 has the clock cyclic, and there's a \*lot\* of pending clock interrupts. Note that the clock fires at PIL 10, but the pause thread running at that CPU has masked interrupts PIL10 and below. :P Well, there it is, that's why clock couldn't bump lbolt, and after 9 seconds the deadman pulled the plug. A pause_thread running on CPU0 (in the past) was probably also responsible for backing up the number of pending PIL 9 scrub requests, such that when PIL was lowered, the overwhelming number of pending PIL 9 interrupts triggered the preempt mechanism in intr_thread().

The fix for this bug was pretty trivial. intr_thread shouldn't employ it's preempt() mechanism if the CPU is quiesced. This was bug 6256245.

Fun stuff. In another posting, we'll look at how DTrace can be an indispensable tool in tracking down scheduling problems, and the joys of the sched provider.
Technorati Tag: OpenSolaris
Technorati Tag: Solaris
Technorati Tag: DTrace
Technorati Tag: mdb

Monday Jun 13, 2005

CMT and Solaris Performance

I (too) long ago mentioned that I would talk a bit about CMT processor architectures, and the performance enhancements we've made (and are making) to Solaris as result. In this post, I would at least like to get started. :)
Some background... (why / what is CMT)
Looking behind us, we see there's been some trends in processor evolution. Caches have grown larger, clock speeds have grown faster, and pipelines have become more sophisticated. These trends seem to have held true for quite some time...so much so, that we've practically come to expect it. Look at clock speed. For how long has it been the de-facto rule of thumb for classifying and comparing processor performance? The notion that given two processors, the one with the higher clock speed is faster has historically been beaten into us. But how reliable is this metric these days? How much of a correlation is there in practice between clock speed and performance? In reality, it's pretty clear that the technique of cranking up clock speed to squeeze out more performance has become a game of diminishing returns. As clock speeds have increased, so too has the relevance and impact of latency on overall performance. What's the point of being able to chew through many instructions quickly if much of the processor's time is wasted stalled waiting for memory? Thanks to Amdahl we know that at some point, a higher clock speed won't help if latency dominates. One must either find a way to reduce the amount of latency experienced by the processor, or find something more productive to do in that latency besides stalling. Sure, caches can help to reduce memory latency, but they too have tradeoffs. As they grow larger, they become slower and more expensive. Whizzier pipeline designs may help the processor chew through more instructions in a given cycle, but stalls are still a problem. What to do?
A throughput approach
Chip Multi-Threaded (CMT) architecture diverges somewhat from what has been the traditional ways of getting more work out of a processor. Rather than butting heads with the laws of physics in an attempt to quickly burn though a single instruction stream (stumbling and stalling along the way), CMT processors do more by allowing multiple threads to execute in parallel. The nature of the parallelism depends on how the processor implements CMT, but in general this is a hallmark characteristic. This is in contrast to traditional (non-CMT) processors that can execute at most one thread at a time. Because multiple threads can execute in parallel, each CMT processor presents itself to the operating system as multiple "logical" CPUs, upon which threads may be scheduled to run. One nice benefit of CMT, is that if the operating system already works in traditional multi-processor configurations, then there would typically be little (or no) change required to get it to function on the CMT system. But what about performance? Here in lies the challenge. Typically, the logical CPUs presented by a CMT processor share some of the processor's physical resources and/or facilities. Depending on the CMT implementation this sharing can be extensive. This means that a given thread executing on a CMT processor may impact (for better or worse) the performance of the other executing threads on the same chip, depending on the nature of the sharing. Before I dig too deep here, and to give some context, let's look at some common CMT implementations:

Some processors (such as the UltraSPARC-IV and Dual-core AMD Opteron) implement CMT by incorporating multiple traditional processor cores in a single physical package. This technique is known as Chip Multi-Processing (or CMP), and from a performance perspective it is generally the tamest. In some cases, one or more on-chip caches may be shared between the processor cores (allowing for threads to constructively/destructively interfere with each other), and shared logic / datapaths to cache, memory, and memory controllers may exist which can also cause contention / bandwidth issues.

Far more interesting, are the Multi-Threaded processing architectures where each core may present multiple logical CPUs, sometimes called "strands". Examples include Intel's Hyper-Threaded P4/Xeon processors, and Sun's forthcoming Niagara processor (Niagara is actually a threaded CMP, 4 threads per core with up to 8 cores per chip). This architecture directly addreses the latency problem by allowing the processor core to execute instructions from a different instruction stream (thread) if a given thread stalls. Threads executing on CPUs presented by a given core will typically share everything except a set of registers (necessary for maintaining distinct thread state), so one thread's performance impact on another (on the same core) can be considerable. Everything including caches, TLBs, ALU/FPUs, and the pipeline itself are shared.

Looking ahead, it's quite possible (even probable) that future processors will have many cores, each with many threads. Various sharing relationships will likely exist between the various logical CPUs...some may share caches, some may share pipelines, some may share floating point units, and some may share something else, while others (on different physical processors) may share nothing at all. All sorts of new opportunities will (and already have) arisen for threads to contend for resources, and impact each other via cache interferance, or not...depending on where (on which logical CPUs) they are dispatched to run, and which processor resources are needed by the threads.

What follows is a brief description of some of the enhancements we've already made to Solaris to enhance performance on CMT processors (specifically CMP and Hyper-Threaded P4/Xeon). Andrei, Sasha, and myself have more recently been working extensively on the dispatcher implementation in preparation for Niagara and beyond. Once the pearly gates of OpenSolaris have opened, i'll follow up with a tour of the implementation for the enhancements described below, and will talk more about our existing work in progress, and the direction we're headed.
Load balancing, logical vs. physical awareness
To improve performance (and performance determanism) on CMP, and Hyper-Threaded P4/Xeon processors, the our first goal was to give the Solaris dispatcher awareness of which logical CPUs shared which physical processors, and to put some load balancing policy in place. Given 4 physical processors each with 2 logical CPUs, if one were to run 4 threads, mpstat(1M) should clearly show one thread per chip (with no doubling up). This policy seeks to minimize contention over any per physical processor resources /facilities. Once this awareness was in place, we enhanced psrinfo(1M) to display the logical to physical CPU mappings:

esaxe@badaboom$ psrinfo -vp
The physical processor has 2 virtual processors (0, 4)
  x86 (chipid 0x0 GenuineIntel family 15 model 2 step 6 clock 3000 MHz)
        Intel(r) Xeon(tm) MP CPU 3.00GHz
The physical processor has 2 virtual processors (1, 5)
  x86 (chipid 0x1 GenuineIntel family 15 model 2 step 6 clock 3000 MHz)
        Intel(r) Xeon(tm) MP CPU 3.00GHz
The physical processor has 2 virtual processors (2, 6)
  x86 (chipid 0x2 GenuineIntel family 15 model 2 step 6 clock 3000 MHz)
        Intel(r) Xeon(tm) MP CPU 3.00GHz
The physical processor has 2 virtual processors (3, 7)
  x86 (chipid 0x3 GenuineIntel family 15 model 2 step 6 clock 3000 MHz)
        Intel(r) Xeon(tm) MP CPU 3.00GHz

The above output is on a 4 processor Hyper-Threaded Xeon MP box. You can see CPUs 0 and 4 are Hyper-Thread's from the same chip. If I run 4 busy loops, we see the dispatcher in action via mpstat(1M):

CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0    0   315  207   37    0    3    0    0    25    0   0   0 100
  1    0   0    0    18    3    0    5    0    0    0     0  100   0   0   0
  2    0   0    0    15    2    0    5    0    0    0     0  100   0   0   0
  3    1   0   49    16    7   17    0    1    2    0    74    0   0   0 100
  4    0   0    0    15    1    0    6    0    0    0     0  100   0   0   0
  5    0   0    0    13    0    8    0    1    0    0     0    0   0   0 100
  6    0   0    0    33    0   49    0    1    0    0     3    0   0   0 100
  7    0   0    0    13    0    0    5    0    0    0     0  100   0   0   0

The "100" in the usr column indicates the CPU is at 100% user time, clearly a busy loop. With processors that are either multi-core, or multi-threaded, only a single level of load balancing (at the physical processor level) is needed. When multi-core, multi-threaded, multi-processor configurations emerge, the dispatcher will need to balance across additional levels as well.
Threaded processor enhancements
On threaded processor architectures, Solaris makes use of the "halt" (or equivalent) instruction to "suspend" idle() CPUs when no work is available. When a thread becomes runnable, a cross trap is sent to "wake up" a designated CPU. We found that on our 2 processor Hyper-Threaded P4 test box, that by not halting idle CPUs we could expect somewhere around a ~ 30% performance hit on a single threaded benchmark, and around 47% higher power consumption (as measured on an idle system).

Solaris also makes use of the Pentium IV pause instruction, which (when inserted in spin loops), lessens the performance impact of a spinning thread on another thread running on the same chip. This instruction was added to spin loops both in the kernel, and in libc.
More to come
Ok, i'm going to have to cut this post here. Obviously what i've touched on is the tip of the iceberg. I haven't even talked about application performance issues, the importance of workload observability on CMT, and the API conundrum.



« July 2016