We now resume() your regular programming.

Now that Opening Day is here and you can see for yourself that OpenSolaris is not vaporware, I can fnally discuss code snippets without having to say vague things like "in general, this code does this." You may now actually look for yourself. (Even better, I can now link to the actual code I was talking about.)

One of the most confusing areas of the kernel involves the areas of swtch() and resume().

Put simply, these two routines are largely responsible for multitasking; in its simplest terms, swtch() selects the next process a given CPU core is to run, and resume() actually resumes execution of a given thread beginning at either its initial address or at the point it was last swtch()ed out.

Given that the various resume() routines contain some of the most architecture-specific code within Solaris, the code differs not only between SPARC and x86, but also between x86 and AMD64. Since I was responsible for the changes made to the resume() routines for AMD64, I thought I'd cover that in a bit more detail:

Both the IA-32 ABI and AMD64 ABI (both PDF files, Adobe Reader or equivalent required) refer to certain registers as non-volatile; that is, the contents of those registers must be preserved, by the caller, across function calls. This makes sense, as function call overhead would be enormous if you had to manually save and restore each register value you cared about whenever you called another function.

This gets a bit tricker within the context of resume(), as those registers have to be carefully saved off because one thread is about to give up use of a CPU core to another thread. Therefore the non-volatile registers need to be saved somewhere thread-specific where the values may easily be retrieved. (Likewise, those register values have to be restored when the thread later resume()s execution.)

Prior to Solaris 10, the x86 resume() routines saved the contents of the non-volatile registers on the stack. While a perfectly serviceable solution, the problem was that debuggers like kmdb and mdb had to go through a fair amount of trouble to find the "real stack" for those threads in order to provide a stack trace. To summarize, the debugger had to start at the last saved stack pointer and generate stack traces two different ways and it would present whichever stack trace it found to be longer. Functional, but rather non-optimal.

The SPARC resume() code worked around this by saving the registers within a pre-defined area of its kthread_t, called a label_t. (The label_t had already seen use in SPARC and x86 as the area register values were saved in order to provide non-local goto support, otherwise known as setjmp() and longjmp()).

Since the kthread_t definition is shared between SPARC and x86, this meant that on x86, save for setjmp() and longjmp(), this area had largely gone unused. By modifying the x86 resume() code to also use the label_t for swtch()ed out thread register storage, we were able to not only reduce the complexity of mdb's stack backtrace algorithm, but we would also be able to provide a new degree of congruency in the way resume() works on SPARC, x86 and AMD64.

The macros that save and restore the registers in the resume() code are called, unsurprisingly enough, SAVE_REGS() and RESTORE_REGS():

 \* Save non-volatile regs other than %rsp (%rbx, %rbp, and %r12 - %r15)
 \* The stack frame must be created before the save of %rsp so that tracebacks
 \* of swtch()ed-out processes show the process as having last called swtch().
#define SAVE_REGS(label_t, retaddr)                             \\
        movq    %rbp, LABEL_RBP(label_t);                       \\
        movq    %rbx, LABEL_RBX(label_t);                       \\
        movq    %r12, LABEL_R12(label_t);                       \\
        movq    %r13, LABEL_R13(label_t);                       \\
        movq    %r14, LABEL_R14(label_t);                       \\
        movq    %r15, LABEL_R15(label_t);                       \\
        pushq   %rbp;                                           \\
        movq    %rsp, %rbp;                                     \\
        movq    %rsp, LABEL_SP(label_t);                        \\
        movq    retaddr, (label_t);                             \\
        movq    %rdi, %r12;                                     \\
        call    __dtrace_probe___sched_off__cpu

 \* Restore non-volatile regs other than %rsp (%rbx, %rbp, and %r12 - %r15)
 \* We load up %rsp from the label_t as part of the context switch, so
 \* we don't repeat that here.
 \* We don't do a 'leave,' because reloading %rsp/%rbp from the label_t
 \* already has the effect of putting the stack back the way it was when
 \* we came in.
#define RESTORE_REGS(scratch_reg)                       \\
        movq    %gs:CPU_THREAD, scratch_reg;            \\
        leaq    T_LABEL(scratch_reg), scratch_reg;      \\
        movq    LABEL_RBP(scratch_reg), %rbp;           \\
        movq    LABEL_RBX(scratch_reg), %rbx;           \\
        movq    LABEL_R12(scratch_reg), %r12;           \\
        movq    LABEL_R13(scratch_reg), %r13;           \\
        movq    LABEL_R14(scratch_reg), %r14;           \\
        movq    LABEL_R15(scratch_reg), %r15
If you peruse the code in swtch.s, you're also likely to notice that there is not one resume() routine, but rather there are three variants:
  • resume(): This is the big cheese, the routine that gets called a vast majority of the time. Its task is "pretty simple"; save the registers of the current thread and a return address to the label_t, then load the registers for the new thread from its label_t, and resume its execution whereever execution of that thread last left off. Simple in concept, rather tricky in execution; this is the kind of code where misplacing one instruction will cause incredibly frustratingly unreproducible errors that will literally take weeks to hunt down and locate.

    (There's no exaggeration at all on that "weeks" time frame; I'll leave it as an exercise for the reader to determine how I know.)

  • resume_from_intr(): This routine resumes execution of a thread that had been forced to give up the CPU due to an interrupt. Solaris processes interrupts using dedicated interrupt threads, so whatever thread has control of a CPU core must give up control to an interrupt thread when an interrupt arrives. Thus resume_from_intr() is called from swtch() when the interrupt service routine has finished executing and the system is ready to release the CPU core back to normal use.
  • resume_from_zombie(): This one is a bit different; when a thread has finished execution and is ready to release its resources back to the system, it's considered to be zombied. When a process is zombied and is to relinquish use of the CPU core, it actually calls the routine swtch_from_zombie(), which in turn calls resume_from_zombie() to start execution of a new thread. This routine also performs housekeeping operations on the zombied thread, including putting it on "death row" so that it may be reaped by its creator.
The code construct for all three routines is basically the same; save the non-volatiles on entry, load up the new thread's registers, and begin executing it. However, if you try to trace the execution of a thread once the thread's registers have been restored, you'll see something seemingly bizarre occurring.

We execute this instruction, one that seemingly doesn't come back:

        jmp     spl0
This is because very early on in each of the resume() routines, you'll find code that looks something like this, to give the example of resume() itself:
        leaq    resume_return(%rip), %r11
        SAVE_REGS(%rax, %r11)
What this code does, is to save resume_return as the return address for the thread. This means that when the thread resumes execution, and its registers are restored:
        leaq    T_LABEL(%r12), %r11
        movq    LABEL_SP(%r11), %rsp    /\* switch to outgoing thread's stack \*/
        movq    (%r11), %r13            /\* saved return addr, LABEL_PC is 0 \*/

[ ... condensed for brevity ... ]

        movq    %r13, %rax      /\* save return address \*/
        pushq   %rax            /\* push return address for spl0() \*/
when the jmp is executed:
        jmp     spl0
spl0() will run to completion, and as spl0 ends with a ret instruction, this will cause the thread to return to the return address specified earlier.

So in the case of a thread calling resume(), when it later is allowed to resume execution after having given up the CPU core it was executing on, it will return from spl0() to the label resume_return, at which time the stack frame created in SAVE_REGS() is destroyed and execution will return to the caller that had originally called resume():

         \* Remove stack frame created in SAVE_REGS()
        addq    $CLONGSIZE, %rsp

There you have it, an all too short treatise on resume() that's likely to have raised more questions than it answered, (and if that was indeed the case, let me apologize in advance.)

However, if you are now more confused than when you started reading, thanks to OpenSolaris, you may now read the source code and investigate the mysteries of resume() for yourself.

That's what the world of Open Source is all about.

Technorati Tag: OpenSolaris
Technorati Tag: Solaris

Post a Comment:
  • HTML Syntax: NOT allowed



« April 2014