We now resume() your regular programming.
By kucharsk on Jun 14, 2005
One of the most confusing areas of the kernel involves the areas of swtch() and resume().
Put simply, these two routines are largely responsible for multitasking; in its simplest terms, swtch() selects the next process a given CPU core is to run, and resume() actually resumes execution of a given thread beginning at either its initial address or at the point it was last swtch()ed out.
Given that the various resume() routines contain some of the most architecture-specific code within Solaris, the code differs not only between SPARC and x86, but also between x86 and AMD64. Since I was responsible for the changes made to the resume() routines for AMD64, I thought I'd cover that in a bit more detail:
Both the IA-32 ABI and AMD64 ABI (both PDF files, Adobe Reader or equivalent required) refer to certain registers as non-volatile; that is, the contents of those registers must be preserved, by the caller, across function calls. This makes sense, as function call overhead would be enormous if you had to manually save and restore each register value you cared about whenever you called another function.
This gets a bit tricker within the context of resume(), as those registers have to be carefully saved off because one thread is about to give up use of a CPU core to another thread. Therefore the non-volatile registers need to be saved somewhere thread-specific where the values may easily be retrieved. (Likewise, those register values have to be restored when the thread later resume()s execution.)
Prior to Solaris 10, the x86 resume() routines saved the contents of the non-volatile registers on the stack. While a perfectly serviceable solution, the problem was that debuggers like kmdb and mdb had to go through a fair amount of trouble to find the "real stack" for those threads in order to provide a stack trace. To summarize, the debugger had to start at the last saved stack pointer and generate stack traces two different ways and it would present whichever stack trace it found to be longer. Functional, but rather non-optimal.
The SPARC resume() code worked around this by saving the registers within a pre-defined area of its kthread_t, called a label_t. (The label_t had already seen use in SPARC and x86 as the area register values were saved in order to provide non-local goto support, otherwise known as setjmp() and longjmp()).
Since the kthread_t definition is shared between SPARC and x86, this meant that on x86, save for setjmp() and longjmp(), this area had largely gone unused. By modifying the x86 resume() code to also use the label_t for swtch()ed out thread register storage, we were able to not only reduce the complexity of mdb's stack backtrace algorithm, but we would also be able to provide a new degree of congruency in the way resume() works on SPARC, x86 and AMD64.
If you peruse the code in swtch.s, you're also likely to notice that there is not one resume() routine, but rather there are three variants:/\* \* Save non-volatile regs other than %rsp (%rbx, %rbp, and %r12 - %r15) \* \* The stack frame must be created before the save of %rsp so that tracebacks \* of swtch()ed-out processes show the process as having last called swtch(). \*/ #define SAVE_REGS(label_t, retaddr) \\ movq %rbp, LABEL_RBP(label_t); \\ movq %rbx, LABEL_RBX(label_t); \\ movq %r12, LABEL_R12(label_t); \\ movq %r13, LABEL_R13(label_t); \\ movq %r14, LABEL_R14(label_t); \\ movq %r15, LABEL_R15(label_t); \\ pushq %rbp; \\ movq %rsp, %rbp; \\ movq %rsp, LABEL_SP(label_t); \\ movq retaddr, (label_t); \\ movq %rdi, %r12; \\ call __dtrace_probe___sched_off__cpu /\* \* Restore non-volatile regs other than %rsp (%rbx, %rbp, and %r12 - %r15) \* \* We load up %rsp from the label_t as part of the context switch, so \* we don't repeat that here. \* \* We don't do a 'leave,' because reloading %rsp/%rbp from the label_t \* already has the effect of putting the stack back the way it was when \* we came in. \*/ #define RESTORE_REGS(scratch_reg) \\ movq %gs:CPU_THREAD, scratch_reg; \\ leaq T_LABEL(scratch_reg), scratch_reg; \\ movq LABEL_RBP(scratch_reg), %rbp; \\ movq LABEL_RBX(scratch_reg), %rbx; \\ movq LABEL_R12(scratch_reg), %r12; \\ movq LABEL_R13(scratch_reg), %r13; \\ movq LABEL_R14(scratch_reg), %r14; \\ movq LABEL_R15(scratch_reg), %r15
This is the big cheese, the routine that gets called a vast majority of the
time. Its task is "pretty simple"; save the registers of the current
thread and a return address to the label_t, then load the registers
for the new thread from its label_t, and resume its execution
whereever execution of that thread last left off. Simple in concept, rather
tricky in execution; this is the kind of code where misplacing one instruction
will cause incredibly frustratingly unreproducible errors that will
literally take weeks to hunt down and locate.
(There's no exaggeration at all on that "weeks" time frame; I'll leave it as an exercise for the reader to determine how I know.)
- resume_from_intr(): This routine resumes execution of a thread that had been forced to give up the CPU due to an interrupt. Solaris processes interrupts using dedicated interrupt threads, so whatever thread has control of a CPU core must give up control to an interrupt thread when an interrupt arrives. Thus resume_from_intr() is called from swtch() when the interrupt service routine has finished executing and the system is ready to release the CPU core back to normal use.
- resume_from_zombie(): This one is a bit different; when a thread has finished execution and is ready to release its resources back to the system, it's considered to be zombied. When a process is zombied and is to relinquish use of the CPU core, it actually calls the routine swtch_from_zombie(), which in turn calls resume_from_zombie() to start execution of a new thread. This routine also performs housekeeping operations on the zombied thread, including putting it on "death row" so that it may be reaped by its creator.
We execute this instruction, one that seemingly doesn't come back:
This is because very early on in each of the resume() routines, you'll find code that looks something like this, to give the example of resume() itself:jmp spl0
What this code does, is to save resume_return as the return address for the thread. This means that when the thread resumes execution, and its registers are restored:leaq resume_return(%rip), %r11 SAVE_REGS(%rax, %r11)
when the jmp is executed:leaq T_LABEL(%r12), %r11 movq LABEL_SP(%r11), %rsp /\* switch to outgoing thread's stack \*/ movq (%r11), %r13 /\* saved return addr, LABEL_PC is 0 \*/ [ ... condensed for brevity ... ] movq %r13, %rax /\* save return address \*/ RESTORE_REGS(%r11) pushq %rax /\* push return address for spl0() \*/
spl0() will run to completion, and as spl0 ends with a ret instruction, this will cause the thread to return to the return address specified earlier.jmp spl0
So in the case of a thread calling resume(), when it later is allowed to resume execution after having given up the CPU core it was executing on, it will return from spl0() to the label resume_return, at which time the stack frame created in SAVE_REGS() is destroyed and execution will return to the caller that had originally called resume():
resume_return: /\* \* Remove stack frame created in SAVE_REGS() \*/ addq $CLONGSIZE, %rsp ret
There you have it, an all too short treatise on resume() that's likely to have raised more questions than it answered, (and if that was indeed the case, let me apologize in advance.)
That's what the world of Open Source is all about.
Technorati Tag: OpenSolaris
Technorati Tag: Solaris