Wednesday Jun 15, 2005

SPARC System Call Anatomy

SPARC System Call Anatomy

Russ Blaine has described aspects of the x86 and x64 system call implementation in OpenSolaris. In this entry I'll describe the codeflow from userland to kernel and back for a SPARC system call.

Making A System Call

An application making a system call actually calls a libc wrapper function which performs any required posturing and then enters the kernel with a software trap instruction. This means that user code and compilers do not need to know the runes to enter the kernel, and allows binaries to work on later versions of the OS where perhaps the runes have been modified, system call numbers newly overloaded etc.

OpenSolaris for SPARC supports 3 software traps for entering the kernel:

S/W Trap # Instruction Description
0x0 ta 0x0 Used for system calls for binaries running in SunOS 4.x binary compatability mode.
0x8 ta 0x8 32-bit (ILP32) binary running on 64-bit (ILP64) kernel
0x40 ta 0x40 64-bit (ILP64) binary running on 64-bit (ILP64) kernel

Since OpenSolaris (as Solaris since Solaris 10) no longer includes a 32-bit kernel the ILP32 syscall on ILP32 kernel is no longer implemented.

In the wrapper function the syscall arguments are rearranged if necessary (the kernel function implementing the syscall may expect them in a different order to the syscall API, for example multiple related system calls may share a single system call number and select behaviour based on an additional argument passed into the kernel). It then places the system call number in register %g1 and executes one of the above trap-always instructions (e.g., the 32-bit libc will use ta 0x8 while the 64-bit libc will use ta 0x40). There's a lot more activity and posturing in the wrapper functions than described here, but for our purposes we simply note that it all boils down to a ta instruction to enter the kernel.

Handling A System Call Trap

A ta n instruction, as executed in userland by the wrapper function, results in a trap type 0x100 + n being taken and we move from traplevel 0 (where all userland and most kernel code executes) to traplevel 1 in nucleus context. Code that executes in nucleus context has to be handcrafted in assembler since nucleus context does not comply to the ABI etc conventions and is generally much more restricted in what it can do. The task of the trap handler executing at traplevel 1 is to provide the necessary glue in order to get us back to TL0 and running privileged (kernel) C code that implements the actual system call.

The trap table entries for sun4u and sun4v for these traps are identical. I'm going to following the two regular syscall traps and ignore the SunOS 4.x trap. Note that a trap table handler has just 8 instructions dedicated to it in the trap table - it must use these to do a little work and then to branch elsewhere:

 \* SYSCALL is used for system calls on both ILP32 and LP64 kernels
 \* depending on the "which" parameter (should be either syscall_trap
 \* or syscall_trap32).
#define SYSCALL(which)                  \\
        TT_TRACE(trace_gen)             ;\\
        set     (which), %g1            ;\\
        ba,pt   %xcc, sys_trap          ;\\
        sub     %g0, 1, %g4             ;\\
        .align  32


        /\* hardware traps \*/
        /\* user traps \*/
        GOTO(syscall_trap_4x);          /\* 100  old system call \*/
        SYSCALL(syscall_trap32);        /\* 108  ILP32 system call on LP64 \*/
        SYSCALL(syscall_trap)           /\* 140  LP64 system call \*/

So in both cases we branch to sys_trap, requesting TL0 handler of syscall_trap32 for an ILP32 syscall and syscall_trap for a ILP64 syscall. In both cases we request PIL to remain as it currently is (always 0 since we came from userland). sys_trap is generic glue code that is used to take us from nucleus (TL>0) context back to TL0 running a specified handler (address in %g1, usually written in C) at a chosen PIL. The specified handler is called with arguments as given by registers %g2 and %g3 at the time we branch to sys_trap: the SYSCALL macro above does not move anything into these registers (no arguments to be passed to handler). sys_trap handlers are always called with a first argument pointing to a struct regs that provides access to all the register values at the time of branching to sys_trap; for syscalls these will include the system call number in %g1 and arguments in output registers (note that %g1 as prepared in the wrapper and %g1 as used in the SYSCALL macro for the trap table entry are not the same register - on trap we move from regular globals (as userland executes in) on to alternate globals - but that sys_trap glue collects all the correct (user) registers together and makes them available in the struct regs it passes to the handler.

sys_trap is also responsible for setting up our return linkage. When the TL0 handling is complete the handler will return, restoring the stack pointer and program counter as constructed in sys_trap. Since we trapped from userland it will be user_rtt that is interposed as the glue that TL0 handling code will return into, and which will get us back out of the kernel and into userland again.

Aside: Fancy Improving Something In OpenSolaris?

Adam Leventhal logged bug 4816328 "system call traps should go straight to user_trap" some time ago. As described above, the SYSCALL macro branches to sys_trap:

        ! force tl=1, update %cwp, branch to correct handler
        wrpr    %g0, 1, %tl
        rdpr    %tstate, %g5
        btst    TSTATE_PRIV, %g5
        and     %g5, TSTATE_CWP, %g6
        bnz,pn  %xcc, priv_trap
        wrpr    %g0, %g6, %cwp

Well we know that we're at TL1 and that we were unprivileged before the trap, so (aside from the current window pointer manipulation which Adam explains in the bug report i- it's not required coming from a syscall trap) we could save a few instructions by going straight to user_trap from the trap table. Adam's benchmarking suggests that can save around 45ns per system call - more than 1% of a quick system call!

syscall_trap32(struct regs \*rp);

We'll follow the ILP32 syscall route; the route for ILP64 is analogous with trivial differences in terms of not having to clear the upper 32 bits of arguments etc. You can view the source here. This runs at TL0 as a sys_trap handler so could be written in C, however for performance and hands-on-DIY assembler-level reasons it is in assembler. Our task is to lookup and call the nominated system call handler, and performing the required housekeeping along the way.

        ldx     [THREAD_REG + T_CPU], %g1       ! get cpu pointer               
        mov     %o7, %l0                        ! save return addr              

First note that we do not obtain a new register window here - we will squat within the window that sys_trap crafted for itself. Normally this would mean that you'd have to live within the output registers, but by agreement handlers called via sys_trap are permitted to use registers %l0 thru %l3.

We begin by loading a pointer to the cpu this thread is executing on into %g1, and saving the return PC (as constructed by sys_trap) in %o7.

        ! If the trapping thread has the address mask bit clear, then it's      
        !   a 64-bit process, and has no business calling 32-bit syscalls.      
        ldx     [%o0 + TSTATE_OFF], %l1         ! saved is that      
        andcc   %l1, TSTATE_AM, %l1             !   of the trapping proc        
        be,pn   %xcc, _syscall_ill32            !                               
          mov   %o0, %l1                        ! save reg pointer              

The comment says it all. The AM bit in the PSTATE at the time we trapped (executed the ta instruction is available in the %tstate register after trap, and sys_trap preserved that before it could be modified by further traps for us in the regs structure. Assuming we're not a 64-bit app making a 32-bit syscall:

        srl     %i0, 0, %o0                     ! copy 1st arg, clear high bits 
        srl     %i1, 0, %o1                     ! copy 2nd arg, clear high bits 
        ldx     [%g1 + CPU_STATS_SYS_SYSCALL], %g2                              
        inc     %g2                             ! cpu_stats.sys.syscall++       
        stx     %g2, [%g1 + CPU_STATS_SYS_SYSCALL]                              

The libc wrapper placed up to the first 6 arguments in %o0 thru %o5 (with the rest, if any, on stack). During sys_trap a SAVE instruction was performed to obtain a new register window, so those arguments are now available in the corresponding input registers (despite us not performing a save in syscall_trap32 itself). We're going to call the real handler so we prepare the arguments in our outputs (which we're sharing with sys_trap but outputs are understood to be volatile across calls). The shift-right-logical by 0 bits is a 32-bit operation (i.e., not srlx) so it performs no shifting but it does clear the uppermost 32-bits of the arguments. We also increment the statistic counting the number of system calls made by this cpu; this statistic is in the cpu_t and the offset, like most, is generated for a by genasym.

        ! Set new state for LWP                                                 
        ldx     [THREAD_REG + T_LWP], %l2                                       
        mov     LWP_SYS, %g3                                                    
        srl     %i2, 0, %o2                     ! copy 3rd arg, clear high bits 
        stb     %g3, [%l2 + LWP_STATE]                                          
        srl     %i3, 0, %o3                     ! copy 4th arg, clear high bits 
        ldx     [%l2 + LWP_RU_SYSC], %g2        ! pesky statistics              
        srl     %i4, 0, %o4                     ! copy 5th arg, clear high bits 
        addx    %g2, 1, %g2                                                     
        stx     %g2, [%l2 + LWP_RU_SYSC]                                        
        srl     %i5, 0, %o5                     ! copy 6th arg, clear high bits 
        ! args for direct syscalls now set up                                   

We continue preparing arguments as above. Interleaved with these instructions we change the lwp_state member of the associated lwp stucture (there must be one - a user thread made a syscall, this is not a kernel thread) to indicate it is running in-kernel (LWP_SYS, would have been LWP_USER prior to this update) and increment the count of the number of syscall made by this particular lwp (there is a 1:1 correspondence between user threads and lwps these days).

Next we write a TRAPTRACE entry - only on DEBUG kernels. That's a topic for another day - I'll skip the code here, too.

While we're on the subject of tracing, note that the next code snippet includes mentions of SYSCALLTRACE. This is not defined in normal production kernels. But, of course, one of the great beauties of DTrace is that it doesn't require custom kernels to perform its tracing since it can insert/enable probes on-the-fly - so SYSCALLTRACE is near worthless now!

        ! Test for pre-system-call handling                                     
        ldub    [THREAD_REG + T_PRE_SYS], %g3   ! pre-syscall proc?             
#ifdef SYSCALLTRACE                                                             
        sethi   %hi(syscalltrace), %g4                                          
        ld      [%g4 + %lo(syscalltrace)], %g4                                  
        orcc    %g3, %g4, %g0                   ! pre_syscall OR syscalltrace?  
        tst     %g3                             ! is pre_syscall flag set?      
#endif /\* SYSCALLTRACE \*/                                                       
        bnz,pn  %icc, _syscall_pre32            ! yes - pre_syscall needed      
        ! Fast path invocation of new_mstate                                    
        mov     LMS_USER, %o0                                                   
        call    syscall_mstate                                                  
        mov     LMS_SYSTEM, %o1                                                 
        lduw    [%l1 + O0_OFF + 4], %o0         ! reload 32-bit args            
        lduw    [%l1 + O1_OFF + 4], %o1                                         
        lduw    [%l1 + O2_OFF + 4], %o2                                         
        lduw    [%l1 + O3_OFF + 4], %o3                                         
        lduw    [%l1 + O4_OFF + 4], %o4                                         
        lduw    [%l1 + O5_OFF + 4], %o5                                         

        ! lwp_arg now set up                                                    

If curthread->t_pre_sys flag is set then we branch to _syscall_pre32 to call pre_syscall. If that does not abort the call it will reload the outputs with the args (they were lost on the call to _syscall_pre32) using lduw instructions from the regs area and loading from just the lower 32-bit word of the args (we can no longer use srl by 0 since no registers have the arguments anymore) and branch back to label 3 above (as if we'd done the same after a call to syscall_mstate).

If we don't have pre-syscall work to perform then call syscall_mstate(LMS_USER, LMS_SYSTEM) to record the transition from user to system state for microstate accounting purposes. Microstate accounting is always performed now - it used not to be the default and was enabled when desired.

After the unconditional call to syscall_mstate we reload the arguments from the regs struct into the output registers (as after the pre-syscall work). Evidently our earlier srl work in the args is a complete waste of time (although not expensive) since we always land up loading them from the passed regs structure. This appears to be a hangover from days when microstate accounting was not always enabled.

Aside: Another Performance Opportunity?

So we see that our original argument shuffling is always undone as we have to reload after a call for microstate accounting, at least. But those reloads are made from the regs structure (cache/memory accesses) while it is clear that the input registers remain untouched and we could simply performing register-to-register manipulations (srl for the 32-bit version, mov for the 64-bit version). Reading through and documenting code like this really is worthwhile - I'll log a bug now!

        ! Call the handler.  The %o's have been set up.                         
        lduw    [%l1 + G1_OFF + 4], %g1         ! get 32-bit code               
        set     sysent32, %g3                   ! load address of vector table  
        cmp     %g1, NSYSCALL                   ! check range                   
        sth     %g1, [THREAD_REG + T_SYSNUM]    ! save syscall code             
        bgeu,pn %ncc, _syscall_ill32                                            
          sll   %g1, SYSENT_SHIFT, %g4          ! delay - get index             
        add     %g3, %g4, %g5                   ! g5 = addr of sysentry         
        ldx     [%g5 + SY_CALLC], %g3           ! load system call handler      
        brnz,a,pt %g1, 4f                       ! check for indir()             
        mov     %g5, %l4                        ! save addr of sysentry         
        ! Yuck.  If %g1 is zero, that means we're doing a syscall() via the     
        ! indirect system call.  That means we have to check the                
        ! flags of the targetted system call, not the indirect system call      
        ! itself.  See return value handling code below.                        
        set     sysent32, %l4                   ! load address of vector table  
        cmp     %o0, NSYSCALL                   ! check range                   
        bgeu,pn %ncc, 4f                        ! out of range, let C handle it 
          sll   %o0, SYSENT_SHIFT, %g4          ! delay - get index             
        add     %g4, %l4, %l4                   ! compute & save addr of sysent 
        call    %g3                             ! call system call handler      

We load the nominated syscall number into %g1, sanity-check it for range, and lookup the entry at that index in the table of 32-bit system calls sysent32 and extract the registered handler (the real implementation). Ignoring the indirect syscall cruft we the call the handler and the real work of the syscall is executed. Erick Schrock has described the sysent/sysent32 table in his blog entry on adding system calls to Solaris.

        ! If handler returns long long then we need to split the 64 bit         
        ! return value in %o0 into %o0 and %o1 for ILP32 clients.               
        lduh    [%l4 + SY_FLAGS], %g4           ! load sy_flags                 
        andcc   %g4, SE_64RVAL | SE_32RVAL2, %g0 ! check for 64-bit return      
        bz,a,pt %xcc, 5f                                                        
          srl   %o0, 0, %o0                     ! 32-bit only                   
        srl     %o0, 0, %o1                     ! lower 32 bits into %o1        
        srlx    %o0, 32, %o0                    ! upper 32 bits into %o0        

For ILP32 clients we need to massage 64-bit return types into 2 adjacent and paired registers.

        ! Check for post-syscall processing.                                    
        ! This tests all members of the union containing t_astflag, t_post_sys, 
        ! and t_sig_check with one test.                                        
        ld      [THREAD_REG + T_POST_SYS_AST], %g1                              
        tst     %g1                             ! need post-processing?         
        bnz,pn  %icc, _syscall_post32           ! yes - post_syscall or AST set 
        mov     LWP_USER, %g1                                                   
        stb     %g1, [%l2 + LWP_STATE]          ! set lwp_state                 
        stx     %o0, [%l1 + O0_OFF]             ! set rp->r_o0                  
        stx     %o1, [%l1 + O1_OFF]             ! set rp->r_o1                  
        clrh    [THREAD_REG + T_SYSNUM]         ! clear syscall code            
        ldx     [%l1 + TSTATE_OFF], %g1         ! get saved tstate              
        ldx     [%l1 + nPC_OFF], %g2            ! get saved npc (new pc)        
        mov     CCR_IC, %g3                                                     
        sllx    %g3, TSTATE_CCR_SHIFT, %g3                                      
        add     %g2, 4, %g4                     ! calc new npc                  
        andn    %g1, %g3, %g1                   ! clear carry bit for no error  
        stx     %g2, [%l1 + PC_OFF]                                             
        stx     %g4, [%l1 + nPC_OFF]                                            
        stx     %g1, [%l1 + TSTATE_OFF]                                         

If post-syscall processing is required then branch to _syscall_post32 which will call post_syscall and then "return" by jumping to the return address passed by sys_trap (which is always user_rtt for syscalls). If not then change the lwp_state back to LWP_USER and stash the return value (possibly in 2 registers as above) in the regs structure, clear the curthread->t_sysnum since we're no longer executing a syscall, and step the PC and nPC values on so that the RETRY instruction at the end of user_rtt which we're about to "return" into will not simply re-execute the ta instruction.

        ! fast path outbound microstate accounting call                         
        mov     LMS_SYSTEM, %o0                                                 
        call    syscall_mstate                                                  
        mov     LMS_USER, %o1                                                   
        jmp     %l0 + 8                                                         

Transition our state from system to user again (for microstate accounting purposes) and "return" through user_rtt as arranged by sys_trap. It is the task of user_rtt to get us back out of the kernel to resume at the instruction indicated in %tstate (for which we stepped the PC and nPC) and continue execution in userland.

Technorati Tag:
Technorati Tag:

Tuesday Jun 14, 2005


thread_nomigrate(): Environmentally friendly prevention of kernel thread migration

The launch of OpenSolaris today means that as a Solaris developer I can take the voice that has already given me and talk not just in general about aspects of Solaris in which I work but in detail and with source freely quoted and referenced as I wish!  We've come a long way - who'd have thought several years ago that employees (techies, even!) would have the freedom to discuss in public what we do for a living in the corporate world (as has delivered for some time now) and now, with OpenSolaris, not just talk in general about subject matter but also discuss the design and implementation.  Fabulous!

I thought I'd start by describing a kernel-private interface I added in Solaris 10 which can be used to request short-term prevention of a kernel thread from migrating between processors.  Thread migration refers to a thread changing processors - running on one processor until preemption or blocking and then resuming on a different processor.  A description of thread_nomigrate (the new interface) soon turns into a mini tour of some aspects of the dispatcher (I don't work in dispatcher land much, I just have an interest in the area, and I had a project that required this functionality).

A Quick Overview of Processor Selection

I'm not going to attempt a niity-gritty detailed story here - just enough for the discussion below.

The kthread_t member t_state tracks the current run state of a kernel thread.  State TS_ONPROC indicates that a thread is currently running on a processor.  This state is always preceded by state TS_RUN - runnable but not yet on a processor.  Threads in state TS_RUN are enqueued on various dispatch queues; each processor has a bunch of dispatch queues (one for every priority level) and there are other global dispatch queues such as the partition-wide preemption queue.  All enqueuing to dispatch queues is performed by the dispatcher workhorses setfrontdq and setbackdq.  It is these functions which honour processor and processor-set binding requests or call cpu_choose to select the best processor to enqueue on.  When a thread is enqueued on a dispatch queue of some processor it is nominally aimed at being run on that processor, and in most cases will be;  however idle processors may choose to run suitable threads initially dispatched to other processors. Eric Saxe has described a lot more of the operation of the dispatcher and scheduler in his opening day blog.

Requirements for "Weak Binding"

There were already a number of ways of avoiding migration (for a thread not already permanently bound, such as an interrupt thread):

  • Raise IPL to above LOCK_LEVEL.

    Not something you want to do for more than a moment, but it is one way to avoid being preempted and hence also to avoid migration (for as long as the state persists).  Not suitable for general use.

  • processor_bind System Call.

    processor_bind implements the corresponding system call which may be used directly from applications or could be the result of a use of pbind(1M).  It acquires cpu_lock and uses cpu_bind_{thread,process,task,project,zone,contract} depending on the arguments. Function thread_bind locks the current thread and records the new user-selected binding by processor id in t_bind_cpu of the kthread structure and again but by cpu structure address in t_bound_cpu, and then requeues the thread if it was waiting on a dispatch queue somewhere (thread state TS_RUN) or poke it off of cpu if it is currently on cpu (possibly not the one to which we've just bound it) to force it through the dispatcher at which point the new binding will take effect (it will be noticed in setfrontdq/setbackdq).  The others - cpu_bind_process etc - are built on top of cpu_bind_thread and on each-other.

  • thread_affinity_set(kthread_id_t t,int cpu_id) and thread_affinity_clear(kthread_id_t).

    The artist previously known as affinity_set (and still available as that for compatability), used to request a system-specified (as opposed to userland-specified) binding.  Again this requires that cpu_lock be held (or it acquires it for you if cpu_id is specified as CPU_CURRENT).  It locks the indicated thread (note that it might not be curthread) and sets a hard affinity for the requested (or current) processor by incrementing t_affinitycnt and setting t_bound_cpu in the kthread structure.  The hard affinity count will prevent any processor_bind initiated requests from succeeding.  Finally it forces the target thread through the dispatcher if necessary (so that the requested binding may be honoured).

  • kprempt_disable() and kpreempt_enable().

    This actually prevents processor migration as a bonus side-effect of disabling preemption.  It is extremely lightweight and usable from any context (well, any where you could ever care about migration); in particular it does not require cpu_lock at all and can be called regardless of IPL and from interrupt context.

    To prevent preemption kpreempt_disable simply increments curthread->t_preempt.  To re-enable preemption this count is decremented.  Uses may be nested so preemption is only possible again when the count returns to zero.  When the count is decremented to zero we must also check for any preemption requests we ignored while preemption was disabled - i.e., whether cpu_kprunrun is set for the current processor - and call kpreempt synchronously now if so.  To understand how that prevents preemption you need to understand a little more of how preemption works in Solaris.  To preempt a thread running on cpu we set cpu_kprunrun for the processor it is on and "poke" that with a null interrupt whereafter return-from-interrupt processing will notice the flag set and call kpreempt.  It is in kpreempt that we consult t_preempt to see if preemption has been temporarily disabled;  if it is then the request is ignored for now and actioned only when preemption is re-enabled.

    Since a thread already running on one processor can only migrate to a new processor if we can get it off the current processor, disabling preemption has a bonus side-effect of preventing migration.  If, however, a thread with preemption disabled performs some operation that causes the thread to sleep (which would be legal but silly - why accept sleeping if you're asking not to be bumped from processor) then it may be migrated on resume since no part of the set{font,back}dq or cpu_choose code consults t_preempt.

    There is one big disadvantage to using kpreempt_disable.  It, errr, disables preemption which may interfere with the dispatch latency for other threads - preemption should only ever be disabled for a tiny window so that the thread can be pushed out of the way for higher priority threads (especially for realtime threads for which dispatch latency must be bounded).

Thus we already had userland-requested processor long-term binding to a specific processor (or set) via processor_bind, system requested long-term binding to a specific processor via thread_affinity_set, and system-requested short-term "binding" (as in "don't kick me off processor") via kpreempt_disable

I was modifying kernel bcopy, copyin, copyout and hwblkpagecopy code (see cheetah_copy.s) to add a new hardware test feature which would require that hardware-accelerated copies (bigger copies use the floating point unit and the prefetch cache to speed copy) run on the same processor throughout the copy (even if preempted for a while in mid copy by a higher priority thread in mid-copy).  I could not use processor_bind (non-starter, it's for user specified binding), nor thread_affinity_set which requires cpu_lock (bcopy(9F) can be called from interrupt context including high level interrupt.  That left kpreempt_disable which, although beautifully light-weight, could not be used for more than a moment without introducing realtime dispatch glitches - and copies (although accelerated) can be very large.  I needed a thread_nomigrate which would stop a kernel thread from migrating from the current processor (whichever you happened to be on when called) but would still allow the thread to be preempted, which was reasonably light-weight (copy code is performance critical), and which had few restrictions on caller context (no more than copy code).  Sounded simple enough!

Some Terminology

I'll refer to threads that are bound to a processor with t_bound_cpu set as being strongly bound.  The processor_bind and thread_affinity_set interfaces produce strong bindings in this sense.  This isn't traditional terminology - none was necessary - but we'll see that the new interface introduces weak binding so I had to call the existing mechanism something.

Processor Offlining

Another requirement of the proposed interface was that it must not interfere with processor offlining.  A quick look at cpu_offline source shows that it fails if there are threads that are strongly bound to the target processor - it waits a short interval to allow any such bindings to drop but if there are any remaining thereafter (no new binding can occur while it waits as cpu_lock is held) the offline attempt fails.  The new interface was required to work more like kpreempt_disable does - not interfere with offlining at all.  kpreempt_disable achieves this through resisting the attempt to preempt the thread with the high-priority per-cpu pause thread - cpu_offline waits until all cpus are running their pause thread so a kpreempt_disable just makes it wait a tiny bit longer.  For the new mechanism, however, we could not acquire cpu_lock as a barrier to preventing new weak bindings (as used in cpu_offline for strong bindings) and the whole point of the new mechanism is not to interfere with preemption so I could not use that method, either.

No Blocking Allowed

As mentioned above, kpreempt_disable does not assure no-migrate semantics if the thread voluntarily gives up cpu.  Since a sleep may take a while we don't want weak-bound threads sleeping as that would interfere with processor offlining.  So we'll outlaw sleeping.  This is no loss - if you can sleep then you can afford to use the full thread_affinity_set route.

Weak-binding Must Be Short-Term

Again to avoid interfering with processor offlining.  A weakbound thread which is preempted will necessarily be queued on the dispatch queues of the processor to which it is weakbound.  During attempted offline of a processor we will need to allow threads weakbound to that processor to drain - we must be sure that allowing threads in TS_RUN state to run a short while longer will be enough for them to complete their action and drop their weak binding.


This turned out to be trickier than initially hoped, which explains some of the liberal commenting you'll find in the source!

void thread_nomigrate(void);

You can view the source to this new function here.  I'll discuss it in chunks below, leaving out the comments that you'll find in the source as I'll elaborate more here.

        cpu_t \*cp;
        kthread_id_t t = curthread;

        cp = CPU;

It is the "current" cpu to which we will bind.  To nail down exactly which that is (since we may migrate at any moment!) we must first disable migration and we do this in the simplest way possible.  We must re-enable preemption before returning (and only keep it disabled for a moment).

Note that since we have preemption disabled, any strong binding requests which happen in parallel on other cpus for this thread will not be able to poke us across to the strongbound cpu (which may be different to the one we're currently on).

        if (CPU_ON_INTR(cp) || t->t_flag & T_INTR_THREAD ||
            getpil() >= DISP_LEVEL) {

During a highlevel interrupt context the caller does not own the current thread structure and so should not make changes to it.  If we are a lowlevel interrupt thread then we can't migrate anyway.  If we're at high IPL then we also cannot migrate.  So we need take no action; in thread_allowmigrate we must perform a corresponding test.

        if (t->t_nomigrate && t->t_weakbound_cpu && t->t_weakbound_cpu != cp) {
                if (!panicstr)
                        panic("thread_nomigrate: binding to %p but already "
                            "bound to %p", (void \*)cp,
                            (void \*)t->t_weakbound_cpu);

Some sanity checking that we've not already weakbound to a different cpu.  Weakbinding is recorded by writing the cpu address to the t_weakbound_cpu member and incrementing the t_nomigrate nesting count, as we'll see below.


Prior to this point we might be racing with a competing strong binding request running on another cpu (e.g., a pbind(1M) command line request on a process in copy code and requesting a weak binding).  But strong binding acquires the thread lock for the target thread, so we can synchronize (without blocking) by grabbing our thread lock.  Note that this restricts the context of callers to those for which grabbing the thread lock is appropriate.

        if (t->t_nomigrate < 0 || weakbindingbarrier && t->t_nomigrate == 0) {
                return;         /\* with kpreempt_disable still active \*/

This was the result of an unfortunate interaction between the initial implementation and pool rebinding (see poolbind(1M)).  Pool bindings must succeed or fail atomically - either all threads are rebound in the request or none are (as described in Andrei's blog).  The rebinding code would acquire cpu_lock (preventing further strong bindings) and check that all rebindings could succeed;  but since cpu_lock does not affect weak binding it could later find that some thread refused the rebinding.  The fix involved introducing a mechanism by which weakbinding could, fleetingly, be upgraded to preemption disabling.  The weakbindingbarrier is raised and lowered by calls to weakbinding_{stop,start}.  If it is raised or this is a nested call and we've already gone the no-preempt route for this thread then we return with preemption disabled and signify/count this through negative counting in t_nomigrate.  The t_weakbound_cpu member will be left NULL.  Note that whimping out and selecting the stronger condition of disabling preemption to achieve no-migration semantics does not signicantly undermine the goal of never interfering with dispatch latency: if you are performing pool rebinding operations you expect a glitch as threads are moved.

It's possible that we are running on a different cpu to which we are strongbound - a strong binding request was made between the time we disabled preemption and when we acquired the thread lock.  We can still grant the weakbinding in this case, which will result in our weak binding being different to our strong binding!  This is not unhealthy as long as we allow the thread to gravitate towards its strongbound cpu as soon as the weakbinding drops (which will be soon since it is a short-term condition).  To favour weakbinding over any strong we will also require some changes in setfrontdq and setbackdq.

Weakbinding requests always succeed - there is no return value to indicate failure.  However we may sometimes want to delay granting a weakbinding request until we are running on a more suitable cpu.  Recall that a weakbinding simply prevents migration during the critical section, but does not nominate a particular cpu.  If our current cpu is the subject of an offline request then we will migrate the thread to another cpu and retry the weakbinding request there.  We do this to avoid the (admittedly unlikely) case that repeated weakbinding requests being made by a thread prevent it from offlining (remember that the strategy is that any weakbound threads waiting to run on an offline target will drop their binding if allowed to run for a moment longer - if new bindings are continually being made then that assumption is violated).

        if (cp != cpu_inmotion || t->t_nomigrate > 0 || t->t_preempt > 1 ||
            t->t_bound_cpu == cp) {
                t->t_weakbound_cpu = cp;

We set cpu_inmotion during cpu_offline to record the target cpu.  If we're not currently on an offline target (the common case) or if we've already weakbound to this cpu (this is a nested call) or if we can't migrate away from this cpu because preemption is disabled or we're strongbound to it then go ahead and grant the weakbinding to this cpu by incrementing the nesting count and recording our weakbinding in t_weakbound_cpu (for the dispatcher).  Make these changes visible to the world before dropping the thread lock so that competing strong binding requests see the full view of the world.  Finally re-enable preemption, and we're done.

        } else {
                 \* Move to another cpu before granting the request by
                 \* forcing this thread through preemption code.  When we
                 \* get to set{front,back}dq called from CL_PREEMPT()
                 \* cpu_choose() will be used to select a cpu to queue
                 \* us on - that will see cpu_inmotion and take
                 \* steps to avoid returning us to this cpu.
                cp->cpu_kprunrun = 1;
                kpreempt_enable();      /\* will call preempt() \*/
                goto again;

If we are the target of an offline request and are not obliged to grant the weakbinding to this cpu, then force ourselves onto another cpu.  The disptacher will lean away from the cpu_inmotion and we'll resume elsewhere and likely grant the binding there.  Who says goto can never be used?

void thread_allowmigrate(void);

This drops the weakbinding if the nesting count reduces to zero, but must also look out for the special cases made in thread_nomigrate.  Source may be viewed here.

        kthread_id_t t = curthread;

        ASSERT(t->t_weakbound_cpu == CPU ||
            (t->t_nomigrate < 0 && t->t_preempt > 0) ||
            CPU_ON_INTR(CPU) || t->t_flag & T_INTR_THREAD ||
            getpil() >= DISP_LEVEL);

On DEBUG kernels check that all is operating as it should be. There's a story to tell here regarding cpr (checkpoint-resume power management) which I'll recount a little later.

        if (CPU_ON_INTR(CPU) || (t->t_flag & T_INTR_THREAD) ||
            getpil() >= DISP_LEVEL)

This corresponds to the beginning on thread_nomigrate for the case where we did not have to do anything to prevent migration.

        if (t->t_nomigrate < 0) {

Negative nested counting in t_nomigrate indicates that we're resolving weakbinding requests by upgrading them to no-preemption semantics during pool rebinding.

        } else {
                if (t->t_bound_cpu &&
                    t->t_weakbound_cpu != t->t_bound_cpu)
                        CPU->cpu_kprunrun = 1;
                t->t_weakbound_cpu = NULL;

If we decrement the nesting count to 0 then clear our weak binding recorded in t_weakbound_cpu.  If we are weakbound to a different cpu to which we are strongbound (as explained above) force a trip through preempt so that we can now drop all resistance and migrate.

Changes to setfrontdq and setbackdq

As outlined above it is these two functions which select dispatch queues on which to place threads that are in a runnable state (including threads preempted from cpu).  These functions already checked for strong binding of the thread being enqueued, so they required an additional check for weak binding.  As explained above it is sometimes possible that a thread be both strong and weak bound, normally to the same cpu but sometimes for a short time to different cpus - the changes should therefore favour weak binding over strong.

Changes to cpu_offline

The cpu_lock is held on calling cpu_offline, and that stops further strong bindings to the target (or any) cpu while we're in cpu_offline.  Except in special circumstances (of a failing cpu) a cpu with bound threads cannot be offlined;  if there are any strongbound threads then cpu_offline performs a brief delay loop to give them a chance to unbind and then fails if any remain.  The existence of strongbound threads is checked with disp_bound_threads and disp_bound_anythreads.

To meet the requirement that weakbinding not interfere with offlining we needed a similar mechanism to prevent any further weak bindings to the target cpu and a means of allowing existing weak bindings to drain; we must do this, however, without using a mutex or similar.

The solution was to introduce cpu_in_motion which would normally be NULL but would be set to the target cpu address when that cpu is being offlined.  Since this variable is not protected by any mutex some consideration of memory ordering in multiprocessor systems is required.  We force the store to cpu_in_motion to global visibility in cpu_offline so we know that no new loads (on other cpus) will see the old value after that point (we've "raised a barrier" to weak binding);  however loads already performed on other cpus may already have the old value (we're not synchronised in any way) so we have to be prepared for a thread running on the target cpu to still manage to weakbind just one last time in which case we repeat the loop to allow weakbound threads to drain and thereafter we know no further weakbindings could have occured since the barrier is long  since visible.  The weakbinding barrier cpu_inmotion is checked in thread_nomigrate and a thread trying to weakbind to the cpu that is the target of an offline request will go through preemption code to first migrate to another cpu.

A Twist In The Tail

I integrated thread_nomigrate along with the project that first required it into build 63 of Solaris 10.  A number of builds later a bug turned up in the case described above where a cpu may be temporarily weakbound to a different cpu to which it is strongbound.  In fixing that I modified the assertion test in thread_allowmigrate.  The test suite we had developed for the project was modified to cover the new case, and I put the changes back after successful testing.

Or so I thought.  ON archives routinely go for pre-integration regression testing (before being rolled up to the whole wad-of-stuff that makes a full Solaris build) and they soon turned up a test failure that had systems running DEBUG archives failing the new assertion check in thread_allowmigrate during cpr (checkpoint-resume - power management) validation tests.

Now cpr testing is on the putback checklist but I'd skipped it in the bug fix on the grounds that I couldn't possibly have affected it.  Well that was true - the newly uncovered bug was actually introduced back in the initial putback of build 63 (about 14 weeks earlier) but was now exposed by my extended assertion.

Remember that the initial consumer of the thread_nomigrate interface was to be some modified kernel hardware-accelerated copy code - bcopy in particular.  Well it turns out that cpr uses bcopy when it writes the pages of the system to disk for later restore, taking special care of some pages which may change during the checkpoint operation itself.  However it did not take any special care with regard to the kthread_t structure of the thread performing the cpr operation, and when bcopy called thread_nomigrate the thread structure for the running thread would record the current cpu address in t_weakbound_cpu and the nesting count in t_nomigrate; if the page being checkpointed/copied happened to be that containing this particular kthread_t then those values were preserved and restored on the resume operation - undoing the stores of thread_allowmigrate for this thread - effectively warping us back in time!

There's certainly a moral there: never assume you understand all the interactions of the various elements of the OS, and do perform all required regression testing no matter how unrelated it seems at the time!  I just required the humiliation of the "fix" being backed out to remind me of this.

Technorati Tag:
Technorati Tag:

Tuesday May 10, 2005

OpenSolaris Release Date Set

For the nay-sayers who claim that "they'll never do it" - Casper has given a strong indication of the planned release date for OpenSolaris.  Should be amusing to see what the die-hard trolls in comp.unix.solaris and comp.os.linux.advocacy make of it!

I work in the Fault Management core group; this blog describes some of the work performed in that group.


« April 2014
Site Pages
Sun Bloggers

No bookmarks in folder