SPARC System Call Anatomy

SPARC System Call Anatomy

Russ Blaine has described aspects of the x86 and x64 system call implementation in OpenSolaris. In this entry I'll describe the codeflow from userland to kernel and back for a SPARC system call.

Making A System Call

An application making a system call actually calls a libc wrapper function which performs any required posturing and then enters the kernel with a software trap instruction. This means that user code and compilers do not need to know the runes to enter the kernel, and allows binaries to work on later versions of the OS where perhaps the runes have been modified, system call numbers newly overloaded etc.

OpenSolaris for SPARC supports 3 software traps for entering the kernel:

S/W Trap # Instruction Description
0x0 ta 0x0 Used for system calls for binaries running in SunOS 4.x binary compatability mode.
0x8 ta 0x8 32-bit (ILP32) binary running on 64-bit (ILP64) kernel
0x40 ta 0x40 64-bit (ILP64) binary running on 64-bit (ILP64) kernel

Since OpenSolaris (as Solaris since Solaris 10) no longer includes a 32-bit kernel the ILP32 syscall on ILP32 kernel is no longer implemented.

In the wrapper function the syscall arguments are rearranged if necessary (the kernel function implementing the syscall may expect them in a different order to the syscall API, for example multiple related system calls may share a single system call number and select behaviour based on an additional argument passed into the kernel). It then places the system call number in register %g1 and executes one of the above trap-always instructions (e.g., the 32-bit libc will use ta 0x8 while the 64-bit libc will use ta 0x40). There's a lot more activity and posturing in the wrapper functions than described here, but for our purposes we simply note that it all boils down to a ta instruction to enter the kernel.

Handling A System Call Trap

A ta n instruction, as executed in userland by the wrapper function, results in a trap type 0x100 + n being taken and we move from traplevel 0 (where all userland and most kernel code executes) to traplevel 1 in nucleus context. Code that executes in nucleus context has to be handcrafted in assembler since nucleus context does not comply to the ABI etc conventions and is generally much more restricted in what it can do. The task of the trap handler executing at traplevel 1 is to provide the necessary glue in order to get us back to TL0 and running privileged (kernel) C code that implements the actual system call.

The trap table entries for sun4u and sun4v for these traps are identical. I'm going to following the two regular syscall traps and ignore the SunOS 4.x trap. Note that a trap table handler has just 8 instructions dedicated to it in the trap table - it must use these to do a little work and then to branch elsewhere:

/\*
 \* SYSCALL is used for system calls on both ILP32 and LP64 kernels
 \* depending on the "which" parameter (should be either syscall_trap
 \* or syscall_trap32).
 \*/
#define SYSCALL(which)                  \\
        TT_TRACE(trace_gen)             ;\\
        set     (which), %g1            ;\\
        ba,pt   %xcc, sys_trap          ;\\
        sub     %g0, 1, %g4             ;\\
        .align  32

...
...

trap_table:
scb:    
trap_table0:
        /\* hardware traps \*/
	...
	...
        /\* user traps \*/
        GOTO(syscall_trap_4x);          /\* 100  old system call \*/
	...
        SYSCALL(syscall_trap32);        /\* 108  ILP32 system call on LP64 \*/
	...
        SYSCALL(syscall_trap)           /\* 140  LP64 system call \*/
	...

So in both cases we branch to sys_trap, requesting TL0 handler of syscall_trap32 for an ILP32 syscall and syscall_trap for a ILP64 syscall. In both cases we request PIL to remain as it currently is (always 0 since we came from userland). sys_trap is generic glue code that is used to take us from nucleus (TL>0) context back to TL0 running a specified handler (address in %g1, usually written in C) at a chosen PIL. The specified handler is called with arguments as given by registers %g2 and %g3 at the time we branch to sys_trap: the SYSCALL macro above does not move anything into these registers (no arguments to be passed to handler). sys_trap handlers are always called with a first argument pointing to a struct regs that provides access to all the register values at the time of branching to sys_trap; for syscalls these will include the system call number in %g1 and arguments in output registers (note that %g1 as prepared in the wrapper and %g1 as used in the SYSCALL macro for the trap table entry are not the same register - on trap we move from regular globals (as userland executes in) on to alternate globals - but that sys_trap glue collects all the correct (user) registers together and makes them available in the struct regs it passes to the handler.

sys_trap is also responsible for setting up our return linkage. When the TL0 handling is complete the handler will return, restoring the stack pointer and program counter as constructed in sys_trap. Since we trapped from userland it will be user_rtt that is interposed as the glue that TL0 handling code will return into, and which will get us back out of the kernel and into userland again.

Aside: Fancy Improving Something In OpenSolaris?

Adam Leventhal logged bug 4816328 "system call traps should go straight to user_trap" some time ago. As described above, the SYSCALL macro branches to sys_trap:

        ENTRY_NP(sys_trap)
        !
        ! force tl=1, update %cwp, branch to correct handler
        !
        wrpr    %g0, 1, %tl
        rdpr    %tstate, %g5
        btst    TSTATE_PRIV, %g5
        and     %g5, TSTATE_CWP, %g6
        bnz,pn  %xcc, priv_trap
        wrpr    %g0, %g6, %cwp

        ALTENTRY(user_trap)
	...
	...
Well we know that we're at TL1 and that we were unprivileged before the trap, so (aside from the current window pointer manipulation which Adam explains in the bug report i- it's not required coming from a syscall trap) we could save a few instructions by going straight to user_trap from the trap table. Adam's benchmarking suggests that can save around 45ns per system call - more than 1% of a quick system call!

syscall_trap32(struct regs \*rp);

We'll follow the ILP32 syscall route; the route for ILP64 is analogous with trivial differences in terms of not having to clear the upper 32 bits of arguments etc. You can view the source here. This runs at TL0 as a sys_trap handler so could be written in C, however for performance and hands-on-DIY assembler-level reasons it is in assembler. Our task is to lookup and call the nominated system call handler, and performing the required housekeeping along the way.

        ENTRY_NP(syscall_trap32)                                                
        ldx     [THREAD_REG + T_CPU], %g1       ! get cpu pointer               
        mov     %o7, %l0                        ! save return addr              

First note that we do not obtain a new register window here - we will squat within the window that sys_trap crafted for itself. Normally this would mean that you'd have to live within the output registers, but by agreement handlers called via sys_trap are permitted to use registers %l0 thru %l3.

We begin by loading a pointer to the cpu this thread is executing on into %g1, and saving the return PC (as constructed by sys_trap) in %o7.

        !                                                                       
        ! If the trapping thread has the address mask bit clear, then it's      
        !   a 64-bit process, and has no business calling 32-bit syscalls.      
        !                                                                       
        ldx     [%o0 + TSTATE_OFF], %l1         ! saved %tstate.am is that      
        andcc   %l1, TSTATE_AM, %l1             !   of the trapping proc        
        be,pn   %xcc, _syscall_ill32            !                               
          mov   %o0, %l1                        ! save reg pointer              

The comment says it all. The AM bit in the PSTATE at the time we trapped (executed the ta instruction is available in the %tstate register after trap, and sys_trap preserved that before it could be modified by further traps for us in the regs structure. Assuming we're not a 64-bit app making a 32-bit syscall:

        srl     %i0, 0, %o0                     ! copy 1st arg, clear high bits 
        srl     %i1, 0, %o1                     ! copy 2nd arg, clear high bits 
        ldx     [%g1 + CPU_STATS_SYS_SYSCALL], %g2                              
        inc     %g2                             ! cpu_stats.sys.syscall++       
        stx     %g2, [%g1 + CPU_STATS_SYS_SYSCALL]                              

The libc wrapper placed up to the first 6 arguments in %o0 thru %o5 (with the rest, if any, on stack). During sys_trap a SAVE instruction was performed to obtain a new register window, so those arguments are now available in the corresponding input registers (despite us not performing a save in syscall_trap32 itself). We're going to call the real handler so we prepare the arguments in our outputs (which we're sharing with sys_trap but outputs are understood to be volatile across calls). The shift-right-logical by 0 bits is a 32-bit operation (i.e., not srlx) so it performs no shifting but it does clear the uppermost 32-bits of the arguments. We also increment the statistic counting the number of system calls made by this cpu; this statistic is in the cpu_t and the offset, like most, is generated for a by genasym.

        !                                                                       
        ! Set new state for LWP                                                 
        !                                                                       
        ldx     [THREAD_REG + T_LWP], %l2                                       
        mov     LWP_SYS, %g3                                                    
        srl     %i2, 0, %o2                     ! copy 3rd arg, clear high bits 
        stb     %g3, [%l2 + LWP_STATE]                                          
        srl     %i3, 0, %o3                     ! copy 4th arg, clear high bits 
        ldx     [%l2 + LWP_RU_SYSC], %g2        ! pesky statistics              
        srl     %i4, 0, %o4                     ! copy 5th arg, clear high bits 
        addx    %g2, 1, %g2                                                     
        stx     %g2, [%l2 + LWP_RU_SYSC]                                        
        srl     %i5, 0, %o5                     ! copy 6th arg, clear high bits 
        ! args for direct syscalls now set up                                   

We continue preparing arguments as above. Interleaved with these instructions we change the lwp_state member of the associated lwp stucture (there must be one - a user thread made a syscall, this is not a kernel thread) to indicate it is running in-kernel (LWP_SYS, would have been LWP_USER prior to this update) and increment the count of the number of syscall made by this particular lwp (there is a 1:1 correspondence between user threads and lwps these days).

Next we write a TRAPTRACE entry - only on DEBUG kernels. That's a topic for another day - I'll skip the code here, too.

While we're on the subject of tracing, note that the next code snippet includes mentions of SYSCALLTRACE. This is not defined in normal production kernels. But, of course, one of the great beauties of DTrace is that it doesn't require custom kernels to perform its tracing since it can insert/enable probes on-the-fly - so SYSCALLTRACE is near worthless now!

        !                                                                       
        ! Test for pre-system-call handling                                     
        !                                                                       
        ldub    [THREAD_REG + T_PRE_SYS], %g3   ! pre-syscall proc?             
#ifdef SYSCALLTRACE                                                             
        sethi   %hi(syscalltrace), %g4                                          
        ld      [%g4 + %lo(syscalltrace)], %g4                                  
        orcc    %g3, %g4, %g0                   ! pre_syscall OR syscalltrace?  
#else                                                                           
        tst     %g3                             ! is pre_syscall flag set?      
#endif /\* SYSCALLTRACE \*/                                                       
        bnz,pn  %icc, _syscall_pre32            ! yes - pre_syscall needed      
          nop                                                                   
                                                                                
        ! Fast path invocation of new_mstate                                    
        mov     LMS_USER, %o0                                                   
        call    syscall_mstate                                                  
        mov     LMS_SYSTEM, %o1                                                 
                                                                                
        lduw    [%l1 + O0_OFF + 4], %o0         ! reload 32-bit args            
        lduw    [%l1 + O1_OFF + 4], %o1                                         
        lduw    [%l1 + O2_OFF + 4], %o2                                         
        lduw    [%l1 + O3_OFF + 4], %o3                                         
        lduw    [%l1 + O4_OFF + 4], %o4                                         
        lduw    [%l1 + O5_OFF + 4], %o5                                         

        ! lwp_arg now set up                                                    
3:

If curthread->t_pre_sys flag is set then we branch to _syscall_pre32 to call pre_syscall. If that does not abort the call it will reload the outputs with the args (they were lost on the call to _syscall_pre32) using lduw instructions from the regs area and loading from just the lower 32-bit word of the args (we can no longer use srl by 0 since no registers have the arguments anymore) and branch back to label 3 above (as if we'd done the same after a call to syscall_mstate).

If we don't have pre-syscall work to perform then call syscall_mstate(LMS_USER, LMS_SYSTEM) to record the transition from user to system state for microstate accounting purposes. Microstate accounting is always performed now - it used not to be the default and was enabled when desired.

After the unconditional call to syscall_mstate we reload the arguments from the regs struct into the output registers (as after the pre-syscall work). Evidently our earlier srl work in the args is a complete waste of time (although not expensive) since we always land up loading them from the passed regs structure. This appears to be a hangover from days when microstate accounting was not always enabled.

Aside: Another Performance Opportunity?

So we see that our original argument shuffling is always undone as we have to reload after a call for microstate accounting, at least. But those reloads are made from the regs structure (cache/memory accesses) while it is clear that the input registers remain untouched and we could simply performing register-to-register manipulations (srl for the 32-bit version, mov for the 64-bit version). Reading through and documenting code like this really is worthwhile - I'll log a bug now!

        !                                                                       
        ! Call the handler.  The %o's have been set up.                         
        !                                                                       
        lduw    [%l1 + G1_OFF + 4], %g1         ! get 32-bit code               
        set     sysent32, %g3                   ! load address of vector table  
        cmp     %g1, NSYSCALL                   ! check range                   
        sth     %g1, [THREAD_REG + T_SYSNUM]    ! save syscall code             
        bgeu,pn %ncc, _syscall_ill32                                            
          sll   %g1, SYSENT_SHIFT, %g4          ! delay - get index             
        add     %g3, %g4, %g5                   ! g5 = addr of sysentry         
        ldx     [%g5 + SY_CALLC], %g3           ! load system call handler      
                                                                                
        brnz,a,pt %g1, 4f                       ! check for indir()             
        mov     %g5, %l4                        ! save addr of sysentry         
        !                                                                       
        ! Yuck.  If %g1 is zero, that means we're doing a syscall() via the     
        ! indirect system call.  That means we have to check the                
        ! flags of the targetted system call, not the indirect system call      
        ! itself.  See return value handling code below.                        
        !                                                                       
        set     sysent32, %l4                   ! load address of vector table  
        cmp     %o0, NSYSCALL                   ! check range                   
        bgeu,pn %ncc, 4f                        ! out of range, let C handle it 
          sll   %o0, SYSENT_SHIFT, %g4          ! delay - get index             
        add     %g4, %l4, %l4                   ! compute & save addr of sysent 
4:                                                                              
        call    %g3                             ! call system call handler      
        nop                                                                     

We load the nominated syscall number into %g1, sanity-check it for range, and lookup the entry at that index in the table of 32-bit system calls sysent32 and extract the registered handler (the real implementation). Ignoring the indirect syscall cruft we the call the handler and the real work of the syscall is executed. Erick Schrock has described the sysent/sysent32 table in his blog entry on adding system calls to Solaris.

        !                                                                       
        ! If handler returns long long then we need to split the 64 bit         
        ! return value in %o0 into %o0 and %o1 for ILP32 clients.               
        !                                                                       
        lduh    [%l4 + SY_FLAGS], %g4           ! load sy_flags                 
        andcc   %g4, SE_64RVAL | SE_32RVAL2, %g0 ! check for 64-bit return      
        bz,a,pt %xcc, 5f                                                        
          srl   %o0, 0, %o0                     ! 32-bit only                   
        srl     %o0, 0, %o1                     ! lower 32 bits into %o1        
        srlx    %o0, 32, %o0                    ! upper 32 bits into %o0        

For ILP32 clients we need to massage 64-bit return types into 2 adjacent and paired registers.

        !                                                                       
        ! Check for post-syscall processing.                                    
        ! This tests all members of the union containing t_astflag, t_post_sys, 
        ! and t_sig_check with one test.                                        
        !                                                                       
        ld      [THREAD_REG + T_POST_SYS_AST], %g1                              
        tst     %g1                             ! need post-processing?         
        bnz,pn  %icc, _syscall_post32           ! yes - post_syscall or AST set 
        mov     LWP_USER, %g1                                                   
        stb     %g1, [%l2 + LWP_STATE]          ! set lwp_state                 
        stx     %o0, [%l1 + O0_OFF]             ! set rp->r_o0                  
        stx     %o1, [%l1 + O1_OFF]             ! set rp->r_o1                  
        clrh    [THREAD_REG + T_SYSNUM]         ! clear syscall code            
        ldx     [%l1 + TSTATE_OFF], %g1         ! get saved tstate              
        ldx     [%l1 + nPC_OFF], %g2            ! get saved npc (new pc)        
        mov     CCR_IC, %g3                                                     
        sllx    %g3, TSTATE_CCR_SHIFT, %g3                                      
        add     %g2, 4, %g4                     ! calc new npc                  
        andn    %g1, %g3, %g1                   ! clear carry bit for no error  
        stx     %g2, [%l1 + PC_OFF]                                             
        stx     %g4, [%l1 + nPC_OFF]                                            
        stx     %g1, [%l1 + TSTATE_OFF]                                         

If post-syscall processing is required then branch to _syscall_post32 which will call post_syscall and then "return" by jumping to the return address passed by sys_trap (which is always user_rtt for syscalls). If not then change the lwp_state back to LWP_USER and stash the return value (possibly in 2 registers as above) in the regs structure, clear the curthread->t_sysnum since we're no longer executing a syscall, and step the PC and nPC values on so that the RETRY instruction at the end of user_rtt which we're about to "return" into will not simply re-execute the ta instruction.

        ! fast path outbound microstate accounting call                         
        mov     LMS_SYSTEM, %o0                                                 
        call    syscall_mstate                                                  
        mov     LMS_USER, %o1                                                   
                                                                                
        jmp     %l0 + 8                                                         
         nop                                                                    

Transition our state from system to user again (for microstate accounting purposes) and "return" through user_rtt as arranged by sys_trap. It is the task of user_rtt to get us back out of the kernel to resume at the instruction indicated in %tstate (for which we stepped the PC and nPC) and continue execution in userland.

Technorati Tag:
Technorati Tag:
Comments:

I want to know what is the policy for scheduling the trap and syscalls? Assume the core10 hits a syscall. Who is going to execute that syscall? Core10 or another core?

Posted by m on July 03, 2013 at 03:45 PM EST #

A trap executes on the cpu resource on which the trap instruction was
executed. A system call is just one type of trap, invoked from userland
with a "trap always" instruction (libc does this for you - you link with
the libc wrapper). A so-called "fast trap" will run to completion at traplevel > 0
but most system calls will require handling in C code running at TL0. So the syscall handler invokes the user_trap code which arranges for us to execute the nominated handler at TL0 and a chosen PIL - zero for syscalls. Now we are just
a regularly schedulable thread which may migrate to other cpus eg if it waits too long on this cpu for cpu time, or if we block we could return on a different cpu.
Only if you bind to the cpu before the syscall are you guaranteed that all of that syscall will complete on the same cpu.

Posted by guest on July 03, 2013 at 04:04 PM EST #

Thanks for reply.
>Only if you bind to the cpu before the syscall are you
>guaranteed that all of that syscall will complete on the same cpu.
And that may hurt the performance if the trap is waiting too long on a cpu. Right?

Posted by m on July 03, 2013 at 04:09 PM EST #

Possibly - by binding you are decreasing the resource on which we can run so could reduce performance. If you use processor sets then you can reserve sets of processors exclusively for stuff that you bind to the set - anything else must run elsewhere and you can direct interrupts away, too. In the usual case there is little to be gained from binding - the thread won't bounce around cpus too readily as we have algorithms to make use of warm caches,latency groups etc. And many system calls are dominated by off-cpu sleep time, such as waiting for an IO to complete.

Posted by guest on July 03, 2013 at 04:34 PM EST #

One thing is not clear. You said "which may migrate to other cpus eg if it waits too long on this cpu for cpu time, or if we block we could return on a different cpu". Thing is, when a cpu hits a syscall or trap, then the most idle processor is himself and that mean all of his resources are free (otherwise means the resources are occupied and that means the cpu is running something). So in what condition, the "waits too long" happens?

Posted by m on July 04, 2013 at 04:24 PM EST #

There can be lots of active threads in the system, potentially more than we have cpu resources. Each cpu has a whole bunch of dispatch queues corresponding to the thread priority levels and while only on thread executes on a given cpu at a time there may be many others waiting on the queues of that cpu. If the running userland thread performs a syscall which then blocks we'll switch to a waiting thread or the idle thread if none. When the thread becomes runnable again it will usually wait on the queues of the cpu it last ran on but it is not necessarily bound to that cpu and another idle cpu could decide to lend a hand and run the thread. The thread scheduling decisions are not really any different for being in a syscall vs running in userland - it just dispatches threads, whatever their nature and whether running in the kernel or ini userland, in priority order.

Posted by gavin on July 04, 2013 at 04:52 PM EST #

That is interesting. Is there any command or method to find the dispatch queue length at any given time or period (for example every 1ms I want to see the dispatch queue length on all cores)

Posted by m on July 04, 2013 at 06:45 PM EST #

vmstat(1m) in the 'r' column will report the total number of runnable threads
(not counting those actually executing on cpu).

It's not trivial to observe run queue lengths in realtime. Any tool that
collected that info to presently nicely at the cmdline would have a substantial
negative performance impact. You can use Dtrace to observe enqueue and dequeue
and count up and down in the script, as per the example at https://wikis.oracle.com/display/DTrace/sched+Provider . You can also write Dtrace scripts to look
at the disp_t embedded within each cpu_t - the disp_nrunnable member records the
total runnable threads across all priority queues on that cpu. You can sample that using a profile probe.

Posted by gavin on July 05, 2013 at 01:57 PM EST #

Post a Comment:
  • HTML Syntax: NOT allowed
About

I work in the Fault Management core group; this blog describes some of the work performed in that group.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today
Site Pages
OpenSolaris
Sun Bloggers

No bookmarks in folder