Wednesday Jun 15, 2005

Solaris Internals (How I Spent My Summer Vacation)

Solaris Internals, Second Edition, and Watching the Kernel Flow

The launch of OpenSolaris, and all the very cool technology and features in Solaris 10 have generated a great deal of excitement in the industry. I for one am particularly jazzed about these events. Let me take a minute and talk about why, and then we'll do a bit of technical stuff...

Solaris Internals

Back in 1997 (through 2000), I coauthored Solaris Internals with my colleague and good mate Richard McDougall. After a few fits, sputters and false starts, we finally found ourselves sitting on opposite ends of a phone connection (Richard lives in the Bay Area, I'm in New Jersey), each holding a shiny new copy of Solaris Internals, exchanging grunts of disbelief. All of a sudden, as my euphoria was reaching a peak, I hear a groaned "Uh oh" from the other end of the phone. Richard (who never misses anything) spotted a bug - a rather severe bug. Page 107 was not followed by page 108, but rather by page 180! You're cruising along, reading all about kernel bootstrap and initialization, flip that page, and whammo, you're staring at a diagram of the page scanners two handed clock algorithm. Not good.

We immediately placed a call to the publisher, with a cry of "stop the presses"! They responded immediately, pulling the books that were printed, fixing the transposed page, and starting over. As far as I know, Richard and I have the only copies with the page bug (I'm waiting to sell mine on Ebay - private bids accepted).

Following the publication of the first edition, Richard and I decided we needed a little time to decompress, get to know our families again, get caught up on some stuff around the house, and generally try to feel normal for a short while. After a bit of that, we'd jump in and get going on an update for Solaris 8 and Solaris 9. As it happens, writing, and exercising (working out) have a few things in common. It's really hard to get started, and especially hard after you've stopped for a bit. Time marched on, and our efforts failed to keep pace. We finally reached a point where it just made more sense to focus our efforts on Solaris 10, and that is precisely what we are doing. We're working feverishly to complete the manuscript this summer, and get it on the shelves ASAP. Our thanks to all the support and kind words we have received over the years on the first edition.

What's Different The Second Time Around?

Working on the first edition was largely an exercise in reading through the kernel source, and describing what it does in english text. For those areas where I found myself scratching my head, there was a lot of email exchanges with various members of Solaris Kernel Engineering, seeking clarification, and/or the rationale behind a design choice. I'm compelled to add here that the engineers at the receiving side of my emails are the real authors (at least for the stuff I wrote - not speaking for Richard here). An amazing group of individuals that remained supportive and responsive throughout the effort. Their contributions to the accuracy and thoroughness of the text are immeasurable.

Working on the second edition, we have a new tool to make understanding what is happening in the kernel a whole heck of a lot easier. I am talking of course about DTrace. DTrace is a truly amazing technology, and having spent a bit of time looking at various areas of DTrace in the source code, I am even more amazed. I'm still trying to figure out how they did! With DTrace at our disposal, we now have a tool that substantially reduces the amount of time required to understand the code path through the kernel for a specific event (that is of course just one of the zillions of things you can do with DTrace). Here are a couple quick examples...

"Trussing" the Kernel...

A potentially interesting approach to walking through the Solaris kernel source is to pick an event or subsystem of interest, use DTrace to generate the code path, and use that as a reference point to start your source examination. An area that I've looked at and written about a bit is the process/threads architecture and subsystem, so we'll take a quick look at process and thread creation. But first, a quick note.

Threads, and thread models, exist in various forms in modern operating systems. In Solaris, they've been a core part of the kernel from the beginning (Solaris 2.0, circa 1991), as the kernel itself is mulithreaded. Additionally, application programming interfaces (APIs) exist so software developers can write multithreaded applications. Solaris threads have evolved over time, going through a series of refinements, and some architectural changes. Phil Harman, also a very good friend and co-worker, is an expert in this area, and authored a great paper for Solaris 9 called Multithreading in the Solaris Operating System, which is an outstanding technical reference describing the architectural change, and characteristics of a new thread model that was introduced in Solaris 9 (well, Solaris 8 technically the paper for the particulars).

Threads are executable objects, where every process has one or more threads, and each thread can be scheduled and executed independently of other threads on the same process. In Solaris, think of a process as a state container for one or more threads. Various components of the process are shared by the threads in the process - the address space (memory space), open files, credentials, etc. Each thread however has its own stack, and each thread can issue system calls and enter the kernel.

Processes, threads, and pretty much every other memory object that resides in the Solaris kernel, is defined in a header file, most of which can be found in the common/sys directory in the source tree, so this is the place to look to see the structure definitions - reference proc.h and thread.h.

Now let's say we're interested is understanding the code path through the kernel for process creation. This becomes a no-brainer with dtrace. All processes are created with the fork(2) system call, so all we need is a D script to trace the kernel when the fork(2) system call is entered. Additionally, we could write a simple C program to call fork(2) if we don't have a load on our system generating fork(2) calls. We need a D script that uses the syscall provider to enable a probe for fork(2) entry and return. I can verify the probe name simply on the command line:

solaris10> dtrace -n syscall::fork:entry
dtrace: invalid probe specifier syscall::fork:entry: probe description syscall::fork:entry does not match any probes

Egads! What madness is this? No fork(2) probe in the syscall provider? Let's see...:

solaris10> dtrace -l -P syscall | grep fork
    9    syscall                                             forkall entry
   10    syscall                                             forkall return
  201    syscall                                               vfork entry
  202    syscall                                               vfork return
  243    syscall                                               fork1 entry
  244    syscall                                               fork1 return

Ah, OK. We have probes for forkall(2), vfork(2) and fork1(2), but not fork(2). A quick look at the man page tells use most of what we need to know. It's an implementation detail...there are actually three flavors of fork(2) in Solaris 10; fork(2), forkall(2) and fork1(2) (well, there's also vfork(2), but that's a deprecated interface, since we now have the posix_spawn(3C) interface). Prior to Solaris 10, fork(2) would replicate all the threads when called from a multithreaded process, and fork1(2) was created for applications that wish to replicate only the calling thread in the child process (unless the code was linked with libpthread, in which case fork(2) would provide fork1(2) behaviour). In Solaris 10, fork(2) provides fork1(2) behaviour, replicating only the calling thread in the child process. A forkall(2) system call exists when the replicate-all-threads behaviour is desired. So what happens when we have a fork(2) call in out code? Since fork(2) and fork1(2) do the same thing in Solaris 10, there's no need to have multiple source files. In the libc source tree, you'll find fork(2) set to the same entry point as fork1(2) with a #pragma compiler directive in scalls.c;

116 /\*
117  \* fork() is fork1() for both Posix threads and Solaris threads.
118  \* The forkall() interface exists for applications that require
119  \* the semantics of replicating all threads.
120  \*/
121 #pragma weak fork = _fork1
122 #pragma weak _fork = _fork1
123 #pragma weak fork1 = _fork1
124 pid_t
125 _fork1(void)
126 {

This was done so the bazillions of applications out there that call fork(2) will just work on Solaris 10. Now where were we...oh yea, why the fork1(2) probe in the dtrace script, when it's fork(2) we're interested in. Because fork(2) is associated with fork1(2) in the libc source, such that calls to fork(2) and fork1(2) enter the same function (_fork1(void), line 125 above). Thus, the dtrace syscall provider can not locate a probe for the fork(2) entry point as one does not exist, which is just fine, because we now know why.

So let's get back to having a look at the code path for a process creation in Solaris 10. Here's a D script that'll do the trick:

#!/usr/sbin/dtrace -s

#pragma D option flowindent

/ self->trace /

This D script above sets the flowindent option, which makes a large function-call-return result much easier to read. When a fork(2) or fork1(2) system call is executed, the entry probe will fire. The action in the probe clause sets a thread local variable (self->trace) to a 1. We use a probe name of fbt::: in the script, which enables every Function Boundary Tracing probe that exists (basically, most every kernel function). Note we have a predicate that tests 'if self->trace == 1', ensuring that we only take action (print the function name - the default action since nothing is specified in the probe clause) when we're in our fork(2) code path. Finally, when the fork1(2) returns, we clear the trace flag and exit, so we only get one set of function call flow, which is all we're interested in. Running the D script, along with a simple C program that does a fork(2) system call, we get:

solaris10> ./pcreate.d -c ./fk
dtrace: script './pcreate.d' matched 32360 probes
Child PID: 5284
Parent PID: 5283
dtrace: pid 5283 has exited
CPU FUNCTION                                 
 12  -> fork1                                 
 12  <- fork1                                 
 12  -> cfork                                 
 12    -> secpolicy_basic_fork                
 12    <- secpolicy_basic_fork                
 12    -> priv_policy                         
 12    <- priv_policy                         
 12    -> holdlwps                            
 12      -> schedctl_finish_sigblock          
 12      <- schedctl_finish_sigblock          
 12      -> pokelwps                          
 12      <- pokelwps                          
 12    <- holdlwps                            
 12    -> flush_user_windows_to_stack         
 12    <- flush_user_windows_to_stack         
 12    -> pool_barrier_enter                  
 12    <- pool_barrier_enter                  
 12    -> getproc                             
 12        <- setbackdq                       
 12        -> generic_enq_thread              
 12        <- generic_enq_thread              
 12        -> disp_lock_exit                  
 12        <- disp_lock_exit                  
 12      <- continuelwps                      
 12      -> continuelwps                      
 12        -> thread_lock                     
 12        <- thread_lock                     
 12        -> disp_lock_exit                  
 12        <- disp_lock_exit                  
 12      <- continuelwps                      
 12      -> thread_lock                       
 12      <- thread_lock                       
 12      -> thread_transition                 
 12      <- thread_transition                 
 12      -> disp_lock_exit_high               
 12      <- disp_lock_exit_high               
 12      -> ts_setrun                         
 12      <- ts_setrun                         
 12      -> setbackdq                         
 12        -> cpu_update_pct                  
 12          -> cpu_grow                      
 12            -> cpu_decay                   
 12              -> exp_x                     
 12              <- exp_x                     
 12            <- cpu_decay                   
 12          <- cpu_grow                      
 12        <- cpu_update_pct                  
 13  <= fork1                                 

I cut most of the lines from the output for this post. The idea here is to illustrate how easy it is to plot the code path through the kernel for a fork call (process creation). You can see what a great option flowindent is. It does a beautiful job of presenting a long function call flow, and show entry points (->) and returns (<-). Note also in the command line output that I used the '-c' flag in the dtrace command line. This instructs dtrace to run the specified command, and exit when it's complete. In the case, the command was the fk executable, which is a simple piece of code that issues a fork(2) call.

Another quick example - this time, a thread create. Here's the D:

#!/usr/sbin/dtrace -s

#pragma D option flowindent


/ self->trace /


The login of this D script is the same as the previous example. What is different here is we're using the PID provider to enable probes in a user process. This is because the entry point we're interested in is not a system call, it is a library routine (pthread_create(3C)). As such, it will be mapped into the user processes address space, and it's the PID provider that opens that door for us. We're also using the $target DTrace macro variable. This works in conjunction with the '-c' command line option, and will set the PID of the command started on the command line to $target, which we can use are part of the PID provider component of the probe name, to probe the process we're interested in. Here's the run:

solaris10> ./tcreate.d -c ./tds    
dtrace: script './tcreate.d' matched 32360 probes
Created 8 writers, 16 readers
CPU FUNCTION                                 
  0  -> pre_syscall                           
  0    -> syscall_mstate                      
  0    <- syscall_mstate                      
  0  <- pre_syscall                           
  0  -> sysconfig                             
  0  <- sysconfig                             
  0  -> post_syscall                          
  0    -> clear_stale_fd                      
  0    <- clear_stale_fd                      
  0    -> syscall_mstate                      
  0    <- syscall_mstate                      
  0  <- post_syscall                          
  0  -> pre_syscall                           
  0    -> syscall_mstate                      
  0    <- syscall_mstate                      
  0  <- pre_syscall                           
  0  -> smmap32                               
  0    -> smmap_common                        
  0      -> as_rangelock                      
  0      <- as_rangelock                      
  0      -> zmap                              
  0        -> map_addr                        
  0          -> map_addr_proc                 
  0            -> rctl_enforced_value         
  0              -> rctl_set_find             
  0    -> lwp_continue                        
  0      -> thread_lock                       
  0      <- thread_lock                       
  0      -> setrun_locked                     
  0        -> thread_transition               
  0        <- thread_transition               
  0        -> disp_lock_exit_high             
  0        <- disp_lock_exit_high             
  0      <- setrun_locked                     
  0      -> ts_setrun                         
  0      <- ts_setrun                         
  0      -> setbackdq                         
  0        -> cpu_update_pct                  
  0          -> cpu_decay                     
  0            -> exp_x                       
  0            <- exp_x                       
  0          <- cpu_decay                     
  0        <- cpu_update_pct                  
  0        -> cpu_choose                      
  0        <- cpu_choose                      
  0        -> disp_lock_enter_high            
  0        <- disp_lock_enter_high            
  0        -> cpu_resched                     
  0        <- cpu_resched                     
  0      <- setbackdq                         
  0      -> generic_enq_thread                
  0      <- generic_enq_thread                
  0      -> disp_lock_exit                    
  0      <- disp_lock_exit                    
  0    <- lwp_continue                        
  0    -> cv_broadcast                        
  0    <- cv_broadcast                        
  0  <- syslwp_continue                       
  0  -> post_syscall                          
  0    -> clear_stale_fd                      
  0    <- clear_stale_fd                      
  0    -> syscall_mstate                      
  0    <- syscall_mstate                      
  0  <- post_syscall                          
  0  <- pthread_create

Once again, most of the output was cut for space. This time, we executed a binary called 'tds', which of course is a program that creates threads.

That's it for now. With some simple D scripting, and sample programs that generate an event of interest, you can trace the code path through the kernel, than use that data to zero in on points of interest in the source code.




« July 2016