Observing the Solaris Kernel

This past weekend I had the pleasure of presenting at the SIGCSE conference in Houston. For the unacquainted the audience is computer science educators, so my focus was on using OpenSolaris as a vehicle for teaching operating system internals.

For the conference I prepared some slides and a demo that attempt to peel off the first layer of the observability onion into the Solaris kernel. Aside from DTrace (which by now the whole world has heard about) I also make extensive use of the excellent mdb kernel debugging facilities. The demos in particular are worth sharing so I am posting them here for all to see and use as you see fit.

First, the slides (also in StarOffice; if you steal them for your own use all I ask is that you provide a pointer back to this blog entry). The main theme of my presentation is that OpenSolaris brings to the table free access to the source code, while the powerful observability tools allow quick insight into the dynamics of the kernel without mastering the source.

The first demo starts from a simple C program, follows the open() syscall flow down to the process file table with DTrace, and then drills all the way down (using the kernel debugger) into the inode at the file system level. This is accomplished by leveraging access to the source code, and for the demo I deliberately chose a part of the system to look at I had never been in before. This demo took about twenty minutes to create and about twenty minutes to walk through.

The second demo shows how mdb can be used to examine the anatomy of a stopped process. I ran out of time before getting to the second demo; I'll post the second demo in another blog entry in a few days, so keep an eye out for it.


First Demo

Let's begin with the following program:
/\*
 \* A simple program which creates a file, writes one byte to it,
 \* closes it, and exits.
 \*/
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <errno.h>

int
main(void)
{
        int fd;
        const char buf[] = "\\0";

        if ((fd = open("simple1.out", O_CREAT|O_WRONLY, 0600)) < 0) {
                perror("Unable to create file");
                exit (1);
        }

        if (write(fd, buf, 1) < 0) {
                perror("Unable to write file");
                exit (2);
        }

        return (0);
}
Let's take a closer look at the open(2) system call. Running truss(1) shows that the system call we're interested in is actually the third, since there are two implicit open()s performed automatically (one by the runtime linker looking for a config file, and the other to map in the C library).
% truss -t open ./simple1
open("/var/ld/ld.config", O_RDONLY)             Err#2 ENOENT
open("/lib/libc.so.1", O_RDONLY)                = 3
open("simple1.out", O_WRONLY|O_CREAT, 0600)     = 3
Armed with this much knowledge I use a really simple DTrace script to see what functions open() calls.
#!/usr/sbin/dtrace -Fs
open:entry
/ execname == "simple1" /
{
        self->tracing++;
}
:return
/ self->tracing > 0 /
{
        trace(0);
}
open:return
/ self->tracing > 0 /
{
        self->tracing--;
}
:entry
/ self->tracing > 0 /
{
        trace(0);
}
Running this script yields quite a bit of output, of which the first few calls are interesting because we can see that the system call is setting up the new file descriptor which is eventually returned.
% CPU FUNCTION
  0  => open                                          0
  0    -> open32                                      0
  0      -> copen                                     0
  0        -> falloc                                  0
  0          -> ufalloc                               0
  0            -> ufalloc_file                        0
  0              -> fd_find                           0
...
Going over to opensolaris.org and pulling up the source for fd_find() we can see that the file descriptor is an index into an array, fi_list[], which is embedded in the process u structure. Armed with that new info, I setup a breakpoint into the debugger from the end of open using DTrace.
#!/usr/sbin/dtrace -wFs
open:return
/ execname == "simple1" /
{ breakpoint(); }
Starting the script as root and then running the simple1 program as my user ID in another window results in the following session (recall we're looking for the third open call, so we ignore the first two times we hit the breakpoint):
[console] root@rutamaya > ./simple1_files.d
dtrace: script './simple1_files.d' matched 4 probes
dtrace: allowing destructive actions
dtrace: breakpoint action at probe syscall::open:return (ecb fffffe8501e671e0)
kmdb: target stopped at:
kaif_enter+8:   popfq
[0]> :c
dtrace: breakpoint action at probe syscall::open:return (ecb fffffe8501e671e0)
kmdb: target stopped at:
kaif_enter+8:   popfq
[0]> :c
dtrace: breakpoint action at probe syscall::open:return (ecb fffffe8501e671e0)
kmdb: target stopped at:
kaif_enter+8:   popfq
[0]> <gsbase::print cpu_t cpu_thread->t_procp->p_user.u_finfo.fi_list
cpu_thread->t_procp->p_user.u_finfo.fi_list = 0xfffffea7aa34b800
[0]> 0xfffffea7aa34b800,4::print uf_entry_t uf_file->f_vnode
uf_file->f_vnode = 0xfffffe813aa86c80
uf_file->f_vnode = 0xfffffe813aa86c80
uf_file->f_vnode = 0xfffffe813aa86c80
uf_file->f_vnode = 0xffffffffb282ce00

%gsbase holds a pointer to the current CPU structure; from there I can get the running thread, access the process pointer, follow the u structure to the u_finfo which contains the array of open files (again indexed by file descriptor, as we discovered from the source). Once I have the pointer of this array I can follow the struct file to get the vnode for each open file.

Going back to the source and looking at the struct vnode I can see that the v_type and v_path fields are interesting so let's take a look at them..

[0]> 0xfffffea7aa34b800,6::print uf_entry_t uf_file|::grep >0|::print "struct file" f_vnode->v_type f_vnode->v_path
f_vnode->v_type = 4 (VCHR)
f_vnode->v_path = 0xfffffea79ece81b8 "/devices/pseudo/pts@0:1"
f_vnode->v_type = 4 (VCHR)
f_vnode->v_path = 0xfffffea79ece81b8 "/devices/pseudo/pts@0:1"
f_vnode->v_type = 4 (VCHR)
f_vnode->v_path = 0xfffffea79ece81b8 "/devices/pseudo/pts@0:1"
f_vnode->v_type = 1 (VREG)
f_vnode->v_path = 0xfffffe80ea212e40 "/home/elowe/proj/SIGCSE/simple1.out"

This prints each element of the array of uf_entry_t structures in the u structure if the pointer is not NULL, and pipes the dot (which is a struct file to the ::print dcmd to examine the fields we're interested in. The first three file descriptors are stdin/stdout/stderr and all point back to my virtual terminal; the fourth is the file being written to by the simple1 program.

Since I'm using NFS (and I can see from my earlier DTrace output that nfs3_open() is in the call path) I decided it might be interesting to poke at the underlying inode. The source shows that nfs3_open() gets an rnode_t by casting the vnode's v_data directly into the file system's internal inode representation. A little bit ago we had the vnode pointer in our mitts:

[0]> 0xfffffea7aa34b800,4::print uf_entry_t uf_file->f_vnode
...
uf_file->f_vnode = 0xffffffffb282ce00

So now we can keep digging:

[0]> 0xffffffffb282ce00::print vnode_t
{
..
    v_data = 0xfffffe83625e9d10
    v_vfsp = 0xffffffff885eaf00
..
    v_op = 0xffffffff86d24c40
..
[0]> 0xffffffff86d24c40::print struct vnodeops
{
    vnop_name = 0xffffffffc01c5a20 "nfs3"
    vop_open = nfs`nfs3_open
..
[0]> 0xfffffe83625e9d10::print rnode_t
{
..
    r_size = 0x1
    r_attr = {
        va_mask = 0xbfff
        va_type = 1 (VREG)
        va_mode = 0x180
        va_uid = 0x1ebfa
        va_gid = 0xa
..
[0]> 180=O
                0600
[0]> 0x1ebfa=D
                125946
[0]> a=D
                10
[0]> :c
We can see that the file mask is 0600, the UID is 0x1ebfa hex (which is 125946 in decimal), and the group ID is 10 decimal.
% ls -l simple1.out
-rw-------   1 elowe    staff          1 Feb 27 11:34 simple1.out

Technorati Tags: [ ]

Comments:

Post a Comment:
Comments are closed for this entry.
About

elowe

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today