Wednesday Feb 11, 2009

Basic demonstration of DTrace feature of OpenSolaris

This is a demonstration, for a full material on DTrace please refer to the  Solaris DTrace  guide at: http://docs.sun.com/app/docs/doc/817-6223

D is a C-like event-oriented script that uses operational system instrumentation points as its events.
That points of instrumentation are called Probes.
You can list all the probes of your system using '-l' option.

# pfexec dtrace -l
ID      PROVIDER    MODULE                FUNCTION        NAME
1       dtrace                                            BEGIN
2       dtrace                                            END
3       dtrace                                            ERROR
4       python723   libpython2.4.so.1.0   PyEval_EvalFrame function-entry
5       python723   libpython2.4.so.1.0   dtrace_entry function-entry
[...]
71348   fbt         semsys                _info           return
71349   fbt         semsys                semexit         entry
71350   fbt         semsys                semexit         return
71351   syscall                           semsys          entry
71352   syscall                           semsys          return

You can see that I have 71352 places in my system that I can get statistics right now.

Every Probe in the system can be mapped by a 4-layer hierarchy:

provider : subsystem : function : probeName

More information on that can be found on chap 1 of the DTrace Guide: http://docs.sun.com/app/docs/doc/817-6223/chp-intro?a=view

What is important to understand now is that 'provider' is normally the name of module inside the kernel, 'subsystem' is a kernel subsystem or a user lib, 'function' is the function that we wanna inspect and 'probeName' is what point of that function we wanna get statistics. If I omit any member of this hierarchy the system will understand as '\*'(anything)

Provider 

The name of the DTrace provider that is publishing this probe. The provider name typically corresponds to the name of the DTrace kernel module that performs the instrumentation to enable the probe. 

Subsystem

If this probe corresponds to a specific program location, the name of the module in which the probe is located. This name is either the name of a kernel module or the name of a user library. 

Function 

If this probe corresponds to a specific program location, the name of the program function in which the probe is located. 

Name 

The final component of the probe name is a name that gives you some idea of the probe's semantic meaning, such as BEGIN or END.

So, let's say I wanna the entry point of every system call of the system:

syscall:::entry

The option '-n' is used to run the D script from the shell call, so this command should show the standard output for all syscall probes:

The standard output is to print the id of the CPU, the id of the process, the function name and the probe name:

# pfexec dtrace -n 'syscall:::entry{ }'
 CPU     ID               FUNCTION:NAME
  0      70861               ioctl:entry
  0      70861               ioctl:entry
  0      70999           sysconfig:entry
  0      70999           sysconfig:entry
  1      70767                read:entry
  1      70861               ioctl:entry
  1      70861               ioctl:entry
  [...]

This can be impressive, but just for a second. In the next moment you will want some way to aggregate this huge amount of information.

To aggregate it by syscall name and count how many times it was called:

# pfexec dtrace -n 'syscall:::entry{ @[probefunc]=count(); }'
\^C
  fstat                1
  getpid               1
  mmap                 1
  schedctl             1
  sigpending           1
  stat64               1
  sysconfig            3
  getgid               4
    [...]
  setcontext           24
  writev               39
  setitimer            47
  p_online             64
  clock_gettime        150
  read                 261
  write                301
  pollsys              335
  lwp_park             379
  ioctl                590

For more information on aggergations, refer to chap 9 of the Dtrace guide: http://docs.sun.com/app/docs/doc/817-6223/chp-aggs?a=view

You can short by execname too:


# pfexec dtrace -n 'syscall:::entry{ @[execname,probefunc]=count();}'
\^C
 dtrace               fstat                  1
  dtrace              lwp_park               1
  dtrace              lwp_sigmask            1
 [...]
  firefox-bin         read                   84
  Xorg                clock_gettime          98
  firefox-bin         ioctl                  132
  firefox-bin         pollsys                151
  rhythmbox           ioctl                  165
  rhythmbox           write                  169
  rhythmbox           lwp_park               230

One can say that so far, the only thing that we discover is that I'm listening to music right now, because rhythmbox is making syscalls like crazy.

So, let's say I wanna really understand what is going on when some process calls the syscall, let's say, ioctl. It wouldn't be perfect if we could just ask the process to say every function it enters?

Let's look at this script:

#!/sbin/dtrace -s

#pragma D option flowindent

syscall::ioctl:entry
{
self->follow = 1;
}
fbt:::
/self->follow/
{}
syscall::ioctl:return
/self->follow/
{
self->follow = 0;
exit(0);
}

Each probe description is an event.

Don't bother yourself with the first two lines. The first one is the interpreter, and the second is just a macro for spacing rightly the standard output.

The first will fire when the system enters in the system call ioctl. The second will fire every time the system changes the working function and the thread have a special thread variable follow. FBT is the function boundary provider, and it maps all the functions in the system. What is under the slashes is a filter, and it is called predicate. The last will fire every time a syscall ioctl returns with the thread variable.

Things are getting clearer now.

The flow gos like that:

The first process that calls ioctl will cause the system to put a thread variable named follow in that specific thread and print the standard output (probe details). The next functions that this thread calls will fire a probe, and the next, and the next, and all will print it details until we reach the return of the syscall. This probe will print itself, unset the thread variable and exit the dtrace program.

Without more delays, let's see the output:

# pfexec dtrace -s flw.d
dtrace: script 'flw.d' matched 60932 probes
CPU FUNCTION
 0  -> ioctl
 0    -> getf
 0      -> set_active_fd
 0      <- set_active_fd
 0    <- getf
 0    -> fop_ioctl
 0      -> crgetmapped
 0      <- crgetmapped
 0      -> spec_ioctl
 0        -> cdev_ioctl
 0          -> ddi_get_soft_state
 0          <- ddi_get_soft_state
 0          -> kmem_zalloc
 0            -> kmem_cache_alloc
 0              -> kmem_depot_alloc
 0              <- kmem_depot_alloc
 0              -> kmem_slab_alloc
 0              <- kmem_slab_alloc
 0            <- kmem_cache_alloc
 0          <- kmem_zalloc
 0          -> kmem_free
 0            -> kmem_cache_free
 0            <- kmem_cache_free
 0          <- kmem_free
 0        <- cdev_ioctl
 0      <- spec_ioctl
 0    <- fop_ioctl
 0    -> releasef
 0      -> clear_active_fd
 0      <- clear_active_fd
 0      -> cv_broadcast
 0      <- cv_broadcast
 0    <- releasef
 0  <- ioctl
 0  <= ioctl

  

Now you can search at http://src.opensolaris.org/source/ for each function and really understand what is happening when ioctl is called.

As we could imagine, ioctl deals with io control, some file descriptor configuration, in the middle we can see some memory allocation and of course, all of that are protected by mutual exclusions locks. All that can be seen in the output.

If you run this scripts some more times it won't be long until you get some result like that:

# pfexec dtrace -s flw.d
dtrace: script 'flw.d' matched 60932 probes
CPU FUNCTION
0  -> do_interrupt
 0    -> tlb_service
 0    <- tlb_service
 0    -> uppc_intr_enter
 0      -> uppc_setspl
 0      <- uppc_setspl
 0    <- uppc_intr_enter
 0    -> intr_thread_prolog
 0    <- intr_thread_epilog
 0  <- dispatch_hardint
 0  <- switch_sp_and_call
 0  <- do_interrupt
 0  -> sys_rtt_common
 0  <- sys_rtt_common
 0  -> do_interrupt
 0    -> tlb_service
 0    <- tlb_service
 0    -> uppc_intr_enter
 0      -> uppc_setspl
 0      <- uppc_setspl
 0    <- uppc_intr_enter
 0    -> intr_thread_prolog
 0    <- intr_thread_epilog
 0  <- dispatch_hardint
 0  <- switch_sp_and_call
 0  <- do_interrupt
 0  -> sys_rtt_common
 0  <- sys_rtt_common
 0  -> do_interrupt
 0    -> tlb_service
 0    <- tlb_service
 0    -> uppc_intr_enter
 0      -> uppc_setspl
 0      <- uppc_setspl
 0    <- uppc_intr_enter
 0    -> intr_thread_prolog
 0    <- intr_thread_epilog
 0  <- dispatch_hardint
 0  <- switch_sp_and_call
 0  <- do_interrupt
 0  -> sys_rtt_common
 0  <- sys_rtt_common
 0  -> do_interrupt
 0    -> tlb_service
 0    <- tlb_service
 0    -> uppc_intr_enter
 0      -> uppc_setspl
 0      <- uppc_setspl
 0    <- uppc_intr_enter
 0    -> intr_thread_prolog
 0    <- intr_thread_epilog
 0  <- dispatch_hardint
 0  <- switch_sp_and_call
 0  <- do_interrupt
 0  -> sys_rtt_common
 0  <- sys_rtt_common
 0  -> ioctl
 0    -> getf
 0      -> set_active_fd
 0      <- set_active_fd
 0    <- getf
 0    -> fop_ioctl
 0      -> crgetmapped
 0      <- crgetmapped
 0      -> nm_ioctl
 0        -> fop_ioctl
 0          -> crgetmapped
 0          <- crgetmapped
 0          -> fifo_ioctl
 0            -> fifo_strioctl
 0              -> strioctl
 0                -> job_control_type
 0                <- job_control_type
 0                -> strcopyout
 0                <- strcopyout
 0              <- strioctl
 0            <- fifo_strioctl
 0          <- fifo_ioctl
 0        <- fop_ioctl
 0      <- nm_ioctl
 0    <- fop_ioctl
 0    -> releasef
 0      -> clear_active_fd
 0      <- clear_active_fd
 0      -> cv_broadcast
 0      <- cv_broadcast
 0    <- releasef
 0  <- ioctl
 0  <= ioctl

Probably when ioctl was called, it needed to take some other thread out of the cpu. We can see some calls from the interrupt sequence. TLB is the table that maps virtual memory, and we can see some calls related to stack pointers and that cleary are evidencing context changes.

As we can see, Dtrace is a great way to find bugs, profile, do performance analyses/evaluation. It is also a great way to learn and unsderstand operational systems behavior.

So, Where we can go now?

Community at Opensolaris.org:
opensolaris.org/os/community/dtrace/

Adam Leventhal:
http://blogs.sun.com/ahl

BigAdmin resources:
www.sun.com/bigadmin/content/dtrace/

Sun's How To:
www.sun.com/software/solaris/howtoguides/dtracehowto.jsp

Solaris Internals:
http://www.solarisinternals.com

Please post a comment or send me a email if you have any doubts, suggestions or if you find something here that isn't that right. We're a community and your help is very welcome a encouraged. =)  [Read More]
About

Hi, my name is Marcelo Arbore, I study electrical engineering at University of Sao Paulo, Brazil, and I work for Sun Microsystems. I also like to surf, I love the sea and I will live in California some day. marcelo.arbore@sun.com

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today