X

News, tips, partners, and perspectives for the Oracle Linux operating system and upstream Linux kernel work

Getting started with SystemTap on Oracle Linux

Guest Author
Oracle Linux kernel engineer Dai Ngo provides an introduction on how to analyse a live Oracle Linux system using SystemTap.

 

Introduction

There are a wealth of tools available for tracing and debugging the Linux kernel on a live system. These include Kprobes, Ftrace,  trace-cmd, Dtrace, eBPF, SystemTap, crash, gdb, etc. Among these tools, few allow the user to develop and re-use scripts that can filter events and collect data more than just function arguments and returned values. Dtrace, eBPF and SystemTap are the ones among these tools that do.

Why SystemTap?

Dtrace is probably the most light-weight tool to run with regard to installation and the requirement of support packages. However, Dtrace's event filtering is based on conditional expressions which are not very flexible. You may not construct a conditional expression where, for example, one path returns a string and another path returns an integer. Dtrace's scripting language does not support the conditional if-then-else construct, functions, or iteration statements such as while, for, or foreach. Also Dtrace is available only for UEK-4 and UEK-5. Dtrace support for UEK-6 is still in-progress.

eBPF is probably the most powerful and feature rich and there are plenty of documents, guides and blogs about eBPF on the web. However, eBPF is available only for UEK-5 or newer kernels.

This is why SystemTap can be considered the preferred tool because it runs on UEK-4, UEK-5, UEK-6 and upstream kernels. The same script can be run on all of these kernels with minimal modification to match the code in each version.

This blog describes the installation procedure for SystemTap on UEK systems and provides an introduction on how to use SystemTap's scripts for tracing events or debugging problems in the Linux kernel on a live system. The information presented here is just a fraction of what SystemTap can do. The goal for this blog is to introduce you to SystemTap's common features that are used daily to debug problems in UEK kernels and also to provide enough information to facilitate exploring SystemTap's capability further.

Installation and Setup of SystemTap

To deploy SystemTap, install the SystemTap RPM along with the following corresponding packages for your kernel:

  • kernel-uek-devel-`uname -r`.rpm <required>

  • kernel-uek-debuginfo-common-`uname -r`.rpm <optional>

  • kernel-uek-debuginfo-`uname -r`.rpm <optional>

SystemTap can run in a limited mode with just the kernel-uek-devel-`uname -r`.rpm package installed. This is referred to as DWARF-less mode which has limited functionality, see DWARF-less Mode section for more details on this mode. DWARF is the acronym for 'Debugging With Attribute Record Format'. With the optional debuginfo packages, SystemTap runs in DWARF mode which allows the scripts to make use of source debugging functionality.

  1. Install SystemTap RPM
    # yum install systemtap

  2. Install kernel-uek-devel-`uname -r`
    # yum install kernel-uek-devel-`uname -r`

  3. Download kernel-uek-debuginfo-common-`uname-r`.rpm and kernel-uek-debuginfo-`uname -r`.rpm from Oracle's public repos:

    1. https://oss.oracle.com/ol8/debuginfo

    2. https://oss.oracle.com/ol7/debuginfo

    3. https://oss.oracle.com/ol6/debuginfo
      # yum install  ./kernel-uek-debuginfo`uname-r`*.rpm

Verify SystemTap was installed successfully by running this simple script:

# stap -v -e 'probe kprobe.{function("vfs_read")} {printf("read performed\n"); exit()}'
Pass 1: parsed user script and 473 library scripts using 275004virt/74412res/7088shr/67632data kb, in 2620usr/90sys/2724real ms.
Pass 2: analyzed script: 1 probe, 1 function, 0 embeds, 0 globals using 276588virt/75852res/7224shr/69216data kb, in 20usr/0sys/28real ms.
Pass 3: translated to C into "/tmp/stap6PLVAR/stap_fc98b194a9f631b25c5d683d39ab9054_1127_src.c" using 276588virt/77080res/8248shr/69216data kb, in 0usr/0sys/0real ms.
Pass 4: compiled C into "stap_fc98b194a9f631b25c5d683d39ab9054_1127.ko" in 4370usr/1140sys/5544real ms.
Pass 5: starting run.
read performed
Pass 5: run completed in 30usr/80sys/376real ms.
[root@nfsvme14 ssc]#

Update SystemTap for UEK-6

Note: UEK-6 requires SystemTap version 4.1 or newer due to a bug in 4.0.

For UEK-6, perform the installation as above then update SystemTap to version 4.1 or newer. You can download the newer version of SystemTap from https://yum.oracle.com/repo/OracleLinux/OL7/SoftwareCollections/x86_64/ or build the latest version from the git repo yourself as below:

  1. Install required packages for building SystemTap: gcc, gcc-c++, elfutils and elfutils-devel
    # yum install gcc gcc-c++ elfutils  elfutils-devel

  2. Build latest version of SystemTap
    # git clone git://sourceware.org/git/systemtap.git
    # cd systemtap
    # ./configure
    # make
    # make install

Running SystemTap

Here is an example script for tracing NFS events using SystemTap:

#!/usr/bin/env stap
/* simple script shows call graph of all NFS client functions */

probe begin {
    printf("Starting NFS script...\n")
}
probe {module("nfs")}.function("*@fs/nfs/*.c").call ? {
    printf("%s -> %s\n", thread_indent(1), ppfunc())
}

probe {module("nfs")}.function("*@fs/nfs/*.c").return ? {
    printf("%s <- %s\n", thread_indent(-1), ppfunc())
}

probe end {
    printf("\nNFS script stopped\n")
}

Execute the script as follows:

# stap -v -d sunrpc  -d nfs -d nfsv3 -d nfsv4 -d kernel nfs_clnt_funcs.stp
  1135 ls(31012):     -> nfs_refresh_inode
  1138 ls(31012):      -> nfs_refresh_inode
  1141 ls(31012):       -> nfs_refresh_inode_locked
  1145 ls(31012):        -> nfs_update_inode
  1148 ls(31012):         -> nfs_file_has_buffered_writers
  1152 ls(31012):         <- nfs_file_has_buffered_writers
  1155 ls(31012):         -> nfs_set_cache_invalid
  1159 ls(31012):         <- nfs_set_cache_invalid
  1164 ls(31012):        <- nfs_update_inode
  1167 ls(31012):       <- nfs_refresh_inode_locked
  1171 ls(31012):      <- nfs_refresh_inode
  1174 ls(31012):     <- nfs_refresh_inode

For UEK-6 the STP_NO_BUILDID_CHECK command line option must be specified to get around the buildid checking error:

# stap -v -d sunrpc -d nfs -d nfsv3 -d nfsv4 -d kernel -DSTP_NO_BUILDID_CHECK nfs_clnt_funcs.stp
  • -v verbose output

  • -d modules Add symbol/unwind information for the given module into the kernel object module

See man page stap(1) for more information.

SystemTap script

The  SystemTap script language resembles awk and C.  There are two main outermost constructs: probes and functions. Within these, statements and expressions use C-like operator syntax and precedence. See section SCRIPT LANGUAGE in stap(1) man page for more details.

The stap program is the front-end to the SystemTap tool. It accepts probing instructions written in its scripting language, translates those instructions into C code, compiles this C code, and loads the resulting kernel module into a running Linux kernel to perform the requested system trace or probe functions.

DWARF-less Mode

If only kernel-uek-devel-`uname -r`.rpm is installed, SystemTap can still be used in DWARF-less mode that uses the Kprobe family of probes to examine the entry and exit points of kernel and module functions.  Without debug information, you cannot look up the arguments or local variables of a function using these probes. However you can still access the function's parameters such as  uint_arg(1)pointer_arg(2), ulong_arg(3), etc.  Even with this limited capability, it's still useful for basic tracing of events or just for learning the new code.

For example, when probing the function nfs_readdir declared as:

static int nfs_readdir(struct file file, structdir_context ctx)

You can obtain the values of parameters file and ctx as pointer_arg(1), and pointer_arg(2). The returned value of a function can be accessed in the return probe by calling returnval() or returnstr() to get a string version of that value.

Example of using DWARF-less probes and predicate to filter out unwanted events and what the output looks like:

#!/usr/bin/env stap
/* examples of DWARF-less probing */

probe begin{
    printf("Starting NFS script...\n")
}

probe kprobe.module("nfs").function("nfs_readdir")
{
    printf("\t\t%s: file[%p] ctx[%p]\n",
           ppfunc(), pointer_arg(1), pointer_arg(2))
    print_backtrace()
}

probe kprobe.module("nfs").function("nfs_readdir").return
{
    printf("\t\t%s: RET[%d]\n", ppfunc(), returnval())
    print_backtrace()
}

/* example of filtering */
probe kprobe.module("nfs").function("nfs_llseek_dir")
{
    /* skip this probe if not SEEK_CUR */
    if (uint_arg(3) != 1)
        next
    print_backtrace()
}

probe end{
    printf("NFS script stopped\n")
}

Example output from running this sample script:

# stap  -v -d sunrpc  -d nfs -d kernel rd.stap
Pass 1: parsed user script and 476 library scripts using 218064virt/86568res/4988shr/82120data kb, in 1440usr/80sys/1516real ms.
Pass 2: analyzed script: 6 probes, 6 functions, 3 embeds, 4 globals using 219516virt/88008res/5124shr/83572data kb, in 30usr/0sys/29real ms.
Pass 3: using cached /root/.systemtap/cache/7b/stap_7b6a993fec722134b0aada80a2b13027_18577.c
Pass 4: using cached /root/.systemtap/cache/7b/stap_7b6a993fec722134b0aada80a2b13027_18577.ko
Pass 5: starting run.
Starting NFS script...
        nfs_readdir: file[0xffff9bf39534c200] ctx[0xffffb5b3411a7ed0]
 0xffffffffc06c6fd0 : nfs_readdir+0x0/0x790 [nfs]
 0xffffffffc06c6fd0 : nfs_readdir+0x0/0x790 [nfs]
 0xffffffffb729ae28 : SyS_getdents+0x98/0x120 [kernel]
 0xffffffffb7003949 : do_syscall_64+0x79/0x1c0 [kernel]
 0xffffffffb7a001ad : entry_SYSCALL_64_after_hwframe+0x169/0x1ce [kernel]

        nfs_readdir: RET[0]
Returning from:  0xffffffffc06c6fd0 : nfs_readdir+0x0/0x790 [nfs]
Returning to  :  0xffffffffb729a5e8 : iterate_dir+0x98/0x190 [kernel]
 0xffffffffb729a5e8 : iterate_dir+0x98/0x190 [kernel]
 0xffffffffb729ae28 : SyS_getdents+0x98/0x120 [kernel]
 0xffffffffb7003949 : do_syscall_64+0x79/0x1c0 [kernel]
 0xffffffffb7a001ad : entry_SYSCALL_64_after_hwframe+0x169/0x1ce [kernel]

        nfs_readdir: file[0xffff9bf39534c200] ctx[0xffffb5b3411a7ed0]
 0xffffffffc06c6fd0 : nfs_readdir+0x0/0x790 [nfs]
 0xffffffffc06c6fd0 : nfs_readdir+0x0/0x790 [nfs]
 0xffffffffb729ae28 : SyS_getdents+0x98/0x120 [kernel]
 0xffffffffb7003949 : do_syscall_64+0x79/0x1c0 [kernel]
 0xffffffffb7a001ad : entry_SYSCALL_64_after_hwframe+0x169/0x1ce [kernel]

        nfs_readdir: RET[0]
Returning from:  0xffffffffc06c6fd0 : nfs_readdir+0x0/0x790 [nfs]
Returning to  :  0xffffffffb729a5e8 : iterate_dir+0x98/0x190 [kernel]
 0xffffffffb729a5e8 : iterate_dir+0x98/0x190 [kernel]
 0xffffffffb729ae28 : SyS_getdents+0x98/0x120 [kernel]
 0xffffffffb7003949 : do_syscall_64+0x79/0x1c0 [kernel]
 0xffffffffb7a001ad : entry_SYSCALL_64_after_hwframe+0x169/0x1ce [kernel]

^CNFS script stopped

Using SystemTap with corresponding debug packages (DWARF mode)

To fully utilize SystemTap's capability the corresponding debug packages, kernel-uek-debuginfo-common-`uname -r`.rpm and kernel-uek-debuginfo-`uname -r`.rpm must be installed.

SystemTap Basics

Here are stap commands to show available probes in the kernel, or in a specific a module, and to display the argument names and types defined for a function:

  • Displaying available probes

    # stap -l 'kernel.function("*")' |sort |more
    ...
    kernel.function("ac6_get_first@net/ipv6/anycast.c:429")
    kernel.function("ac6_get_idx@net/ipv6/anycast.c:475")
    kernel.function("ac6_get_next@net/ipv6/anycast.c:452")
    kernel.function("ac6_proc_exit@net/ipv6/anycast.c:551")
    ...
    # stap -l '{module("nfsd")}.function("*")' |more
    ...
    module("nfsd").function("__nfs4_file_get_access@fs/nfsd/nfs4state.c:558")
    module("nfsd").function("__nfs4_file_put_access@fs/nfsd/nfs4state.c:604")
    module("nfsd").function("__nfs4_get_fd@fs/nfsd/nfs4state.c:435")
    module("nfsd").function("__nfsd4_create_reclaim_record_grace@fs/nfsd/nfs4recover.c:169")
    module("nfsd").function("__nfsd4_find_backchannel@fs/nfsd/nfs4callback.c:1245")
    module("nfsd").function("__nfsd4_find_conn@fs/nfsd/nfs4state.c:3501")
    module("nfsd").function("__nfsd4_hash_conn@fs/nfsd/nfs4state.c:1755")
    module("nfsd").function("__nfsd4_init_cld_pipe@fs/nfsd/nfs4recover.c:980")
    ...
  • Displaying a function's signature

    # stap -L '{module("nfsd")}.function("__nfs4_file_get_access")'
    module("nfsd").function("__nfs4_file_get_access@fs/nfsd/nfs4state.c:558") $fp:struct nfs4_file* $access:u32

Accessing function arguments and returned value using SystemTap probes

Function arguments can be accessed directly using the defined name preceded by '$'.  If the argument is a pointer to a structure, its member can be accessed directly from the entry probe. Structure members can be casted to appropriate data structure's allowing the probe to extract data many levels deep from within the function's arguments.

  • Accessing function arguments from the entry probe

    probe module("nfsd").function("nfsd4_putfh")
    {
        rqstp = $rqstp
        si = &@cast(&rqstp->rq_addr, "sockaddr_in")
        ina = &@cast(&si->sin_addr, "in_addr")
        printf("\t\t%s: START %s\n", ppfunc(), $$parms)
        printf("\t\tXID[0x%x] sin_addr[0x%0x] sin_port[%d]\n\n",
               rqstp->rq_xid, ina->s_addr, si->sin_port)
        print_backtrace()
    }

Function arguments can also be accessed from the return probe by using '&@cast(@entry($arg_name), "arg_type")' syntax.

  • Accessing function arguments from the return probe

    probe kernel.{function("__inet_stream_connect").return,
        function("inet_shutdown").return}
    {
        socket = &@cast(@entry($sock), "socket")
        sk = &@cast(socket->sk, "sock")
        printf("%s %s: inet_sock[%p 0x%x] sk_flags[0x%x] [%s]\n",
               ctime(), ppfunc(), sock, sc->skc_state, sc->skc_flags, execname())
        printf("\t\t\t%s  --> RET[%d]\n", ppfunc(), $return)
        print_backtrace()
    }

Using SystemTap Functions

SystemTap scripts may define subroutines to factor out common work. Functions may take any number of scalar arguments, and must return a single scalar value. Scalars in this context are integers or strings.

Example to illustrate the use of a SystemTap function and Tapset symdata and substr functions to get the symbol of a kernel address.

  • SystemTap function and Tapset function

    function is_sunrpc:long(sym:string)
    {
        /* "xs_tcp_ops+0x0/0xe0 [sunrpc]" */
        sstr = substr(sym, 0, 14)
        if (sstr == "xs_tcp_ops+0x0")
            return 1
        else
            return 0
    }
    
    /* only interested in SUNRPC sockets */
    probe kernel.{function("tcp_close"), function("tcp_fin"), function("tcp_reset")}
    {
        if ($sk->sk_user_data) {
            xprt = &@cast($sk->sk_user_data, "rpc_xprt", "kernel:sunrpc")
            /* check if the ops of this sock is for a sunrpc's TCP connection */
            yes = is_sunrpc(symdata(xprt->ops))
            if (yes)
                show_sunrpc_state(xprt, xprt->snd_task)
        }
    }

Using Tapset system function to run a shell command

You can also run a shell command from a SystemTap probe using the Tapset system function.

Example of using the system function to send a message to syslog from a SystemTap function:

  • Using Tapset's system function to execute shell command

    /*
     * Print the warning message to the stdout and send it to the syslog via logger.
     */
    function warnlog(fsname:string, dev:long) {
        message = sprintf("%s (%d, %d) - no space left on the file system (uid: %d, %s:%d)\n",
                          fsname, MAJOR(dev), MINOR(dev), uid(), execname(), pid());
        /* Print the message to the output */
        printf("[%s]: %s", ctime(gettimeofday_s()), message);
    
        /* Log the message with logger */
        command = sprintf("/usr/bin/logger '%s'", message);
        system(command);
    }

Probing at specific statement in source code

Sometimes there is a need to place a probe at some arbitrary points in a function. You can can accomplish this using the .statement probe. The .statement variant places a probe at the exact spot in the source code, exposing those local variables that are visible there.

  • Example of probing at a specific line number in source code and its output

    probe module("sunrpc").statement("__rpc_add_timer@net/sunrpc/sched.c:99")
    {
        fileline = symfileline(addr())
        if (strtol(fileline, 16)) {
            // fileline is just the address
            fname = sprintf("%s", symdata(addr()))
        } else {
            // print symbol & filename & line number
            fname = sprintf("%s@%s", symname(addr()), fileline)
        }
        printf("%s: PROBE[%s]\n\t\t@[%s]\n", ppfunc(), pp(), fname)
        print_backtrace()
        printf("\n")
    }
    __rpc_add_timer: PROBE[module("sunrpc").statement("__rpc_add_timer@net/sunrpc/sched.c:99")]
            @[__dta___rpc_sleep_on_priority_980+0x12e/0x3b0 [sunrpc]]
     0xffffffffc0463e5e : __dta___rpc_sleep_on_priority_980+0x12e/0x3b0 [sunrpc]
     0xffffffffc0464127 : rpc_sleep_on+0x47/0x80 [sunrpc]
     0xffffffffc045a4d1 : xprt_transmit+0x361/0x3c0 [sunrpc]
     0xffffffffc04564c1 : call_transmit+0x1c1/0x2a0 [sunrpc]
     0xffffffffc0461c38 : __dta___rpc_execute_956+0xa8/0x420 [sunrpc]
     0xffffffffc0461fc2 : rpc_async_schedule+0x12/0x20 [sunrpc]
     0xffffffff890b18b9 : process_one_work+0x169/0x3a0 [kernel]
     0xffffffff890b236d : worker_thread+0x4d/0x3f0 [kernel]
     0xffffffff890b8005 : kthread+0x105/0x140 [kernel]
     0xffffffff89a0035e : ret_from_fork+0x3e/0x50 [kernel]
     0xffffffff89a0035e : ret_from_fork+0x3e/0x50 [kernel]

Embedded C with guru mode

Sometimes, you need to access data from the kernel or make a call to a kernel function or you just want to use C code to process the data extracted from a probe.  This can be done by running stap with embedded C code in guru mode. When in guru mode, the embedded C code,  enclosed between "%{" and "%}" markers in the top level of the script file, is transcribed verbatim, without analysis, in sequence, into the top level of the generated C code.

Guru mode is set by passing the -g option to the stap command.

Note: Running stap in guru mode could crash the system if the C code has bugs since safety features such as code and data memory reference protection are removed so use it with caution. To prevent this from happening, SystemTap provides runtime macros for safely accessing data from addresses that could potentially be invalid or dangerous. However it's up to you to make sure that kernel memory accesses from C code are done using these runtime macros:

  • kread macro to read a long value from a  memory location.

  • kwrite macro to write a long to a memory location.

  • kderef macro is similar to kread macro and is used to read data with the size of 1, 2, 4, or 8 bytes.

  • kderef_string to retrieves string from kernel memory.

The complete list of SystemTap runtime macros can be found in the SystemTap repo header file systemtap-repo/runtime/linux/loc2c-runtime.h.

Example of using kread to read a long value from a kernel memory location:

  • Using kread to read a long value from a kernel memory location

    /*
     * sfunction kernel_long - Retrieves a long value stored in kernel memory
     * @addr: The kernel address to retrieve the long from
     *
     * Description: Returns the long value from a given kernel memory address.
     * Reports an error when reading from the given address fails.
     */
    function kernel_long:long (addr:long) %{    /* pure */
        __label__ deref_fault;
        STAP_RETVALUE = kread((long *)(uintptr_t)STAP_ARG_addr);
        if (0) {
    deref_fault:        /* branched to from kread() */
            snprintf(CONTEXT->error_buffer, sizeof(CONTEXT->error_buffer),
                     "kernel long copy fault at 0x%p [man error::fault]", (void *)(uintptr_t)STAP_ARG_addr);
            CONTEXT->last_error = CONTEXT->error_buffer;
        }
    %}

Example of using kwrite to write a long value to a kernel memory location:

  • Using kwrite to write a long value to a kernel memory location

    /*
     * sfunction set_kernel_long - Writes a long value to kernel memory
     * @addr: The kernel address to write the long to
     * @val: The long which is to be written
     *
     * Description: Writes the long value to a given kernel memory address.
     * Reports an error when writing to the given address fails. Requires the use of guru mode (-g).
     */
    function set_kernel_long (addr:long, val:long) %{   /* guru */
        __label__ deref_fault;
        kwrite((long *)(uintptr_t)STAP_ARG_addr, STAP_ARG_val);
        if (0) {
    deref_fault:    /* branched to from kwrite() */
        snprintf(CONTEXT->error_buffer, sizeof(CONTEXT->error_buffer),
                 "kernel long copy fault at 0x%p [man error::fault]", (void *)(uintptr_t)STAP_ARG_addr);
        CONTEXT->last_error = CONTEXT->error_buffer;
      }
    %}

More examples of the runtime macros can be found in the /usr/share/systemtap/example directory.

Statistics (aggregate)

Aggregate instances are used to collect statistics on numerical values, when it is important to accumulate new data quickly and in large volume. These instances operate without exclusive locks, and store only aggregated stream statistics. Aggregates make sense only for global variables. They are stored individually or as elements of an associative array.

The aggregation operator is “<<<”, and its effect is similar to an assignment or a C++ output streaming operation. The left operand specifies a scalar or array-index l-value, which must be declared global. The right operand is a numeric expression. The meaning is intuitive: add the given number as a sample to the set of numbers to compute their statistics. The specific list of statistics to gather is given separately by the extraction functions.

SystemTap provides 2 types of extraction functions to extract information about aggregate:

  • integer extractors : @count(s), @sum(s)@min(s)@max(s) and @avg(s)

  • histogram extractors: @hist_linear and @hist log

Refer to SystemTap Language Reference for more information on aggregate and its extractors.

The following script is an example that illustrates the usage of aggregate to collect statistics of NFS file read requests:

#!/usr/bin/env stap
global times
global rdsize

function time() { return gettimeofday_us() }    # time measurement function

probe module("nfs").function("nfs_file_read").return
{
    t = time() - @entry(time())
    sz = @entry($to->count)
    rdsize <<< sz
    times[sz] <<< t
    printf("%s: %d bytes %dus\n", ppfunc(), sz, t)
}

probe end
{
    printf("\n\t\tRead Sizes\n")
    print(@hist_log(rdsize))
    foreach (x+ in times) {
        printf("\n\t\tElapse Time (us) For Read Size %d Bytes\n", x)
        print(@hist_log(times[x]))
    }
}

Sample output of from running this script:

nfs_file_read: 8192 bytes 3397us
nfs_file_read: 8192 bytes 12us
nfs_file_read: 8192 bytes 10us
nfs_file_read: 1048576 bytes 5299us
nfs_file_read: 1048576 bytes 3us
^C
                 Read Sizes
  value |-------------------------------------------------- count
   2048 |                                                   0
   4096 |                                                   0
   8192 |@@@                                                3
  16384 |                                                   0
  32768 |                                                   0
        ~
 262144 |                                                   0
 524288 |                                                   0
1048576 |@@                                                 2
2097152 |                                                   0
4194304 |                                                   0

            Elapse Time (us) For Read Size 8192 Bytes
value |-------------------------------------------------- count
    2 |                                                   0
    4 |                                                   0
    8 |@@                                                 2
   16 |                                                   0
   32 |                                                   0
      ~
  512 |                                                   0
 1024 |                                                   0
 2048 |@                                                  1
 4096 |                                                   0
 8192 |                                                   0

          Elapse Time (us) For Read Size 1048576 Bytes
value |-------------------------------------------------- count
    0 |                                                   0
    1 |                                                   0
    2 |@                                                  1
    4 |                                                   0
    8 |                                                   0
      ~
 1024 |                                                   0
 2048 |                                                   0
 4096 |@                                                  1
 8192 |                                                   0
16384 |                                                   0

Pass 5: run completed in 50usr/160sys/109080real ms.

Using staprun(8)

CODEHERE

The staprun program is the back-end of the SystemTap tool.  It expects a kernel module produced by the front-end stap tool. Splitting the SystemTap tool into a front-end and a back-end allows a user to compile a SystemTap script on a development  machine that has the kernel development tools (needed to compile the script) and then transfer the resulting kernel module to a production machine that doesn't have any development tools installed. This is useful where you need to run a SystemTap script on customer's systems that do not have all the required RPMs installed.

To create the kernel '.ko' module, compile the script using stap command with option -p4 as shown below. See man page of staprun(8) for more information.

Creating the systemtap ko module:

root@nfsvmf14 cache]# stap -p4 -d sunrpc -d nfsv4 -d nfsv3 -d nfs -d kernel -v -g -DSTP_NO_BUILDID_CHECK nfs4_lockowner-uek5.stp -m nfs4_lockowner-uek5
Pass 1: parsed user script and 476 library scripts using 217992virt/86500res/4808shr/82048data kb, in 1490usr/80sys/1573real ms.
Pass 2: analyzed script: 14 probes, 33 functions, 1 embed, 13 globals using 225572virt/95592res/6436shr/89628data kb, in 370usr/200sys/574real ms.
Pass 3: translated to C into "/tmp/stapRPCDdH/stap_f0dfd47f93a0d4828a011cd6892869ee_21269_src.c" using 234836virt/102508res/6628shr/98892data kb, in 6070usr/360sys/6443real ms. /root/.systemtap/cache/f0/stap_f0dfd47f93a0d4828a011cd6892869ee_21269.ko
Pass 4: compiled C into "stap_f0dfd47f93a0d4828a011cd6892869ee_21269.ko" in 23610usr/2810sys/26661real ms.
[root@nfsvmf14 ssc]#

Copy the 'nfs4_lockowner-uek5.ko' module to the target system and use staprun command to start the script:

[root@nfsvmf14 cache]# staprun nfs4_lockowner-uek5.ko
NFS Probes Start...

SystemTap script for tracing SUNRPC state machine

Below is a working example of a script that traces activity of the SUNRPC state machine of every RPC request on the UEK-5 kernel.   The script includes examples of embedded C, Tapset functions, SystemTap functions and programming tips in the probes.

#!/usr/bin/env stap
/*
 * Copyright (c) 2020 Oracle and/or its affiliates.
 * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER.                                                                                                                                                             
 * This code is free software; you can redistribute it and/or modify it
 * under the terms of the GNU General Public License version 2 only, as
 * published by the Free Software Foundation.                                                                    
 * This code is distributed in the hope that it will be useful, but WITHOUT
 * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
 * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
 * version 2 for more details.                                                                                   
 */

/*
 * stap  -d sunrpc -d nfsv4 -d nfsv3 -d nfs -d kernel -v -g -DSTP_NO_BUILDID_CHECK rpc-ss-uek5.stp [n]
 */
global rpc_timer_fired
global rpc_tid
global nostack

probe begin
{
    rpc_timer_fired = 0
    nostack = 0
    if ($# > 0)
        nostack = 1
    printf("Probe starts...[nostack=%d]\n", nostack)
}

/*
 * Embedded C
 */
%{
#include <linux/sched.h>                /* includes asm/cputime.h */
#include <linux/time.h>
#include <linux/sunrpc/clnt.h>
#include <linux/sunrpc/sched.h>         /* rpc_task & etc */
char *tcp_state[13] = {
    "INVALID - 0", "TCP_ESTABLISHED", "TCP_SYN_SENT", "TCP_SYN_RECV",
    "TCP_FIN_WAIT1", "TCP_FIN_WAIT2", "TCP_TIME_WAIT", "TCP_CLOSE",
    "TCP_CLOSE_WAIT", "TCP_LAST_ACK", "TCP_LISTEN", "TCP_CLOSING", "TCP_NEW_SYN_RECV"
};

char genstr[MAXSTRINGLEN] = { "NONE" };
%}
function xprtstate:string(state:long) %{
    int state = (int)STAP_ARG_state;
    char *str = "";
    int ix, n, len = 0;

    for (ix = 0; ix <= 9; ix++) {
        if (state & (1 << ix)) {
            switch (ix) {
                case 0: str = "XPRT_LOCKED"; break;
                case 1: str = " XPRT_CONNECTED"; break;
                case 2: str = " XPRT_CONNECTING"; break;
                case 3: str = " XPRT_CLOSE_WAIT"; break;
                case 4: str = " XPRT_BOUND"; break;
                case 5: str = " XPRT_BINDING"; break;
                case 6: str = " XPRT_CLOSING"; break;
                case 9: str = " XPRT_CONGESTED"; break;
            }
            n = snprintf(&genstr[len], MAXSTRINGLEN - len, str);
            len += n;
         }
    }
    snprintf(STAP_RETVALUE, MAXSTRINGLEN, "%s", genstr);
%}

function tcpstate:string(state:long) %{
    int state = (int)STAP_ARG_state;
    if (state > 12)
        snprintf(STAP_RETVALUE, MAXSTRINGLEN, "INVALID - %d",  state);
    else
        snprintf(STAP_RETVALUE, MAXSTRINGLEN, "%s", tcp_state[state]);
%}

function task2tstr:string(task:long) %{ /* pure */ /* unprivileged */
    struct rpc_task *taskp = (struct rpc_task *)STAP_ARG_task;
    if (taskp->tk_flags & 1)
        snprintf(STAP_RETVALUE, MAXSTRINGLEN, "%s", "ASYNC");
    else
        snprintf(STAP_RETVALUE, MAXSTRINGLEN, "%s", "SYNC");
%}
function task2proc:string(task:long) %{ /* pure */ /* unprivileged */
    struct rpc_task *taskp = (struct rpc_task *)STAP_ARG_task;
    if (taskp->tk_msg.rpc_proc)
        snprintf(STAP_RETVALUE, MAXSTRINGLEN, "%s", taskp->tk_msg.rpc_proc->p_name);
    else
        snprintf(STAP_RETVALUE, MAXSTRINGLEN, "NULL");
%}

function task2timeout:long(task:long) %{ /* pure */ /* unprivileged */
    struct rpc_task *taskp = (struct rpc_task *)STAP_ARG_task;
    STAP_RETVALUE = (int64_t) taskp->tk_client->cl_timeout;
%}

/*
 * Systemtap functions
 */
function is_sunrpc:long(sym:string)
{
    /* "xs_tcp_ops+0x0/0xe0 [sunrpc]" */
    sstr = substr(sym, 0, 14)
    if (sstr == "xs_tcp_ops+0x0")
        return 1
    else
        return 0
}

function show_tcp_state_short(sk:long)
{
    if (sk == 0)
        next
    sock = &@cast(sk, "sock")
    sc = &@cast(&sock->__sk_common, "sock_common")
    printf("%s %s: inet_sock[%p 0x%x: %s] sk_flags[0x%x] [%s]\n",
           ctime(), ppfunc(), sock, sc->skc_state, tcpstate(sc->skc_state),
           sc->skc_flags, execname())
}
function show_tcp_state(xp:long, tk:long)
{
    taskp = &@cast(tk, "rpc_task", "kernel:sunrpc")
    xprt = &@cast(xp, "rpc_xprt", "kernel:sunrpc")
    if (xprt == 0) {
        printf("%s %s: xprt NULL [%s]\n", ctime(), ppfunc(), execname())
        next
    }
    sxprt = &@cast(xprt, "sock_xprt", "kernel:sunrpc")
    sk = &@cast(sxprt->inet, "sock")
    if (sk == 0) {
        printf("%s %s: xprt[%p 0x%x:%s] inet_sock[0] [%s]\n", ctime(), ppfunc(),
               xprt, xprt->state, xprtstate(xprt->state), execname())
    } else {
        printf("%s %s: xprt[%p 0x%x:%s] [%s]\n", ctime(), ppfunc(),
               xprt, xprt->state, xprtstate(xprt->state), execname())
        sc = &@cast(&sk->__sk_common, "sock_common")
        if (sc)
            printf("\tinet_sock[%p %d: %s]\n",
                   sk, sc->skc_state, tcpstate(sc->skc_state))
        else
            printf("\tinet_sock[%p %d: %s]\n",
                   sk, sc->skc_state, tcpstate(sc->skc_state))
    }
    if (taskp != 0) {
        printf("\ttask[%p] %s tk_action[%p] tk_status[%d] tk_runstate[0x%x] tk_flags[0x%x]\n",
               taskp, task2proc(taskp), taskp->tk_action,
               taskp->tk_status, taskp->tk_runstate, taskp->tk_flags)
    }
}

probe module("nfs").function("nfs_init_timeout_values")
{
    /* HZ == 1000 */
    printf("%s %s: proto[%d] timeo[%d 1/10sec] ticks[%d] retrans[%d]\n", ctime(), ppfunc(),
           $proto, $timeo, $timeo * 1000 / 10, $retrans)
}
probe module("nfs").function("nfs_init_timeout_values").return
{
    to = &@cast(@entry($to), "rpc_timeout")
    printf("\tRETURN to_initval[%d] to_retries[%d] to_increment[%d] to_maxval[%d]\n",
           to->to_initval, to->to_retries, to->to_increment, to->to_maxval)
    if (nostack == 1)
        next
    print_backtrace()
    printf("\n")
}

probe module("sunrpc").function("xs_tcp_set_socket_timeouts")
{
    xprt = $xprt
    show_tcp_state(xprt, xprt->snd_task)
    to = &@cast(xprt->timeout, "rpc_timeout")
    if (to != 0) {
        /* unit in ticks (jiffies). HZ is 1000 ticks, see include/generated/autoconf.h */
        printf("\t\t\tto_initval[%d] to_retries[%d] to_increment[%d] to_maxval[%d] xprt[%p]\n",
               to->to_initval, to->to_retries, to->to_increment, to->to_maxval, xprt)
    }
    if (nostack == 1)
        next
    print_backtrace()
    printf("\n")
}
probe module("sunrpc").{function("call_status"),
    function("call_start"),
    function("call_reserve"),
    function("call_reserveresult"),
    function("call_refresh"),
    function("call_refreshresult"),
    function("call_allocate"),
    function("call_bind"),
    function("call_bind_status"),
    function("call_connect"),
    function("call_connect_status"),
    function("call_transmit"),
    function("call_decode"),
    function("xs_tcp_send_request"),
    function("__rpc_sleep_on_priority"),
    function("call_transmit_status"),
    function("rpc_exit_task")}
{
    taskp = $task
    rqstp = taskp->tk_rqstp
    if (ppfunc() == "call_start")
        printf("\n")
    if (rqstp != 0) {
        sxprt = &@cast(taskp->tk_xprt, "sock_xprt")
        show_tcp_state(sxprt, taskp)
    } else {
        printf("%s %s: xprt NULL, sock NULL [%s]\n", ctime(), ppfunc(), execname())
        printf("\t\t\ttask[%p] %s tk_action[%p] tk_status[%d]\n",
               taskp, task2proc(taskp), taskp->tk_action, taskp->tk_status)
    }
    if (nostack)
        next
    print_backtrace()
    printf("\n")
}


probe module("sunrpc").function("__rpc_queue_timer_fn").return
{
    rpc_timer_fired = 0
    rpc_tid = 0
}

Sample output from this script for the RENEW request:

Tue Nov 17 00:49:57 2020 call_start: xprt NULL, sock NULL [kworker/0:1]
                        task[0xffff8b9f550b6a00] RENEW tk_action[0xffffffffc0454410] tk_status[0]
Tue Nov 17 00:49:57 2020 call_reserve: xprt NULL, sock NULL [kworker/0:1]
                        task[0xffff8b9f550b6a00] RENEW tk_action[0xffffffffc0454790] tk_status[0]
Tue Nov 17 00:49:57 2020 call_reserveresult: xprt[0xffff8b9f554bf000 0x12: XPRT_CONNECTED XPRT_BOUND] [kworker/0:1]
        inet_sock[0xffff8b9f56172800 1: TCP_ESTABLISHED]
        task[0xffff8b9f550b6a00] RENEW tk_action[0xffffffffc04547f0] tk_status[0] tk_runstate[0x5] tk_flags[0x5281]
Tue Nov 17 00:49:57 2020 call_refresh: xprt[0xffff8b9f554bf000 0x12: XPRT_CONNECTED XPRT_BOUND] [kworker/0:1]
        inet_sock[0xffff8b9f56172800 1: TCP_ESTABLISHED]
        task[0xffff8b9f550b6a00] RENEW tk_action[0xffffffffc0454b50] tk_status[0] tk_runstate[0x5] tk_flags[0x5281]
Tue Nov 17 00:49:57 2020 call_refreshresult: xprt[0xffff8b9f554bf000 0x12: XPRT_CONNECTED XPRT_BOUND] [kworker/0:1]
        inet_sock[0xffff8b9f56172800 1: TCP_ESTABLISHED]
        task[0xffff8b9f550b6a00] RENEW tk_action[0xffffffffc0454bb0] tk_status[0] tk_runstate[0x5] tk_flags[0x5281]
Tue Nov 17 00:49:57 2020 call_allocate: xprt[0xffff8b9f554bf000 0x12: XPRT_CONNECTED XPRT_BOUND] [kworker/0:1]
        inet_sock[0xffff8b9f56172800 1: TCP_ESTABLISHED]
        task[0xffff8b9f550b6a00] RENEW tk_action[0xffffffffc0454910] tk_status[0] tk_runstate[0x5] tk_flags[0x5281]
Tue Nov 17 00:49:57 2020 call_bind: xprt[0xffff8b9f554bf000 0x12: XPRT_CONNECTED XPRT_BOUND] [kworker/0:1]
        inet_sock[0xffff8b9f56172800 1: TCP_ESTABLISHED]
        task[0xffff8b9f550b6a00] RENEW tk_action[0xffffffffc04542c0] tk_status[0] tk_runstate[0x5] tk_flags[0x5281]
Tue Nov 17 00:49:57 2020 call_connect: xprt[0xffff8b9f554bf000 0x12: XPRT_CONNECTED XPRT_BOUND] [kworker/0:1]
        inet_sock[0xffff8b9f56172800 1: TCP_ESTABLISHED]
        task[0xffff8b9f550b6a00] RENEW tk_action[0xffffffffc0456100] tk_status[0] tk_runstate[0x5] tk_flags[0x5281]
Tue Nov 17 00:49:57 2020 call_transmit: xprt[0xffff8b9f554bf000 0x12: XPRT_CONNECTED XPRT_BOUND] [kworker/0:1]
        inet_sock[0xffff8b9f56172800 1: TCP_ESTABLISHED]
        task[0xffff8b9f550b6a00] RENEW tk_action[0xffffffffc0456300] tk_status[0] tk_runstate[0x5] tk_flags[0x5281]
Tue Nov 17 00:49:57 2020 xs_tcp_send_request: xprt[0xffff8b9f554bf000 0x13:XPRT_LOCKED XPRT_CONNECTED XPRT_BOUND] [kworker/0:1]
        inet_sock[0xffff8b9f56172800 1: TCP_ESTABLISHED]
        task[0xffff8b9f550b6a00] RENEW tk_action[0xffffffffc04561b0] tk_status[0] tk_runstate[0x5] tk_flags[0x5281]
Tue Nov 17 00:49:57 2020 __rpc_sleep_on_priority: xprt[0xffff8b9f554bf000 0x13:XPRT_LOCKED XPRT_CONNECTED XPRT_BOUND] [kworker/0:1]
        inet_sock[0xffff8b9f56172800 1: TCP_ESTABLISHED]
        task[0xffff8b9f550b6a00] RENEW tk_action[0xffffffffc04561b0] tk_status[0] tk_runstate[0x5] tk_flags[0x5a81]
Tue Nov 17 00:49:57 2020 call_transmit_status: xprt[0xffff8b9f554bf000 0x13:XPRT_LOCKED XPRT_CONNECTED XPRT_BOUND] [kworker/0:1]
        inet_sock[0xffff8b9f56172800 1: TCP_ESTABLISHED]
        task[0xffff8b9f550b6a00] RENEW tk_action[0xffffffffc04561b0] tk_status[0] tk_runstate[0x7] tk_flags[0x5a81]
Tue Nov 17 00:49:57 2020 call_status: xprt[0xffff8b9f554bf000 0x12: XPRT_CONNECTED XPRT_BOUND] [kworker/0:1]
        inet_sock[0xffff8b9f56172800 1: TCP_ESTABLISHED]
        task[0xffff8b9f550b6a00] RENEW tk_action[0xffffffffc0456830] tk_status[0] tk_runstate[0x5] tk_flags[0x5a81]
Tue Nov 17 00:49:57 2020 call_decode: xprt[0xffff8b9f554bf000 0x12: XPRT_CONNECTED XPRT_BOUND] [kworker/0:1]
        inet_sock[0xffff8b9f56172800 1: TCP_ESTABLISHED]
        task[0xffff8b9f550b6a00] RENEW tk_action[0xffffffffc0454cf0] tk_status[112] tk_runstate[0x5] tk_flags[0x5a81]
Tue Nov 17 00:49:57 2020 rpc_exit_task: xprt[0xffff8b9f554bf000 0x12: XPRT_CONNECTED XPRT_BOUND] [kworker/0:1]
        inet_sock[0xffff8b9f56172800 1: TCP_ESTABLISHED]

Resources

SystemTap installation includes a number of Tapset scripts and SystemTap scripts. These scripts provide excellent examples for learning SystemsTap's features and how to apply them in your own scripts.

Tapset scripts are located in the /usr/share/systemtap/tapset directory. Tapset scripts are libraries of probe aliases and auxiliary functions. While these scripts look like regular SystemTap scripts, they cannot be run directly but they can be accessed from your own scripts. Refer to SystemTap Tapset Reference Manual for a complete list of available Tapset resources.

SystemTap Scripts are located in the /usr/share/systemtap/examples directory. These scripts are designed for specific purposes and can be run directly.  There are numerous useful example scripts that can be used immediately. See /usr/share/systemtap/examples/index.txt for the description of each script and the command syntax of how to run it.

For example the SystemTap script /usr/share/systemtap/examples/profiling/thread-times.stp, which is used to profile thread CPU usage. The thread-times.stp script sets up time-based sampling. Every five seconds it prints out a sorted list with the top twenty threads occupying the CPU, broken down as a percentage of user and kernel time.

# stap -DSTP_NO_BUILDID_CHECK /usr/share/systemtap/examples/profiling/thread-times.stp
            comm   tid   %user %kernel (of 4953 ticks)
       swapper/0     0   0.00%  99.29%
            sshd  8402   0.00%   0.20%
    kworker/0:1H   519   0.00%   0.14%
             top  7043   0.04%   0.02%
    xfsaild/dm-0   613   0.00%   0.06%
    ...

            comm   tid   %user %kernel (of 4953 ticks)
       swapper/0     0   0.00%  99.19%
            sshd  8402   0.00%   0.18%
              ls  8419   0.02%   0.06%
        nfsidmap  8422   0.00%   0.06%
        nfsidmap  8425   0.00%   0.06%
    ...

References

Legal Notice

The content on this page is governed by the Terms of Use; except as for the sample code. The sample code on this page is provided under the GPLv2 only. 

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.