Adding your own performance statistics to Solaris

Have you ever wondered how all of those magical vmstat statistics come from, or how you might go about adding your own? Now that we have DTrace, you will be able to get all the add-hoc statistics you need, so we seldom need to add hard coded statistics. However, the kstat mechanism is used to provide light weight statistics that are a stable part of your kernel code. The kstat interface would be used to provide standard information that would be reported from a user level tool. For example, if you wanted to add your own device driver I/O statistics into the statistics pool reported by the iostat command, you would add a kstat provider.

Peter Boothby provides an excellent introduction to the kstat facility and the user API in his developer article. Since Pete's covered the detail, I'll just recap the major parts and jump into an overview of a kstat provider.

We'll start with the basics about adding your own kstat provider, and now that we've launched OpenSolaris, I can include source references ;-)

The statistics reported by vmstat, iostat and most of the other Solaris tools are gathered via a central kernel statistics subsystems, known as "kstat". The kstat facility is an all purpose interface for collecting and reporting named and typed data.

A typical scenario will have a kstat producer and a kstat reader. The kstat reader is a utility in user mode which reads, potentially aggregates and then reports the results. For example, the vmstat utility is a kstat reader, which aggregates statistics which a provided by the vm system in the kernel.

Statistics are named and accessed via a four-tuple: class, module, name, instance. Solaris 8 introduced a new method to access kstat information from the command line or in custom-written scripts. You can use the command-line tool /usr/bin/kstat interactively to print all or selected kstat information from a system. This program is written in the Perl language, and you can use the Perl XS extension module to write your own custom Perl programs. Both facilities are documented in the pages of the online manual.

The kstat Command

You can invoke the kstat command on the command line or within shell scripts to selectively extract kernel statistics. Like many other Solaris commands, kstat it takes optional interval and count arguments for repetitive, periodic output. Its command options are quite flexible.

The first form follows standard UNIX command-line syntax, and the second form provides a way to pass some of the arguments as colon-separated fields. Both forms offer the same functionality. Each of the module, instance, name, or statistic specifiers may be a shell glob pattern or a Perl regular expression enclosed by "/" characters. You can It is possible to use both specifier types within a single operand. Leaving a specifier empty is equivalent to using the "\*" glob pattern for that specifier. Running kstat with no arguments will print out nearly all kstat entries from the running kernel (most, but not all kstats of KSTAT_TYPE_RAW are decoded).

The tests specified by the options are logically ANDed, and all matching kstats are selected. The argument for the -c, -i, -m, -n, and -s options can may be specified as a shell glob pattern, or a Perl regular expression enclosed in "/" characters. If you want to pass a regular expression containing shell metacharacters to the command, you must protect it from the shell by enclosing it with the appropriate quotation marks. For example, to show all kstats that have a statistics name beginning with intr in the module name cpu_stat, you could use the following script:

$ kstat -p -m cpu_stat -s 'intr\*'
cpu_stat:0:cpu_stat0:intr       878951000
cpu_stat:0:cpu_stat0:intrblk    21604
cpu_stat:0:cpu_stat0:intrthread 668353070
cpu_stat:1:cpu_stat1:intr       211041358
cpu_stat:1:cpu_stat1:intrblk    280
cpu_stat:1:cpu_stat1:intrthread 209879640

The -p option used in the preceding previous example displays output in a parsable format. If you do not specify this option, kstat produces output in a human-readable, tabular format. In the following example, we leave out the -p flag and use the module:instance:name:statistic argument form and a Perl regular expression.

$ kstat cpu_stat:::/\^intr/
module: cpu_stat                        instance: 0
name:   cpu_stat0                       class:    misc
        intr                            879131909
        intrblk                         21608
        intrthread                      668490486
module: cpu_stat                        instance: 1
name:   cpu_stat1                       class:    misc
        intr                            211084960
        intrblk                         280
        intrthread                      209923001

A kstat provider - a walk though

To add your own statistics to your Solaris kernel, you will need to create a kstat provider, which consists of an initialization function to create the statistics group, and then a call-back function that updates the statistics prior to being read. The callback function is often used to aggregate or summarize information prior to being reported back to the reader. The kstat provider interface is defined in kstat(3KSTAT) And kstat(9S). There is some more verbose information in usr/src/uts/common/sys/kstat.h. The first step is to decide on the type of information you want to export. The two primary types are RAW, NAMED or IO. The RAW interface exports raw C data structures to userland, and it's use is strongly discouraged, since a change in the C structure will cause incompatibilities in the reader. The named mechanism are preferred since the data is typed and extensible. Both the NAMED and IO use typed data.

The NAMED type is for providing single or multiple records of data, and is the most common choice. The IO record is specifically for providing I/O statistics. It is collected and reported by the iostat command, and therefore should be used only for items which can be viewed and reported as I/O devices (we do this currently for I/O devices and NFS file systems).

A simple example of named statistics is the virtual memory summaries provided via "system_pages":

$ kstat -n system_pages
module: unix                            instance: 0     
name:   system_pages                    class:    pages                         
        availrmem                       343567
        crtime                          0
        desfree                         4001
        desscan                         25
        econtig                         4278190080
        fastscan                        256068
        freemem                         248309
        kernelbase                      3556769792
        lotsfree                        8002
        minfree                         2000
        nalloc                          11957763
        nalloc_calls                    9981
        nfree                           11856636
        nfree_calls                     6689
        nscan                           0
        pagesfree                       248309
        pageslocked                     168569
        pagestotal                      512136
        physmem                         522272
        pp_kernel                       64102
        slowscan                        100
        snaptime                        6573953.83957897

These are first declared and initialized by the following C structs in usr/src/uts/common/os/kstat_fr.c:

struct {
        kstat_named_t physmem;
        kstat_named_t nalloc;
        kstat_named_t nfree;
        kstat_named_t nalloc_calls;
        kstat_named_t nfree_calls;
        kstat_named_t kernelbase;
        kstat_named_t econtig;
        kstat_named_t freemem;
        kstat_named_t availrmem;
        kstat_named_t lotsfree;
        kstat_named_t desfree;
        kstat_named_t minfree;
        kstat_named_t fastscan;
        kstat_named_t slowscan;
        kstat_named_t nscan;
        kstat_named_t desscan;
        kstat_named_t pp_kernel;
        kstat_named_t pagesfree;
        kstat_named_t pageslocked;
        kstat_named_t pagestotal;
} system_pages_kstat = {

        { "physmem",            KSTAT_DATA_ULONG },
        { "nalloc",             KSTAT_DATA_ULONG },
        { "nfree",              KSTAT_DATA_ULONG },
        { "nalloc_calls",       KSTAT_DATA_ULONG },
        { "nfree_calls",        KSTAT_DATA_ULONG },
        { "kernelbase",         KSTAT_DATA_ULONG },
        { "econtig",            KSTAT_DATA_ULONG },
        { "freemem",            KSTAT_DATA_ULONG },
        { "availrmem",          KSTAT_DATA_ULONG },
        { "lotsfree",           KSTAT_DATA_ULONG },
        { "desfree",            KSTAT_DATA_ULONG },
        { "minfree",            KSTAT_DATA_ULONG },
        { "fastscan",           KSTAT_DATA_ULONG },
        { "slowscan",           KSTAT_DATA_ULONG },
        { "nscan",              KSTAT_DATA_ULONG },
        { "desscan",            KSTAT_DATA_ULONG },
        { "pp_kernel",          KSTAT_DATA_ULONG },
        { "pagesfree",          KSTAT_DATA_ULONG },
        { "pageslocked",        KSTAT_DATA_ULONG },
        { "pagestotal",         KSTAT_DATA_ULONG },

These statistics are the most simple type; merely a basic list of 64-bit variables. Once declared, the kstats are registered with the subsystem:

static int system_pages_kstat_update(kstat_t \*, int);


        kstat_t \*ksp;

        ksp = kstat_create("unix", 0, "system_pages", "pages", KSTAT_TYPE_NAMED,
                sizeof (system_pages_kstat) / sizeof (kstat_named_t),
        if (ksp) {
                ksp->ks_data = (void \*) &system_pages_kstat;
                ksp->ks_update = system_pages_kstat_update;


The kstat create function takes the 4-tuple description, the size of the kstat and provides a handle to the created kstats. The handle is then updated to include a pointer to the data, and a call-back function which will be invoked when the user reads the statistics.

The callback function has the task of updating the data structure pointed to by ks_data when invoked. If you choose not to have one, simply set the callback function to default_kstat_update(). The system pages kstat preable looks like this:

static int
system_pages_kstat_update(kstat_t \*ksp, int rw)

        if (rw == KSTAT_WRITE) {
                return (EACCES);

This basic preamble checks to see if the user code is trying to read or write the structure. (Yes, it's possible to write to some statistics, if the provider allows it). Once basic checks are done, the update call-back simply stores the statistics into the predefined data structure, and then returns.

        system_pages_kstat.freemem.value.ul     = (ulong_t)freemem;
        system_pages_kstat.availrmem.value.ul   = (ulong_t)availrmem;
        system_pages_kstat.lotsfree.value.ul    = (ulong_t)lotsfree;
        system_pages_kstat.desfree.value.ul     = (ulong_t)desfree;
        system_pages_kstat.minfree.value.ul     = (ulong_t)minfree;
        system_pages_kstat.fastscan.value.ul    = (ulong_t)fastscan;
        system_pages_kstat.slowscan.value.ul    = (ulong_t)slowscan;
        system_pages_kstat.nscan.value.ul       = (ulong_t)nscan;
        system_pages_kstat.desscan.value.ul     = (ulong_t)desscan;
        system_pages_kstat.pagesfree.value.ul   = (ulong_t)freemem;

        return (0);

That's it for a basic named kstat.

I/O statistics

Moving on to I/O, we can see how I/O stats are measured and recorded. There is special type of kstat type for I/O statistics. This provides a common methodology for recording statistics about devices which have queuing, utilization and response time metrics.

These devices are measured as a queue using "reimann sum" - which is a count of the visits to the queue and a sum of the "active" time. These two metrics can be used to determine the average service time and I/O counts for the device. There are typically two queues for each device, the wait queue and the active queue. This represents the time spent after the request has been accepted and enqueued, and then the time spent active on the device. The statistics are covered in kstat(3KSTAT):

     typedef struct kstat_io {
      \* Basic counters.
     u_longlong_t     nread;      /\* number of bytes read \*/
     u_longlong_t     nwritten;   /\* number of bytes written \*/
     uint_t           reads;      /\* number of read operations \*/
     uint_t           writes;     /\* number of write operations \*/
     \* Accumulated time and queue length statistics.
     \* Time statistics are kept as a running sum of "active" time.
     \* Queue length statistics are kept as a running sum of the
     \* product of queue length and elapsed time at that length --
     \* that is, a Riemann sum for queue length integrated against time.
     \*       \^
         \*       |           _________
         \*       8           | i4    |
         \*       |           |   |
         \*   Queue   6           |   |
         \*   Length  |   _________   |   |
         \*       4   | i2    |_______|   |
         \*       |   |   i3      |
         \*       2_______|           |
         \*       |    i1             |
         \*       |_______________________________|
         \*       Time->  t1  t2  t3  t4
     \* At each change of state (entry or exit from the queue),
     \* we add the elapsed time (since the previous state change)
     \* to the active time if the queue length was non-zero during
     \* that interval; and we add the product of the elapsed time
     \* times the queue length to the running length\*time sum.
     \* This method is generalizable to measuring residency
     \* in any defined system: instead of queue lengths, think
     \* of "outstanding RPC calls to server X".
     \* A large number of I/O subsystems have at least two basic
     \* "lists" of transactions they manage: one for transactions
     \* that have been accepted for processing but for which processing
     \* has yet to begin, and one for transactions which are actively
     \* being processed (but not done). For this reason, two cumulative
     \* time statistics are defined here: pre-service (wait) time,
     \* and service (run) time.
     \* The units of cumulative busy time are accumulated nanoseconds.
     \* The units of cumulative length\*time products are elapsed time
     \* times queue length.
     hrtime_t   wtime;            /\* cumulative wait (pre-service) time \*/
     hrtime_t   wlentime;         /\* cumulative wait length\*time product\*/
     hrtime_t   wlastupdate;      /\* last time wait queue changed \*/
     hrtime_t   rtime;            /\* cumulative run (service) time \*/
     hrtime_t   rlentime;         /\* cumulative run length\*time product \*/
     hrtime_t   rlastupdate;      /\* last time run queue changed \*/
     uint_t     wcnt;             /\* count of elements in wait state \*/
     uint_t     rcnt;             /\* count of elements in run state \*/
     } kstat_io_t;

An I/O device driver has a similar declare and create secion, as we saw with the named statistics. A quick look at the floppy disk device driver ( usr/src/uts/sun/io/fd.c) shows the kstat_create() in the device driver attach function:

static int
fd_attach(dev_info_t \*dip, ddi_attach_cmd_t cmd)
        fdc->c_un->un_iostat = kstat_create("fd", 0, "fd0", "disk",
        if (fdc->c_un->un_iostat) {
                fdc->c_un->un_iostat->ks_lock = &fdc->c_lolock;

The per-I/O statistics are updated in two places: the device driver strategy function, where the I/O is first recieved and queued. At this point, the I/O is marked as waiting on the wait queue:

#define KIOSP   KSTAT_IO_PTR(un->un_iostat)

static int
fd_strategy(register struct buf \*bp)
        struct fdctlr \*fdc;
        struct fdunit \*un;

        fdc = fd_getctlr(bp->b_edev);
        un = fdc->c_un;
	/\* Mark I/O as waiting on wait q \*/
        if (un->un_iostat) {


The I/O spends some time on the wait queue until the device is able to process the request. For each I/O the fdstart() routine moves the I/O from the wait queue to the run queue via the kstat_waitq_to_runq() function:

static void
fdstart(struct fdctlr \*fdc)

		/\* Mark I/O as active, move from wait to active q \*/
                if (un->un_iostat) {

		/\* Do I/O... \*/

When the I/O is complete (still in the fdstart() function), it is marked as leaving the active queue via kstat_runq_exit(). This updates the last part of the statistic, leaving us with the number of I/Os, and the total time spent on each queue.

		/\* Mark I/O as complete \*/
                if (un->un_iostat) {
                        if (bp->b_flags & B_READ) {
                                KIOSP->nread +=
                                        (bp->b_bcount - bp->b_resid);
                        } else {
                                KIOSP->nwritten += (bp->b_bcount - bp->b_resid);



These statistics provide us with our familiar metrics where actv is the average length of the queue of active I/Os and asvc_t is the average service time in the device. The wait queue is represented accordingly with wait and wsvc_t.

$ iostat -xn 10
                    extended device statistics              
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    1.2    0.1    9.2    1.1  0.1  0.5    0.1   10.4   1   1 fd0

That's likely enough for a primer to kstat providers, as the detailed information is available in other places... Enjoy!


Technorati Tag: OpenSolaris

Technorati Tag: Solaris


kstat tells how many pages are locked, is there a way to see what these are?

Posted by A F on October 20, 2005 at 01:27 AM PDT #

I am on a quest to understand how Solaris does i/o and how to interpret the metrics out of iostat and sar. For example I have a batch appli ation that is a simple read update and write process. In iostat I see very few read i/o's but many write i/o's. Also in my current run I am cpu bound (96%). I am guessing that all the i/o's are asynchronous, the %rcache and %wcache are both 100. The confusing thing is that from sar I get a large number of lread/s and a small number of lwrite/s. The i/o subsystem 1 large 5 disk array set-up 0+1 with 1 GB of cache with 1 2Gb FCAL channel. What I am trying to understand is the scalability of this app. When will I start pushing the memory or device too hard, how hard am I pushing them at this point? Can I derive this from the already available solaris monitors? Thanks for your time in reading this. If there is another site that deals with this I would appreciate knowing about it. Not a UNIX guy, Bob Schwarz

Posted by Bob Schwarz on May 24, 2006 at 07:20 AM PDT #

in the kstat_waitq_enter function shouldn;t the wlentime be accumalated before wcnt is incremented? similary in exit shouldn't the count be decremented after the wlentime is incremented. little bit confused. Thanks

kstat_waitq_enter(kstat_io_t \*kiop)
1309 {
1310 hrtime_t new, delta;
1311 ulong_t wcnt;
1313 new = gethrtime_unscaled();
1314 delta = new - kiop->wlastupdate;
1315 kiop->wlastupdate = new;
1316 wcnt = kiop->wcnt++;
1317 if (wcnt != 0) {
1318 kiop->wlentime += delta \* wcnt;
1319 kiop->wtime += delta;
1320 }
1321 }

Posted by just a question on July 19, 2008 at 01:44 PM PDT #

Post a Comment:
  • HTML Syntax: NOT allowed



« April 2014