Tuesday Jan 26, 2010

Workaround for runaway metacity

Sun Ray on OpenSolaris build 131 requires the same workarounds I previously mentioned.

There is one more that helps with both 130 and 131. With the new gdm set up the login screen now runs "metacity" and occasionally this can get into a loop just consuming CPU. The trigger is that metacity has been sent a signal to terminate but then tries to be a bit too clever and goes into the loop. I've filed this bug so that it can be fixed.

Happily once again you can work around this with a bit of dtrace:

#!/usr/sbin/dtrace -qws

proc:::signal-handle
/ execname == "metacity" && args[0] == 15 / {
        system("logger -t metacity.d -p daemon.notice killing metacity[%d]", pid); 
        raise(9)
}

Sunday Jan 03, 2010

Automatic virus scanning with c-icap & ZFS

Now that I have OpenSolaris running on the home server I thought I would take advantage of the virus scanning capabilities using the clamav instance I have running. After downloading, compiling and installing c-icap I was able to get the service up and running quickly using the instructions here.

However using a simple test of trying to copy an entire home directory I would see regular errors of the form:

Jan  2 16:18:49 pearson vscand: [ID 940187 daemon.error] Error receiving data from Scan Engine: Error 0

Which were accompanied by a an error to the application and the error count to vscanadm stats.

From the source it was clear that the recv1 was returning 0, indicating the stream to the virus scan engine had closed the connection. What was not clear was why?

So I ran this D to see if what was in the buffer being read would give a clue:


root@pearson:/root# cat vscan.d 
pid$target::vs_icap_readline:entry
{
        self->buf = arg1;
        self->buflen = arg2;
}
syscall::recv:return /self->buf && arg1 == 0/
{
        this->b = copyin(self->buf, self->buflen);
        trace(stringof(this->b));
}
pid$target::vs_icap_readline:return
/self->buf/
{
        self->buf=0;
        self->buflen=0;
}
root@pearson:/root# 

root@pearson:/root# dtrace -s  vscan.d -p $(pgrep vscand)
dtrace: script 'vscan.d' matched 3 probes
CPU     ID                    FUNCTION:NAME
  1   4344                      recv:return 
             0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f  0123456789abcdef
         0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................

The clue was that the error comes back on the very first byte being read. The viruse scan engine is deliberately closing the connection after handling a request which since it had negotiated "keep-alive" it should not.

The solution2 was to set the MaxKeepAliveRequests entry in the c-icap.conf file to be -1 and therefore disable this feature.

1Why is recv being used to read one byte at a time? Insane, a bug will be filed.

2It is in my opinion a bug that the vscand can't cope gracefully with this. Another bug will be filed.

Wednesday Dec 02, 2009

Tracing getipnodesXXXX calls

When I wrote the D script to decode gethostbyname() and gethostbyaddr() library calls I fully intended to proactive write the script to do getipnodebyname() and getipnodebyaddr() and for that matter all the getXbyY routines. However that spare five minutes never arrived so it was only today while investigating a real issue that I had need for a script to decode getipnodebyaddr(). Fortunately taking the other script and modifying to work with getipnodebyXXXX was not that hard.

It can only decode 5 addresses per call before it runs out of space for DIF as it has to do more than the gethostbyXXXX() version since it has to cope with both IPv4 and IPv6 addresses:

dhcp-10-18-9-247-gmp03# dtrace -32 -CZs gethostbyXXXX.d -c "getent ipnodes ibm.com"
129.42.17.103	ibm.com
129.42.18.103	ibm.com
129.42.16.103	ibm.com
Look up: ibm.com:
Host: ibm.com
	h_address[0]: 0:0:0:0:0:0:0:0:0:0:ff:ff:81:2a:11:67
	h_address[1]: 0:0:0:0:0:0:0:0:0:0:ff:ff:81:2a:12:67
	h_address[2]: 0:0:0:0:0:0:0:0:0:0:ff:ff:81:2a:10:67

dhcp-10-18-9-247-gmp03# dtrace -32 -CZs getipnodebyXXXX.d -c "smbadm list"  
[\*] [CJG]
[\*] [cjg.uk.sun.com]
	[+x6250a-vbox10.cjg.uk.sun.com] [10.18.8.140]
[\*] [CJG] [S-1-5-21-1782774743-1218725973-889210084]
[.] [DHCP-10-18-9-24] [S-1-5-21-277162072-319636157-2443625992]
Look up: x6250a-vbox10:
Host: x6250a-vbox10.cjg.uk.sun.com
	h_address[0]: 10.18.8.140

The script is here. Feel free to use it.

Thursday Nov 12, 2009

The Kings of Computing use dtrace!

I've said many times that dtrace is not just a wonderful tool for developers and performance gurus. The Kings of Computing, which are of course System Admins, also find it really useful.

There is an ancient version of make called Parallel make that occasionally suffers from a bug (1223984) where it gets into a loop like this:

waitid(P_ALL, 0, 0x08047270, WEXITED|WTRAPPED)	Err#10 ECHILD
alarm(0)					= 30
alarm(30)					= 0
waitid(P_ALL, 0, 0x08047270, WEXITED|WTRAPPED)	Err#10 ECHILD
alarm(0)					= 30
alarm(30)					= 0
waitid(P_ALL, 0, 0x08047270, WEXITED|WTRAPPED)	Err#10 ECHILD

This will then consume a CPU and the users CPU shares. The application is never going to be fixed so the normal advice is not to use it. However since it can be NFS mounted from anywhere I can't reliably delete all copies of it so occasionally we will see run away processes on our build server.

It turns out this is a snip to fix with dtrace. Simply look for cases where the wait system call returns an error and errno is set to ECHILD (10) and if that happens 10 times in a row for the same process and that process does not call fork then stop the process.

The script is simple enough for me to just do it on the command line:


# dtrace -wn 'syscall::waitsys:return / arg1 <= 0 && 
execname == "make.bin" && errno == 10  && waitcount[pid]++ > 20 / {

	stop();

	printf("uid %d pid %d", uid, pid) }

syscall::forksys:return / arg1 > 0 / { waitcount[pid] = 0 }'
dtrace: description 'syscall::waitsys:return ' matched 2 probes
dtrace: allowing destructive actions
CPU     ID                    FUNCTION:NAME
  2  20588                   waitsys:return uid 36580 pid 29252
  3  20588                   waitsys:return uid 36580 pid 2522
  5  20588                   waitsys:return uid 36580 pid 28663
  7  20588                   waitsys:return uid 36580 pid 29884
 10  20588                   waitsys:return uid 36580 pid 941
 15  20588                   waitsys:return uid 36580 pid 1098

This was way easier then messing around with prstat, truss and pstop!

Thursday Sep 10, 2009

Using dtrace to find double frees

Some of the most common failures that result in customer calls are misuses of the memory allocation routines, malloc, calloc, realloc, valloc, memalign and free. There are many ways in which you can misuse these routines and the data that they return and the resulting failures often occur within the routines even though the problem is with the calling program.

I'm not going to discuss here all the ways you can abuse these routines but look at a particular type abuse. The double free. When you allocate memory using these routines it is your responsibility to free it again so that the memory does not “leak”. However you must only free the memory once. Freeing it more than once is a bug and the results of that are undefined.

This very simple code has a double free:

#include <stdlib.h>

void
doit(int n, char \*x)
{
        if (n-- == 0)
                free(x);
        else
                doit(n,x);
}
int
main(int argc, char \*\*argv)
{
        char \*x;
        char \*y;

        x = malloc(100000);
        
        doit(3, x);
        doit(10, x);
}

and if you compile and run that program all appears well;


However a more realistic program could go on to fail in interesting ways leaving you with the difficult task of finding the culprit. It is for that reason the libumem has good checking for double frees:


: exdev.eu FSS 26 $;  LD_PRELOAD=libumem.so.1 /home/cg13442/lang/c/double_free
Abort(coredump)
: exdev.eu FSS 27 $; mdb core
Loading modules: [ libumem.so.1 libc.so.1 ld.so.1 ]
> ::status
debugging core file of double_free (64-bit) from exdev
file: /home/cg13442/lang/c/double_free
initial argv: /home/cg13442/lang/c/double_free
threading model: native threads
status: process terminated by SIGABRT (Abort), pid=18108 uid=14442 code=-1
> ::umem_status
Status:         ready and active
Concurrency:    16
Logs:           (inactive)
Message buffer:
free(e53650): double-free or invalid buffer
stack trace:
libumem.so.1'umem_err_recoverable+0xa6
libumem.so.1'process_free+0x17e
libumem.so.1'free+0x16
double_free'doit+0x3a
double_free'doit+0x4d
double_free'doit+0x4d
double_free'doit+0x4d
double_free'doit+0x4d
double_free'doit+0x4d
double_free'doit+0x4d
double_free'doit+0x4d
double_free'doit+0x4d
double_free'doit+0x4d
double_free'doit+0x4d
double_free'main+0x100
double_free'_start+0x6c

> 

Good though this is there are situations when libumem is not used and others where it can't be used1. In those cases it is useful to be able to use dtrace to do this and any way it is always nice to have more than one arrow in your quiver:


: exdev.eu FSS 54 $; me/cg13442/lang/c/double_free 2> /dev/null              <
/usr/sbin/dtrace -qs doublefree.d -c /home/cg13442/lang/c/double_free 2> /dev/null
Hit Control-C to stop tracing
double free?
	Address: 0xe53650
	Previous free at: 2009 Jun 23 12:23:22, LWP -1
	This     free at: 2009 Jun 23 12:23:22, LWP -1
	Frees 42663 nsec apart
	Allocated 64474 nsec ago by LWP -1

              libumem.so.1`free
              double_free`doit+0x3a
              double_free`doit+0x4d
              double_free`doit+0x4d
              double_free`doit+0x4d
              double_free`doit+0x4d
              double_free`doit+0x4d
              double_free`doit+0x4d
              double_free`doit+0x4d
              double_free`doit+0x4d

: exdev.eu FSS 56 $; 

If run as root you can get the the real LWP values that did the allocation and the frees:

: exdev.eu FSS 63 $; pfexec /usr/sbin/dtrace -qs doublefree.d -c /home/cg1344>
Hit Control-C to stop tracing
double free?
	Address: 0xe53650
	Previous free at: 2009 Jun 23 14:21:29, LWP 1
	This     free at: 2009 Jun 23 14:21:29, LWP 1
	Frees 27543 nsec apart
	Allocated 39366 nsec ago by LWP 1

              libumem.so.1`free
              double_free`doit+0x3a
              double_free`doit+0x4d
              double_free`doit+0x4d
              double_free`doit+0x4d
              double_free`doit+0x4d
              double_free`doit+0x4d
              double_free`doit+0x4d
              double_free`doit+0x4d
              double_free`doit+0x4d

: exdev.eu FSS 64 $;

Here is the script in all it's glory.

#!/usr/sbin/dtrace -qs

BEGIN
{
	printf("Hit Control-C to stop tracing\\n");
}
ERROR 
/ arg4 == DTRACEFLT_KPRIV || arg4 == DTRACEFLT_UPRIV /
{
	lwp = -1;
}

pid$target::realloc:entry,
pid$target::free:entry
{
	self->addr = arg0;
	self->recurse++;
}

pid$target::realloc:return,
pid$target::free:return
/ self->recurse /
{
	self->recurse--;
	self->addr = 0;
}

pid$target::malloc:entry,
pid$target::memalign:entry,
pid$target::valloc:entry,
pid$target::calloc:entry,
pid$target::realloc:entry,
pid$target::realloc:entry,
pid$target::free:entry
/ lwp != -1 && self->lwp == 0 /
{
	self->lwp = curlwpsinfo->pr_lwpid;
}

pid$target::malloc:entry,
pid$target::calloc:entry,
pid$target::realloc:entry,
pid$target::memalign:entry,
pid$target::valloc:entry,
pid$target::free:entry
/ self->lwp == 0 /
{
	self->lwp = lwp;
}

pid$target::malloc:return,
pid$target::calloc:return,
pid$target::realloc:return,
pid$target::memalign:return,
pid$target::valloc:return
{
	alloc_time[arg1] = timestamp;
	allocated[arg1] = 1;
	free_walltime[arg1] = 0LL;
	free_time[arg1] = 0LL;
	free_lwpid[arg1] = 0;
	alloc_lwpid[arg1] = self->lwp;
	self->lwp = 0;
}

pid$target::realloc:entry,
pid$target::free:entry
/ self->recurse == 1 && alloc_time[arg0] && allocated[arg0] == 0 /
{
	printf("double free?\\n");
	printf("\\tAddress: 0x%p\\n", arg0);
	printf("\\tPrevious free at: %Y, LWP %d\\n", free_walltime[arg0],
		free_lwpid[arg0]);
	printf("\\tThis     free at: %Y, LWP %d\\n", walltimestamp,
		self->lwp);
	printf("\\tFrees %d nsec apart\\n", timestamp - free_time[arg0]);
	printf("\\tAllocated %d nsec ago by LWP %d\\n",
		timestamp - alloc_time[arg0], alloc_lwpid[arg0]);

	ustack(10);
}

pid$target::realloc:entry,
pid$target::free:entry
/ self->recurse == 1 && alloc_time[arg0] && allocated[arg0] == 1 /
{
	free_walltime[arg0] = walltimestamp;
	free_time[arg0] = timestamp;
	free_lwpid[arg0] = self->lwp;

	allocated[arg0] = 0;
}

pid$target::free:entry
/self->lwp && self->recurse == 0/
{
	self->lwp = 0;
}

1Most of the cases it “can't” be used is because it finds fatal problems early on in the start up of applications. Then the application writers make bizarre claims that this is a problem with libumem and will tell you it is not supported with their app. In fact the problem is with the application.

Wednesday Sep 09, 2009

Understanding iostat

1Iostat has been around for years and until Dtrace came along and allowed us to look more deeply into the kernel was the tool for analysing how the io subsystem was working in Solaris. However interpreting the output has proved in the past to cause problems.

First if you are looking at latency issues it is vital that you use the smallest time quantum to iostat you can, which as of Solaris 10 is 1 second. Here is a sample of some output produced from “iostat -x 1”:

                  extended device statistics                 
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b 
sd3       0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0 
                 extended device statistics                 
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b 
sd3       5.0 1026.5    1.6 1024.5  0.0 25.6   24.8   0  23 
                 extended device statistics                 
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b 
sd3       0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0 


The first thing to draw your attention to is the Column “%b” which the manual tells you is:

%b percent of time the disk is busy (transactions in progress)

So in this example the disk was “busy”, ie had at least one transaction (command) in progress for 23% of the time period. Ie 0.23 seconds as the time period was 1 second.

Now look at the “actv” column. Again the manual says:

actv average number of transactions actively being serviced (removed from the queue but not yet completed)
This is the number of I/O operations accepted, but not yet serviced, by the device.
In this example the average number of transactions outstanding for this time quantum was 25.6. Now here is the bit that is so often missed. Since we know that all the transactions actually took place within 0.23 seconds and were not evenly spread across the full second the average queue depth when busy was 100/23 \* 25.6 or 111.3. Thanks to dtrace and this D script you can see the actual IO pattern2:

Even having done the maths iostat smooths out peaks in the IO pattern and thus under reports the peak number of transactions as 103.0 when the true value is 200.
The same is true for the bandwidth. The iostat above comes reports 1031.5 transactions a second (r/s + w/s) again though this does not take into account that all those IO requests happened in 0.23 seconds. So the true figure for the device would be 1031.5 \* 100/23 which is 4485 transations/sec.
If we up the load on the disk a bit then you can conclude more from the iostat:
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b 
sd3       0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0 
                 extended device statistics                 
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b 
sd3       5.0 2155.7    1.6 2153.7 30.2 93.3   57.1  29  45 
                 extended device statistics                 
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b 
sd3       0.0 3989.1    0.0 3989.1 44.6 157.2   50.6  41  83 
                 extended device statistics                 
device    r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b 
sd3       0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0 

Since the %w column is non zero, and from the manual %w is:

%w percent of time there are transactions waiting for service (queue non-empty)

This is telling us that the device's active queue was full. So on the third line of the above output the devices queue was full for 0.41 seconds. Since the queue depth is quite easy to find out3 and in this case was 256, you can deduce that the queue depth for that 0.41 seconds was 256. Thus the average for the 0.59 seconds left was (157.2-(0.41\*256))/0.59 which is 88.5. The graph of the dtrace results tells a different story:


These examples demonstrate what can happen if your application dumps a large number of transactions onto a storage device while the through put will be fine and if you look at iostat data things can appear ok if the granularity of the samples is not close to your requirement for latency any problem can be hidden by the statistical nature of iostat.

1Apologies to those who saw a draft of this on my blog briefly.

2The application creating the IO attempts to keep 200 transations in the disk at all the time. It is interesting to see that it fails as it does not get notification of the completion of the IO until all or nearly all the outstanding transactions have completed.

3This command will do it for all the devices on your system:

echo '\*sd_state::walk softstate | ::print -d -at "struct sd_lun" un_throttle' | pfexec mdb -k

however be warned the throttle is dynamic so dtrace gives the real answer.

Thursday Aug 06, 2009

Monitoring mounts

Sometimes in the course of being a system administrator it is useful to know what file systems are being mounted and when and what mounts fail and why. While you can turn on automounter verbose mode that only answers the question for the automounter.

Dtrace makes answering the general question a snip:

: exdev.eu FSS 24 $; cat mount_monitor.d                         
#!/usr/sbin/dtrace -qs

fbt::domount:entry
/ args[1]->dir /
{
        self->dir = args[1]->flags & 0x8 ? args[1]->dir : 
              copyinstr((intptr_t)args[1]->dir);
}
fbt::domount:return
/ self->dir != 0 /
{
        
        printf("%Y domount ppid %d, %s %s pid %d -> %s", walltimestamp, 
              ppid, execname, self->dir, pid, arg1 == 0 ? "OK" : "failed");
}
fbt::domount:return
/ self->dir != 0 && arg1 == 0/
{
        printf("\\n");
        self->dir = 0;
}
fbt::domount:return
/ self->dir != 0 && arg1 != 0/
{
        printf("errno %d\\n", arg1);
        self->dir = 0;
}
: exdev.eu FSS 25 $; pfexec /usr/sbin/dtrace -qs  mount_monitor.d
2009 Aug  6 12:57:57 domount ppid 0, sched /share/consoles pid 0 -> OK
2009 Aug  6 12:57:59 domount ppid 0, sched /share/chroot pid 0 -> OK
2009 Aug  6 12:58:00 domount ppid 0, sched /share/newsrc pid 0 -> OK
2009 Aug  6 12:58:00 domount ppid 0, sched /share/build2 pid 0 -> OK
2009 Aug  6 12:58:00 domount ppid 0, sched /share/chris_at_play pid 0 -> OK
2009 Aug  6 12:58:00 domount ppid 0, sched /share/ws_eng pid 0 -> OK
2009 Aug  6 12:58:00 domount ppid 0, sched /share/ws pid 0 -> OK
2009 Aug  6 12:58:03 domount ppid 0, sched /home/tx pid 0 -> OK
2009 Aug  6 12:58:04 domount ppid 0, sched /home/fl pid 0 -> OK
2009 Aug  6 12:58:05 domount ppid 0, sched /home/socal pid 0 -> OK
2009 Aug  6 12:58:07 domount ppid 0, sched /home/bur pid 0 -> OK
2009 Aug  6 12:58:23 domount ppid 0, sched /net/e2big.uk/export/install/docs pid 0 -> OK
2009 Aug  6 12:58:23 domount ppid 0, sched /net/e2big.uk/export/install/browser pid 0 -> OK
2009 Aug  6 12:58:23 domount ppid 0, sched /net/e2big.uk/export/install/cdroms pid 0 -> OK
2009 Aug  6 12:59:45 domount ppid 8929, Xnewt /tmp/.X11-pipe/X6 pid 8935 -> OK

In particular that last line if repeated often can give you a clue to things not being right.

Friday Jul 31, 2009

Adding a Dtrace provider to the kernel

Since writing scsi.d I have been pondering if there should really be a scsi dtrace provider that allows you to do all that scsi.d does and more. Since the push of 6797025 that both removed the main reason for not doing this and also gave impetus to do it as scsi.d needed incompatible changes to use the new return function as the return “probe”.

This work is very much work in progress and may or may not see the light of day due to some other issues around scsi addressing, however I thought I would document how I added a kernel dtrace provider so if you want to you don't have to do so much searching1.

Adding the probes themselves is simplicity itself using the DTRACE_PROBEN() macros. Following the convention I added this macro:

#define	DTRACE_SCSI_2(name, type1, arg1, type2, arg2)			\\
	DTRACE_PROBE2(__scsi_##name, type1, arg1, type2, arg2);

to usr/src/uts/common/sys/sdt.h. Then after including <sys/sdt.h> in each file I put this macro in each of the places I wanted my probes:

 	DTRACE_SCSI_2(transport, struct scsi_pkt \*, pkt,
 	    struct scsi_address \*, P_TO_ADDR(pkt))

The bit that took a while to find was how to turn these into a provider. To do that edit the file “usr/src/uts/common/dtrace/sdt_subr.c” and create the attribute structure2:

 static dtrace_pattr_t scsi_attr = {
 { DTRACE_STABILITY_EVOLVING, DTRACE_STABILITY_EVOLVING, DTRACE_CLASS_ISA },
 { DTRACE_STABILITY_PRIVATE, DTRACE_STABILITY_PRIVATE, DTRACE_CLASS_UNKNOWN },
 { DTRACE_STABILITY_PRIVATE, DTRACE_STABILITY_PRIVATE, DTRACE_CLASS_UNKNOWN },
 { DTRACE_STABILITY_PRIVATE, DTRACE_STABILITY_PRIVATE, DTRACE_CLASS_ISA },
 { DTRACE_STABILITY_EVOLVING, DTRACE_STABILITY_EVOLVING, DTRACE_CLASS_ISA },
 };

and add it to the sdt_providers array:


	{ "scsi", "__scsi_", &scsi_attr, 0 },

than add the probes to the sdt_args array:

	{ "scsi", "transport", 0, 0, "struct scsi_pkt \*", "scsi_pktinfo_t \*"},
	{ "scsi", "transport", 1, 1, "struct scsi_address \*", "scsi_addrinfo_t \*"},
	{ "scsi", "complete", 0, 0, "struct scsi_pkt \*", "scsi_pktinfo_t \*"},
	{ "scsi", "complete", 1, 1, "struct scsi_address \*", "scsi_addrinfo_t \*"},

Finally you need to create a file containing the definitions of the output structures, scsi_pktinfo_t and scsi_addrinfo_t and define translators for them. That goes into /usr/lib/dtrace and I called mine scsa.d (there is already one called scsi.d).

/\*
 \* CDDL HEADER START
 \*
 \* The contents of this file are subject to the terms of the
 \* Common Development and Distribution License (the "License").
 \* You may not use this file except in compliance with the License.
 \*
 \* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
 \* or http://www.opensolaris.org/os/licensing.
 \* See the License for the specific language governing permissions
 \* and limitations under the License.
 \*
 \* When distributing Covered Code, include this CDDL HEADER in each
 \* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
 \* If applicable, add the following below this CDDL HEADER, with the
 \* fields enclosed by brackets "[]" replaced with your own identifying
 \* information: Portions Copyright [yyyy] [name of copyright owner]
 \*
 \* CDDL HEADER END
 \*/
/\*
 \* Copyright 2009 Sun Microsystems, Inc.  All rights reserved.
 \* Use is subject to license terms.
 \*/

#pragma D depends_on module scsi
#pragma D depends_on provider scsi

inline char TEST_UNIT_READY = 0x0;
#pragma D binding "1.0" TEST_UNIT_READY
inline char REZERO_UNIT_or_REWIND = 0x0001;
#pragma D binding "1.0" REZERO_UNIT_or_REWIND

inline char SCSI_HBA_ADDR_COMPLEX = 0x0040;
#pragma D binding "1.0" SCSI_HBA_ADDR_COMPLEX

typedef struct scsi_pktinfo {
	caddr_t pkt_ha_private;
	uint_t	pkt_flags;
	int	pkt_time;
	uchar_t \*pkt_scbp;
	uchar_t \*pkt_cdbp;
	ssize_t pkt_resid;
	uint_t	pkt_state;
	uint_t 	pkt_statistics;
	uchar_t pkt_reason;
	uint_t	pkt_cdblen;
	uint_t	pkt_tgtlen;
	uint_t	pkt_scblen;
} scsi_pktinfo_t;

#pragma D binding "1.0" translator
translator scsi_pktinfo_t  < struct scsi_pkt \*P > {
	pkt_ha_private = P->pkt_ha_private;
	pkt_flags = P->pkt_flags;
	pkt_time = P->pkt_time;
	pkt_scbp = P->pkt_scbp;
	pkt_cdbp = P->pkt_cdbp;
	pkt_resid = P->pkt_resid;
	pkt_state = P->pkt_state;
	pkt_statistics = P->pkt_statistics;
	pkt_reason = P->pkt_reason;
	pkt_cdblen = P->pkt_cdblen;
	pkt_tgtlen = P->pkt_tgtlen;
	pkt_scblen = P->pkt_scblen;
};

typedef struct scsi_addrinfo {
	struct scsi_hba_tran	\*a_hba_tran;
	ushort_t a_target;	/\* ua target \*/
	uchar_t	 a_lun;		/\* ua lun on target \*/
	struct scsi_device \*a_sd;
} scsi_addrinfo_t;

#pragma D binding "1.0" translator
translator scsi_addrinfo_t  < struct scsi_address \*A > {
	a_hba_tran = A->a_hba_tran;
	a_target = !(A->a_hba_tran->tran_hba_flags & SCSI_HBA_ADDR_COMPLEX) ?
		0 : A->a.spi.a_target;
	a_lun = !(A->a_hba_tran->tran_hba_flags & SCSI_HBA_ADDR_COMPLEX) ?
		0 : A->a.spi.a_lun;
	a_sd = (A->a_hba_tran->tran_hba_flags & SCSI_HBA_ADDR_COMPLEX) ?
		A->a.a_sd : 0;
};

again this is just enough to get going so I can see and use the probes:

jack@v4u-2500b-gmp03:~$ pfexec dtrace -P scsi -l
   ID   PROVIDER            MODULE                          FUNCTION NAME
 1303       scsi              scsi                    scsi_transport transport
 1313       scsi              scsi                 scsi_hba_pkt_comp complete
jack@v4u-2500b-gmp03:~$ 

While this all works well for parallel scsi getting the address of devices on fibre is not clear to me. If you have any suggestions I'm all ears.

1If there is such a document already in existence then please add a comment. I will just wish I could have found it.

2These may not be the right attributes but gets me to the point it compiles and can be used in a PoC.

Tuesday Jun 30, 2009

Using dtrace to track down memory leaks

I've been working with a customer to try and find a memory “leak” in their application. Many things have been tried, libumem, and the mdb ::findleaks command all with no success.

So I was, as I am sure others before me have, pondering if you could use dtrace to do this. Well I think you can. I have a script that puts probes into malloc et al and counts how often they are called by this thread and when they are freed often free is called.

Then in the entry probe of the target application note away how many calls there have been to the allocators and how many to free and with a bit of care realloc. Then in the return probe compare the number of calls to allocate and free with the saved values and aggregate the results. The principle is that you find the routines that are resulting in allocations that they don't clear up. This should give you a list of functions that are possible leakers which you can then investigate1.

Using the same technique I for getting dtrace to “follow fork” that I described here I ran this up on diskomizer, a program that I understand well and I'm reasonably sure does not have systemic memory leaks. The dtrace script reports three sets of results.

  1. A count of how many times each routine and it's descendents have called a memory allocator.

  2. A count of how many times each routine and it's descendents have called free or realloc with a non NULL pointer as the first argument.

  3. The difference between the two numbers above.

Then with a little bit of nawk to remove all the functions for which the counts are zero gives:

# /usr/sbin/dtrace -Z -wD TARGET_OBJ=diskomizer2 -o /tmp/out-us \\
	-s /tmp/followfork.d \\
	-Cs /tmp/allocated.d -c \\
         "/opt/SUNWstc-diskomizer/bin/sparcv9/diskomizer -f /devs -f background \\
          -o background=0 -o SECONDS_TO_RUN=1800"
dtrace: failed to compile script /tmp/allocated.d: line 20: failed to create entry probe for 'realloc': No such process
dtrace: buffer size lowered to 25m
dtrace: buffer size lowered to 25m
dtrace: buffer size lowered to 25m
dtrace: buffer size lowered to 25m
 
# nawk '$1 != 0 { print  $0 }' < /tmp/out.3081
allocations
           1 diskomizer`do_dev_control
           1 diskomizer`set_dev_state
           1 diskomizer`set_state
           3 diskomizer`report_exit_reason
           6 diskomizer`alloc_time_str
           6 diskomizer`alloc_time_str_fmt
           6 diskomizer`update_aio_read_stats
           7 diskomizer`cancel_all_io
           9 diskomizer`update_aio_write_stats
          13 diskomizer`cleanup
          15 diskomizer`update_aio_time_stats
          15 diskomizer`update_time_stats
          80 diskomizer`my_calloc
         240 diskomizer`init_read
         318 diskomizer`do_restart_stopped_devices
         318 diskomizer`start_io
         449 diskomizer`handle_write
         606 diskomizer`do_new_write
        2125 diskomizer`handle_read_then_write
        2561 diskomizer`init_buf
        2561 diskomizer`set_io_len
       58491 diskomizer`handle_read
       66255 diskomizer`handle_write_then_read
      124888 diskomizer`init_read_buf
      124897 diskomizer`do_new_read
      127460 diskomizer`expect_signal
freecount
           1 diskomizer`expect_signal
           3 diskomizer`report_exit_reason
           4 diskomizer`close_and_free_paths
           6 diskomizer`update_aio_read_stats
           9 diskomizer`update_aio_write_stats
          11 diskomizer`cancel_all_io
          15 diskomizer`update_aio_time_stats
          15 diskomizer`update_time_stats
          17 diskomizer`cleanup
         160 diskomizer`init_read
         318 diskomizer`do_restart_stopped_devices
         318 diskomizer`start_io
         442 diskomizer`handle_write
         599 diskomizer`do_new_write
        2125 diskomizer`handle_read_then_write
        2560 diskomizer`init_buf
        2560 diskomizer`set_io_len
       58491 diskomizer`handle_read
       66246 diskomizer`handle_write_then_read
      124888 diskomizer`do_new_read
      124888 diskomizer`init_read_buf
      127448 diskomizer`cancel_expected_signal
mismatch_count
     -127448 diskomizer`cancel_expected_signal
          -4 diskomizer`cancel_all_io
          -4 diskomizer`cleanup
          -4 diskomizer`close_and_free_paths
           1 diskomizer`do_dev_control
           1 diskomizer`init_buf
           1 diskomizer`set_dev_state
           1 diskomizer`set_io_len
           1 diskomizer`set_state
           6 diskomizer`alloc_time_str
           6 diskomizer`alloc_time_str_fmt
           7 diskomizer`do_new_write
           7 diskomizer`handle_write
           9 diskomizer`do_new_read
           9 diskomizer`handle_write_then_read
          80 diskomizer`init_read
          80 diskomizer`my_calloc
      127459 diskomizer`expect_signal

#

From the above you can see that there are two functions that create and free the majority of the allocations and the allocations almost match each other, which is expected as they are effectively constructor and destructor for each other. The small mismatch is not unexpected in this context.

However it is the vast number of functions that are not listed at all as they and their children make no calls to the memory allocator or have exactly matching allocation and free that are important here. Those are the functions that we have just ruled out.

From here it is easy now to drill down on the functions that are interesting you, ie the ones where there are unbalanced allocations.


I've uploaded the files allocated.d and followfork.d so you can see the details. If you find it useful then let me know.

1Unfortunately the list is longer than you want as on SPARC it includes any functions that don't have their own stack frame due to the way dtrace calculates ustackdepth, which the script makes use of.

2The script only probes particular objects, in this case the main diskomizer binary, but you can limit it to a particular library or even a particular set of entry points based on name if you edit the script.

Saturday Jun 27, 2009

Follow fork for dtrace pid provider?

There is a ongoing request to have follow fork functionality for the dtrace pid provider but so far no one has stood upto the plate for that RFE. In the mean time my best workaround for this is this:

cjg@brompton:~/lang/d$ cat followfork.d
proc:::start
/ppid == $target/
{
	stop();
	printf("fork %d\\n", pid);
	system("dtrace -qs child.d -p %d", pid);
}
cjg@brompton:~/lang/d$ cat child.d
pid$target::malloc:entry
{
	printf("%d %s:%s %d\\n", pid, probefunc, probename, ustackdepth)
}
cjg@brompton:~/lang/d$ pfexec /usr/sbin/dtrace -qws followfork.d -s child.d -p 26758
26758 malloc:entry 22
26758 malloc:entry 15
26758 malloc:entry 18
26758 malloc:entry 18
26758 malloc:entry 18
fork 27548
27548 malloc:entry 7
27548 malloc:entry 7
27548 malloc:entry 18
27548 malloc:entry 16
27548 malloc:entry 18

Clearly you can have the child script do what ever you wish.

Better solutions are welcome!

Saturday Mar 28, 2009

snapshot on unlink?

This thread on OpenSolaris made me wonder how hard it would be to take a snapshot before any file is deleted. It turns out that using dtrace it is not hard at all. Using dtrace to monitor unlink and unlinkat calls and a short script to take the snapshots:

#!/bin/ksh93

function snapshot
{
	eval $(print x=$2)

	until [[ "$x" == "/" || -d "$x/.zfs/snapshot" ]]
	do
		x="${x%/\*}"
	done
	if [[ "$x" == "/" || "$x" == "/tmp" ]]
	then
		return
	fi
	if [[ -d "$x/.zfs/snapshot" ]]
	then
		print mkdir "$x/.zfs/snapshot/unlink_$1"
		pfexec mkdir "$x/.zfs/snapshot/unlink_$1"
	fi
}
function parse
{
	eval $(print x=$4)
	
	if [[ "${x%%/\*}" == "" ]]
	then
		snapshot $1 "$2$4"
	else
		snapshot $1 "$2$3/$4"
	fi
}
pfexec dtrace -wqn 'syscall::fsat:entry /pid != '$$' && uid > 100 && arg0 == 5/ {
	printf("%d %d \\"%s\\" \\"%s\\" \\"%s\\"\\n",
	pid, walltimestamp, root, cwd, copyinstr(arg2)); stop()
}
syscall::unlink:entry /pid != '$$' && uid > 100 / {
	printf("%d %d \\"%s\\" \\"%s\\" \\"%s\\"\\n",
	pid, walltimestamp, root, cwd, copyinstr(arg0)); stop()
}' | while read pid timestamp root cwd file
do
	print prun $pid
	parse $timestamp $root $cwd $file
	pfexec prun $pid
done

Now this is just a Saturday night proof of concept and it should be noted it has a significant performance impact and single threads all calls to unlink.

Also you end up with lots of snapshots:

cjg@brompton:~$ zfs list -t snapshot -o name,used | grep unlink

rpool/export/home/cjg@unlink_1238270760978466613                           11.9M

rpool/export/home/cjg@unlink_1238275070771981963                             59K

rpool/export/home/cjg@unlink_1238275074501904526                             59K

rpool/export/home/cjg@unlink_1238275145860458143                             34K

rpool/export/home/cjg@unlink_1238275168440000379                            197K

rpool/export/home/cjg@unlink_1238275233978665556                            197K

rpool/export/home/cjg@unlink_1238275295387410635                            197K

rpool/export/home/cjg@unlink_1238275362536035217                            197K

rpool/export/home/cjg@unlink_1238275429554657197                            136K

rpool/export/home/cjg@unlink_1238275446884300017                            350K

rpool/export/home/cjg@unlink_1238275491543380576                            197K

rpool/export/home/cjg@unlink_1238275553842097361                            197K

rpool/export/home/cjg@unlink_1238275643490236001                             63K

rpool/export/home/cjg@unlink_1238275644670212158                             63K

rpool/export/home/cjg@unlink_1238275646030183268                               0

rpool/export/home/cjg@unlink_1238275647010165407                               0

rpool/export/home/cjg@unlink_1238275648040143427                             54K

rpool/export/home/cjg@unlink_1238275649030124929                             54K

rpool/export/home/cjg@unlink_1238275675679613928                            197K

rpool/export/home/cjg@unlink_1238275738608457151                            198K

rpool/export/home/cjg@unlink_1238275800827304353                           57.5K

rpool/export/home/cjg@unlink_1238275853116324001                           32.5K

rpool/export/home/cjg@unlink_1238275854186304490                           53.5K

rpool/export/home/cjg@unlink_1238275862146153573                            196K

rpool/export/home/cjg@unlink_1238275923255007891                           55.5K

rpool/export/home/cjg@unlink_1238275962114286151                           35.5K

rpool/export/home/cjg@unlink_1238275962994267852                           56.5K

rpool/export/home/cjg@unlink_1238275984723865944                           55.5K

rpool/export/home/cjg@unlink_1238275986483834569                             29K

rpool/export/home/cjg@unlink_1238276004103500867                             49K

rpool/export/home/cjg@unlink_1238276005213479906                             49K

rpool/export/home/cjg@unlink_1238276024853115037                           50.5K

rpool/export/home/cjg@unlink_1238276026423085669                           52.5K

rpool/export/home/cjg@unlink_1238276041792798946                           50.5K

rpool/export/home/cjg@unlink_1238276046332707732                           55.5K

rpool/export/home/cjg@unlink_1238276098621721894                             66K

rpool/export/home/cjg@unlink_1238276108811528303                           69.5K

rpool/export/home/cjg@unlink_1238276132861080236                             56K

rpool/export/home/cjg@unlink_1238276166070438484                             49K

rpool/export/home/cjg@unlink_1238276167190417567                             49K

rpool/export/home/cjg@unlink_1238276170930350786                             57K

rpool/export/home/cjg@unlink_1238276206569700134                           30.5K

rpool/export/home/cjg@unlink_1238276208519665843                           58.5K

rpool/export/home/cjg@unlink_1238276476484690821                             54K

rpool/export/home/cjg@unlink_1238276477974663478                             54K

rpool/export/home/cjg@unlink_1238276511584038137                           60.5K

rpool/export/home/cjg@unlink_1238276519053902818                             71K

rpool/export/home/cjg@unlink_1238276528213727766                             62K

rpool/export/home/cjg@unlink_1238276529883699491                             47K

rpool/export/home/cjg@unlink_1238276531683666535                           3.33M

rpool/export/home/cjg@unlink_1238276558063169299                           35.5K

rpool/export/home/cjg@unlink_1238276559223149116                           62.5K

rpool/export/home/cjg@unlink_1238276573552877191                           35.5K

rpool/export/home/cjg@unlink_1238276584602668975                           35.5K

rpool/export/home/cjg@unlink_1238276586002642752                             53K

rpool/export/home/cjg@unlink_1238276586522633206                             51K

rpool/export/home/cjg@unlink_1238276808718681998                            216K

rpool/export/home/cjg@unlink_1238276820958471430                           77.5K

rpool/export/home/cjg@unlink_1238276826718371992                             51K

rpool/export/home/cjg@unlink_1238276827908352138                             51K

rpool/export/home/cjg@unlink_1238276883227391747                            198K

rpool/export/home/cjg@unlink_1238276945366305295                           58.5K

rpool/export/home/cjg@unlink_1238276954766149887                           32.5K

rpool/export/home/cjg@unlink_1238276955946126421                           54.5K

rpool/export/home/cjg@unlink_1238276968985903108                           52.5K

rpool/export/home/cjg@unlink_1238276988865560952                             31K

rpool/export/home/cjg@unlink_1238277006915250722                           57.5K

rpool/export/home/cjg@unlink_1238277029624856958                             51K

rpool/export/home/cjg@unlink_1238277030754835625                             51K

rpool/export/home/cjg@unlink_1238277042004634457                           51.5K

rpool/export/home/cjg@unlink_1238277043934600972                             52K

rpool/export/home/cjg@unlink_1238277045124580763                             51K

rpool/export/home/cjg@unlink_1238277056554381122                             51K

rpool/export/home/cjg@unlink_1238277058274350998                             51K

rpool/export/home/cjg@unlink_1238277068944163541                             59K

rpool/export/home/cjg@unlink_1238277121423241127                           32.5K

rpool/export/home/cjg@unlink_1238277123353210283                           53.5K

rpool/export/home/cjg@unlink_1238277136532970668                           52.5K

rpool/export/home/cjg@unlink_1238277152942678490                               0

rpool/export/home/cjg@unlink_1238277173482320586                               0

rpool/export/home/cjg@unlink_1238277187222067194                             49K

rpool/export/home/cjg@unlink_1238277188902043005                             49K

rpool/export/home/cjg@unlink_1238277190362010483                             56K

rpool/export/home/cjg@unlink_1238277228691306147                           30.5K

rpool/export/home/cjg@unlink_1238277230021281988                           51.5K

rpool/export/home/cjg@unlink_1238277251960874811                             57K

rpool/export/home/cjg@unlink_1238277300159980679                           30.5K

rpool/export/home/cjg@unlink_1238277301769961639                             50K

rpool/export/home/cjg@unlink_1238277302279948212                             49K

rpool/export/home/cjg@unlink_1238277310639840621                             28K

rpool/export/home/cjg@unlink_1238277314109790784                           55.5K

rpool/export/home/cjg@unlink_1238277324429653135                             49K

rpool/export/home/cjg@unlink_1238277325639636996                             49K

rpool/export/home/cjg@unlink_1238277360029166691                            356K

rpool/export/home/cjg@unlink_1238277375738948709                           55.5K

rpool/export/home/cjg@unlink_1238277376798933629                             29K

rpool/export/home/cjg@unlink_1238277378458911557                             50K

rpool/export/home/cjg@unlink_1238277380098888676                             49K

rpool/export/home/cjg@unlink_1238277397738633771                             48K

rpool/export/home/cjg@unlink_1238277415098386055                             49K

rpool/export/home/cjg@unlink_1238277416258362893                             49K

rpool/export/home/cjg@unlink_1238277438388037804                             57K

rpool/export/home/cjg@unlink_1238277443337969269                           30.5K

rpool/export/home/cjg@unlink_1238277445587936426                           51.5K

rpool/export/home/cjg@unlink_1238277454527801430                           50.5K

rpool/export/home/cjg@unlink_1238277500967098623                            196K

rpool/export/home/cjg@unlink_1238277562866135282                           55.5K

rpool/export/home/cjg@unlink_1238277607205456578                             49K

rpool/export/home/cjg@unlink_1238277608135443640                             49K

rpool/export/home/cjg@unlink_1238277624875209357                             57K

rpool/export/home/cjg@unlink_1238277682774484369                           30.5K

rpool/export/home/cjg@unlink_1238277684324464523                             50K

rpool/export/home/cjg@unlink_1238277685634444004                             49K

rpool/export/home/cjg@unlink_1238277686834429223                           75.5K

rpool/export/home/cjg@unlink_1238277700074256500                             48K

rpool/export/home/cjg@unlink_1238277701924235244                             48K

rpool/export/home/cjg@unlink_1238277736473759068                           49.5K

rpool/export/home/cjg@unlink_1238277748313594650                           55.5K

rpool/export/home/cjg@unlink_1238277748413593612                             28K

rpool/export/home/cjg@unlink_1238277750343571890                             48K

rpool/export/home/cjg@unlink_1238277767513347930                           49.5K

rpool/export/home/cjg@unlink_1238277769183322087                             50K

rpool/export/home/cjg@unlink_1238277770343306935                             48K

rpool/export/home/cjg@unlink_1238277786193093885                             48K

rpool/export/home/cjg@unlink_1238277787293079433                             48K

rpool/export/home/cjg@unlink_1238277805362825259                           49.5K

rpool/export/home/cjg@unlink_1238277810602750426                            195K

rpool/export/home/cjg@unlink_1238277872911814531                            195K

rpool/export/home/cjg@unlink_1238277934680920214                            195K

rpool/export/home/cjg@unlink_1238277997220016825                            195K

rpool/export/home/cjg@unlink_1238278063868871589                           54.5K

rpool/export/home/cjg@unlink_1238278094728323253                             61K

rpool/export/home/cjg@unlink_1238278096268295499                             63K

rpool/export/home/cjg@unlink_1238278098518260168                             52K

rpool/export/home/cjg@unlink_1238278099658242516                             56K

rpool/export/home/cjg@unlink_1238278103948159937                             57K

rpool/export/home/cjg@unlink_1238278107688091854                             54K

rpool/export/home/cjg@unlink_1238278113907980286                             62K

rpool/export/home/cjg@unlink_1238278116267937390                             64K

rpool/export/home/cjg@unlink_1238278125757769238                            196K

rpool/export/home/cjg@unlink_1238278155387248061                            136K

rpool/export/home/cjg@unlink_1238278160547156524                            229K

rpool/export/home/cjg@unlink_1238278165047079863                            351K

rpool/export/home/cjg@unlink_1238278166797050407                            197K

rpool/export/home/cjg@unlink_1238278168907009714                             55K

rpool/export/home/cjg@unlink_1238278170666980686                            341K

rpool/export/home/cjg@unlink_1238278171616960684                           54.5K

rpool/export/home/cjg@unlink_1238278190336630319                            777K

rpool/export/home/cjg@unlink_1238278253245490904                            329K

rpool/export/home/cjg@unlink_1238278262235340449                            362K

rpool/export/home/cjg@unlink_1238278262915331213                            362K

rpool/export/home/cjg@unlink_1238278264915299508                            285K

rpool/export/home/cjg@unlink_1238278310694590970                             87K

rpool/export/home/cjg@unlink_1238278313294552482                             66K

rpool/export/home/cjg@unlink_1238278315014520386                             31K

rpool/export/home/cjg@unlink_1238278371773568934                            258K

rpool/export/home/cjg@unlink_1238278375673503109                            198K

rpool/export/home/cjg@unlink_1238278440802320314                            138K

rpool/export/home/cjg@unlink_1238278442492291542                           55.5K

rpool/export/home/cjg@unlink_1238278445312240229                           2.38M

rpool/export/home/cjg@unlink_1238278453582077088                            198K

rpool/export/home/cjg@unlink_1238278502461070222                            256K

rpool/export/home/cjg@unlink_1238278564359805760                            256K

rpool/export/home/cjg@unlink_1238278625738732194                           63.5K

rpool/export/home/cjg@unlink_1238278633428599541                           61.5K

rpool/export/home/cjg@unlink_1238278634568579678                            137K

rpool/export/home/cjg@unlink_1238278657838186760                            288K

rpool/export/home/cjg@unlink_1238278659768151784                            223K

rpool/export/home/cjg@unlink_1238278661518121640                            159K

rpool/export/home/cjg@unlink_1238278664378073421                            136K

rpool/export/home/cjg@unlink_1238278665908048641                            138K

rpool/export/home/cjg@unlink_1238278666968033048                            136K

rpool/export/home/cjg@unlink_1238278668887996115                            281K

rpool/export/home/cjg@unlink_1238278670307970765                            227K

rpool/export/home/cjg@unlink_1238278671897943665                            162K

rpool/export/home/cjg@unlink_1238278673197921775                            164K

rpool/export/home/cjg@unlink_1238278674027906895                            164K

rpool/export/home/cjg@unlink_1238278674657900961                            165K

rpool/export/home/cjg@unlink_1238278675657885128                            165K

rpool/export/home/cjg@unlink_1238278676647871187                            241K

rpool/export/home/cjg@unlink_1238278678347837775                            136K

rpool/export/home/cjg@unlink_1238278679597811093                            199K

rpool/export/home/cjg@unlink_1238278687297679327                            197K

rpool/export/home/cjg@unlink_1238278749616679679                            197K

rpool/export/home/cjg@unlink_1238278811875554411                           56.5K

cjg@brompton:~$ 

Good job that snapshots are cheap. I'm not going to be doing this all the time but it makes you think what could be done.

Friday Mar 20, 2009

Limiting scsi.d output to IOs that take a long time.

I have added support for only reporting on scsi packets that take more than a certain amount of time to complete to scsi.d. This is partularly useful when combined with the PERF_REPORT option to see just the IO requests that took over a time threshold and still collect statistics on all the other IO requests.

To use this you have to specify “-D REPORT_OVERTIME=X” where X is the time you wish to report on in nano seconds. Here is an example that will only print the details of IO requests that took more than 250ms:

# scsi.d -D REPORT_OVERTIME=$((250\*1000\*1000)) -D PERF_REPORT 
Only reporting IOs longer than 250000000ns
Hit Control C to interrupt
00005.388800363 glm0:-> 0x2a WRITE(10) address 00:00, lba 0x00270e67, len 0x000008, control 0x00 timeout 60 CDBP 300002df438 1 sched(0) cdb(10) 2a0000270e6700000800
00005.649494475 glm0:<- 0x2a WRITE(10) address 00:00, lba 0x00270e67, len 0x000008, control 0x00 timeout 60 CDBP 300002df438, reason 0x0 (COMPLETED) pkt_state 0x1f state 0x0 Success Time 260833us
00005.384612799 glm0:-> 0x0a  WRITE(6) address 00:00, lba 0x048564, len 0x000001, control 0x00 timeout 60 CDBP 30002541ac0 1 sched(0) cdb(6) 0a0485640100
00005.716416658 glm0:<- 0x0a  WRITE(6) address 00:00, lba 0x048564, len 0x000001, control 0x00 timeout 60 CDBP 30002541ac0, reason 0x0 (COMPLETED) pkt_state 0x1f state 0x0 Success Time 331939us
00005.385907691 glm0:-> 0x0a  WRITE(6) address 00:00, lba 0x0605b4, len 0x000001, control 0x00 timeout 60 CDBP 300035637a0 1 sched(0) cdb(6) 0a0605b40100
00005.773925990 glm0:<- 0x0a  WRITE(6) address 00:00, lba 0x0605b4, len 0x000001, control 0x00 timeout 60 CDBP 300035637a0, reason 0x0 (COMPLETED) pkt_state 0x1f state 0x0 Success Time 388153us
00005.389078533 glm0:-> 0x2a WRITE(10) address 00:00, lba 0x004b19d3, len 0x000003, control 0x00 timeout 60 CDBP 300002df0f8 1 sched(0) cdb(10) 2a00004b19d300000300
00005.824242527 glm0:<- 0x2a WRITE(10) address 00:00, lba 0x004b19d3, len 0x000003, control 0x00 timeout 60 CDBP 300002df0f8, reason 0x0 (COMPLETED) pkt_state 0x1f state 0x0 Success Time 435303us
\^C

  glm                                                       0
           value  ------------- Distribution ------------- count    
         4194304 |                                         0        
         8388608 |                                         3        
        16777216 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@        678      
        33554432 |@@@@@                                    98       
        67108864 |@                                        27       
       134217728 |                                         8        
       268435456 |                                         3        
       536870912 |                                         0        

# 

The implementation of this uses dtrace speculations and as such may require some tuning of the various settings. Specifically the number of speculation buffers I have set to 1000 which should be enough for all but the busiest systems but if you do reach that limit you can increase them using the following options:


-D NSPEC=N

Set the number of speculations to this value.

-D SPECSIZE=N

Set the size of the speculaton buffer. This should be 200 times the size of NSPEC.

-D CLEANRATE=N

Specify the clean rate.

As usual the script is available here. It is version 1.18.

Tuesday Mar 10, 2009

Update to scsi.d

I've just released a new version of scsi.d that fixes a strange bug that only effected 10 but would result in scsi.d not starting:

# dtrace -qCs scsi.d-1.16  
dtrace: failed to compile script scsi.d-1.16: "/usr/include/sys/scsi/scsi_pkt.h", line 33: incomplete struct/union/enum struct scsi_address: pkt_address
# 

The solution was simple even if I still don't really understand why the header included fine in OpenSolaris but not in 10.

Thursday Jan 29, 2009

lwp_park and lwp_unpark

When you have a program that uses the locking primitives mutex_lock() and mutex_unlock and their POSIX equivlients if you truss the process you will often see calls to lwp_park() and lwp_unpark() appering:

/1:	 1.3658	lwp_park(0x00000000, 0)				= 0
/2:	 1.3659	lwp_unpark(1)					= 0
/1:	 1.3659	lwp_park(0x00000000, 0)				= 0
/2:	 1.3660	lwp_unpark(1)					= 0
/1:	 1.3660	lwp_park(0x00000000, 0)				= 0
/1:	 1.3661	lwp_unpark(2)					= 0
/2:	 1.3661	lwp_park(0x00000000, 0)				= 0

These system calls are, as their names imply, the calls that cause the current LWP to stop (park) and allow the current LWP to allow another parked LWP to run (unpark). If we consider the noddy example code:


#include <stdlib.h>
#include <unistd.h>
#include <thread.h>
#include <synch.h>

mutex_t global_lock;
int global_count;

static void \*
locker(void \*arg) 
{
	while (1) {
		mutex_lock(&global_lock);
		global_count++;
		mutex_unlock(&global_lock);
	}
}

int
main(int argc, char \*\*argv)
{
	int count;

	if (argc == 2) {
		count = strtol(argv[1], NULL, NULL);
	}  else {
		count = 2;
	}
	while (--count > 0) {
		thr_create(0, 0, locker, NULL, NULL, NULL);
	}
	locker((void \*)time);
}

The mutex global_lock is going to be battled over by all the threads that are created if one of those threads needs to sleep, as it can't get the mutex, then it has to make a system call so that it can stop using the CPU. The system call is lwp_park(). When the other thread, the one that has the mutex, releases the mutex it signals the blocked thread using lwp_unpark(), with the LWP id of the thread to which to start. This can be seen in the truss:


/1:	 1.3658	lwp_park(0x00000000, 0)				= 0
/2:	 1.3659	lwp_unpark(1)					= 0

/1:	 1.3659	lwp_park(0x00000000, 0)				= 0

/2:	 1.3660	lwp_unpark(1)					= 0
/1:	 1.3660	lwp_park(0x00000000, 0)				= 0
/1:	 1.3661	lwp_unpark(2)					= 0
/2:	 1.3661	lwp_park(0x00000000, 0)				= 0

however the truss can be a bit misleading. You have to remember that truss only reports on system calls when the system call returns (unless they block long enough to be reported as sleeping). So for a call like lwp_park which will sleep until there is a corresponding lwp_unpark call from another thread. In the truss output above you can see this. LWP 2, on the first line in red calls lwp_unpark(1) to unpark LWP 1, at this point LWP 1 returns from what was the blocked lwp_park() call and continues on it's merry way. Alas as can be seen it does not get very far before it once again finds itself blocked but that is what happens if your code is just fighting for a lock. If this were a real application then there are many D scripts that could be used to help track down your issue not least one like this:

pfexec /usr/sbin/dtrace -n 'syscall::lwp_park:entry /execname == "mutex"/
    { @[ustack(5)] = count() }'

Tuesday Jan 27, 2009

New version of scsi.d required for build 106

This version supports some more filters. Specifically you can now specify these new options:

  • MIN_BLOCK only report on IO to less than or equal to this value.

  • MAX_BLOCK only report on IO to blocks greater or equal to this value.

This is most useful for limiting your trace to particular block ranges, be they file system or as was the case that caused me to add this to see who is trampling on the disk label.

In this contrived example it was format:

pfexec /usr/sbin/dtrace -o /tmp/dt.$$ -Cs  scsi.d -D MAX_BLOCK=3 
<SNIP>
00058.529467684 glm0:-> 0x0a  WRITE(6) address 01:00, lba 0x000000, len 0x000001
, control 0x00 timeout 60 CDBP 60012635560 1 format(3985) cdb(6) 0a0000000100
00058.542945891 glm0:<- 0x0a  WRITE(6) address 01:00, lba 0x000000, len 0x000001
, control 0x00 timeout 60 CDBP 60012635560, reason 0x0 (COMPLETED) pkt_state 0x1
f state 0x0 Success Time 13604us

While this answered my question there are neater ways of answering the question just by using the IO provider:

: s4u-nv-gmp03.eu TS 68 $; pfexec /usr/sbin/dtrace -n 'io:::start / args[0]->b_blkno < 3 && args[0]->b_flags & B_WRITE / { printf("%s %s %d %d", execname, args[1]->dev_statname, args[0]->b_blkno, args[0]->b_bcount) }'
dtrace: description 'io:::start ' matched 6 probes
CPU     ID                    FUNCTION:NAME
  0    629             default_physio:start format sd0 0 512
  0    629             default_physio:start format sd0 0 512
  0    629             default_physio:start format sd0 0 512
  0    629             default_physio:start format sd0 0 512
  0    629             default_physio:start format sd0 0 512
  0    629             default_physio:start format sd0 0 512

Also build 106 of nevada has changed the structure definition for scsi_address and in doing so this breaks scsi.d which has intimate knowledge of scsi_address structures. I have a solution that you can download but in writing it I also filed this bug:

679803 dtrace suffers from macro recursion when including scsi_address.h

which scsi.d has to work around. When that bug is resolved the work around may have to be revisited.

All versions of scsi.d are available here and this specific verison, version 1.16 here.

Thank you to Artem Kachitchkine for bringing the changes to scsi_address.h and their effects on scsi.d to my attention.

Wednesday Jan 21, 2009

Getting the right CTF

I just spent to long, way too long, working out why a system dump's CTF did not seem to match the source code or for that matter the assembler that had been generated.

When a Solaris release is released all the CTF is merged into the unix file. As updates are released any structures that change are not updated in the unix file (since the old definition may still be being used) so the CTF definition is held in the module in which they are defined.

So faced with a dump where I needed to look at the “conn_udp” element in “struct conn_s” mdb kept saying there was no element “conn_udp”:

> ::print -at conn_t conn_udp
mdb: failed to find member conn_udp of conn_t: no such member of structure or union
> 

since the assembler made it abundantly clear that we were indeed using this element (I would show you the source but this is Solaris 10 and the source is very different from the OpenSolaris code). The thing to recall was that the structure is really defined in the “ip” module so to get the correct definition you need this:

> ::print -at ip`conn_t conn_udp
30 struct udp_s \*conn_udp 
>

This also effects dtrace as that is also a consumer of CTF (note this dtrace is entirely pointless):

# dtrace -n 'fbt::udp_bind:entry / ((conn_t \*)(args[0]->q_ptr))->conn_udp / { tot++ }'
dtrace: invalid probe specifier fbt::udp_bind:entry / ((conn_t \*)(args[0]->q_ptr))->conn_udp / { tot++ }: in predicate: conn_udp is not a member of struct conn_s
# 

and again by getting the definition from the original module gets the right answer:

# dtrace -n 'fbt::udp_bind:entry / ((ip`conn_t \*)(args[0]->q_ptr))->conn_udp / { tot++ }'
dtrace: description 'fbt::udp_bind:entry ' matched 1 probe
\^C

#

Since “ip`conn_t” will always give the right answer even in the case where the merged CTF data is in unix and that is the current version it is best to understand where the object was declared.

I kind of wish that at least in the case of dtrace it should get this right at the very least when you have specified the module since it knows what module you are in it could choose the CTF from that module.


# dtrace -n 'fbt:ip:udp_bind:entry / ((conn_t \*)(args[0]->q_ptr))->conn_udp / { tot++ }'
dtrace: invalid probe specifier fbt:ip:udp_bind:entry / ((conn_t \*)(args[0]->q_ptr))->conn_udp / { tot++ }: in predicate: conn_udp is not a member of struct conn_s
#


Should IMO work.

Wednesday Nov 05, 2008

dtrace top tip

When logged on to a laptop (or any desktop system using X) don't run this command:

 $ pfexec dtrace -l -p $(pgrep Xorg)

Instead do this:

 $ pfexec dtrace -l -o /tmp/dt -p $(pgrep Xorg)

The former will deadlock the Xserver and if, like me you are in a hotel room with no other way to login to the system require you to power cycle it. The latter will put the results in /tmp/dt which you can then look at.

Even as I hit return I thought I should not do that as bad things could happen, See CR 4259419.

I blame this on staying up late at an Election night Party with some very happy people. Thank you to them for letting me share the experience, I just hope that you are not let down in the same way we have been after the 1997 Labour victory which brought so much hope to so many.

Friday Oct 10, 2008

Gathering performance statistics with scsi.d

While scsi.d is good for looking at scsi packets and seeing those raw CDBs not many people are really interested in what a SCSI packet looks like, well not enough people if you ask me. However what is much more interesting is how long the scsi packets are taking. Now scsi.d tells you this for each packet but aggregating the data would be are more useful.

: e2big.eu TS 81 $; pfexec /usr/sbin/dtrace -Cs scsi.d -D QUIET -D PERF_REPORT -D REPORT_TARGET \\
-D REPORT_LUN -n tick-1m {printa(@); clear(@); exit(0) }
Hit Control C to interrupt

  qus                                                       1
           value  ------------- Distribution ------------- count    
          131072 |                                         0        
          262144 |@@@@                                     25       
          524288 |@@@@@@@@@@@@                             68       
         1048576 |@@@@@@                                   34       
         2097152 |                                         2        
         4194304 |@@@                                      19       
         8388608 |@@@@@                                    29       
        16777216 |@@@@                                     22       
        33554432 |@@@@@@                                   35       
        67108864 |                                         1        
       134217728 |                                         0        

  fp                                                        2
           value  ------------- Distribution ------------- count    
          262144 |                                         0        
          524288 |                                         3        
         1048576 |                                         1        
         2097152 |@@                                       15       
         4194304 |@@@@@@@@                                 67       
         8388608 |@@@@@@@@@@                               81       
        16777216 |@@@@@@@@                                 65       
        33554432 |@@@@@@@@                                 66       
        67108864 |@@@                                      27       
       134217728 |                                         0        

  fp                                                        0
           value  ------------- Distribution ------------- count    
           65536 |                                         0        
          131072 |                                         27       
          262144 |@                                        485      
          524288 |@@@@@@                                   2901     
         1048576 |@@@@@                                    2203     
         2097152 |@@@@@@@                                  3204     
         4194304 |@@@@@@@@@                                4087     
         8388608 |@@@@@@@@                                 3978     
        16777216 |@@@                                      1606     
        33554432 |@                                        570      
        67108864 |                                         123      
       134217728 |                                         45       
       268435456 |                                         0        

  fp                                                        3
           value  ------------- Distribution ------------- count    
           65536 |                                         0        
          131072 |                                         41       
          262144 |@                                        493      
          524288 |@@@@@@                                   2926     
         1048576 |@@@@                                     2157     
         2097152 |@@@@@@                                   3228     
         4194304 |@@@@@@@@@                                4461     
         8388608 |@@@@@@@@@                                4561     
        16777216 |@@@                                      1634     
        33554432 |@                                        510      
        67108864 |                                         116      
       134217728 |                                         52       
       268435456 |                                         2        
       536870912 |                                         0        

  scsi_vhci                                                 0
           value  ------------- Distribution ------------- count    
          131072 |                                         0        
          262144 |@                                        588      
          524288 |@@@@@                                    4807     
         1048576 |@@@@@@                                   5423     
         2097152 |@@@@@@@                                  6609     
         4194304 |@@@@@@@@@                                8627     
         8388608 |@@@@@@@@@                                8641     
        16777216 |@@@                                      3289     
        33554432 |@                                        1088     
        67108864 |                                         239      
       134217728 |                                         97       
       268435456 |                                         2        
       536870912 |                                         0        


: e2big.eu TS 82 $; 

All the new options are supplied via -D flags to dtrace and they are:

Option Name

Description

QUIET

Be quiet. Don't report any packets seen. Useful when you only want a performance report.

PERF_REPORT

Produce a per HBA performance report when the script complete. The report is an aggregation held in @ so can be printed at regular intervals using a tick probe as in the above example but without the call to exit().

REPORT_TARGET

If producing a peformance report include the target to produce per target report.

REPORT_LUN

If producing a per target report then include the LUN to produce a per lun report.

DYNVARSIZE

Pass this value to the #pragma D option dynvarsize= option. Eg:

-D DYNVARSIZE=64m


The latest version of the script, version 1.15 is here: http://blogs.sun.com/chrisg/resource/scsi_d/scsi.d-1.15

Friday Oct 03, 2008

What is that system call doing?

Getting back on topic, here is a nice short bit of Dtrace.

Sometimes by the time I get to see an issue the “where on the object” question is well defined and in two recent cases that came down to “Why is system call X slow?” . The two system calls were not the same in each case but the bit of D to find the answer was almost identical in both cases.

Faced with a system call that is taking a long time you have to understand the three possible reasons this can happen:

  1. It has to do a lot of processing to achieve it's results.

  2. It blocks for a long time waiting for an asynchronous event to occur.

  3. It blocks for a short time but many times waiting for asynchronous events to occur.

So it would be really nice to be able to see where a system call is spending all it's time. The starting point for such an investigation is that when in the system call there are only two important states. The thread is either running on a CPU or it is not. Typically when it is not it is because it is blocked for some reason. So using the Dtrace sched provider's on-cpu and off-cpu probes to see how much time the system call spends blocked and then print out stacks if it is blocked for more than a given amount of time.

Here it is running against a simple mv(1) command:

$ pfexec /usr/sbin/dtrace -s syscall-time.d -c "mv .d .x"
dtrace: script 'syscall-time.d' matched 17 probes
dtrace: pid 26118 has exited
CPU     ID                    FUNCTION:NAME
  3  79751                     rename:entry rename(.d, .x)
  3  21381                    resume:on-cpu Off cpu for: 1980302
              genunix`cv_timedwait_sig+0x1c6
              rpcmod`clnt_cots_kcallit+0x55d
              nfs`nfs4_rfscall+0x3a9
              nfs`rfs4call+0xb7
              nfs`nfs4rename_persistent_fh+0x1eb
              nfs`nfs4rename+0x482
              nfs`nfs4_rename+0x89
              genunix`fop_rename+0xc2
              genunix`vn_renameat+0x2ab
              genunix`vn_rename+0x2b

  3  79752                    rename:return 
  on-cpu                                            
           value  ------------- Distribution ------------- count    
           16384 |                                         0        
           32768 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@           3        
           65536 |@@@@@@@@@@                               1        
          131072 |                                         0        

  off-cpu                                           
           value  ------------- Distribution ------------- count    
          131072 |                                         0        
          262144 |@@@@@@@@@@@@@@@@@@@@@@@@@@@              2        
          524288 |                                         0        
         1048576 |@@@@@@@@@@@@@                            1        
         2097152 |                                         0        

rename times on: 205680 off: 2625604 total: 2831284


$ 

From the aggregations at the bottom of the output you can see that the system call went off-cpu three times and one of those occasions it was off CPU for long enough that my limit of 10000000 nano seconds was reached and so a stack trace was printed. It also becomes pretty clear where that system call spent all it's time. It was a “rename” system call and I'm on an NFS file system so it has to wait for the server to respond and that server is going to have to make sure it has updated some non-volatile storage.


Here is the script:


#!/usr/sbin/dtrace -s
/\* run using dtrace -p or dtace -c \*/
syscall::rename:entry
/ pid == $target /
{
	self->traceme = 1;
	self->ts = timestamp;
	self->on_cpu = timestamp;
	self->total_on = 0;
	self->total_off = 0;
	printf("rename(%s, %s)", copyinstr(arg0), copyinstr(arg1));
}
sched:::off-cpu
/ self->traceme == 1 /
{
	self->off_cpu = timestamp;
	self->total_on += self->off_cpu - self->on_cpu;
}
sched:::off-cpu
/ self->traceme == 1 /
{
	 @["on-cpu"] = quantize(self->off_cpu - self->on_cpu);
}
sched:::on-cpu
/ self->traceme == 1 /
{
	self->on_cpu = timestamp;
	 @["off-cpu"] = quantize(self->on_cpu - self->off_cpu);
	 self->total_off += self->on_cpu - self->off_cpu;
}
/\* if off cpu for more than a millisecond print a stack \*/
sched:::on-cpu
/ self->traceme == 1 && timestamp - self->off_cpu > 1000\*1000 /
{
	printf("Off cpu for: %d", self->on_cpu - self->off_cpu);
	stack(10);
}
sched:::off-cpu
/ self->traceme == 1 && timestamp - self->on_cpu > 1000\*1000 /
{
	printf("On cpu for: %d", self->off_cpu - self->on_cpu);
	stack(10);
}
syscall::rename:return
/self->traceme/
{
	self->traceme = 0;
	self->total_on += timestamp - self->on_cpu;
	@["on-cpu"] = quantize(timestamp - self->on_cpu);
	printa(@);
	printf("%s times on: %d off: %d total: %d\\n",probefunc, self->total_on,
	self->total_off, timestamp-self->ts);
	self->on_cpu = 0;
	self->off_cpu = 0;
	self->total_on = 0;
	self->total_off = 0;
}

11/9/2009 Updated script to correct the time in the comment.

Friday Jul 18, 2008

Filtering what scsi.d processes scsi.d reports IO for.

I have updated scsi.d to allow you to see just the the commands associated with particular processes either by execname or by process id.

To see all the scsi IO associated with a process with an exec name of “dd”:

# /tmp/scsi.d -D EXECNAME='"dd"'                             
Hit Control C to interrupt
00058.416893400 glm0:-> 0x00 TEST_UNIT_READY address 00:00, lba 0x000000, len 0x000000, control 0x00 timeout 60 CDBP 300e8f41358 1 dd(23319) cdb(6) 000000000000
00058.417960685 glm0:<- 0x00 TEST_UNIT_READY address 00:00, lba 0x000000, len 0x000000, control 0x00 timeout 60 CDBP 300e8f41358, reason 0x0 (COMPLETED) state 0x17 Time 1208us
00058.418159142 glm0:-> 0x00 TEST_UNIT_READY address 00:00, lba 0x000000, len 0x000000, control 0x00 timeout 60 CDBP 300e8f41358 1 dd(23319) cdb(6) 000000000000
00058.419187590 glm0:<- 0x00 TEST_UNIT_READY address 00:00, lba 0x000000, len 0x000000, control 0x00 timeout 60 CDBP 300e8f41358, reason 0x0 (COMPLETED) state 0x17 Time 1128us
00058.419399600 glm0:-> 0x1a MODE_SENSE(6) address 00:00, lba 0x000300, len 0x000024, control 0x00 timeout 60 CDBP 300e8f41358 1 dd(23319) cdb(6) 1a0003002400
00058.443792908 glm0:<- 0x1a MODE_SENSE(6) address 00:00, lba 0x000300, len 0x000024, control 0x00 timeout 60 CDBP 300e8f41358, reason 0x0 (COMPLETED) state 0x1f Time 24492us
00058.443986521 glm0:-> 0x1a MODE_SENSE(6) address 00:00, lba 0x000400, len 0x000024, control 0x00 timeout 60 CDBP 300e8f41358 1 dd(23319) cdb(6) 1a0004002400
00058.460271773 glm0:<- 0x1a MODE_SENSE(6) address 00:00, lba 0x000400, len 0x000024, control 0x00 timeout 60 CDBP 300e8f41358, reason 0x0 (COMPLETED) state 0x1f Time 16385us
00058.460509036 glm0:-> 0x08   READ(6) address 00:00, lba 0x000000, len 0x000001, control 0x00 timeout 60 CDBP 300e8f41358 1 dd(23319) cdb(6) 080000000100
00058.464377145 glm0:<- 0x08   READ(6) address 00:00, lba 0x000000, len 0x000001, control 0x00 timeout 60 CDBP 300e8f41358, reason 0x0 (COMPLETED) state 0x1f Time 3966us
00058.464571604 glm0:-> 0x08   READ(6) address 00:00, lba 0x000000, len 0x000001, control 0x00 timeout 60 CDBP 300e8f41358 1 dd(23319) cdb(6) 080000000100
00058.465001748 glm0:<- 0x08   READ(6) address 00:00, lba 0x000000, len 0x000001, control 0x00 timeout 60 CDBP 300e8f41358, reason 0x0 (COMPLETED) state 0x1f Time 528us
00058.465739416 glm0:-> 0x08   READ(6) address 00:00, lba 0x0186a0, len 0x000001, control 0x00 timeout 60 CDBP 300e8f41358 1 dd(23319) cdb(6) 080186a00100
00058.471783255 glm0:<- 0x08   READ(6) address 00:00, lba 0x0186a0, len 0x000001, control 0x00 timeout 60 CDBP 300e8f41358, reason 0x0 (COMPLETED) state 0x1f Time 6143us


or to see all the scsi IO associated with a particular command:

# ./scsi.d -c "dd  if=/dev/rdsk/c0t0d0s0 of=/dev/null count=1 iseek=100000"
Hit Control C to interrupt
1+0 records in
1+0 records out
00000.046670200 glm0:-> 0x00 TEST_UNIT_READY address 00:00, lba 0x000000, len 0x000000, control 0x00 timeout 60 CDBP 300e8f41358 1 dd(23326) cdb(6) 000000000000
00000.047677067 glm0:<- 0x00 TEST_UNIT_READY address 00:00, lba 0x000000, len 0x000000, control 0x00 timeout 60 CDBP 300e8f41358, reason 0x0 (COMPLETED) state 0x17 Time 1151us
00000.047871553 glm0:-> 0x00 TEST_UNIT_READY address 00:00, lba 0x000000, len 0x000000, control 0x00 timeout 60 CDBP 300e8f41358 1 dd(23326) cdb(6) 000000000000
00000.048759564 glm0:<- 0x00 TEST_UNIT_READY address 00:00, lba 0x000000, len 0x000000, control 0x00 timeout 60 CDBP 300e8f41358, reason 0x0 (COMPLETED) state 0x17 Time 988us
00000.048979511 glm0:-> 0x1a MODE_SENSE(6) address 00:00, lba 0x000300, len 0x000024, control 0x00 timeout 60 CDBP 300e8f41358 1 dd(23326) cdb(6) 1a0003002400
00000.073204217 glm0:<- 0x1a MODE_SENSE(6) address 00:00, lba 0x000300, len 0x000024, control 0x00 timeout 60 CDBP 300e8f41358, reason 0x0 (COMPLETED) state 0x1f Time 24326us
00000.073406796 glm0:-> 0x1a MODE_SENSE(6) address 00:00, lba 0x000400, len 0x000024, control 0x00 timeout 60 CDBP 300e8f41358 1 dd(23326) cdb(6) 1a0004002400
00000.089689153 glm0:<- 0x1a MODE_SENSE(6) address 00:00, lba 0x000400, len 0x000024, control 0x00 timeout 60 CDBP 300e8f41358, reason 0x0 (COMPLETED) state 0x1f Time 16385us
00000.089932178 glm0:-> 0x08   READ(6) address 00:00, lba 0x000000, len 0x000001, control 0x00 timeout 60 CDBP 300e8f41358 1 dd(23326) cdb(6) 080000000100
00000.093793880 glm0:<- 0x08   READ(6) address 00:00, lba 0x000000, len 0x000001, control 0x00 timeout 60 CDBP 300e8f41358, reason 0x0 (COMPLETED) state 0x1f Time 3962us
00000.093991464 glm0:-> 0x08   READ(6) address 00:00, lba 0x000000, len 0x000001, control 0x00 timeout 60 CDBP 300e8f41358 1 dd(23326) cdb(6) 080000000100
00000.094429110 glm0:<- 0x08   READ(6) address 00:00, lba 0x000000, len 0x000001, control 0x00 timeout 60 CDBP 300e8f41358, reason 0x0 (COMPLETED) state 0x1f Time 538us
00000.095207716 glm0:-> 0x08   READ(6) address 00:00, lba 0x0186a0, len 0x000001, control 0x00 timeout 60 CDBP 300e8f41358 1 dd(23326) cdb(6) 080186a00100
00000.101200857 glm0:<- 0x08   READ(6) address 00:00, lba 0x0186a0, len 0x000001, control 0x00 timeout 60 CDBP 300e8f41358, reason 0x0 (COMPLETED) state 0x1f Time 6096us

Similarly you can see the IO associated with a particular process:

# ./scsi.d -p $(pgrep -x dd) | head -30 | sort -n
Hit Control C to interrupt
00000.002472402 glm0:-> 0x08   READ(6) address 00:00, lba 0x02b2d2, len 0x000001, control 0x00 timeout 60 CDBP 300e8f40318 1 dd(23328) cdb(6) 0802b2d20100
00000.010438300 glm0:<- 0x08   READ(6) address 00:00, lba 0x02b2d2, len 0x000001, control 0x00 timeout 60 CDBP 300e8f40318, reason 0x0 (COMPLETED) state 0x1f Time 8156us
00000.010706063 glm0:-> 0x08   READ(6) address 00:00, lba 0x02b2d3, len 0x000001, control 0x00 timeout 60 CDBP 300e8f40318 1 dd(23328) cdb(6) 0802b2d30100
00000.011154931 glm0:<- 0x08   READ(6) address 00:00, lba 0x02b2d3, len 0x000001, control 0x00 timeout 60 CDBP 300e8f40318, reason 0x0 (COMPLETED) state 0x1f Time 560us
00000.011371674 glm0:-> 0x08   READ(6) address 00:00, lba 0x02b2d4, len 0x000001, control 0x00 timeout 60 CDBP 300e8f40318 1 dd(23328) cdb(6) 0802b2d40100
00000.011814640 glm0:<- 0x08   READ(6) address 00:00, lba 0x02b2d4, len 0x000001, control 0x00 timeout 60 CDBP 300e8f40318, reason 0x0 (COMPLETED) state 0x1f Time 552us
00000.012025254 glm0:-> 0x08   READ(6) address 00:00, lba 0x02b2d5, len 0x000001, control 0x00 timeout 60 CDBP 300e8f40318 1 dd(23328) cdb(6) 0802b2d50100
00000.012468655 glm0:<- 0x08   READ(6) address 00:00, lba 0x02b2d5, len 0x000001, control 0x00 timeout 60 CDBP 300e8f40318, reason 0x0 (COMPLETED) state 0x1f Time 552us

Clearly there are plenty of possiblities here, not least the reporting of all IO associated with a process or command and all of it's children.

This is still work in progress but since I've used it to file a bug I thought I better put it out here so that others can enjoy it. This is version 1.14.

About

This is the old blog of Chris Gerhard. It has mostly moved to http://chrisgerhard.wordpress.com

Search

Archives
« April 2014
MonTueWedThuFriSatSun
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
    
       
Today