Wednesday Jul 09, 2008

Getting dtrace to exit with the same status as the traced program

Today's dtrace one-liner is how do you get dtrace to exit with the same exit code as the program you are tracing so you can use it in a script?

There are going to be many ways to solve this. Mine is simply to include a probe for the exit system call and then pass arg0 of the exit system call to the dtrace exit() function:

while dtrace -n 'syscall::\*exit:entry /pid == $target / { trace(arg0); exit(arg0) } fbt:autofs::return / pid == $target / { stack(5); trace(arg1) }' -c 'ls -d /home/cjg'
do
umountall -F nfs
svcadm restart autofs
done

Simple and allows the test to run and run until the problem is seen.

Monday Jul 07, 2008

Full tracing of gethostbyXXX calls.

The gethostbyname.d “one liner” is now no longer anywhere near a one liner as to not be funny. It does however do all the things you would want it to do. Printing the entire hostent1 structure on success and all the h_error values on failure.

: enoexec.eu FSS 257 $; pfexec /usr/sbin/dtrace -32 -CZs gethostbyXXXX.d -c getent hosts www.ibm.com

129.42.56.216	www.ibm.com www.ibm.com.cs186.net
Look up: www.ibm.com: took 3276803us
Host: www.ibm.com
	h_alias[0]: www.ibm.com.cs186.net
	h_address[0]: 129.42.56.216

: enoexec.eu FSS 258 $; pfexec /usr/sbin/dtrace -32 -CZs gethostbyXXXX.d -c getent hosts this.host.does.not.exist.sun.com
Look up: this.host.does.not.exist.sun.com: took 31789us
gethostbyname_r failed. h_errno: 1: Host not found

: enoexec.eu FSS 259 $;  pfexec /usr/sbin/dtrace -32 -CZs gethostbyXXXX.d -c getent hosts 127.0.0.1
127.0.0.1	localhost
Look up: 127.0.0.1 took 247882us:
Host: localhost
	h_address[0]: 127.0.0.1

: enoexec.eu FSS 260 $; 

I learned some useful things while writing this:

  1. Writing userland scripts is much harder than kernel scripts. If your script has “#include <xxxx.h> in it then you are entering a world of hurt. Specifically make sure you set the data model of dtrace to match that of the application you are scripting.

  2. The various copyin() routines write into scratch memory. Scratch memory is only good during the current clause so if you want to move the data from clause to clause you need to store it. Thanks again to Jon for pointing this out.

  3. How cool would “follow fork” be for dtrace? Very.

  4. It would be really nice to bundle this script up as “getXXXbyYYY.d and include all the other getXXXbyYYY routines that there are (getpwbyname, getpwbyuid etc etc) however even this script is on the edge of producing DIF that is to large for the standard settings so having more probes would make it unworkable. Anyway they make a good exercise for the reader.

The script is here: gethostbyXXXX.d



1Well only the first 10 host name aliases and the first 10 IP addresses are printed. Adding more is trivial but you will need to up the size of the DIF that the kernel is prepared to accept.

Thursday Jul 03, 2008

Dtracing gethostbyname_r

Today's dtrace “one liner” is to dig into why a process that is failing when a call to gethostbyname_r(). Now if the application reported all that was going on this would not be needed, meanwhile in the real world we are lucky enough to have dtrace. To test it out I'm tracing the getent(1) command. In real life it would use the -p flag to dtrace and the process ID of the daemon.

/usr/sbin/dtrace -n 'pid$target::gethostbyname_r:entry {
        self->name = arg0;
}
pid$target::gethostbyname_r:return / arg1 == 0 / {
        ustack(5);
}
pid$target::gethostbyname_r:return / self->name / {
        printf("%s", copyinstr(self->name));
        self->name = 0
}' -c "getent hosts xxxxxdredd"
dtrace: description 'pid$target::gethostbyname_r:entry ' matched 3 probes
dtrace: pid 4748 has exited
CPU     ID                    FUNCTION:NAME
  2  48759           gethostbyname_r:return 
              libnsl.so.1`gethostbyname_r+0xc4
              getent`dogethost+0x54
              getent`main+0x7c
              getent`_start+0x108

  2  48759           gethostbyname_r:return xxxxxdredd

It would be nice to be able to get the h_errno value as well but so far I've not managed that.

Update:

Jon Haslem kindly explained to me the subtleties of copyin() so that I can get the h_errno value.

/usr/sbin/dtrace -Zn 'pid$target::gethostbyname_r:entry {
        self->name = arg0; 
        self->errno = arg4; 
}
pid$target::gethostbyname_r:return / arg1 == 0 / {
        ustack(5);
        printf("%d %s h_errno %x", pid,
                copyinstr(self->name), 
        \*(int \*)copyin(self->errno,sizeof(int)));
}' -c "getent hosts xxxxxdredd"
dtrace: description 'pid$target::gethostbyname_r:entry ' matched 2 probes
dtrace: pid 5087 has exited
CPU     ID                    FUNCTION:NAME
  2  48764            gethostbyname_r:entry errno: d4220008
  2  48765           gethostbyname_r:return 
              libnsl.so.1`gethostbyname_r+0xc4
              getent`dogethost+0x54
              getent`main+0x7c
              getent`_start+0x108
5087 xxxxxdredd h_errno 1

# 

Wednesday Jul 02, 2008

What is the maximum number of commands queued to a LUN

This is not quite a one liner as I'm reusing the code from a previous post to print out the devices in a human readable form other wise it is just a one liner and was when I typed it in.

The question posed here was what is the maximum number of commands sent to a LUN at any one time? Clearly this will max out at the throttle for the device however what was interesting, since the customer had already tuned the throttle down and the problem had gone away was what was their configuration capable of sending to the LUN:

#!/usr/sbin/dtrace -qCs

#define SD_TO_DEVINFO(un) ((struct dev_info \*)((un)->un_sd->sd_dev))

#define DEV_NAME(un) \\
        stringof(`devnamesp[SD_TO_DEVINFO(un)->devi_major].dn_name) /\* ` \*/

#define DEV_INST(un) (SD_TO_DEVINFO(un)->devi_instance)

fbt:\*sd:\*sd_start_cmds:entry  { @[DEV_NAME(args[0]),DEV_INST(args[0])] = max(arg
s[0]->un_ncmds_in_driver) }
END {
        printa("%s%d %@d\\n", @);
}


This produces a nice list of disk devices and the maximum number of commands that have been sent to them at anyone time:

# dtrace -qCs  /var/tmp/max_sd.d -n 'tick-5sec { exit(0) }'
sd2 1
sd0 70

# 

Combine that with the dscript from the latency bubble posting earlier and you can drill down on where your IO is waiting.

Tuesday Jul 01, 2008

Today's Dtrace one liner

Todays dtrace one liner is part of a case investigating why messages are not making it into the messages file. Using the divide and concur priciple the first question you need to answer is: “Is the process that is supposed to generate the messages calling into syslog?”

$ dtrace -n 'pid$target::syslog:entry { printf("%d %s", arg0, copyinstr(arg1)) }' -p $(pgrep xxxx)
dtrace: description 'pid$target::syslog:entry ' matched 1 probe
CPU     ID                    FUNCTION:NAME
  1  43227                     syslog:entry 5 %s
\^C

That is enough to answer the first question. You can get all flash and pull out the string passed in as the second argument but then it is not a one liner and is answering a different question. However it is neat so here it is:

$ dtrace -qn 'pid$target::syslog:entry { printf("%d %s", arg0, (this->arg1 = copy
instr(arg1))) }
pid$target::syslog:entry / this->arg1 == "%s"/ { printf(" %s\\n", copyinstr(arg2))
}' -p $(pgrep xxxx)

How about varargs and vsprintf for dtrace....

Friday May 30, 2008

A more generic scsi.d

I had reason to want to be able to decode scsi_pkt structures in dtrace again this week and it struck me that that is really what scsi.d does. So I went about modifying it to be able to translate the structures in the more general case. The result is that it is now moderately simple to get it to decode a scsi_pkt structure in any routine that can access them.

If that is the boat you are into, and clearly I am, then download the latest scsi.d script (version 1.13 or later) and then look for the comments “FRAMEWORK” to show you where you need to add your code.

I've also added a few more SCSI commands to the commands that scsi.d will understand. These are mostly from the set of commands used by tape devices.

As an example I have left the probe I was interested in in the script, which was why we are seeing ssd reporting illegal requests. Which command does the target not like?

01054.352580740 scsi_vhci0:ILL 0x1d SEND_DIAGNOSTIC address 00:00, lba 0x100100, len 0x000010, control 0x00 timeout 300 CDBP 6002a2681c0 sched(0) cdb(6) 1d1001001000


Then it is a simple matter to explain why this is not a problem. Get the monitoring software that is generating these commands to not talk to the targets that don't understand them.

Tuesday Mar 25, 2008

Automatic opening a USB disk on Sun Ray

One of my users today had a bit of a hissy fit today when she plugged in her USB thumb drive into the Sun Ray and it did nothing. That is it did nothing visible. Behind the scenes the drive had been mounted somewhere but there was no realistic way she could know this.

So I need a way to get the file browser to open when the drive is inserted. A quick google finds " "USB Drive" daemon for Sun Ray sessions" which looks like the answer. The problem I have with this is that it polls to see if there is something mounted. Given my users never log out this would mean this running on average every second. Also the 5 second delay just does not take into account the attention span of a teenager.

There has to be a better way.

My solution is to use dtrace to see when the file system has been mounted and then run nautilus with that directory.

The great thing about Solaris 10 and later is that I can give the script just the privilege that allows it to run dtrace without handing out access to the world. Then of course you can then give that privilege away.

So I came up with this script. Save it. Mine is in /usr/local which in turn is a symbolic link to /tank/fs/local. Then add an entry to /etc/security/exec_attr, subsisting the correct absolute (ie one with no symbolic links in it) path in the line.

Basic Solaris User:solaris:cmd:::/tank/fs/local/bin/utmountd:privs=dtrace_kernel

This gives the script just enough privileges to allow it to work. It then drops the extra privilege so that when it runs nautilus it has no extra privileges.

Then you just have to arrange for users to run the script when they login using:

pfexec /usr/local/bin/utmountd

I have done this by creating a file called /etc/dt/config/Xsession.d/utmountd that contains these lines:


pfexec /usr/local/bin/utmountd &
trap "kill $!" EXIT

I leave making this work for uses of CDE as an exercise for the reader.

Wednesday Feb 27, 2008

Latency Bubbles follow up

Following on from the latency bubbles in your IO posting. I have been asked two questions about this post privately:

  1. How can you map those long numbers in the output into readable entries, eg sd0.

  2. How can I confirm that disksort has been turned off?

The first one just requires another glob of D:

##pragma D option quiet

#define SD_TO_DEVINFO(un) ((struct dev_info \*)((un)->un_sd->sd_dev))

#define DEV_NAME(un) \\
        stringof(`devnamesp[SD_TO_DEVINFO(un)->devi_major].dn_name) /\* ` \*/

#define DEV_INST(un) (SD_TO_DEVINFO(un)->devi_instance)


fbt:ssd:ssdstrategy:entry,
fbt:sd:sdstrategy:entry
{
        bstart[(struct buf \*)arg0] = timestamp;
}

fbt:ssd:ssdintr:entry,
fbt:sd:sdintr:entry
/ arg0 != 0 /
{
        this->buf = (struct buf \*)((struct scsi_pkt \*)arg0)->pkt_private;
}

fbt:ssd:ssdintr:entry,
fbt:sd:sdintr:entry
/ this->buf /
{
        this->priv = (struct sd_xbuf \*) this->buf->b_private;
}

fbt:ssd:ssdintr:entry,
fbt:sd:sdintr:entry
/ this->priv /
{
        this->un = this->priv->xb_un;
}

fbt:ssd:ssdintr:entry,
fbt:sd:sdintr:entry
/ this->buf && bstart[this->buf] && this->un /
{
        @l[DEV_NAME(this->un), DEV_INST(this->un)] =
                lquantize((timestamp - bstart[this->buf])/1000000, 0,
                60000, 60000);
        @q[DEV_NAME(this->un), DEV_INST(this->un)] =
                quantize((timestamp - bstart[this->buf])/1000000);
                bstart[this->buf] = 0;
}


The second required a little bit of mdb. Yes you can also get the same from dtrace mdb gives the the immediate answer, firstly for all the disks that use the sd driver and then for instance 1:

# echo '\*sd_state::walk softstate | ::print -at "struct sd_lun" un_f_disksort_disabled' | mdb -k
300000ad46b unsigned un_f_disksort_disabled = 0
60000e23f2b unsigned un_f_disksort_disabled = 0
# echo '\*sd_state::softstate 1 | ::print -at "struct sd_lun" un_f_disksort_disabled' | mdb -k
300000ad46b unsigned un_f_disksort_disabled = 0

Thursday Feb 14, 2008

Dtrace bites with tail optimisation

I really should have caught this when I was using shared shell on the customer's system. However I did not. If you want to dtrace the return values of tail call optimised functions you can't always trust what it says. Given this code:


	size_t size;
	int error;

	VERIFY(nvlist_size(nvl, &size, NV_ENCODE_NATIVE) == 0);

and I want to know what ends up getting written into size, I did the following:


dtrace -n '
fbt:genunix:nvlist_size:entry
{
        self->size = args[1];
        printf("size ret = %p %d", self->size, \*self->size);
}
fbt::nvlist_size:return                                                         
{                                                                               
        printf("size ret = %p %d", self->size, \*self->size);                                   
        self->size = 0;                                                         
}' -c "zlogin $ZONENAME /usr/sbin/zfs list"
dtrace: description '
fbt:genunix:nvlist_size:entry
' matched 2 probes
NAME              USED  AVAIL  REFER  MOUNTPOINT
tank              122M   134G  25.5K  /tank
tank/chris         51K   134G  26.5K  /tank/chris
tank/chris/spam  24.5K   134G  24.5K  /tank/chris/spam
dtrace: pid 19681 has exited
CPU     ID                    FUNCTION:NAME
 14  17668                nvlist_size:entry size ret = 2a100799770 7696589036160
 14  17669               nvlist_size:return size ret = 2a100799770 7696589036160

then scratched my head as to why it appeared that the value was not being written to. I expected the value highlighted in red above to contain my result so why had it not been written to?

int
nvlist_size(nvlist_t \*nvl, size_t \*size, int encoding)
{
	return (nvlist_common(nvl, NULL, size, encoding, NVS_OP_GETSIZE));
}

Looking at the source to nvlist_size() it is clearly going to get tail call optimised as the assembler shows:

> nvlist_size::dis
nvlist_size:                    sra       %o2, 0x0, %o3
nvlist_size+4:                  mov       %o1, %o5
nvlist_size+8:                  mov       %o5, %o2
nvlist_size+0xc:                mov       %o7, %g1
nvlist_size+0x10:               clr       %o1
nvlist_size+0x14:               mov       0x2, %o4
nvlist_size+0x18:               call      -0x168        <nvlist_common>
nvlist_size+0x1c:               mov       %g1, %o7
> 

Luckily tracing nvlist_common() can still get to the correct information:

dtrace -n '
fbt::nvlist_common:entry
{
        self->nvlist_common_size = args[2];
        printf("%p %d\\n",
                self->nvlist_common_size, \*self->nvlist_common_size);
}
fbt::nvlist_common:return
{
        printf("%p %d\\n",
                self->nvlist_common_size, \*self->nvlist_common_size);
}

fbt:genunix:nvlist_size:entry
{
        self->size = args[1];
        printf("size ret = %p %d", self->size, \*self->size);
}
fbt::nvlist_size:return
{
        printf("size ret = %p %d", self->size, \*self->size);

        self->size = 0;
}'  -c "zlogin $ZONENAME /usr/sbin/zfs list"
dtrace: description '
fbt::nvlist_common:entry
' matched 4 probes
NAME              USED  AVAIL  REFER  MOUNTPOINT
tank              122M   134G  25.5K  /tank
tank/chris         51K   134G  26.5K  /tank/chris
tank/chris/spam  24.5K   134G  24.5K  /tank/chris/spam
dtrace: pid 19654 has exited
CPU     ID                    FUNCTION:NAME
14  17668                nvlist_size:entry size ret = 2a1014e5680 2890534899393
 14  17669               nvlist_size:return size ret = 2a1014e5680 2890534899393
 14   9628              nvlist_common:entry 2a1014e5680 2890534899393

 14   9629             nvlist_common:return 2a1014e5680 944

Once you have been bitten the documentation tells you to beware of this. What would be nice would be a way of warning the user they are at risk.

Friday Jan 11, 2008

Latency bubbles in your disk IO

The following was written in response to an email from a customer about monitoring IO in response to my scsi.d postings. Tim covers where disk IO requests can be queued in his posting titled “Where can I/O queue up in sd/ssdwhich I would recommend as a starting point.

The disk IO sub-systems are built to provide maximum through put which is most often the right thing. However the weakness of tuning for throughput is that occasionally you can get some bizarre behaviour when it comes to latency. The way that optimum IO bandwidth is achieved is by sorting each io by logical block address (LBA) and then issuing those in order to minimize head seek. This is documented in the disksort(9F) manual page.

So if you have a sequence of writes to blocks N, N+1, N+2, N-200, N+3, N+4, N+5,N+6, N+7 in that order and your LUN as a queue depth and therefore throttle of 2.1 The IO's will actually be delivered to the LUN in this order N, N+1, N+2, N+3, N+4, N+5,N+6, N+7, N-200. Hence there will be a significant latency applied to the IO going to LBA N-200 and in practice it is possible to have IO requests delayed on the waitq for many seconds (I have a pathological test case that can hold them there for the time it takes to perform an IO on nearly every block on the LUN, literally hours). You better hope that that IO was not your important one!

This issue only comes into play in the disk driver has reached the throttle for the device as up until that point each IO can be passed straight to the LUN for processing.2 Once the driver has reached the throttle for the LUN it begins queuing IO requests internally and by default will sort them to get maximum throughput. Clearly the lower the throttle the the sooner you get into this potential scenario.

Now for the good news. For most disk arrays sorting by LBA does not make much sense since the LUN will be made up of a number of drives and there will be a read cache and a write cache. So for these devices it makes sense to disable disksort and deliver the IO requests to the LUN in the order in which they are delivered to the disk driver. If you look in the source for sd.c you will see that we do this by default for most common arrays. To achieve this there is a flag, “disable disksort”, that can be set in sd.conf or ssd.conf depending on which driver is in use. See Micheal's blog entry about editing sd.conf. While you are reading that entry note you can use it to set the throttle for individual LUNS so you do not have to set [s]sd_max_throttle, which will penalize all devices rather than just the one you were aiming for. If you have just one that only has a small queue depth and you will see below why a small queue depth can be a really bad thing.

So how could you spot these latency bubbles?

It will come as no surprise that the answer is dtrace. Using my pathological test case, but with it set to run for only 10 minute to a single spindle, the following D produces a clear indication that all is not well:

fbt:ssd:ssdstrategy:entry,
fbt:sd:sdstrategy:entry
{
        start[(struct buf \*)arg0] = timestamp;
}

fbt:ssd:ssdintr:entry,
fbt:sd:sdintr:entry
/ start[(this->buf = (struct buf \*)((struct scsi_pkt \*)arg0)->pkt_private)] != 0 /
{
        this->un = ((struct sd_xbuf \*) this->buf->b_private)->xb_un;
        @[this->un] = lquantize((timestamp - start[this->buf])/1000000, 
                 60000, 600000, 60000);
        @q[this->un] = quantize((timestamp - start[this->buf])/1000000);
        
        start[this->buf] = 0;
}

This produces the following output3, the times are milliseconds:


dtrace: script 'ssdrwtime.d' matched 4 probes
\^C

 

    6597960853440
           value  ------------- Distribution ------------- count    
         < 60000 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 377204   
           60000 |                                         0        
          120000 |                                         0        
          180000 |                                         0        
          240000 |                                         0        
          300000 |                                         0        
          360000 |                                         0        
          420000 |                                         0        
          480000 |                                         2        
          540000 |                                         300      
       >= 600000 |                                         0        

 
    6597960853440
           value  ------------- Distribution ------------- count    
              -1 |                                         0        
               0 |                                         40       
               1 |                                         9        
               2 |                                         6        
               4 |                                         17       
               8 |                                         23       
              16 |                                         6        
              32 |                                         36       
              64 |@@                                       15407    
             128 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@   361660   
             256 |                                         0        
             512 |                                         0        
            1024 |                                         0        
            2048 |                                         0        
            4096 |                                         0        
            8192 |                                         0        
           16384 |                                         0        
           32768 |                                         0        
           65536 |                                         0        
          131072 |                                         0        
          262144 |                                         0        
          524288 |                                         302      
         1048576 |                                         0        

Now recall that my test case is particularly unpleasant but it demonstrates the point. 300 IO requests took over 9 minutes and they only actually got to complete as the test case was shutting down. While the vast majority of the IO requests complete in less than 256ms.


Now lets run the same pathological test with disksort disabled:

dtrace: script 'ssdrwtime.d' matched 4 probes
\^C

    6597960853440
           value  ------------- Distribution ------------- count    
         < 60000 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 543956   
           60000 |                                         0        


    6597960853440
           value  ------------- Distribution ------------- count    
              -1 |                                         0        
               0 |                                         30       
               1 |                                         21       
               2 |                                         30       
               4 |                                         0        
               8 |                                         0        
              16 |                                         50       
              32 |                                         3        
              64 |                                         384      
             128 |                                         505      
             256 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  531169   
             512 |@                                        11764    
            1024 |                                         0        

Notice that now the majority of the IO requests took longer now, falling in the 256ms bucket rather than the 128ms bucket but none of the IO requests took many minutes.


Now my test case is pathological but if you have drives with small queue depths and disk sort is still enabled you are open to some quite spectacular latency bubbles. To mitigate this my advice is:

  1. Don't ever set the global [s]sd_max_throttle in /etc/system. Use the [s]sd.conf file to set the appropriate throttle for each device.

  2. Consider what is more important to you. Throughput or latency. If is is latency or if your LUN is on a storage array then turn of disksort using the [s]sd.conf file.

  3. If you have pathological applications then understand that the IO subsystem can give you throughput or bounded latency, not both. So separate out the IO devices that need throughput from those for which latency is more important.

  4. Be aware that even “dumb” disk drives often implement disk sort internally so in some cases they can give a similar issues when they have a queue depth of greater than 24. In those cases you may find it better to throttle them down to a queue depth of 2 and disable disksort in [s]sd to get the most predictable latency all be it at the expense of throughput. If this is your issue then you can spot this either by using scsi.d directly or by modifying it to produce aggregations like those above. I'll leave that as an exercise for the reader.


1The queue depth of a LUN is the number of commands that it can handle at the same time. The throttle is usually set to the same number and it used by the disk driver to prevent it sending more commands than the device can cope with.

2Now the LUN itself may then re order the IO if it has more then two IO's in it's internal queue.

3Edited to remove output for other drives.

4With a queue depth of 2 the drive can not sort the IO requests. It has to have one as active and the next one waiting. When the active one completes the waiting one will be actioned before a new command can come from the initiator.

Thursday Dec 06, 2007

Unable to mount the device? dtrace tells me why.

I had to copy a large movie file1 to work so rather then battle the corporate firewall, or pack my laptop in my pannier, I though I would just load it onto the USB pen drive I got at CEC.

When I inserted the pen drive instead of nautilus poping up with the contents I got a pop-up saying:

Unable to mount volume 'NO NAME'

No clue why though. A quick dtrace:

1192 # dtrace  -n 'syscall::mount:entry { printf("%s %s %s", copyinstr(arg0),>
dtrace  -n 'syscall::mount:entry { printf("%s %s %s", copyinstr(arg0), copyinstr(arg1), copyinstr(arg3)) } syscall::mount:return { printf("%s %d %d", execname, (int) arg1, errno) }'
dtrace: description 'syscall::mount:entry ' matched 2 probes
CPU     ID                    FUNCTION:NAME
  0  98800                      mount:entry /dev/dsk/c5t0d0p0:1 /media/NO NAME pcfs
  0  98801                     mount:return mount -1 30

and we have the answer. Who ever is calling mount assumes the file system is read-write but fails to try read-only when the mount fails. Flicking the switch on the side of the pen drive to read-write, something I was going to need anyway and reinserting it and I was away. Working out before dtrace would have been horrible this was so quick to do that it did not interfere with my breakfast.


Now I have to file the bug.


1The file was my first attempt at recording a presentation using http://www.screencast-o-matic.com/ from which I learned a number of things. First while I can talk to a room full of engineers or even a telephone without needing or wanting any form of script and not stutter or trip over my words when faced with a microphone that is recording I seem to have a problem. Secondly good though screencast-o-matic is it still took me over and hour to record 15 minutes of content and even that content is so bad I'm not putting it on the web. Clearly something I need to work on.

Tuesday Jun 19, 2007

Where are all the log files?

Todays question was:

Is there a list of all the default log files that are used in Solaris?

Not that I know of. Mostly since most software you can configure to log anywhere you wish it would be an impossible task to come up with a complete list that was of any practical benefit.

However there are some places to go looking for log files:

  1. The file /etc/syslog.conf will contain the names of logfiles written to via syslog.

  2. The contents of the directory /var/svc/log is the default location for log files from SMF. These files are connected to any daemons standard out and standard error so can grow.

  3. Then the files in /etc/default will define logfiles for services that are not using syslog. For example /var/adm/sulog

So having ticked off those log files and decided upon a strategy for maintaining them, mine is to keep 100k of log for the logs in /var/svc/log and let logadm(1M) look after them. I keep sulog forever and clean it by hand as I'm paranoid. Configuring logadm to look after the SMF logs is easy:

for i in /var/svc/log/\*.log
do
logadm -w $i -C1 -c -s100k
done

So how can I be sure that there are no more log files out there? You could use find to find all the files modified in the last 24 hours however this will get you a lot of false positives. Since what is really interesting are the active log files that are in the “/” and “/var” file systems, I can use dtrace to find them by running this script for a few hours:

syscall::write:entry
/ ! (execname == "ksh" && arg0 == 63 ) &&
    fds[arg0].fi_oflags & O_APPEND &&
    (fds[arg0].fi_mount == "/" || fds[arg0].fi_mount == "/var" )/
{
        @logs[fds[arg0].fi_pathname] = count();
        logfiles[ fds[arg0].fi_pathname]++
}
syscall::write:entry
/ logfiles[ fds[arg0].fi_pathname] == 1 &&
    ! (execname == "ksh" && arg0 == 63 ) &&
    fds[arg0].fi_oflags & O_APPEND &&
    (fds[arg0].fi_mount == "/" || fds[arg0].fi_mount == "/var" )/
{
        printf("%s %s", fds[arg0].fi_fs, fds[arg0].fi_pathname);
}

in half an hour gives me:

# dtrace -s /home/cjg/lang/d/log.d
dtrace: script '/home/cjg/lang/d/log.d' matched 2 probes
CPU     ID                    FUNCTION:NAME
  0   4575                      write:entry ufs /var/cron/log
  0   4575                      write:entry ufs /var/adm/wtmpx
  0   4575                      write:entry ufs /var/adm/sulog
  0   4575                      write:entry ufs /var/adm/messages
  0   4575                      write:entry ufs /var/apache2/logs/access_log
  0   4575                      write:entry ufs /var/svc/log/system-filesystem-autofs:default.log
  0   4575                      write:entry ufs /var/log/syslog
  0   4575                      write:entry ufs /var/log/exim/mainlog
\^C

  /var/adm/messages                                                 1
  /var/adm/sulog                                                    2
  /var/adm/wtmpx                                                    2
  /var/svc/log/system-filesystem-autofs:default.log                 4
1
  /var/apache2/logs/access_log                                      7
  /var/log/exim/mainlog                                            28
  /var/log/syslog                                                  42
  /var/cron/log                                                  16772
# 

Clearly there is still scope for false positives files in /var/tmp that are opened O_APPEND for example, or if you use a different shell but it gives a very good starting point.



1The autofs log file has been written to thanks to me using the well hidden feature of being able to turn automounter verbose mode on and off by accessing the f file “=v” in the as root in the root of an indirect mount point. Typically this is “/net/=v”. Equally you can set the trace level by accessing “/net/=N” where N is any integer.

2Cron is so busy as I am still running the test jobs for the timezone aware cron.

Friday Mar 16, 2007

scsi.d updates are like busses.

None for months then three come along one after another. This latest update is a direct result of me failing to test the script on Solaris 10. Some of the things that work on Nevada are not quite there on Solaris 10.

Specifically in nevada you can do this:

this->name = “->”

But not in Solaris 10. So the name now has to be a flag that is then checked later.

As always the new script is in the scsi.d directory this one is version 1.12.

On the up-side the bugs were reported within hours as there are actually people using this script to do useful work which makes me realise this was worth putting out there.

Tags:

Wednesday Mar 14, 2007

Bug fix for scsi.d

Today, well last night, I learned that dtrace clause local variables (this->) unlike thread local variables (self->) are not initialised by default. This is all clearly stated in the documentation had I read it with more care. My thanks to Adam on the OpenSoalris zfs-discuss alias for pointing it out to me. The long and the short of this means that when I changed scsi.d to use clause local variables I needed to also ensure that they were initialised before testing them.

While I was editing it I also added support for Group 6 and 7 SCSI commands. Since these commands are vendor specific it just prints out the first 32 bytes of the CDB, or the whole CDB if it is less than 32 bytes. In the event that you need more than 32 bytes to be printed adding this is trivial. The section of code, with line numbers, that prints out bytes 25 to 32 is listed here:

   554  PRINT_CDB(24)
   555  PRINT_CDB(25)
   556  PRINT_CDB(26)
   557  PRINT_CDB(27)
   558  PRINT_CDB(28)
   559  PRINT_CDB(29)
   560  PRINT_CDB(30)
   561  PRINT_CDB(31)

There are no prizes for working out how to make it print byte 33.

So there is verson 1.11 of scsi.d.

Tags:

Tuesday Mar 13, 2007

scsi.d improvements.

With thanks to Brendan Gregg for the encouragement I have improved the argument handling in scsi.d. I asy improved carefully as it is still not right. Mainly as I am too subborn to wrap the thing in a shell script when dtrace can so nearly do what I want. Then new argument syntax is as follows:

scsi.d [ @T timeinseconds ] [ hba [ instamce ]]

The “@T” is because dtrace won't let you use “-T” as it does not stop argument processing after reading the arguments on the first line of the script. If/When it does I will change it to be “-T”.

So now you can trace a particular HBA without setting a timeout, something you previously could not do and all the strange quotes are no longer required.

Additionally all the thread local variables have become clause local variables,“self “ becomes “this” which should reduce the impact of the script on the system.

So I bring you version 1.10 of scsi.d.

Tags:

Friday Dec 29, 2006

scsi.d update

A small fix to scsi.d provided by a bloggless colleague results in version 1.7.

My thanks to him.

Tags:

Thursday Jul 13, 2006

scsi.d update

I have added printing of the name of the executable and the process id that initiates an IO so that it is easier to see who is causing all those scsi commands to be sent.

Then for those who just have to have the raw bits to be happy, I have updated scsi.d to also dump out the raw cdb as well.

00000.627329400 fp5:-> 0x2a WRITE(10) address 00:00, lba 0x0001bfe9, len 0x000001, control 0x00 timeout 60 CDBP 600d7881d1c diskomizer64mpis(23849) cdb(10) 2a000001bfe900000100
00000.788444600 fp5:-> 0x2a WRITE(10) address 00:00, lba 0x00de6380, len 0x000010, control 0x00 timeout 60 CDBP 600a4282abc diskomizer64mpis(23847) cdb(10) 2a0000de638000001000

You can find the script here: http://blogs.sun.com/roller/resources/chrisg/scsi.d


Tags:

Friday Feb 10, 2006

scsi.d script

After some feedback about the format of the output from my Dtrace script for looking at SCSI io I how have added a timestamp which helps sorting the output. The output is now cleaner and hopefully clearer though does not fit on a 80 column screen.

00000.844267200 isp1:-> 0x2a (WRITE(10)) address 06:00, lba 0x0143a76e, len 0x000002, control 0x00 timeout 60 CDB 60031134488 SDB 60031134518
00000.844354400 isp1:-> 0x2a (WRITE(10)) address 00:00, lba 0x0143a76e, len 0x000002, control 0x00 timeout 60 CDB 3000cd59e78 SDB 3000cd59f08
00000.848251440 isp1:-> 0x2a (WRITE(10)) address 06:00, lba 0x0143ddd0, len 0x000002, control 0x00 timeout 60 CDB 6001dd1ba50 SDB 6001dd1bae0
00000.848310720 isp1:-> 0x2a (WRITE(10)) address 00:00, lba 0x0143ddd0, len 0x000002, control 0x00 timeout 60 CDB 3001da270f8 SDB 3001da27188
00000.850371280 isp1:<- 0x2a (WRITE(10)) address 00:00, lba 0x0143a76e, len 0x000002, control 0x00 timeout 60 CDB 3000cd59e78 SDB 3000cd59f08, reason 0x0 (COMPLETED) state 0x5f Time 6084us
00000.851151040 isp1:<- 0x2a (WRITE(10)) address 06:00, lba 0x0143a76e, len 0x000002, control 0x00 timeout 60 CDB 60031134488 SDB 60031134518, reason 0x0 (COMPLETED) state 0x5f Time 6927us
00000.853292800 isp1:<- 0x2a (WRITE(10)) address 00:00, lba 0x0143ddd0, len 0x000002, control 0x00 timeout 60 CDB 3001da270f8 SDB 3001da27188, reason 0x0 (COMPLETED) state 0x5f Time 5014us
00000.854442400 isp1:<- 0x2a (WRITE(10)) address 06:00, lba 0x0143ddd0, len 0x000002, control 0x00 timeout 60 CDB 6001dd1ba50 SDB 6001dd1bae0, reason 0x0 (COMPLETED) state 0x5f Time 6226us
00002.839392160 isp1:-> 0x2a (WRITE(10)) address 06:00, lba 0x0143e0b0, len 0x000004, control 0x00 timeout 60 CDB 3001da263c0 SDB 3001da26450
00002.839482480 isp1:-> 0x2a (WRITE(10)) address 00:00, lba 0x0143e0b0, len 0x000004, control 0x00 timeout 60 CDB 60002cb4538 SDB 60002cb45c8
00002.849052160 isp1:<- 0x2a (WRITE(10)) address 00:00, lba 0x0143e0b0, len 0x000004, control 0x00 timeout 60 CDB 60002cb4538 SDB 60002cb45c8, reason 0x0 (COMPLETED) state 0x5f Time 9630us
00002.850171840 isp1:<- 0x2a (WRITE(10)) address 06:00, lba 0x0143e0b0, len 0x000004, control 0x00 timeout 60 CDB 3001da263c0 SDB 3001da26450, reason 0x0 (COMPLETED) state 0x5f Time 10824us
00003.840019440 isp1:-> 0x2a (WRITE(10)) address 06:00, lba 0x0143e200, len 0x000004, control 0x00 timeout 60 CDB 3000cd59e78 SDB 3000cd59f08
00003.840110160 isp1:-> 0x2a (WRITE(10)) address 00:00, lba 0x0143e200, len 0x000004, control 0x00 timeout 60 CDB 30014c0c780 SDB 30014c0c810
00003.846265280 isp1:<- 0x2a (WRITE(10)) address 00:00, lba 0x0143e200, len 0x000004, control 0x00 timeout 60 CDB 30014c0c780 SDB 30014c0c810, reason 0x0 (COMPLETED) state 0x5f Time 6205us
00003.847439680 isp1:<- 0x2a (WRITE(10)) address 06:00, lba 0x0143e200, len 0x000004, control 0x00 timeout 60 CDB 3000cd59e78 SDB 3000cd59f08, reason 0x0 (COMPLETED) state 0x5f Time 7470us

Lots of “fun” games can be played with this, like the above shows that this system has target 0 and target 6 forming a mirror making isp1 a Single Point of failure. Although my favourite is this one:


While running


# dd if=/dev/rdsk/c0t8d0s2 of=/dev/null oseek=1024 iseek=$(( 16#1fffff )) count=2
2+0 records in
2+0 records out
#

I get the following trace:

Th

00001.971470332 qus1:-> 0x00 (TEST UNIT READY) address 08:00, lba 0x00000000, len 0x000000, control 0x00 timeout 60 CDB 300016a94f0 SDB 300016a9520
00001.972324082 qus1:<- 0x00 (TEST UNIT READY) address 08:00, lba 0x00000000, len 0x000000, control 0x00 timeout 60 CDB 300016a94f0 SDB 300016a9520, reason 0x0 (COMPLETED) state 0x17 Time 937us
00001.972433832 qus1:-> 0x00 (TEST UNIT READY) address 08:00, lba 0x00000000, len 0x000000, control 0x00 timeout 60 CDB 300016a9d90 SDB 300016a9dc0
00001.973217082 qus1:<- 0x00 (TEST UNIT READY) address 08:00, lba 0x00000000, len 0x000000, control 0x00 timeout 60 CDB 300016a9d90 SDB 300016a9dc0, reason 0x0 (COMPLETED) state 0x17 Time 826us
00001.973324748 qus1:-> 0x1a (MODE SENSE(6)) address 08:00, lba 0x00000300, len 0x000024, control 0x00 timeout 60 CDB 300016a9380 SDB 300016a93b0
00001.976352165 qus1:<- 0x1a (MODE SENSE(6)) address 08:00, lba 0x00000300, len 0x000024, control 0x00 timeout 60 CDB 300016a9380 SDB 300016a93b0, reason 0x0 (COMPLETED) state 0x5f Time 3070us
00001.976443415 qus1:-> 0x1a (MODE SENSE(6)) address 08:00, lba 0x00000400, len 0x000024, control 0x00 timeout 60 CDB 300016a9ab0 SDB 300016a9ae0
00001.979359665 qus1:<- 0x1a (MODE SENSE(6)) address 08:00, lba 0x00000400, len 0x000024, control 0x00 timeout 60 CDB 300016a9ab0 SDB 300016a9ae0, reason 0x0 (COMPLETED) state 0x5f Time 2959us
00001.979453248 qus1:-> 0x08 (  READ(6)) address 08:00, lba 0x00000000, len 0x000001, control 0x00 timeout 60 CDB 300016a9c20 SDB 300016a9c50
00001.979814748 qus1:<- 0x08 (  READ(6)) address 08:00, lba 0x00000000, len 0x000001, control 0x00 timeout 60 CDB 300016a9c20 SDB 300016a9c50, reason 0x0 (COMPLETED) state 0x5f Time 403us
00001.979898415 qus1:-> 0x08 (  READ(6)) address 08:00, lba 0x00000000, len 0x000001, control 0x00 timeout 60 CDB 300016a90a0 SDB 300016a90d0
00001.980151165 qus1:<- 0x08 (  READ(6)) address 08:00, lba 0x00000000, len 0x000001, control 0x00 timeout 60 CDB 300016a90a0 SDB 300016a90d0, reason 0x0 (COMPLETED) state 0x5f Time 294us
00001.980507332 qus1:-> 0x08 (  READ(6)) address 08:00, lba 0x001fffff, len 0x000001, control 0x00 timeout 60 CDB 300016a9660 SDB 300016a9690
00001.993267665 qus1:<- 0x08 (  READ(6)) address 08:00, lba 0x001fffff, len 0x000001, control 0x00 timeout 60 CDB 300016a9660 SDB 300016a9690, reason 0x0 (COMPLETED) state 0x5f Time 12804us
00001.993382498 qus1:-> 0x28 ( READ(10)) address 08:00, lba 0x00200000, len 0x000001, control 0x00 timeout 60 CDB 300016a9940 SDB 300016a9970
00001.999256915 qus1:<- 0x28 ( READ(10)) address 08:00, lba 0x00200000, len 0x000001, control 0x00 timeout 60 CDB 300016a9940 SDB 300016a9970, reason 0x0 (COMPLETED) state 0x5f Time 5921us


I like it has you see the transition from READ(6) to READ(10) as it moves from LBA 0x1fffff to 0x200000. Did I mention needing to get out more?


You can get the script here. Still do do is correct decoding of CDBs bigger than 10 bytes, which is not a problem for my current systems and more detailed decoding of CDBs that are not reads and writes.


Tags:

About

This is the old blog of Chris Gerhard. It has mostly moved to http://chrisgerhard.wordpress.com

Search

Archives
« April 2014
MonTueWedThuFriSatSun
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
    
       
Today