Friday Jul 31, 2009

Adding a Dtrace provider to the kernel

Since writing scsi.d I have been pondering if there should really be a scsi dtrace provider that allows you to do all that scsi.d does and more. Since the push of 6797025 that both removed the main reason for not doing this and also gave impetus to do it as scsi.d needed incompatible changes to use the new return function as the return “probe”.

This work is very much work in progress and may or may not see the light of day due to some other issues around scsi addressing, however I thought I would document how I added a kernel dtrace provider so if you want to you don't have to do so much searching1.

Adding the probes themselves is simplicity itself using the DTRACE_PROBEN() macros. Following the convention I added this macro:

#define	DTRACE_SCSI_2(name, type1, arg1, type2, arg2)			\\
	DTRACE_PROBE2(__scsi_##name, type1, arg1, type2, arg2);

to usr/src/uts/common/sys/sdt.h. Then after including <sys/sdt.h> in each file I put this macro in each of the places I wanted my probes:

 	DTRACE_SCSI_2(transport, struct scsi_pkt \*, pkt,
 	    struct scsi_address \*, P_TO_ADDR(pkt))

The bit that took a while to find was how to turn these into a provider. To do that edit the file “usr/src/uts/common/dtrace/sdt_subr.c” and create the attribute structure2:

 static dtrace_pattr_t scsi_attr = {
 { DTRACE_STABILITY_EVOLVING, DTRACE_STABILITY_EVOLVING, DTRACE_CLASS_ISA },
 { DTRACE_STABILITY_PRIVATE, DTRACE_STABILITY_PRIVATE, DTRACE_CLASS_UNKNOWN },
 { DTRACE_STABILITY_PRIVATE, DTRACE_STABILITY_PRIVATE, DTRACE_CLASS_UNKNOWN },
 { DTRACE_STABILITY_PRIVATE, DTRACE_STABILITY_PRIVATE, DTRACE_CLASS_ISA },
 { DTRACE_STABILITY_EVOLVING, DTRACE_STABILITY_EVOLVING, DTRACE_CLASS_ISA },
 };

and add it to the sdt_providers array:


	{ "scsi", "__scsi_", &scsi_attr, 0 },

than add the probes to the sdt_args array:

	{ "scsi", "transport", 0, 0, "struct scsi_pkt \*", "scsi_pktinfo_t \*"},
	{ "scsi", "transport", 1, 1, "struct scsi_address \*", "scsi_addrinfo_t \*"},
	{ "scsi", "complete", 0, 0, "struct scsi_pkt \*", "scsi_pktinfo_t \*"},
	{ "scsi", "complete", 1, 1, "struct scsi_address \*", "scsi_addrinfo_t \*"},

Finally you need to create a file containing the definitions of the output structures, scsi_pktinfo_t and scsi_addrinfo_t and define translators for them. That goes into /usr/lib/dtrace and I called mine scsa.d (there is already one called scsi.d).

/\*
 \* CDDL HEADER START
 \*
 \* The contents of this file are subject to the terms of the
 \* Common Development and Distribution License (the "License").
 \* You may not use this file except in compliance with the License.
 \*
 \* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
 \* or http://www.opensolaris.org/os/licensing.
 \* See the License for the specific language governing permissions
 \* and limitations under the License.
 \*
 \* When distributing Covered Code, include this CDDL HEADER in each
 \* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
 \* If applicable, add the following below this CDDL HEADER, with the
 \* fields enclosed by brackets "[]" replaced with your own identifying
 \* information: Portions Copyright [yyyy] [name of copyright owner]
 \*
 \* CDDL HEADER END
 \*/
/\*
 \* Copyright 2009 Sun Microsystems, Inc.  All rights reserved.
 \* Use is subject to license terms.
 \*/

#pragma D depends_on module scsi
#pragma D depends_on provider scsi

inline char TEST_UNIT_READY = 0x0;
#pragma D binding "1.0" TEST_UNIT_READY
inline char REZERO_UNIT_or_REWIND = 0x0001;
#pragma D binding "1.0" REZERO_UNIT_or_REWIND

inline char SCSI_HBA_ADDR_COMPLEX = 0x0040;
#pragma D binding "1.0" SCSI_HBA_ADDR_COMPLEX

typedef struct scsi_pktinfo {
	caddr_t pkt_ha_private;
	uint_t	pkt_flags;
	int	pkt_time;
	uchar_t \*pkt_scbp;
	uchar_t \*pkt_cdbp;
	ssize_t pkt_resid;
	uint_t	pkt_state;
	uint_t 	pkt_statistics;
	uchar_t pkt_reason;
	uint_t	pkt_cdblen;
	uint_t	pkt_tgtlen;
	uint_t	pkt_scblen;
} scsi_pktinfo_t;

#pragma D binding "1.0" translator
translator scsi_pktinfo_t  < struct scsi_pkt \*P > {
	pkt_ha_private = P->pkt_ha_private;
	pkt_flags = P->pkt_flags;
	pkt_time = P->pkt_time;
	pkt_scbp = P->pkt_scbp;
	pkt_cdbp = P->pkt_cdbp;
	pkt_resid = P->pkt_resid;
	pkt_state = P->pkt_state;
	pkt_statistics = P->pkt_statistics;
	pkt_reason = P->pkt_reason;
	pkt_cdblen = P->pkt_cdblen;
	pkt_tgtlen = P->pkt_tgtlen;
	pkt_scblen = P->pkt_scblen;
};

typedef struct scsi_addrinfo {
	struct scsi_hba_tran	\*a_hba_tran;
	ushort_t a_target;	/\* ua target \*/
	uchar_t	 a_lun;		/\* ua lun on target \*/
	struct scsi_device \*a_sd;
} scsi_addrinfo_t;

#pragma D binding "1.0" translator
translator scsi_addrinfo_t  < struct scsi_address \*A > {
	a_hba_tran = A->a_hba_tran;
	a_target = !(A->a_hba_tran->tran_hba_flags & SCSI_HBA_ADDR_COMPLEX) ?
		0 : A->a.spi.a_target;
	a_lun = !(A->a_hba_tran->tran_hba_flags & SCSI_HBA_ADDR_COMPLEX) ?
		0 : A->a.spi.a_lun;
	a_sd = (A->a_hba_tran->tran_hba_flags & SCSI_HBA_ADDR_COMPLEX) ?
		A->a.a_sd : 0;
};

again this is just enough to get going so I can see and use the probes:

jack@v4u-2500b-gmp03:~$ pfexec dtrace -P scsi -l
   ID   PROVIDER            MODULE                          FUNCTION NAME
 1303       scsi              scsi                    scsi_transport transport
 1313       scsi              scsi                 scsi_hba_pkt_comp complete
jack@v4u-2500b-gmp03:~$ 

While this all works well for parallel scsi getting the address of devices on fibre is not clear to me. If you have any suggestions I'm all ears.

1If there is such a document already in existence then please add a comment. I will just wish I could have found it.

2These may not be the right attributes but gets me to the point it compiles and can be used in a PoC.

Friday Mar 20, 2009

Limiting scsi.d output to IOs that take a long time.

I have added support for only reporting on scsi packets that take more than a certain amount of time to complete to scsi.d. This is partularly useful when combined with the PERF_REPORT option to see just the IO requests that took over a time threshold and still collect statistics on all the other IO requests.

To use this you have to specify “-D REPORT_OVERTIME=X” where X is the time you wish to report on in nano seconds. Here is an example that will only print the details of IO requests that took more than 250ms:

# scsi.d -D REPORT_OVERTIME=$((250\*1000\*1000)) -D PERF_REPORT 
Only reporting IOs longer than 250000000ns
Hit Control C to interrupt
00005.388800363 glm0:-> 0x2a WRITE(10) address 00:00, lba 0x00270e67, len 0x000008, control 0x00 timeout 60 CDBP 300002df438 1 sched(0) cdb(10) 2a0000270e6700000800
00005.649494475 glm0:<- 0x2a WRITE(10) address 00:00, lba 0x00270e67, len 0x000008, control 0x00 timeout 60 CDBP 300002df438, reason 0x0 (COMPLETED) pkt_state 0x1f state 0x0 Success Time 260833us
00005.384612799 glm0:-> 0x0a  WRITE(6) address 00:00, lba 0x048564, len 0x000001, control 0x00 timeout 60 CDBP 30002541ac0 1 sched(0) cdb(6) 0a0485640100
00005.716416658 glm0:<- 0x0a  WRITE(6) address 00:00, lba 0x048564, len 0x000001, control 0x00 timeout 60 CDBP 30002541ac0, reason 0x0 (COMPLETED) pkt_state 0x1f state 0x0 Success Time 331939us
00005.385907691 glm0:-> 0x0a  WRITE(6) address 00:00, lba 0x0605b4, len 0x000001, control 0x00 timeout 60 CDBP 300035637a0 1 sched(0) cdb(6) 0a0605b40100
00005.773925990 glm0:<- 0x0a  WRITE(6) address 00:00, lba 0x0605b4, len 0x000001, control 0x00 timeout 60 CDBP 300035637a0, reason 0x0 (COMPLETED) pkt_state 0x1f state 0x0 Success Time 388153us
00005.389078533 glm0:-> 0x2a WRITE(10) address 00:00, lba 0x004b19d3, len 0x000003, control 0x00 timeout 60 CDBP 300002df0f8 1 sched(0) cdb(10) 2a00004b19d300000300
00005.824242527 glm0:<- 0x2a WRITE(10) address 00:00, lba 0x004b19d3, len 0x000003, control 0x00 timeout 60 CDBP 300002df0f8, reason 0x0 (COMPLETED) pkt_state 0x1f state 0x0 Success Time 435303us
\^C

  glm                                                       0
           value  ------------- Distribution ------------- count    
         4194304 |                                         0        
         8388608 |                                         3        
        16777216 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@        678      
        33554432 |@@@@@                                    98       
        67108864 |@                                        27       
       134217728 |                                         8        
       268435456 |                                         3        
       536870912 |                                         0        

# 

The implementation of this uses dtrace speculations and as such may require some tuning of the various settings. Specifically the number of speculation buffers I have set to 1000 which should be enough for all but the busiest systems but if you do reach that limit you can increase them using the following options:


-D NSPEC=N

Set the number of speculations to this value.

-D SPECSIZE=N

Set the size of the speculaton buffer. This should be 200 times the size of NSPEC.

-D CLEANRATE=N

Specify the clean rate.

As usual the script is available here. It is version 1.18.

Tuesday Mar 10, 2009

Update to scsi.d

I've just released a new version of scsi.d that fixes a strange bug that only effected 10 but would result in scsi.d not starting:

# dtrace -qCs scsi.d-1.16  
dtrace: failed to compile script scsi.d-1.16: "/usr/include/sys/scsi/scsi_pkt.h", line 33: incomplete struct/union/enum struct scsi_address: pkt_address
# 

The solution was simple even if I still don't really understand why the header included fine in OpenSolaris but not in 10.

Tuesday Jan 27, 2009

New version of scsi.d required for build 106

This version supports some more filters. Specifically you can now specify these new options:

  • MIN_BLOCK only report on IO to less than or equal to this value.

  • MAX_BLOCK only report on IO to blocks greater or equal to this value.

This is most useful for limiting your trace to particular block ranges, be they file system or as was the case that caused me to add this to see who is trampling on the disk label.

In this contrived example it was format:

pfexec /usr/sbin/dtrace -o /tmp/dt.$$ -Cs  scsi.d -D MAX_BLOCK=3 
<SNIP>
00058.529467684 glm0:-> 0x0a  WRITE(6) address 01:00, lba 0x000000, len 0x000001
, control 0x00 timeout 60 CDBP 60012635560 1 format(3985) cdb(6) 0a0000000100
00058.542945891 glm0:<- 0x0a  WRITE(6) address 01:00, lba 0x000000, len 0x000001
, control 0x00 timeout 60 CDBP 60012635560, reason 0x0 (COMPLETED) pkt_state 0x1
f state 0x0 Success Time 13604us

While this answered my question there are neater ways of answering the question just by using the IO provider:

: s4u-nv-gmp03.eu TS 68 $; pfexec /usr/sbin/dtrace -n 'io:::start / args[0]->b_blkno < 3 && args[0]->b_flags & B_WRITE / { printf("%s %s %d %d", execname, args[1]->dev_statname, args[0]->b_blkno, args[0]->b_bcount) }'
dtrace: description 'io:::start ' matched 6 probes
CPU     ID                    FUNCTION:NAME
  0    629             default_physio:start format sd0 0 512
  0    629             default_physio:start format sd0 0 512
  0    629             default_physio:start format sd0 0 512
  0    629             default_physio:start format sd0 0 512
  0    629             default_physio:start format sd0 0 512
  0    629             default_physio:start format sd0 0 512

Also build 106 of nevada has changed the structure definition for scsi_address and in doing so this breaks scsi.d which has intimate knowledge of scsi_address structures. I have a solution that you can download but in writing it I also filed this bug:

679803 dtrace suffers from macro recursion when including scsi_address.h

which scsi.d has to work around. When that bug is resolved the work around may have to be revisited.

All versions of scsi.d are available here and this specific verison, version 1.16 here.

Thank you to Artem Kachitchkine for bringing the changes to scsi_address.h and their effects on scsi.d to my attention.

Friday Oct 10, 2008

Gathering performance statistics with scsi.d

While scsi.d is good for looking at scsi packets and seeing those raw CDBs not many people are really interested in what a SCSI packet looks like, well not enough people if you ask me. However what is much more interesting is how long the scsi packets are taking. Now scsi.d tells you this for each packet but aggregating the data would be are more useful.

: e2big.eu TS 81 $; pfexec /usr/sbin/dtrace -Cs scsi.d -D QUIET -D PERF_REPORT -D REPORT_TARGET \\
-D REPORT_LUN -n tick-1m {printa(@); clear(@); exit(0) }
Hit Control C to interrupt

  qus                                                       1
           value  ------------- Distribution ------------- count    
          131072 |                                         0        
          262144 |@@@@                                     25       
          524288 |@@@@@@@@@@@@                             68       
         1048576 |@@@@@@                                   34       
         2097152 |                                         2        
         4194304 |@@@                                      19       
         8388608 |@@@@@                                    29       
        16777216 |@@@@                                     22       
        33554432 |@@@@@@                                   35       
        67108864 |                                         1        
       134217728 |                                         0        

  fp                                                        2
           value  ------------- Distribution ------------- count    
          262144 |                                         0        
          524288 |                                         3        
         1048576 |                                         1        
         2097152 |@@                                       15       
         4194304 |@@@@@@@@                                 67       
         8388608 |@@@@@@@@@@                               81       
        16777216 |@@@@@@@@                                 65       
        33554432 |@@@@@@@@                                 66       
        67108864 |@@@                                      27       
       134217728 |                                         0        

  fp                                                        0
           value  ------------- Distribution ------------- count    
           65536 |                                         0        
          131072 |                                         27       
          262144 |@                                        485      
          524288 |@@@@@@                                   2901     
         1048576 |@@@@@                                    2203     
         2097152 |@@@@@@@                                  3204     
         4194304 |@@@@@@@@@                                4087     
         8388608 |@@@@@@@@                                 3978     
        16777216 |@@@                                      1606     
        33554432 |@                                        570      
        67108864 |                                         123      
       134217728 |                                         45       
       268435456 |                                         0        

  fp                                                        3
           value  ------------- Distribution ------------- count    
           65536 |                                         0        
          131072 |                                         41       
          262144 |@                                        493      
          524288 |@@@@@@                                   2926     
         1048576 |@@@@                                     2157     
         2097152 |@@@@@@                                   3228     
         4194304 |@@@@@@@@@                                4461     
         8388608 |@@@@@@@@@                                4561     
        16777216 |@@@                                      1634     
        33554432 |@                                        510      
        67108864 |                                         116      
       134217728 |                                         52       
       268435456 |                                         2        
       536870912 |                                         0        

  scsi_vhci                                                 0
           value  ------------- Distribution ------------- count    
          131072 |                                         0        
          262144 |@                                        588      
          524288 |@@@@@                                    4807     
         1048576 |@@@@@@                                   5423     
         2097152 |@@@@@@@                                  6609     
         4194304 |@@@@@@@@@                                8627     
         8388608 |@@@@@@@@@                                8641     
        16777216 |@@@                                      3289     
        33554432 |@                                        1088     
        67108864 |                                         239      
       134217728 |                                         97       
       268435456 |                                         2        
       536870912 |                                         0        


: e2big.eu TS 82 $; 

All the new options are supplied via -D flags to dtrace and they are:

Option Name

Description

QUIET

Be quiet. Don't report any packets seen. Useful when you only want a performance report.

PERF_REPORT

Produce a per HBA performance report when the script complete. The report is an aggregation held in @ so can be printed at regular intervals using a tick probe as in the above example but without the call to exit().

REPORT_TARGET

If producing a peformance report include the target to produce per target report.

REPORT_LUN

If producing a per target report then include the LUN to produce a per lun report.

DYNVARSIZE

Pass this value to the #pragma D option dynvarsize= option. Eg:

-D DYNVARSIZE=64m


The latest version of the script, version 1.15 is here: http://blogs.sun.com/chrisg/resource/scsi_d/scsi.d-1.15

Friday Jul 18, 2008

Filtering what scsi.d processes scsi.d reports IO for.

I have updated scsi.d to allow you to see just the the commands associated with particular processes either by execname or by process id.

To see all the scsi IO associated with a process with an exec name of “dd”:

# /tmp/scsi.d -D EXECNAME='"dd"'                             
Hit Control C to interrupt
00058.416893400 glm0:-> 0x00 TEST_UNIT_READY address 00:00, lba 0x000000, len 0x000000, control 0x00 timeout 60 CDBP 300e8f41358 1 dd(23319) cdb(6) 000000000000
00058.417960685 glm0:<- 0x00 TEST_UNIT_READY address 00:00, lba 0x000000, len 0x000000, control 0x00 timeout 60 CDBP 300e8f41358, reason 0x0 (COMPLETED) state 0x17 Time 1208us
00058.418159142 glm0:-> 0x00 TEST_UNIT_READY address 00:00, lba 0x000000, len 0x000000, control 0x00 timeout 60 CDBP 300e8f41358 1 dd(23319) cdb(6) 000000000000
00058.419187590 glm0:<- 0x00 TEST_UNIT_READY address 00:00, lba 0x000000, len 0x000000, control 0x00 timeout 60 CDBP 300e8f41358, reason 0x0 (COMPLETED) state 0x17 Time 1128us
00058.419399600 glm0:-> 0x1a MODE_SENSE(6) address 00:00, lba 0x000300, len 0x000024, control 0x00 timeout 60 CDBP 300e8f41358 1 dd(23319) cdb(6) 1a0003002400
00058.443792908 glm0:<- 0x1a MODE_SENSE(6) address 00:00, lba 0x000300, len 0x000024, control 0x00 timeout 60 CDBP 300e8f41358, reason 0x0 (COMPLETED) state 0x1f Time 24492us
00058.443986521 glm0:-> 0x1a MODE_SENSE(6) address 00:00, lba 0x000400, len 0x000024, control 0x00 timeout 60 CDBP 300e8f41358 1 dd(23319) cdb(6) 1a0004002400
00058.460271773 glm0:<- 0x1a MODE_SENSE(6) address 00:00, lba 0x000400, len 0x000024, control 0x00 timeout 60 CDBP 300e8f41358, reason 0x0 (COMPLETED) state 0x1f Time 16385us
00058.460509036 glm0:-> 0x08   READ(6) address 00:00, lba 0x000000, len 0x000001, control 0x00 timeout 60 CDBP 300e8f41358 1 dd(23319) cdb(6) 080000000100
00058.464377145 glm0:<- 0x08   READ(6) address 00:00, lba 0x000000, len 0x000001, control 0x00 timeout 60 CDBP 300e8f41358, reason 0x0 (COMPLETED) state 0x1f Time 3966us
00058.464571604 glm0:-> 0x08   READ(6) address 00:00, lba 0x000000, len 0x000001, control 0x00 timeout 60 CDBP 300e8f41358 1 dd(23319) cdb(6) 080000000100
00058.465001748 glm0:<- 0x08   READ(6) address 00:00, lba 0x000000, len 0x000001, control 0x00 timeout 60 CDBP 300e8f41358, reason 0x0 (COMPLETED) state 0x1f Time 528us
00058.465739416 glm0:-> 0x08   READ(6) address 00:00, lba 0x0186a0, len 0x000001, control 0x00 timeout 60 CDBP 300e8f41358 1 dd(23319) cdb(6) 080186a00100
00058.471783255 glm0:<- 0x08   READ(6) address 00:00, lba 0x0186a0, len 0x000001, control 0x00 timeout 60 CDBP 300e8f41358, reason 0x0 (COMPLETED) state 0x1f Time 6143us


or to see all the scsi IO associated with a particular command:

# ./scsi.d -c "dd  if=/dev/rdsk/c0t0d0s0 of=/dev/null count=1 iseek=100000"
Hit Control C to interrupt
1+0 records in
1+0 records out
00000.046670200 glm0:-> 0x00 TEST_UNIT_READY address 00:00, lba 0x000000, len 0x000000, control 0x00 timeout 60 CDBP 300e8f41358 1 dd(23326) cdb(6) 000000000000
00000.047677067 glm0:<- 0x00 TEST_UNIT_READY address 00:00, lba 0x000000, len 0x000000, control 0x00 timeout 60 CDBP 300e8f41358, reason 0x0 (COMPLETED) state 0x17 Time 1151us
00000.047871553 glm0:-> 0x00 TEST_UNIT_READY address 00:00, lba 0x000000, len 0x000000, control 0x00 timeout 60 CDBP 300e8f41358 1 dd(23326) cdb(6) 000000000000
00000.048759564 glm0:<- 0x00 TEST_UNIT_READY address 00:00, lba 0x000000, len 0x000000, control 0x00 timeout 60 CDBP 300e8f41358, reason 0x0 (COMPLETED) state 0x17 Time 988us
00000.048979511 glm0:-> 0x1a MODE_SENSE(6) address 00:00, lba 0x000300, len 0x000024, control 0x00 timeout 60 CDBP 300e8f41358 1 dd(23326) cdb(6) 1a0003002400
00000.073204217 glm0:<- 0x1a MODE_SENSE(6) address 00:00, lba 0x000300, len 0x000024, control 0x00 timeout 60 CDBP 300e8f41358, reason 0x0 (COMPLETED) state 0x1f Time 24326us
00000.073406796 glm0:-> 0x1a MODE_SENSE(6) address 00:00, lba 0x000400, len 0x000024, control 0x00 timeout 60 CDBP 300e8f41358 1 dd(23326) cdb(6) 1a0004002400
00000.089689153 glm0:<- 0x1a MODE_SENSE(6) address 00:00, lba 0x000400, len 0x000024, control 0x00 timeout 60 CDBP 300e8f41358, reason 0x0 (COMPLETED) state 0x1f Time 16385us
00000.089932178 glm0:-> 0x08   READ(6) address 00:00, lba 0x000000, len 0x000001, control 0x00 timeout 60 CDBP 300e8f41358 1 dd(23326) cdb(6) 080000000100
00000.093793880 glm0:<- 0x08   READ(6) address 00:00, lba 0x000000, len 0x000001, control 0x00 timeout 60 CDBP 300e8f41358, reason 0x0 (COMPLETED) state 0x1f Time 3962us
00000.093991464 glm0:-> 0x08   READ(6) address 00:00, lba 0x000000, len 0x000001, control 0x00 timeout 60 CDBP 300e8f41358 1 dd(23326) cdb(6) 080000000100
00000.094429110 glm0:<- 0x08   READ(6) address 00:00, lba 0x000000, len 0x000001, control 0x00 timeout 60 CDBP 300e8f41358, reason 0x0 (COMPLETED) state 0x1f Time 538us
00000.095207716 glm0:-> 0x08   READ(6) address 00:00, lba 0x0186a0, len 0x000001, control 0x00 timeout 60 CDBP 300e8f41358 1 dd(23326) cdb(6) 080186a00100
00000.101200857 glm0:<- 0x08   READ(6) address 00:00, lba 0x0186a0, len 0x000001, control 0x00 timeout 60 CDBP 300e8f41358, reason 0x0 (COMPLETED) state 0x1f Time 6096us

Similarly you can see the IO associated with a particular process:

# ./scsi.d -p $(pgrep -x dd) | head -30 | sort -n
Hit Control C to interrupt
00000.002472402 glm0:-> 0x08   READ(6) address 00:00, lba 0x02b2d2, len 0x000001, control 0x00 timeout 60 CDBP 300e8f40318 1 dd(23328) cdb(6) 0802b2d20100
00000.010438300 glm0:<- 0x08   READ(6) address 00:00, lba 0x02b2d2, len 0x000001, control 0x00 timeout 60 CDBP 300e8f40318, reason 0x0 (COMPLETED) state 0x1f Time 8156us
00000.010706063 glm0:-> 0x08   READ(6) address 00:00, lba 0x02b2d3, len 0x000001, control 0x00 timeout 60 CDBP 300e8f40318 1 dd(23328) cdb(6) 0802b2d30100
00000.011154931 glm0:<- 0x08   READ(6) address 00:00, lba 0x02b2d3, len 0x000001, control 0x00 timeout 60 CDBP 300e8f40318, reason 0x0 (COMPLETED) state 0x1f Time 560us
00000.011371674 glm0:-> 0x08   READ(6) address 00:00, lba 0x02b2d4, len 0x000001, control 0x00 timeout 60 CDBP 300e8f40318 1 dd(23328) cdb(6) 0802b2d40100
00000.011814640 glm0:<- 0x08   READ(6) address 00:00, lba 0x02b2d4, len 0x000001, control 0x00 timeout 60 CDBP 300e8f40318, reason 0x0 (COMPLETED) state 0x1f Time 552us
00000.012025254 glm0:-> 0x08   READ(6) address 00:00, lba 0x02b2d5, len 0x000001, control 0x00 timeout 60 CDBP 300e8f40318 1 dd(23328) cdb(6) 0802b2d50100
00000.012468655 glm0:<- 0x08   READ(6) address 00:00, lba 0x02b2d5, len 0x000001, control 0x00 timeout 60 CDBP 300e8f40318, reason 0x0 (COMPLETED) state 0x1f Time 552us

Clearly there are plenty of possiblities here, not least the reporting of all IO associated with a process or command and all of it's children.

This is still work in progress but since I've used it to file a bug I thought I better put it out here so that others can enjoy it. This is version 1.14.

Friday May 30, 2008

A more generic scsi.d

I had reason to want to be able to decode scsi_pkt structures in dtrace again this week and it struck me that that is really what scsi.d does. So I went about modifying it to be able to translate the structures in the more general case. The result is that it is now moderately simple to get it to decode a scsi_pkt structure in any routine that can access them.

If that is the boat you are into, and clearly I am, then download the latest scsi.d script (version 1.13 or later) and then look for the comments “FRAMEWORK” to show you where you need to add your code.

I've also added a few more SCSI commands to the commands that scsi.d will understand. These are mostly from the set of commands used by tape devices.

As an example I have left the probe I was interested in in the script, which was why we are seeing ssd reporting illegal requests. Which command does the target not like?

01054.352580740 scsi_vhci0:ILL 0x1d SEND_DIAGNOSTIC address 00:00, lba 0x100100, len 0x000010, control 0x00 timeout 300 CDBP 6002a2681c0 sched(0) cdb(6) 1d1001001000


Then it is a simple matter to explain why this is not a problem. Get the monitoring software that is generating these commands to not talk to the targets that don't understand them.

Friday Jan 11, 2008

Latency bubbles in your disk IO

The following was written in response to an email from a customer about monitoring IO in response to my scsi.d postings. Tim covers where disk IO requests can be queued in his posting titled “Where can I/O queue up in sd/ssdwhich I would recommend as a starting point.

The disk IO sub-systems are built to provide maximum through put which is most often the right thing. However the weakness of tuning for throughput is that occasionally you can get some bizarre behaviour when it comes to latency. The way that optimum IO bandwidth is achieved is by sorting each io by logical block address (LBA) and then issuing those in order to minimize head seek. This is documented in the disksort(9F) manual page.

So if you have a sequence of writes to blocks N, N+1, N+2, N-200, N+3, N+4, N+5,N+6, N+7 in that order and your LUN as a queue depth and therefore throttle of 2.1 The IO's will actually be delivered to the LUN in this order N, N+1, N+2, N+3, N+4, N+5,N+6, N+7, N-200. Hence there will be a significant latency applied to the IO going to LBA N-200 and in practice it is possible to have IO requests delayed on the waitq for many seconds (I have a pathological test case that can hold them there for the time it takes to perform an IO on nearly every block on the LUN, literally hours). You better hope that that IO was not your important one!

This issue only comes into play in the disk driver has reached the throttle for the device as up until that point each IO can be passed straight to the LUN for processing.2 Once the driver has reached the throttle for the LUN it begins queuing IO requests internally and by default will sort them to get maximum throughput. Clearly the lower the throttle the the sooner you get into this potential scenario.

Now for the good news. For most disk arrays sorting by LBA does not make much sense since the LUN will be made up of a number of drives and there will be a read cache and a write cache. So for these devices it makes sense to disable disksort and deliver the IO requests to the LUN in the order in which they are delivered to the disk driver. If you look in the source for sd.c you will see that we do this by default for most common arrays. To achieve this there is a flag, “disable disksort”, that can be set in sd.conf or ssd.conf depending on which driver is in use. See Micheal's blog entry about editing sd.conf. While you are reading that entry note you can use it to set the throttle for individual LUNS so you do not have to set [s]sd_max_throttle, which will penalize all devices rather than just the one you were aiming for. If you have just one that only has a small queue depth and you will see below why a small queue depth can be a really bad thing.

So how could you spot these latency bubbles?

It will come as no surprise that the answer is dtrace. Using my pathological test case, but with it set to run for only 10 minute to a single spindle, the following D produces a clear indication that all is not well:

fbt:ssd:ssdstrategy:entry,
fbt:sd:sdstrategy:entry
{
        start[(struct buf \*)arg0] = timestamp;
}

fbt:ssd:ssdintr:entry,
fbt:sd:sdintr:entry
/ start[(this->buf = (struct buf \*)((struct scsi_pkt \*)arg0)->pkt_private)] != 0 /
{
        this->un = ((struct sd_xbuf \*) this->buf->b_private)->xb_un;
        @[this->un] = lquantize((timestamp - start[this->buf])/1000000, 
                 60000, 600000, 60000);
        @q[this->un] = quantize((timestamp - start[this->buf])/1000000);
        
        start[this->buf] = 0;
}

This produces the following output3, the times are milliseconds:


dtrace: script 'ssdrwtime.d' matched 4 probes
\^C

 

    6597960853440
           value  ------------- Distribution ------------- count    
         < 60000 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 377204   
           60000 |                                         0        
          120000 |                                         0        
          180000 |                                         0        
          240000 |                                         0        
          300000 |                                         0        
          360000 |                                         0        
          420000 |                                         0        
          480000 |                                         2        
          540000 |                                         300      
       >= 600000 |                                         0        

 
    6597960853440
           value  ------------- Distribution ------------- count    
              -1 |                                         0        
               0 |                                         40       
               1 |                                         9        
               2 |                                         6        
               4 |                                         17       
               8 |                                         23       
              16 |                                         6        
              32 |                                         36       
              64 |@@                                       15407    
             128 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@   361660   
             256 |                                         0        
             512 |                                         0        
            1024 |                                         0        
            2048 |                                         0        
            4096 |                                         0        
            8192 |                                         0        
           16384 |                                         0        
           32768 |                                         0        
           65536 |                                         0        
          131072 |                                         0        
          262144 |                                         0        
          524288 |                                         302      
         1048576 |                                         0        

Now recall that my test case is particularly unpleasant but it demonstrates the point. 300 IO requests took over 9 minutes and they only actually got to complete as the test case was shutting down. While the vast majority of the IO requests complete in less than 256ms.


Now lets run the same pathological test with disksort disabled:

dtrace: script 'ssdrwtime.d' matched 4 probes
\^C

    6597960853440
           value  ------------- Distribution ------------- count    
         < 60000 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 543956   
           60000 |                                         0        


    6597960853440
           value  ------------- Distribution ------------- count    
              -1 |                                         0        
               0 |                                         30       
               1 |                                         21       
               2 |                                         30       
               4 |                                         0        
               8 |                                         0        
              16 |                                         50       
              32 |                                         3        
              64 |                                         384      
             128 |                                         505      
             256 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  531169   
             512 |@                                        11764    
            1024 |                                         0        

Notice that now the majority of the IO requests took longer now, falling in the 256ms bucket rather than the 128ms bucket but none of the IO requests took many minutes.


Now my test case is pathological but if you have drives with small queue depths and disk sort is still enabled you are open to some quite spectacular latency bubbles. To mitigate this my advice is:

  1. Don't ever set the global [s]sd_max_throttle in /etc/system. Use the [s]sd.conf file to set the appropriate throttle for each device.

  2. Consider what is more important to you. Throughput or latency. If is is latency or if your LUN is on a storage array then turn of disksort using the [s]sd.conf file.

  3. If you have pathological applications then understand that the IO subsystem can give you throughput or bounded latency, not both. So separate out the IO devices that need throughput from those for which latency is more important.

  4. Be aware that even “dumb” disk drives often implement disk sort internally so in some cases they can give a similar issues when they have a queue depth of greater than 24. In those cases you may find it better to throttle them down to a queue depth of 2 and disable disksort in [s]sd to get the most predictable latency all be it at the expense of throughput. If this is your issue then you can spot this either by using scsi.d directly or by modifying it to produce aggregations like those above. I'll leave that as an exercise for the reader.


1The queue depth of a LUN is the number of commands that it can handle at the same time. The throttle is usually set to the same number and it used by the disk driver to prevent it sending more commands than the device can cope with.

2Now the LUN itself may then re order the IO if it has more then two IO's in it's internal queue.

3Edited to remove output for other drives.

4With a queue depth of 2 the drive can not sort the IO requests. It has to have one as active and the next one waiting. When the active one completes the waiting one will be actioned before a new command can come from the initiator.

Friday Mar 16, 2007

scsi.d updates are like busses.

None for months then three come along one after another. This latest update is a direct result of me failing to test the script on Solaris 10. Some of the things that work on Nevada are not quite there on Solaris 10.

Specifically in nevada you can do this:

this->name = “->”

But not in Solaris 10. So the name now has to be a flag that is then checked later.

As always the new script is in the scsi.d directory this one is version 1.12.

On the up-side the bugs were reported within hours as there are actually people using this script to do useful work which makes me realise this was worth putting out there.

Tags:

Wednesday Mar 14, 2007

Bug fix for scsi.d

Today, well last night, I learned that dtrace clause local variables (this->) unlike thread local variables (self->) are not initialised by default. This is all clearly stated in the documentation had I read it with more care. My thanks to Adam on the OpenSoalris zfs-discuss alias for pointing it out to me. The long and the short of this means that when I changed scsi.d to use clause local variables I needed to also ensure that they were initialised before testing them.

While I was editing it I also added support for Group 6 and 7 SCSI commands. Since these commands are vendor specific it just prints out the first 32 bytes of the CDB, or the whole CDB if it is less than 32 bytes. In the event that you need more than 32 bytes to be printed adding this is trivial. The section of code, with line numbers, that prints out bytes 25 to 32 is listed here:

   554  PRINT_CDB(24)
   555  PRINT_CDB(25)
   556  PRINT_CDB(26)
   557  PRINT_CDB(27)
   558  PRINT_CDB(28)
   559  PRINT_CDB(29)
   560  PRINT_CDB(30)
   561  PRINT_CDB(31)

There are no prizes for working out how to make it print byte 33.

So there is verson 1.11 of scsi.d.

Tags:

Tuesday Mar 13, 2007

scsi.d improvements.

With thanks to Brendan Gregg for the encouragement I have improved the argument handling in scsi.d. I asy improved carefully as it is still not right. Mainly as I am too subborn to wrap the thing in a shell script when dtrace can so nearly do what I want. Then new argument syntax is as follows:

scsi.d [ @T timeinseconds ] [ hba [ instamce ]]

The “@T” is because dtrace won't let you use “-T” as it does not stop argument processing after reading the arguments on the first line of the script. If/When it does I will change it to be “-T”.

So now you can trace a particular HBA without setting a timeout, something you previously could not do and all the strange quotes are no longer required.

Additionally all the thread local variables have become clause local variables,“self “ becomes “this” which should reduce the impact of the script on the system.

So I bring you version 1.10 of scsi.d.

Tags:

Monday Feb 19, 2007

16 byte CDB with scsi.d

I've been working to put USCSI support for LUNS greater than 2Tb into diskomizer (a test program for testing disks). So while I have a system that has a LUN greater than 2TB I thought I would just test out scsi.d as it has the code for all the SCSI commands in it. Alas there was a trivial bug that meant the lookup of the names of commands did not work. So I have spun a new version (1.9). Download it from here.

Now for the sad out there here is the transition from 10 byte CDB to 16:

00000.015415100 fp5:-> 0x2a WRITE(10) address 40:01, lba 0xfffffffe, len 0x000001, control 0x00 timeout 60 CDBP 60002278ebc dd(11510) cdb(10) 2a00fffffffe00000100
00000.016729900 fp5:<- 0x2a WRITE(10) address 40:01, lba 0xfffffffe, len 0x000001, control 0x00 timeout 60 CDBP 60002278ebc, reason 0x0 (COMPLETED) state 0x1f Time 1360us
00000.016828400 fp5:-> 0x2a WRITE(10) address 40:01, lba 0xffffffff, len 0x000001, control 0x00 timeout 60 CDBP 6000388d3b4 dd(11510) cdb(10) 2a00ffffffff00000100
00000.018059000 fp5:<- 0x2a WRITE(10) address 40:01, lba 0xffffffff, len 0x000001, control 0x00 timeout 60 CDBP 6000388d3b4, reason 0x0 (COMPLETED) state 0x1f Time 1272us
00000.018205900 fp5:-> 0x8a WRITE(16) address 40:01, lba 0x0000000100000000, len 0x000001, control 0x00 timeout 60 CDBP 600060c8704 dd(11510) cdb(16) 8a000000000100000000000000010000
00000.019467100 fp5:<- 0x8a WRITE(16) address 40:01, lba 0x0000000100000000, len 0x000001, control 0x00 timeout 60 CDBP 600060c8704, reason 0x0 (COMPLETED) state 0x1f Time 1320us
00000.019611800 fp5:-> 0x8a WRITE(16) address 40:01, lba 0x0000000100000001, len 0x000001, control 0x00 timeout 60 CDBP 6000227ab6c dd(11510) cdb(16) 8a000000000100000001000000010000
00000.020823300 fp5:<- 0x8a WRITE(16) address 40:01, lba 0x0000000100000001, len 0x000001, control 0x00 timeout 60 CDBP 6000227ab6c, reason 0x0 (COMPLETED) state 0x1f Time 1257us


I know I need to get out more.

Monday Jan 29, 2007

scsi.d update

A very small update to scsi.d. It allows it to work even when there is some CTF confusion due to having an incorrectly built kernel module loaded, typically via an IDR.

Version 1.8 is available here.

Friday Dec 29, 2006

scsi.d update

A small fix to scsi.d provided by a bloggless colleague results in version 1.7.

My thanks to him.

Tags:

Thursday Jul 13, 2006

scsi.d update

I have added printing of the name of the executable and the process id that initiates an IO so that it is easier to see who is causing all those scsi commands to be sent.

Then for those who just have to have the raw bits to be happy, I have updated scsi.d to also dump out the raw cdb as well.

00000.627329400 fp5:-> 0x2a WRITE(10) address 00:00, lba 0x0001bfe9, len 0x000001, control 0x00 timeout 60 CDBP 600d7881d1c diskomizer64mpis(23849) cdb(10) 2a000001bfe900000100
00000.788444600 fp5:-> 0x2a WRITE(10) address 00:00, lba 0x00de6380, len 0x000010, control 0x00 timeout 60 CDBP 600a4282abc diskomizer64mpis(23847) cdb(10) 2a0000de638000001000

You can find the script here: http://blogs.sun.com/roller/resources/chrisg/scsi.d


Tags:

Friday Feb 10, 2006

scsi.d script

After some feedback about the format of the output from my Dtrace script for looking at SCSI io I how have added a timestamp which helps sorting the output. The output is now cleaner and hopefully clearer though does not fit on a 80 column screen.

00000.844267200 isp1:-> 0x2a (WRITE(10)) address 06:00, lba 0x0143a76e, len 0x000002, control 0x00 timeout 60 CDB 60031134488 SDB 60031134518
00000.844354400 isp1:-> 0x2a (WRITE(10)) address 00:00, lba 0x0143a76e, len 0x000002, control 0x00 timeout 60 CDB 3000cd59e78 SDB 3000cd59f08
00000.848251440 isp1:-> 0x2a (WRITE(10)) address 06:00, lba 0x0143ddd0, len 0x000002, control 0x00 timeout 60 CDB 6001dd1ba50 SDB 6001dd1bae0
00000.848310720 isp1:-> 0x2a (WRITE(10)) address 00:00, lba 0x0143ddd0, len 0x000002, control 0x00 timeout 60 CDB 3001da270f8 SDB 3001da27188
00000.850371280 isp1:<- 0x2a (WRITE(10)) address 00:00, lba 0x0143a76e, len 0x000002, control 0x00 timeout 60 CDB 3000cd59e78 SDB 3000cd59f08, reason 0x0 (COMPLETED) state 0x5f Time 6084us
00000.851151040 isp1:<- 0x2a (WRITE(10)) address 06:00, lba 0x0143a76e, len 0x000002, control 0x00 timeout 60 CDB 60031134488 SDB 60031134518, reason 0x0 (COMPLETED) state 0x5f Time 6927us
00000.853292800 isp1:<- 0x2a (WRITE(10)) address 00:00, lba 0x0143ddd0, len 0x000002, control 0x00 timeout 60 CDB 3001da270f8 SDB 3001da27188, reason 0x0 (COMPLETED) state 0x5f Time 5014us
00000.854442400 isp1:<- 0x2a (WRITE(10)) address 06:00, lba 0x0143ddd0, len 0x000002, control 0x00 timeout 60 CDB 6001dd1ba50 SDB 6001dd1bae0, reason 0x0 (COMPLETED) state 0x5f Time 6226us
00002.839392160 isp1:-> 0x2a (WRITE(10)) address 06:00, lba 0x0143e0b0, len 0x000004, control 0x00 timeout 60 CDB 3001da263c0 SDB 3001da26450
00002.839482480 isp1:-> 0x2a (WRITE(10)) address 00:00, lba 0x0143e0b0, len 0x000004, control 0x00 timeout 60 CDB 60002cb4538 SDB 60002cb45c8
00002.849052160 isp1:<- 0x2a (WRITE(10)) address 00:00, lba 0x0143e0b0, len 0x000004, control 0x00 timeout 60 CDB 60002cb4538 SDB 60002cb45c8, reason 0x0 (COMPLETED) state 0x5f Time 9630us
00002.850171840 isp1:<- 0x2a (WRITE(10)) address 06:00, lba 0x0143e0b0, len 0x000004, control 0x00 timeout 60 CDB 3001da263c0 SDB 3001da26450, reason 0x0 (COMPLETED) state 0x5f Time 10824us
00003.840019440 isp1:-> 0x2a (WRITE(10)) address 06:00, lba 0x0143e200, len 0x000004, control 0x00 timeout 60 CDB 3000cd59e78 SDB 3000cd59f08
00003.840110160 isp1:-> 0x2a (WRITE(10)) address 00:00, lba 0x0143e200, len 0x000004, control 0x00 timeout 60 CDB 30014c0c780 SDB 30014c0c810
00003.846265280 isp1:<- 0x2a (WRITE(10)) address 00:00, lba 0x0143e200, len 0x000004, control 0x00 timeout 60 CDB 30014c0c780 SDB 30014c0c810, reason 0x0 (COMPLETED) state 0x5f Time 6205us
00003.847439680 isp1:<- 0x2a (WRITE(10)) address 06:00, lba 0x0143e200, len 0x000004, control 0x00 timeout 60 CDB 3000cd59e78 SDB 3000cd59f08, reason 0x0 (COMPLETED) state 0x5f Time 7470us

Lots of “fun” games can be played with this, like the above shows that this system has target 0 and target 6 forming a mirror making isp1 a Single Point of failure. Although my favourite is this one:


While running


# dd if=/dev/rdsk/c0t8d0s2 of=/dev/null oseek=1024 iseek=$(( 16#1fffff )) count=2
2+0 records in
2+0 records out
#

I get the following trace:

Th

00001.971470332 qus1:-> 0x00 (TEST UNIT READY) address 08:00, lba 0x00000000, len 0x000000, control 0x00 timeout 60 CDB 300016a94f0 SDB 300016a9520
00001.972324082 qus1:<- 0x00 (TEST UNIT READY) address 08:00, lba 0x00000000, len 0x000000, control 0x00 timeout 60 CDB 300016a94f0 SDB 300016a9520, reason 0x0 (COMPLETED) state 0x17 Time 937us
00001.972433832 qus1:-> 0x00 (TEST UNIT READY) address 08:00, lba 0x00000000, len 0x000000, control 0x00 timeout 60 CDB 300016a9d90 SDB 300016a9dc0
00001.973217082 qus1:<- 0x00 (TEST UNIT READY) address 08:00, lba 0x00000000, len 0x000000, control 0x00 timeout 60 CDB 300016a9d90 SDB 300016a9dc0, reason 0x0 (COMPLETED) state 0x17 Time 826us
00001.973324748 qus1:-> 0x1a (MODE SENSE(6)) address 08:00, lba 0x00000300, len 0x000024, control 0x00 timeout 60 CDB 300016a9380 SDB 300016a93b0
00001.976352165 qus1:<- 0x1a (MODE SENSE(6)) address 08:00, lba 0x00000300, len 0x000024, control 0x00 timeout 60 CDB 300016a9380 SDB 300016a93b0, reason 0x0 (COMPLETED) state 0x5f Time 3070us
00001.976443415 qus1:-> 0x1a (MODE SENSE(6)) address 08:00, lba 0x00000400, len 0x000024, control 0x00 timeout 60 CDB 300016a9ab0 SDB 300016a9ae0
00001.979359665 qus1:<- 0x1a (MODE SENSE(6)) address 08:00, lba 0x00000400, len 0x000024, control 0x00 timeout 60 CDB 300016a9ab0 SDB 300016a9ae0, reason 0x0 (COMPLETED) state 0x5f Time 2959us
00001.979453248 qus1:-> 0x08 (  READ(6)) address 08:00, lba 0x00000000, len 0x000001, control 0x00 timeout 60 CDB 300016a9c20 SDB 300016a9c50
00001.979814748 qus1:<- 0x08 (  READ(6)) address 08:00, lba 0x00000000, len 0x000001, control 0x00 timeout 60 CDB 300016a9c20 SDB 300016a9c50, reason 0x0 (COMPLETED) state 0x5f Time 403us
00001.979898415 qus1:-> 0x08 (  READ(6)) address 08:00, lba 0x00000000, len 0x000001, control 0x00 timeout 60 CDB 300016a90a0 SDB 300016a90d0
00001.980151165 qus1:<- 0x08 (  READ(6)) address 08:00, lba 0x00000000, len 0x000001, control 0x00 timeout 60 CDB 300016a90a0 SDB 300016a90d0, reason 0x0 (COMPLETED) state 0x5f Time 294us
00001.980507332 qus1:-> 0x08 (  READ(6)) address 08:00, lba 0x001fffff, len 0x000001, control 0x00 timeout 60 CDB 300016a9660 SDB 300016a9690
00001.993267665 qus1:<- 0x08 (  READ(6)) address 08:00, lba 0x001fffff, len 0x000001, control 0x00 timeout 60 CDB 300016a9660 SDB 300016a9690, reason 0x0 (COMPLETED) state 0x5f Time 12804us
00001.993382498 qus1:-> 0x28 ( READ(10)) address 08:00, lba 0x00200000, len 0x000001, control 0x00 timeout 60 CDB 300016a9940 SDB 300016a9970
00001.999256915 qus1:<- 0x28 ( READ(10)) address 08:00, lba 0x00200000, len 0x000001, control 0x00 timeout 60 CDB 300016a9940 SDB 300016a9970, reason 0x0 (COMPLETED) state 0x5f Time 5921us


I like it has you see the transition from READ(6) to READ(10) as it moves from LBA 0x1fffff to 0x200000. Did I mention needing to get out more?


You can get the script here. Still do do is correct decoding of CDBs bigger than 10 bytes, which is not a problem for my current systems and more detailed decoding of CDBs that are not reads and writes.


Tags:

About

This is the old blog of Chris Gerhard. It has mostly moved to http://chrisgerhard.wordpress.com

Search

Archives
« April 2014
MonTueWedThuFriSatSun
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
    
       
Today