Monday May 10, 2010

Failfast for Solaris IO


Recently I have been trying to better understand Solaris IO, specifically what goes on once a process enters the biowait(\*buf) function. As part of this investigation I found a need to learn about failfast for Solaris IO which I will discuss in this blog post. Failfast was first presented as PSARC/2002/126: Buf Flag for Faster Failover. If you do not already have a general knowledge of Solaris IO internals I strongly recommend reading General Flow of Control from the Writing Device Drivers book available for free on

At a later date I would like to expand this article with more discussion of the interaction between ZFS & SVM and the failfast flag, along with a more general Solaris IO entry.


Data buffers (buffer, buf or bp from now on) passed into the (s)sd driver ("the driver") are encapsulated within a scsi_pkt structure (packet or pkt) in sd_initpkt_for_buf(\*buf, \*\*scsi_pkt) before they are passed to the transport (glm, qlc, etc.) via scsi_transport(\*scsi_pkt).

sd_initpkt_for_buf() sets the scsi_pkt's command completion routine, pkt_comp, to sdintr(\*scsi_pkt). When the transport has finished processing a packet (e.g. due to a completion, timeout or error) it sets pkt_reason as required and then calls pkt_comp to pass the packet back to the driver. Commands time out if no response is received from the target within pkt_time, set to SD_IO_TIME (60s) by sd_initpkt_for_buf(). In case of timeout the driver will attempt to retry the packet up to SD_RETRY_COUNT times (3 for fibre channel, otherwise 5). This means that without failfast it can take up to 5 minutes for, e.g., a read to return an error in the case of a non-responsive disk.

Failfast is a process that takes place within the driver to more expediently fail a pending buf and inform the upper layer Volume Manager (ZFS, SVM, VxVM). Co-operation is required from the VM which must set B_FAILFAST in the buf b_flags mask to enable the behaviour (the driver can check the ddi-failfast-supported property to know whether B_FAILFAST can be used).

Most VMs tend to round-robin read IO when multiple copies of the data exist. In the case of a mirror where one disk has gone away we ultimately expect all of our read IOs to be serviced by the working disk. In order for this to happen it is necessary for the driver to return a failure code (EIO) to the VM so that it can retry with the working disk. When B_FAILFAST is set we can return EIO faster thereby reducing the overall average IO time. The B_FAILFAST flag was initially proposed as B_ALTDATASRC as this accurately describes the conditions that need to be true for us to want to use failfast behaviour.


Within sd each physical target LUN is represented as an sd instance (sd_lun or un), each of which tracks its internal failfast state in un_failfast_state and un_failfast_bp. The instance may be in one of three states: SD_FAILFAST_INACTIVE, failfast pending (an inferred state where un_failfast_bp != NULL) and SD_FAILFAST_ACTIVE.

When any packet (i.e. regardless of B_FAILFAST) is returned to sd via pkt_comp qualifies for a retry due to a timeout condition specified in pkt_reason (these are: CMD_TIMEOUT and CMD_INCOMPLETE where the incomplete reason is a selection timeout) we call into sd_retry_command(\*sd_lun, \*buf, int retry_check_flag, ...) with the buf and SD_RETRIES_FAILFAST set in retry_check_flag. sd_retry_command() and sd_return_command(\*sd_lun, \*buf) change the instance failfast state. Every instance begins in SD_FAILFAST_INACTIVE.

Transition to failfast pending: The first buf to enter sd_retry_command() with SD_RETRIES_FAILFAST set will take the sd instance into the failfast pending state by registering itself as the un_failfast_bp. The buf is then retried normally. Subsequent SD_RETRIES_FAILFAST bufs will be retried without changing any failfast state.

Transition to SD_FAILFAST_ACTIVE: When the un_failfast_bp buf returns to sd_retry_command() it transitions the instance to SD_FAILFAST_ACTIVE by setting un_failfast_state and clearing un_failfast_bp. sd_failfast_flushq(\*sd_lun) is called which arranges for all all B_FAILFAST bufs on the wait queue to be returned to the caller with a suitable error set (this is done via thread). This buf is also returned with an error set if it has B_FAILFAST set, otherwise it is retried.

Transition to SD_FAILFAST_INACTIVE: Any buf that either completes successfully (via sd_return_command()) or requires a retry for any reason other than those that take us into failfast pending will transition us into SD_FAILFAST_INACTIVE by updating un_failfast_state and clearing un_failfast_bp. It should now be clear from above that only B_FAILFAST bufs are affected by the failfast state which means any subsequent buf without B_FAILFAST (or indeed any buf currently in the transport) can allow the transition back to SD_FAILFAST_INACTIVE.

Any buf passed into a SD_FAILFAST_ACTIVE sd instance with B_FAILFAST set is immediately failed in sd_core_iostart(int index, \*sd_lun, \*buf).

Monday Jan 11, 2010

mdb: biowait(buf_t \*bp) to (s)sd softstate

How to get the sd_lun structure from a buf (e.g. in biowait()).

> 0x2a1002efca0::findstack -v
stack pointer for thread 2a1002efca0: 2a1002ee1a1
[ 000002a1002ee1a1 sema_p+0x138() ]
000002a1002ee251 biowait+0x6c(46bb44a9d00, 0, 18bac00, 30024b12000, 1a, 46bb44a9d00)
000002a1002ee301 default_physio+0x388(12ebf74, 24, 0, 46bb44a9d40, 12ddc10, 46bb44a9d38)
000002a1002ee431 scsi_uscsi_handle_cmd+0x1b8(2000000010, 1, 338a8de2c50, 12ebf74, 46bb44a9d00, 3005a3d1d70)
000002a1002ee521 sd_send_scsi_cmd+0x114(2000000010, 1970800, 3005a3d1d70, 1, 3000111ecc0, 2a1002eeeb0)
000002a1002ee5e1 sd_send_scsi_MODE_SENSE+0x110(3000240be40, 6, 3389353b680, 24, 4, 1)
000002a1002ee701 sd_get_physical_geometry+0x9c(3389353b680, 2a1002ef06c, 43d5bd5, 200, 1, 3000111ecc0)
000002a1002ee7b1 sd_resync_geom_caches+0xb4(3000111ecc0, 43d5bd5, 200, 1, 3ec1, ff)
000002a1002ee881 sd_validate_geometry+0xb4(3000111ecc0, 1, 60, 1, 7, fa000050)
000002a1002ee941 sd_ready_and_valid+0x2d4(3000111ecc0, 2a1002efca0, 0, 3000240be40, 3000240be40, c1)
000002a1002eea51 sdopen+0x248(1, 3000111ecc0, 0, 1978108, 3000111eda0, 0)
000002a1002eeb01 spec_open+0x4f8(2a1002ef528, 224, 3000410be48, a21, 430043ae440, 0)
000002a1002eebc1 fop_open+0x78(2a1002ef528, 2, 3000410be48, 40000003, 301ac8b1ec0, 301ac8b1ec0)
000002a1002eec71 dev_lopen+0x34(2a1002ef5e0, 3, 4, 3000410be48, ffffffff, ffffffffffffffff)
000002a1002eed31 md_layered_open+0x120(13, 2a1002ef6c8, 3, 30003e9d580, 2000000010, 3000410be48)
000002a1002eedf1 stripe_open_all_devs+0x188(58, 3, 0, 0, 0, dc)
000002a1002eeed1 stripe_open+0xa0(dc, 3, 4, 3000113a628, 30003e88b70, 3)
000002a1002eef81 md_layered_open+0xb8(0, 2a1002ef908, 3, 3000113a628, dc, 3000410be48)
000002a1002ef041 mirror_probe_dev+0x98(3000113a078, 19be608, 0, 1, 30003e8b2b0, 0)
000002a1002ef111 md_probe_one+0x84(49976e5ee40, 3000113a078, 0, 68c965b92c0, 14, 7ba0730c)
000002a1002ef1c1 md_daemon+0x21c(0, 19bf478, 33864030100, 19bf478, 2a1002efa88, 19bf4a0)
000002a1002ef291 thread_start+4(19bf478, 0, 0, 0, 0, 0)

The first argument to biowait() is a pointer to a buf_t structure.

> 46bb44a9d00::print -t buf_t
int b_flags = 0x200067
struct buf \*b_forw = 0
struct buf \*b_back = 0
struct buf \*av_forw = 0
struct buf \*av_back = 0
o_dev_t b_dev = 0
size_t b_bcount = 0x24
union b_un = {
caddr_t b_addr = 0x3389353b680
struct fs \*b_fs = 0x3389353b680
struct cg \*b_cg = 0x3389353b680
struct dinode \*b_dino = 0x3389353b680
daddr32_t \*b_daddr = 0x3389353b680
lldaddr_t _b_blkno = {
longlong_t _f = 0
struct _p = {
int32_t _u = 0
int32_t _l = 0
char b_obs1 = '\\0'
size_t b_resid = 0x24
clock_t b_start = 0
struct proc \*b_proc = 0
struct page \*b_pages = 0
clock_t b_obs2 = 0
size_t b_bufsize = 0
int (\*)() b_iodone = 0
struct vnode \*b_vp = 0
struct buf \*b_chain = 0
int b_obs3 = 0
int b_error = 0x5
void \*b_private = 0x3005a3d1d70
dev_t b_edev = 0x2000000010
ksema_t b_sem = {
void \* [2] _opaque = [ 0, 0 ]
ksema_t b_io = {
void \* [2] _opaque = [ 0x2a1002efca0, 0 ]
struct buf \*b_list = 0
struct page \*\*b_shadow = 0x338a79b81c0
void \*b_dip = 0x30003961b90
struct vnode \*b_file = 0
offset_t b_offset = 0xffffffffffffffff

We are interested in getting the sd_lun so we'll take the b_edev which is a dev_t. The DDI getminor(dev_t dev) and getmajor(dev_t dev) functions allow us to extract the major and minor numbers from a dev_t.

So to get the major number we shift right by NBITSMINOR64 (32) on 64-bit or NBITSMINOR (18) on 32-bit. We then AND with MAXMAJ64 (0xffffffff) on 64-bit or MAXMAJ (MAXMAJ64) on 32-bit:

> (0x2000000010>>0t32)&0xffffffff=D

And for the minor number we AND we MAXMIN64 (0xffffffff) on 64-bit or MAXMIN (MAXMIN64) on 32-bit:

> 0x2000000010&0xffffffff=D

Alternatively if your genunix module provides the ::devt dcmd, this can be used:

> 0x2000000010::devt
32 16

The ::major2name dcmd converts the major number to a name. Alternatively we could check in /etc/name2major from an explorer or on the host itself.

> 0t32::major2name

In this case the device is sd. If it had returned ssd all of the following commands that mention sd should be replaced with ssd.

Converting the minor number to an sd instance is slightly more tricky. The driver's DDI getinfo(9E) function is called, in the case of sd this is sdinfo(9E). The SDUNIT(dev_t dev) macro is called:

#define SDUNIT(dev) (getminor((dev)) >> SDUNIT_SHIFT)

So we need to shift the minor number right by SDUNIT_SHIFT (3 on my system):

> 0t16>>0t3=D

We now know that this thread is waiting on a buffer which is being serviced by sd2.

The next stage is get this sd instance's sd_lun structure. These are held on an array pointed to by sd's DDI softstate ptr, sd_softate (or ssd_softstate for ssd). For more information see ddi_soft_state(9F) in the Solaris 10 man page collection.

> \*sd_state::print -t struct i_ddi_soft_state
void \*\*array = 0x3000111a640
kmutex_t lock = {
void \* [1] _opaque = [ 0 ]
size_t size = 0x558
size_t n_items = 0x40
struct i_ddi_soft_state \*next = 0x30003b7f900

We're after sd instance 2 so that will be entry 3 in the array (remember before sd2 are sd0 and sd1). There are a number of different ways to do this so I'll cover a few.

Getting the softstate #1, without any helper dcmds:

The array is a list of pointers, so we'll need to know the size of a uintptr_t. We then multiply this by the sd_state instance that we want and add it to the array, e.g.:

> ::sizeof uintptr_t
sizeof (uintptr_t) = 8
> 0x3000111a640+8\*2/J
0x3000111a650: 3000111ecc0 <-- sd2

Getting the softstate #2, walking the array up to the state we want:

Here we tell mdb to print out the address (a) and the contents (P) of the first 3 (,3) values from array:

> 0x3000111a640,3/naP
0x3000111a640: 0x3000111e680 <-- sd0
0x3000111a648: 0x30003ea80c0 <-- sd1
0x3000111a650: 0x3000111ecc0 <-- sd2

Getting the softstate #3, using the ::array dcmd (probably the worst way but the one I somehow always try and use):

We get the first three (0t3) elements from the start of the array. We specify that each element of the array is a uintptr_t. This might make more sense if the array was not an array of pointers but of a real structure.

> 0x3000111a640::array uintptr_t 0t3
> 3000111a650/J
0x3000111a650: 3000111ecc0 <-- sd2

Getting the softstate #4, with the helpful ::softstate dcmd (definitely the best way):

> \*sd_state::softstate 0t2

Getting all softstates using the softstate walker:

> \*sd_state::walk softstate

We're now done. We have the sd_lun pointer for sd2 and we can do whatever we want with it.

Below are a few helpful things that can be dumped. This is in no way exhaustive.

> \*sd_state::softstate 0t2|::print -t struct sd_lun
struct scsi_device \*un_sd = 0x3000240be40
struct buf \*un_rqs_bp = 0x300010a1340
struct scsi_pkt \*un_rqs_pktp = 0x300039a7e90
int un_sense_isbusy = 0
int un_buf_chain_type = 0x1
int un_uscsi_chain_type = 0x8
int un_direct_chain_type = 0x8
int un_priority_chain_type = 0x9
struct buf \*un_waitq_headp = 0
struct buf \*un_waitq_tailp = 0
struct buf \*un_retry_bp = 0
int (\*)() un_retry_statp = 0
void \*un_xbuf_attr = 0x30003ba0200
uint32_t un_sys_blocksize = 0x200
uint32_t un_tgt_blocksize = 0x200
uint64_t un_blockcount = 0x43d5bd5
uchar_t un_ctype = 0x2
char \*un_node_type = 0x1972e48 "ddi_block:channel"

> 0x3000240be40::print -t struct scsi_device
struct scsi_address sd_address = {
struct scsi_hba_tran \*a_hba_tran = 0x300024290c0
ushort_t a_target = 0x2
uchar_t a_lun = 0
uchar_t a_sublun = 0
dev_info_t \*sd_dev = 0x30003961b90
kmutex_t sd_mutex = {
void \* [1] _opaque = [ 0 ]
void \*sd_reserved = 0x300024290c0
struct scsi_inquiry \*sd_inq = 0x3000393d788
struct scsi_extended_sense \*sd_sense = 0
caddr_t sd_private = 0x3000111ecc0

> 0x30003961b90::devinfo
30003961b90 sd, instance #2
System properties at 30003b59810:
name='lun' type=int items=1
name='target' type=int items=1
name='class_prop' type=string items=1
name='class' type=string items=1
Driver properties at 30003b594f0:
name='ddi-no-autodetach' type=int items=1
name='inquiry-serial-no' type=string items=1
value='00N0A2TH '
name='pm-components' type=string items=3
value='NAME=spindle-motor' + '0=off' + '1=on'
name='pm-hardware-state' type=string items=1
name='ddi-failfast-supported' type=any items=0
name='ddi-kernel-ioctl' type=any items=0
name='device-nblocks' type=int64 items=1
Hardware properties at 30003b59518:
name='devid' type=string items=1
name='inquiry-revision-id' type=string items=1
name='inquiry-product-id' type=string items=1
value='MAP3367N SUN36G'
name='inquiry-vendor-id' type=string items=1
name='inquiry-device-type' type=int items=1

I hope this was helpful. I'm going to try and put up a few more posts in a similar style. I'm happy to take requests but can't guarantee results!

Thursday Aug 13, 2009

Solaris RPE Kernel rotation

This month I am on a rotation into the Solaris RPE (Revenue Product Engineering) Kernel team where I have picked a bug in the Solaris kernel which I am attempting to diagnose and fix.  Thanks to Mita Solanky, Chris Beal, Bill Watson & Rob Harris for helping me arrange this.

Normally I work as a part of the TSC (Technical Solutions Centre) Kernel team as an engineer who diagnoses kernel-related issues.  This could be system panics (i.e. crash dump analysis), performance problems, errors, queries from customers, etc.  The RPE organisation is the direct escalation path for us engineers in the TSC, and they are required to understand and provide fixes for bugs logged by the TSC.

By spending time in RPE I am getting to know better how this organisation functions (e.g. process), the exact role the engineers play, what external pressures they have and more about how code changes go back into the Solaris product.  RPE have some communication with NRE (New Revenue Engineering) which is interesting for me as I do not usually encounter NRE engineers as part of my day-to-day job.

Once the rotation has finished I'll blog again about my experience here as compared to the TSC role I normally play.  For this this serves as a brief introduction to my next blog entry where I will discuss the bug I am working on. 


stuff I get up to :)


« July 2014