Monday May 10, 2010

Failfast for Solaris IO


Recently I have been trying to better understand Solaris IO, specifically what goes on once a process enters the biowait(\*buf) function. As part of this investigation I found a need to learn about failfast for Solaris IO which I will discuss in this blog post. Failfast was first presented as PSARC/2002/126: Buf Flag for Faster Failover. If you do not already have a general knowledge of Solaris IO internals I strongly recommend reading General Flow of Control from the Writing Device Drivers book available for free on

At a later date I would like to expand this article with more discussion of the interaction between ZFS & SVM and the failfast flag, along with a more general Solaris IO entry.


Data buffers (buffer, buf or bp from now on) passed into the (s)sd driver ("the driver") are encapsulated within a scsi_pkt structure (packet or pkt) in sd_initpkt_for_buf(\*buf, \*\*scsi_pkt) before they are passed to the transport (glm, qlc, etc.) via scsi_transport(\*scsi_pkt).

sd_initpkt_for_buf() sets the scsi_pkt's command completion routine, pkt_comp, to sdintr(\*scsi_pkt). When the transport has finished processing a packet (e.g. due to a completion, timeout or error) it sets pkt_reason as required and then calls pkt_comp to pass the packet back to the driver. Commands time out if no response is received from the target within pkt_time, set to SD_IO_TIME (60s) by sd_initpkt_for_buf(). In case of timeout the driver will attempt to retry the packet up to SD_RETRY_COUNT times (3 for fibre channel, otherwise 5). This means that without failfast it can take up to 5 minutes for, e.g., a read to return an error in the case of a non-responsive disk.

Failfast is a process that takes place within the driver to more expediently fail a pending buf and inform the upper layer Volume Manager (ZFS, SVM, VxVM). Co-operation is required from the VM which must set B_FAILFAST in the buf b_flags mask to enable the behaviour (the driver can check the ddi-failfast-supported property to know whether B_FAILFAST can be used).

Most VMs tend to round-robin read IO when multiple copies of the data exist. In the case of a mirror where one disk has gone away we ultimately expect all of our read IOs to be serviced by the working disk. In order for this to happen it is necessary for the driver to return a failure code (EIO) to the VM so that it can retry with the working disk. When B_FAILFAST is set we can return EIO faster thereby reducing the overall average IO time. The B_FAILFAST flag was initially proposed as B_ALTDATASRC as this accurately describes the conditions that need to be true for us to want to use failfast behaviour.


Within sd each physical target LUN is represented as an sd instance (sd_lun or un), each of which tracks its internal failfast state in un_failfast_state and un_failfast_bp. The instance may be in one of three states: SD_FAILFAST_INACTIVE, failfast pending (an inferred state where un_failfast_bp != NULL) and SD_FAILFAST_ACTIVE.

When any packet (i.e. regardless of B_FAILFAST) is returned to sd via pkt_comp qualifies for a retry due to a timeout condition specified in pkt_reason (these are: CMD_TIMEOUT and CMD_INCOMPLETE where the incomplete reason is a selection timeout) we call into sd_retry_command(\*sd_lun, \*buf, int retry_check_flag, ...) with the buf and SD_RETRIES_FAILFAST set in retry_check_flag. sd_retry_command() and sd_return_command(\*sd_lun, \*buf) change the instance failfast state. Every instance begins in SD_FAILFAST_INACTIVE.

Transition to failfast pending: The first buf to enter sd_retry_command() with SD_RETRIES_FAILFAST set will take the sd instance into the failfast pending state by registering itself as the un_failfast_bp. The buf is then retried normally. Subsequent SD_RETRIES_FAILFAST bufs will be retried without changing any failfast state.

Transition to SD_FAILFAST_ACTIVE: When the un_failfast_bp buf returns to sd_retry_command() it transitions the instance to SD_FAILFAST_ACTIVE by setting un_failfast_state and clearing un_failfast_bp. sd_failfast_flushq(\*sd_lun) is called which arranges for all all B_FAILFAST bufs on the wait queue to be returned to the caller with a suitable error set (this is done via thread). This buf is also returned with an error set if it has B_FAILFAST set, otherwise it is retried.

Transition to SD_FAILFAST_INACTIVE: Any buf that either completes successfully (via sd_return_command()) or requires a retry for any reason other than those that take us into failfast pending will transition us into SD_FAILFAST_INACTIVE by updating un_failfast_state and clearing un_failfast_bp. It should now be clear from above that only B_FAILFAST bufs are affected by the failfast state which means any subsequent buf without B_FAILFAST (or indeed any buf currently in the transport) can allow the transition back to SD_FAILFAST_INACTIVE.

Any buf passed into a SD_FAILFAST_ACTIVE sd instance with B_FAILFAST set is immediately failed in sd_core_iostart(int index, \*sd_lun, \*buf).

Wednesday Jan 13, 2010

mdb: calculating thread idle time

In SolarisCAT a thread output shows the thread idle time. In mdb I've not found a dcmd that provides this same information (although if you are using the excellent ACT module the ::act_thread dcmd does give solve this problem).

Fortunately the answer is very simple:

> 2a100499ca0::findstack -v
stack pointer for thread 2a100499ca0: 2a100498e41
[ 000002a100498e41 sema_p+0x138() ]
000002a100498ef1 biowait+0x6c(600d3d31240, 0, 18ba800, 30008670000, 2080201, 600d3d31240)
000002a100498fa1 bwrite_common+0x1ac(0, 600d3d31240, 1, 0, 0, 1)
000002a100499051 ldl_savestate+0x88(600d2c5da40, 14e5757ef, a72ba5c0, 600de7f8080, 600de7f8280, 0)
000002a100499101 logmap_sethead+0x78(600d2c5db60, 600d2c5da40, f555, 6ed48, 600ccb7f6b0, 600ccb7f5c0)
000002a1004991b1 trans_roll+0x354(600d2c5da40, 10, 2000, 10, 600df0f3180, 600ccb7f6ba)
000002a100499291 thread_start+4(600d2c5da40, 0, 6269665f7365, 745f62696c6c5f68, 6f6c645f636f6465, 2e70726f63000000)

> 2a100499ca0::print -t kthread_t t_disp_time
clock_t t_disp_time = 0x9a22a2
> \*panic_lbolt64-0x9a22a2=D

panic_lbolt64 is set to lbolt64 when panic() is called. lbolt64 is incremented each time the clock thread runs. By default this is 100 times/second but if hires_tick is set in /etc/system the clock thread runs 1000 times/second (usr/src/uts/common/conf/param.c for details). This means that by default there are 100 ticks per second. On a live system panic_lbolt64 can be replaced by lbolt64.

Converting ticks to minutes and seconds is left as a (really simple) exercise for the reader.

Saturday May 17, 2008

Compiling MPlayer on OpenSolaris 2008.05

Video playback (with codecs) is an absolute must to make OpenSolaris a workable alternative to Ubuntu for my home machine

What follows is a very quick guide to compiling MPlayer v1.0rc2 on OpenSolaris

Fetch gcc, gmake and gawk to allow us to compile MPlayer with minimum fuss; we also pull SUNWxorg-headers to allow us to compile the Xv video-out plugin (I also have FSWxorg-headers installed; if things don't work out, you may want to add that to the end of the list). As IPSgawk is in the Blastwave IPS repository, we add that also:

$ pfexec pkg set-authority -O Blastwave
$ pfexec pkg install SUNWgcc SUNWgmake IPSgawk SUNWxorg-headers

Compiling MPlayer is now very straightforward, the only special thing we need to do is promote /opt/csw/gnu in our PATH during compile time. This is because various parts of the build fail with the standard Solaris awk, so we override with GNU awk:

$ echo $PATH
$ export PATH=/usr/gnu/bin:/opt/csw/gnu:/usr/bin:/usr/X11/bin:/usr/sbin:/sbin
$ wget
$ tar jxf MPlayer-1.0rc2.tar.bz2 && cd MPlayer-1.0rc2
$ ./configure
$ gmake

Key points to watch out for are the list of enabled audio and video plugins upon configure completing (you almost certainly want Xv)

If all went well, you should be able to run MPlayer with a simple ./mplayer -vo xv /path/to/video

In the meantime, I'm looking into the best way to get MPlayer packages available via one of the standard IPS repositories


stuff I get up to :)


« July 2016