Failfast for Solaris IO
By Lewis Thompson on May 10, 2010
Recently I have been trying to better understand Solaris IO, specifically what goes on once a process enters the biowait(\*buf) function. As part of this investigation I found a need to learn about failfast for Solaris IO which I will discuss in this blog post. Failfast was first presented as PSARC/2002/126: Buf Flag for Faster Failover. If you do not already have a general knowledge of Solaris IO internals I strongly recommend reading General Flow of Control from the Writing Device Drivers book available for free on docs.sun.com.
At a later date I would like to expand this article with more discussion of the interaction between ZFS & SVM and the failfast flag, along with a more general Solaris IO entry.
Data buffers (buffer, buf or bp from now on) passed into the (s)sd driver ("the driver") are encapsulated within a scsi_pkt structure (packet or pkt) in sd_initpkt_for_buf(\*buf, \*\*scsi_pkt) before they are passed to the transport (glm, qlc, etc.) via scsi_transport(\*scsi_pkt).
sd_initpkt_for_buf() sets the scsi_pkt's command completion routine, pkt_comp, to sdintr(\*scsi_pkt). When the transport has finished processing a packet (e.g. due to a completion, timeout or error) it sets pkt_reason as required and then calls pkt_comp to pass the packet back to the driver. Commands time out if no response is received from the target within pkt_time, set to SD_IO_TIME (60s) by sd_initpkt_for_buf(). In case of timeout the driver will attempt to retry the packet up to SD_RETRY_COUNT times (3 for fibre channel, otherwise 5). This means that without failfast it can take up to 5 minutes for, e.g., a read to return an error in the case of a non-responsive disk.
Failfast is a process that takes place within the driver to more expediently fail a pending buf and inform the upper layer Volume Manager (ZFS, SVM, VxVM). Co-operation is required from the VM which must set B_FAILFAST in the buf b_flags mask to enable the behaviour (the driver can check the ddi-failfast-supported property to know whether B_FAILFAST can be used).
Most VMs tend to round-robin read IO when multiple copies of the data exist. In the case of a mirror where one disk has gone away we ultimately expect all of our read IOs to be serviced by the working disk. In order for this to happen it is necessary for the driver to return a failure code (EIO) to the VM so that it can retry with the working disk. When B_FAILFAST is set we can return EIO faster thereby reducing the overall average IO time. The B_FAILFAST flag was initially proposed as B_ALTDATASRC as this accurately describes the conditions that need to be true for us to want to use failfast behaviour.
Within sd each physical target LUN is represented as an sd instance (sd_lun or un), each of which tracks its internal failfast state in un_failfast_state and un_failfast_bp. The instance may be in one of three states: SD_FAILFAST_INACTIVE, failfast pending (an inferred state where un_failfast_bp != NULL) and SD_FAILFAST_ACTIVE.
When any packet (i.e. regardless of B_FAILFAST) is returned to sd via pkt_comp qualifies for a retry due to a timeout condition specified in pkt_reason (these are: CMD_TIMEOUT and CMD_INCOMPLETE where the incomplete reason is a selection timeout) we call into sd_retry_command(\*sd_lun, \*buf, int retry_check_flag, ...) with the buf and SD_RETRIES_FAILFAST set in retry_check_flag. sd_retry_command() and sd_return_command(\*sd_lun, \*buf) change the instance failfast state. Every instance begins in SD_FAILFAST_INACTIVE.
Transition to failfast pending: The first buf to enter sd_retry_command() with SD_RETRIES_FAILFAST set will take the sd instance into the failfast pending state by registering itself as the un_failfast_bp. The buf is then retried normally. Subsequent SD_RETRIES_FAILFAST bufs will be retried without changing any failfast state.
Transition to SD_FAILFAST_ACTIVE: When the un_failfast_bp buf returns to sd_retry_command() it transitions the instance to SD_FAILFAST_ACTIVE by setting un_failfast_state and clearing un_failfast_bp. sd_failfast_flushq(\*sd_lun) is called which arranges for all all B_FAILFAST bufs on the wait queue to be returned to the caller with a suitable error set (this is done via thread). This buf is also returned with an error set if it has B_FAILFAST set, otherwise it is retried.
Transition to SD_FAILFAST_INACTIVE: Any buf that either completes successfully (via sd_return_command()) or requires a retry for any reason other than those that take us into failfast pending will transition us into SD_FAILFAST_INACTIVE by updating un_failfast_state and clearing un_failfast_bp. It should now be clear from above that only B_FAILFAST bufs are affected by the failfast state which means any subsequent buf without B_FAILFAST (or indeed any buf currently in the transport) can allow the transition back to SD_FAILFAST_INACTIVE.
Any buf passed into a SD_FAILFAST_ACTIVE sd instance with B_FAILFAST set is immediately failed in sd_core_iostart(int index, \*sd_lun, \*buf).