Monday May 10, 2010

Failfast for Solaris IO

Introduction

Recently I have been trying to better understand Solaris IO, specifically what goes on once a process enters the biowait(\*buf) function. As part of this investigation I found a need to learn about failfast for Solaris IO which I will discuss in this blog post. Failfast was first presented as PSARC/2002/126: Buf Flag for Faster Failover. If you do not already have a general knowledge of Solaris IO internals I strongly recommend reading General Flow of Control from the Writing Device Drivers book available for free on docs.sun.com.


At a later date I would like to expand this article with more discussion of the interaction between ZFS & SVM and the failfast flag, along with a more general Solaris IO entry.


Functionality

Data buffers (buffer, buf or bp from now on) passed into the (s)sd driver ("the driver") are encapsulated within a scsi_pkt structure (packet or pkt) in sd_initpkt_for_buf(\*buf, \*\*scsi_pkt) before they are passed to the transport (glm, qlc, etc.) via scsi_transport(\*scsi_pkt).


sd_initpkt_for_buf() sets the scsi_pkt's command completion routine, pkt_comp, to sdintr(\*scsi_pkt). When the transport has finished processing a packet (e.g. due to a completion, timeout or error) it sets pkt_reason as required and then calls pkt_comp to pass the packet back to the driver. Commands time out if no response is received from the target within pkt_time, set to SD_IO_TIME (60s) by sd_initpkt_for_buf(). In case of timeout the driver will attempt to retry the packet up to SD_RETRY_COUNT times (3 for fibre channel, otherwise 5). This means that without failfast it can take up to 5 minutes for, e.g., a read to return an error in the case of a non-responsive disk.


Failfast is a process that takes place within the driver to more expediently fail a pending buf and inform the upper layer Volume Manager (ZFS, SVM, VxVM). Co-operation is required from the VM which must set B_FAILFAST in the buf b_flags mask to enable the behaviour (the driver can check the ddi-failfast-supported property to know whether B_FAILFAST can be used).


Most VMs tend to round-robin read IO when multiple copies of the data exist. In the case of a mirror where one disk has gone away we ultimately expect all of our read IOs to be serviced by the working disk. In order for this to happen it is necessary for the driver to return a failure code (EIO) to the VM so that it can retry with the working disk. When B_FAILFAST is set we can return EIO faster thereby reducing the overall average IO time. The B_FAILFAST flag was initially proposed as B_ALTDATASRC as this accurately describes the conditions that need to be true for us to want to use failfast behaviour.


Implementation

Within sd each physical target LUN is represented as an sd instance (sd_lun or un), each of which tracks its internal failfast state in un_failfast_state and un_failfast_bp. The instance may be in one of three states: SD_FAILFAST_INACTIVE, failfast pending (an inferred state where un_failfast_bp != NULL) and SD_FAILFAST_ACTIVE.


When any packet (i.e. regardless of B_FAILFAST) is returned to sd via pkt_comp qualifies for a retry due to a timeout condition specified in pkt_reason (these are: CMD_TIMEOUT and CMD_INCOMPLETE where the incomplete reason is a selection timeout) we call into sd_retry_command(\*sd_lun, \*buf, int retry_check_flag, ...) with the buf and SD_RETRIES_FAILFAST set in retry_check_flag. sd_retry_command() and sd_return_command(\*sd_lun, \*buf) change the instance failfast state. Every instance begins in SD_FAILFAST_INACTIVE.


Transition to failfast pending: The first buf to enter sd_retry_command() with SD_RETRIES_FAILFAST set will take the sd instance into the failfast pending state by registering itself as the un_failfast_bp. The buf is then retried normally. Subsequent SD_RETRIES_FAILFAST bufs will be retried without changing any failfast state.


Transition to SD_FAILFAST_ACTIVE: When the un_failfast_bp buf returns to sd_retry_command() it transitions the instance to SD_FAILFAST_ACTIVE by setting un_failfast_state and clearing un_failfast_bp. sd_failfast_flushq(\*sd_lun) is called which arranges for all all B_FAILFAST bufs on the wait queue to be returned to the caller with a suitable error set (this is done via thread). This buf is also returned with an error set if it has B_FAILFAST set, otherwise it is retried.


Transition to SD_FAILFAST_INACTIVE: Any buf that either completes successfully (via sd_return_command()) or requires a retry for any reason other than those that take us into failfast pending will transition us into SD_FAILFAST_INACTIVE by updating un_failfast_state and clearing un_failfast_bp. It should now be clear from above that only B_FAILFAST bufs are affected by the failfast state which means any subsequent buf without B_FAILFAST (or indeed any buf currently in the transport) can allow the transition back to SD_FAILFAST_INACTIVE.


Any buf passed into a SD_FAILFAST_ACTIVE sd instance with B_FAILFAST set is immediately failed in sd_core_iostart(int index, \*sd_lun, \*buf).


Thursday Aug 13, 2009

Solaris RPE Kernel rotation

This month I am on a rotation into the Solaris RPE (Revenue Product Engineering) Kernel team where I have picked a bug in the Solaris kernel which I am attempting to diagnose and fix.  Thanks to Mita Solanky, Chris Beal, Bill Watson & Rob Harris for helping me arrange this.


Normally I work as a part of the TSC (Technical Solutions Centre) Kernel team as an engineer who diagnoses kernel-related issues.  This could be system panics (i.e. crash dump analysis), performance problems, errors, queries from customers, etc.  The RPE organisation is the direct escalation path for us engineers in the TSC, and they are required to understand and provide fixes for bugs logged by the TSC.


By spending time in RPE I am getting to know better how this organisation functions (e.g. process), the exact role the engineers play, what external pressures they have and more about how code changes go back into the Solaris product.  RPE have some communication with NRE (New Revenue Engineering) which is interesting for me as I do not usually encounter NRE engineers as part of my day-to-day job.


Once the rotation has finished I'll blog again about my experience here as compared to the TSC role I normally play.  For this this serves as a brief introduction to my next blog entry where I will discuss the bug I am working on. 

About

stuff I get up to :)

Search

Archives
« April 2014
MonTueWedThuFriSatSun
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
    
       
Today