SVM and the B_FAILFAST flag

Now that OpenSolaris is here it is a lot easier to talk about some of the interesting implementation details in the code. In this post I wanted to discuss the first project I did after I started to work on the Solaris Volume Manager (SVM). This is on my mind right now because it also happens to be related to one of my most recent changes to the code. This change is not even in Solaris Express yet, it is only available in OpenSolaris. Early access to these kind of changes is just one small reasons why OpenSolaris is so cool.

My first SVM project was to add support for the B_FAILFAST flag. This flag is defined in /usr/include/sys/buf.h and it was implemented in some of the disk drivers so that I/O requests that were queued in the driver could be cleared out quickly when the driver saw that a disk was malfunctioning. For SVM the big requester for this feature was our clustering software. The problem they were seeing was that in a production environment there would be many concurrent I/O requests queued up down in the sd driver. When the disk was failing the sd driver would need to process each of these requests, wait for the timeouts and retrys and slowly drain its queue. The cluster software could not failover to another node until all of these pending requests had been cleared out of the system. The B_FAILFAST flag is the exact solution to this problem. It tells the driver to do two things. First, it reduces the number of retries that the driver does to a failing disk before it gives up and returns an error. Second, when the first I/O buf that is queued up in the driver gets an error, the driver will immediately error out all of the other, pending bufs in its queue. Furthermore, any new bufs sent down with the B_FAILFAST flag will immediately return with an error.

This seemed fairly straightforward to implement in SVM. The code had to be modified to detect if the underlying devices supported the B_FAILFAST flag and if so, the flag should be set in the buf that was being passed down from the md driver to the underlying drivers that made up the metadevice. For simplicity we decided we would only add this support to the mirror metadevice in SVM. However, the more I looked at this, the more complicated it seemed to be. We were worried about creating new failure modes with B_FAILFAST and the big concern was the possibility of a "spurious" error. That is, getting back an error on the buf that we would not have seen if we had let the underlying driver perform its full set of timeouts and retries. This concern eventually drove the whole design of the initial B_FAILFAST implementation within the mirror code. To handle this spurious error case I implemented an algorithm within the driver so that when we got back an errored B_FAILFAST buf we would resubmit that buf without the B_FAILFAST flag set. During this retry, all of the other failed I/O bufs would also immediately come back up into the md driver. I queued those up so that I could either fail all of the them after the retried buf finally failed or I could resubmit them back down to the underlying driver if the retried I/O succeeded. Implementing this correctly took a lot longer than I originally expected when I took this first project and it was one of those things that worked but I was never very happy with. The code was complex and I never felt completely confident that there wasn't some obscure error condition lurking here that would come back to bite us later. In addition, because of the retry, the failover of a component within a mirror actually took \*longer\* now if there was only a single I/O being processed.

This original algorithm was delivered in the S10 code and was also released as a patch for S9 and SDS 4.2.1. It has been in use for a couple of years which gave me some direct experience with how well the B_FAILFAST option worked in real life. We actually have seen one or two of these so called spurious errors but in all cases there were real, underlying problems with the disks. The storage was marginal and SVM would have been better off just erroring out those components within the mirror and immediately failing over to the good side of the mirror. By this time I was comfortable with this idea so I rewrote the B_FAILFAST code within the mirror driver. This new algorithm is what you'll see today in the OpenSolaris code base. I basically decided to just trust the error we get back when B_FAILFAST is set. The code will follow the normal error path so that it puts the submirror component into the maintenance state and just uses the other, good side of the mirror from that point onward. I was able to remove the queue and simplify the logic almost back to what it was before we added support for B_FAILFAST.

However, there is still one special case we have to worry about when using B_FAILFAST. As I mentioned above, when B_FAILFAST is set, all of the pending I/O bufs that are queued down in the underlying driver will fail once the first buf gets an error. When we are down to the last side of a mirror the SVM code will continue to try to do I/O to the those last submirror components, even though they are taking errors. This is called the LAST_ERRED state within SVM and is an attempt to try to provide access to as much of your data as possible. When using B_FAILFAST it is probable that not all of the failed I/O bufs will have been seen by the disk and given a chance to succeed. With the new algorithm the code detects this state and reissues all of the I/O bufs without B_FAILFAST set. There is no longer any queueing, we just resubmit the I/O bufs without the flag and all future I/O to the submirror is done without the flag. Once the LAST_ERRED state is cleared the code will return to using the B_FAILFAST flag.

All of this is really an implementation detail of mirroring in SVM. There is no user-visible component of this except for a change in the behavior of how quickly the mirror will fail the errored drives in the submirror. All of the code is contained within the mirror portion of the SVM driver and you can see it in mirror.c. The function mirror_check_failfast is used to determine if all of the components in a submirror support using the B_FAILFAST flag. The mirror_done function is called when the I/O to the underlying submirror is complete. In this function we check if the I/O failed and if B_FAILFAST was set. If so we call the submirror_is_lasterred function to check for that condition and the last_err_retry function is called only when we need to resubmit the I/O. This function is actually executed in a helper thread since the I/O completes in a thread separately from the thread that initiated the I/O down into the md driver.

To wrap up, the SVM md driver code lives in the source tree at usr/src/uts/common/io/lvm. The main md driver is in the md subdirectory and each specific kind of metadevice also has its own subdirectory ( mirror, stripe, etc.). The SVM command line utilities live in usr/src/cmd/lvm and the shared library code that SVM uses lives in usr/src/lib/lvm. Libmeta is the primary library. In another post I'll talk in more detail about some of these other components of SVM.

Technorati Tag: OpenSolaris
Technorati Tag: Solaris

Post a Comment:
  • HTML Syntax: NOT allowed



Top Tags
« August 2016