Bug of the week

There are many bugs out there that are interesting, either because of an implementation detail or the debugging necessary to root cause the problem. As you may have noticed, I like to publicly expound upon the most interesting ones I've fixed (as long as it's not a security vulnerability). This week turned up a rather interesting specimen:

6198523 dirfindvp() can erroneously return ENOENT

This bug was first spotted by Casper back in November last year while trying to do some builds on ZFS. The basic pathology is that at some point during the build, we'd get error messages like:

sh: cannot determine current directory

Some ideas were kicked around by the ZFS team, and after the problem seemed to go away, the team believed that some recent mass of changes had also fixed the problem. Five months later, Jonathan hit the same bug on another build machine running ZFS. As I wrote the getcwd() code, I was determined to root cause the problem this time around.

Back in build 56 of S10, I moved getcwd(3c) into the kernel, along with changes to store pathnames with vnodes (which is used by the DTrace I/O provider as well as pfiles(1)). Basically, we first try to do a forward lookup on the stored pathname; if that works, then we simply return the resolved path1. If this fails (vnode paths are never guaranteed to be correct), then we have to fall back into the slow path. This slow path involves looking up the parent, finding the current vnode in parent, prepending path, and repeat. Once we reach the root of the filesystem, we have a complete path.

To debug this problem, I used this D script to track the behavior of dirtopath(), the function that performs the dirty work of the slow path. Running this for a while produced a tasty bit of information:

dirtopath       /export/ws/build/usr/src/cmd/sgs/ld
lookup(/export/ws/build/usr/src/cmd, .make.dependency.8309dfdc.234596.166) failed (2)
dirfindvp(/export/ws/build/usr/src/cmd,/export/ws/build/usr/src/cmd/sgs) failed (2)
dirtopath() returned 2

Looking at this, it was clear that dirfindvp() (which finds a given vnode in its parent) was inappropriately failing. In particular, after a failed lookup for a temporary make file, we bail out of the loop and report failure, despite the fact that "sgs" is still sitting there in the directory. A long look at the code revealed the problem. Without revealing too much of the code (OpenSolaris, where are you?), it's essentially structured like so:

        while (!err && !eof) {

                /\* ... \*/

                while ((intptr_t)dp < (intptr_t)dbuf + dbuflen) {

                        /\* ... \*/

                        /\*
                         \* We only want to bail out if there was an error other
                         \* than ENOENT.  Otherwise, it could be that someone
                         \* just removed an entry since the readdir() call, and
                         \* the entry we want is further on in the directory.
                         \*/
                        if (err != ENOENT) {                             
                                break;
                        }                                                       
                }                                                               
        }           

The code is trying to avoid exactly our situation: we fail to do a lookup of a file we just saw beacuse the contents are rapidly changing. The bug is that in the while loop we have a check for !err && !eof. If we fail to look up an entry, and it's the last entry in the chunk we just read, then we'll prematurely bail out of the enclosing while loop, returning ENOENT when we shouldn't. Using this test program, it's easy to reproduce on both ZFS and UFS. There are several noteworthy aspects of this bug:

  • The bug had been in the gate for over a year, and there hadn't been a single reported build failure.

  • It only happens when the cached vnode value is invalid, which is rare2.

  • It is a race condition between readdir, lookup, and remove.

  • On UFS, inodes are marked as deleted but can still be looked up until the delete queue is processed at a later point. ZFS deletes entries immediately, so this was much more apparent on ZFS.

  • Because of the above, it was incredibly transient. It would have taken an order of magnitude more time to root cause if not for DTrace, which excels at solving these transient phenomena

A three line change and the bug was fixed, and will make it back to S10 in time for Update 1. If it hadn't been for those among us willing to run our builds on top of ZFS, this problem would not have been found until ZFS integrated, or a customer escalation cost the company a whole bunch of money.


1 There are many more subtleties here relating to Zones, and verifying that the path hasn't been changed to refer to another file. The curious among you will have to wait for OpenSolaris.

2 I haven't yet investigated why we ended up in the slow path in this case. First things first.

Comments:

Post a Comment:
Comments are closed for this entry.
About

Musings about Fishworks, Operating Systems, and the software that runs on them.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today