Thursday Sep 25, 2008

Debugging a refcnt

A refcnt is a way to keep track of how many outstanding holds are being used on a data structure. Say that you need to reference something in the data structure, but you don't want to do it while holding a lock. You do want to make sure that the data is still there when you go back. What you do is bump the refcnt and have the cleanup code never free that data until the refcnt goes back to 0.

Some issues are what happens if it never goes back to 0? Or if it accidentally goes back to 0 before you are done with it? The answers are that you leak memory and crash, in that order.

I'm debugging the scenario where the refcnt goes back to 0 before you release your hold. Before I dive into looking at that, let's look at the relevant code in the mirror mount source: usr/src/uts/common/fs/nfs/nfs4_stub_vnops.c (Note that this code will change over time, I'm linking to the copy of the gate.)

We bump the count here:

    291 static void
    292 nfs4_ephemeral_tree_hold(nfs4_ephemeral_tree_t \*net)
    293 {
    294 	mutex_enter(&net->net_cnt_lock);
    295 	net->net_refcnt++;
    296 	ASSERT(net->net_refcnt != 0);
    297 	mutex_exit(&net->net_cnt_lock);
    298 }

The first thing to note is that we grab net_cnt_lock before we do anything. How do we even know that is valid? Well, we grab a lock on the mountinfo and check to see if there is an ephemeral node:

    696 	mutex_enter(&mi->mi_lock);
    701 	if (mi->mi_ephemeral_tree == NULL) {
    727 	} else {
    728 		net = mi->mi_ephemeral_tree;
    729 		mutex_exit(&mi->mi_lock);
    731 		nfs4_ephemeral_tree_hold(net);

Oh, that is very naughty indeed - what I said and what the code says disagree. What do other parts of the code say?

   1422 	mutex_enter(&mi->mi_lock);
   1444 	nfs4_ephemeral_tree_hold(net);
   1528 	mutex_exit(&mi->mi_lock);

And a special case in nfs4_ephemeral_harvest_forest

   2098 	mutex_enter(&ntg->ntg_forest_lock);
   2099 	for (net = ntg->ntg_forest; net != NULL; net = next) {
   2100 		next = net->net_next;
   2102 		nfs4_ephemeral_tree_hold(net);

I expect the code at 2102 to not hold the mntinfo4. We don't have our hands on that info at the time. But I believe the code at 731 to be a bug that will eventually bit us.

But back to the hold code:

    295 	net->net_refcnt++;
    296 	ASSERT(net->net_refcnt != 0);

We bump the refcnt and immediately assert that it is not 0. As refcnt is an unsigned int, this will catch the case where we wrap over the size of an integer. We don't ever expect to hit this case.

It is in the rele case that we are more concerned:

    300 /\*
    301  \* We need a safe way to decrement the refcnt whilst the
    302  \* lock is being held.
    303  \*/
    304 static void
    305 nfs4_ephemeral_tree_decr(nfs4_ephemeral_tree_t \*net)
    306 {
    307 	ASSERT(mutex_owned(&net->net_cnt_lock));
    308 	ASSERT(net->net_refcnt != 0);
    309 	net->net_refcnt--;
    310 }
    312 static void
    313 nfs4_ephemeral_tree_rele(nfs4_ephemeral_tree_t \*net)
    314 {
    315 	mutex_enter(&net->net_cnt_lock);
    316 	nfs4_ephemeral_tree_decr(net);
    317 	mutex_exit(&net->net_cnt_lock);
    318 }

First note that either we hold the lock ourself and call nfs4_ephemeral_tree_decr or we let nfs4_ephemeral_tree_rele grab the lock for us. The assert on 307 checks that this is valid. The assert on 308 checks to see that we are not already at 0 when we decrement. I.e., because the refcnt is unsigned, once we drop below 0, we can not do a check like:

                ASSERT(net->net_refcnt >= 0);

By the way, we do not need to be holding mi->mi_lock everywhere we rele or decr. The reason we need to when we hold is that we need to know the memory is valid. With the other two operations, by definition, the memory must be valid.

To prove that statement, we need to show that the ephemeral tree can not be disassociated from the mntinfo4 without the mi->mi_lock being held. The only place we do that is in nfs4_ephemeral_umount:

   2003 		/\*
   2004 		 \* At this point, the tree should no
   2005 		 \* longer be associated with the
   2006 		 \* mntinfo4. We need to pull it off
   2007 		 \* there and let the harvester take
   2008 		 \* care of it once the refcnt drops.
   2009 		 \*/
   2010 		mutex_enter(&mi->mi_lock);
   2011 		mi->mi_ephemeral_tree = NULL;
   2012 		mutex_exit(&mi->mi_lock);

Okay, further validation that we should have the lock held at 731.

The final piece of the puzzle is what happens in the harvester? I.e., we decided in 2011 to disassociate the tree with the mntinfo4. But we didn't release the memory. How do we do that? Well, we look in nfs4_ephemeral_harvest_forest and we also see why we don't care about holding mi->mi_lock there.

   2098 	mutex_enter(&ntg->ntg_forest_lock);
   2099 	for (net = ntg->ntg_forest; net != NULL; net = next) {
   2100 		next = net->net_next;
   2102 		nfs4_ephemeral_tree_hold(net);

When we hold ntg->ntg_forest_lock, no other thread is allowed to manipulate the forest of ephemeral trees. What that means is that the current forest is locked and the memory is not going away. Hence we don't need to hold mi->mi_lock. The other code has to hold it because the reference in the mntinfo4 can go away (see line 2011).

   2140 		while (e) {
   2247 			e = prior;
   2248 		}
   2250 check_done:
   2252 		/\*
   2253 		 \* At this point we are done processing this tree.
   2254 		 \*
   2255 		 \* If the tree is invalid and we are the only reference
   2256 		 \* to it, then we push it on the local linked list
   2257 		 \* to remove it at the end. We avoid that action now
   2258 		 \* to keep the tree processing going along at a fair clip.
   2259 		 \*
   2260 		 \* Else, even if we are the only reference, we drop
   2261 		 \* our hold on the current tree and allow it to be
   2262 		 \* reused as needed.
   2263 		 \*/
   2264 		mutex_enter(&net->net_cnt_lock);
   2265 		if (net->net_refcnt == 1 &&
   2266 		    net->net_status & NFS4_EPHEMERAL_TREE_INVALID) {
   2267 			nfs4_ephemeral_tree_decr(net);
   2268 			net->net_status &= ~NFS4_EPHEMERAL_TREE_LOCKED;
   2269 			mutex_exit(&net->net_cnt_lock);
   2270 			mutex_exit(&net->net_tree_lock);
   2272 			if (prev)
   2273 				prev->net_next = net->net_next;
   2274 			else
   2275 				ntg->ntg_forest = net->net_next;
   2277 			net->net_next = harvest;
   2278 			harvest = net;
   2279 			continue;
   2280 		}
   2282 		nfs4_ephemeral_tree_decr(net);
   2283 		net->net_status &= ~NFS4_EPHEMERAL_TREE_LOCKED;
   2284 		mutex_exit(&net->net_cnt_lock);
   2285 		mutex_exit(&net->net_tree_lock);
   2287 		prev = net;
   2288 	}
   2289 	mutex_exit(&ntg->ntg_forest_lock);

2248 completes the a while processing loop started in 2140 above. Among other things, this loop decides if a mirror mount has been idle long enough to automatically release it. It does this in a recursion elimination to visit every child and sibling node of the root of the ephemeral tree.

The interesting code starts on lines 2265 and 2266. If the refcnt is 1, then the hold we established on 2102 is the only reference to this node. Then, if the tree has been invalidated, either by the harvest loop above (this time or earlier) or by a manual unmount, then we know it is safe to release this node. We decrement the refcnt at 2267. This is important because if we leave it at 1 and push it on the harvest list, then another node can do a rele and we think it is okay. We remove it from the forst in lines 2272 to 2275 and push it onto a local linked list harvest on lines 2277 and 2278. We then go on to process the next node.

When we have finished all trees in the forest, they are all in one of 3 states:

  1. Not in use by a thread and harvested.
  2. Not idled out. May or may not be in use by a thread.
  3. In use by a thread, idled out, and not harvested.

With all of that background out of the way, let us look at a coredump:

> $c
assfail+0x7e(fffffffff8490b98, fffffffff8490b70, 134)
nfs4_ephemeral_umount_unlock+0x3f(ffffff0004f33c14, ffffff0004f33bb8)
nfs4_ephemeral_umount_activate+0x7b(ffffff0524673000, ffffff0004f33c14, ffffff0004f33bb8)
nfs4_free_mount+0x2fe(ffffff047ae40dc8, 400, ffffff01974f2b28)
> \*ffffff0004f33bb8::print nfs4_ephemeral_tree_t
    net_mount = 0xffffff0523c36000
    net_root = 0
    net_next = 0
    net_tree_lock = {
        _opaque = [ 0xffffff0004f33c80 ]
    net_cnt_lock = {
        _opaque = [ 0xffffff0004f33c80 ]
    net_status = 0
    net_refcnt = 0
> ffffff0004f33c14::print boolean_t
1 (B_TRUE)

This is the assert on line 308. Some things we can tell right away is that this should be a manual unmount, since we came through nfs4_free_mount and not the harvester. The flags are 0x400, which means MS_FORCE - a forced umount call. Also, while the refcnt is 0, it has not been harvested. In that case, we would expect to see that the state would include NFS4_EPHEMERAL_TREE_INVALID. And indeed, the mntinfo4 (see net_mount) does still point to this node:

> ffffff0523c36000::print mntinfo4_t mi_ephemeral_tree
mi_ephemeral_tree = 0xffffff020db7fc18
> ffffff0523c36000::print mntinfo4_t mi_ephemeral_tree | ::print nfs4_ephemeral_tree_t
    net_mount = 0xffffff0523c36000
    net_root = 0
    net_next = 0
    net_tree_lock = {
        _opaque = [ 0xffffff0004f33c80 ]
    net_cnt_lock = {
        _opaque = [ 0xffffff0004f33c80 ]
    net_status = 0
    net_refcnt = 0

So, who did us in? Well, the first thought I had was this piece of code in nfs4_ephemeral_umount : (which we would have just called)

   1990 		nfs4_ephemeral_tree_decr(net);
   1991 		mutex_exit(&net->net_cnt_lock);
   1993 		if (was_locked == FALSE)
   1994 			mutex_exit(&net->net_tree_lock);

We only hold the refcnt in here if we do not set was_locked to TRUE:

   1764 	int			was_locked = FALSE;
   1845 			was_locked = TRUE;
   1846 		} else {
   1847 			net->net_refcnt++;
   1848 			ASSERT(net->net_refcnt != 0);
   1849 		}
   1851 		mutex_exit(&net->net_cnt_lock);

Hmm, we also avoid using nfs4_ephemeral_tree_hold. We should probably create a symmetric function for incr like we did for decr.

Anyway, I fixed the code at 19909 and we still see the panic. The new one has the same signature. By the way, I found this probable error by matching holds to reles. I stopped with just the simple cases when I found the above issue.

Okay, what does nfs4_free_mount() do?

   4039 	/\*
   4040 	 \* If we got an error, then do not nuke the
   4041 	 \* tree. Either the harvester is busy reclaiming
   4042 	 \* this node or we ran into some busy condition.
   4043 	 \*
   4044 	 \* The harvester will eventually come along and cleanup.
   4045 	 \* The only problem would be the root mount point.
   4046 	 \*
   4047 	 \* Since the busy node can occur for a variety
   4048 	 \* of reasons and can result in an entry staying
   4049 	 \* in df output but no longer accessible from the
   4050 	 \* directory tree, we are okay.
   4051 	 \*/
   4052 	if (!nfs4_ephemeral_umount(mi, flag, cr,
   4053 	    &must_unlock, &eph_tree))
   4054 		nfs4_ephemeral_umount_activate(mi, &must_unlock,
   4055 		    &eph_tree);

So we call nfs4_ephemeral_umount, which had that suspect piece of code. We know that we will not set NFS4_EPHEMERAL_TREE_LOCKED ourselves. We can see that it was not held, because pmust_unlock was set to TRUE. Also, the status tells us we did not mark the tree as NFS4_EPHEMERAL_TREE_INVALID and the fact that the mntinfo4 pointer is valid tells us we did not disassociate the tree from it. All of which tells me fixing line 1990 would probably have no impact on the execution path. I.e., I'm starting to suspect we don't have a problem with a race case.

The only way to return 0 and not disassociate the tree is to go down this path back in nfs4_ephemeral_umount:

   1945 	if (!is_derooting) {
   1946 		/\*
   1947 		 \* Only work on children if the caller has not already
   1948 		 \* done so.
   1949 		 \*/
   1950 		if (!is_recursed) {
   1951 			ASSERT(eph != NULL);
   1953 			error = nfs4_ephemeral_unmount_engine(eph,
   1954 			    FALSE, flag, cr);
   1955 			if (error)
   1956 				goto is_busy;
   1957 		}
   1958 	} else {

And as a matter of fact, the evidence suggests that no other code did this either. nfs4_ephemeral_unmount_engine must have been error free.

So either we have decremented the refcnt twice for our one hold or we decremented when we did not have a hold.

Heh, I've gone down many a code branch, only to backtrack as I found the code was correct. Perhaps we should return to our debug session? nfs4_ephemeral_umount_activate takes a pointer to a mntinfo4 and a pointer to a pointer of a nfs4_ephemeral_tree_t:

nfs4_ephemeral_umount_activate+0x7b(ffffff0524673000, ffffff0004f33c14, ffffff0004f33bb8)
> \*ffffff0004f33bb8::print nfs4_ephemeral_tree_t
    net_mount = 0xffffff0523c36000
    net_root = 0
    net_next = 0
    net_tree_lock = {
        _opaque = [ 0xffffff0004f33c80 ]
    net_cnt_lock = {
        _opaque = [ 0xffffff0004f33c80 ]
    net_status = 0
    net_refcnt = 0
> ffffff0524673000::print mntinfo4_t mi_ephemeral
mi_ephemeral = 0

The ephemeral node is not pointing to the same thing as the function. And I have more faith in the function, because it had to just execute:

   1733 	if (mi->mi_ephemeral) {
   1734 		/\*
   1735 		 \* If we are the root node of an ephemeral branch
   1736 		 \* which is being removed, then we need to fixup
   1737 		 \* pointers into and out of the node.
   1738 		 \*/
   1739 		if (!(mi->mi_flags & MI4_EPHEMERAL_RECURSED))
   1740 			nfs4_ephemeral_umount_cleanup(mi->mi_ephemeral);
   1742 		ASSERT(mi->mi_ephemeral != NULL);
   1744 		kmem_free(mi->mi_ephemeral, sizeof (\*mi->mi_ephemeral));
   1745 		mi->mi_ephemeral = NULL;
   1746 	}
   1747 	mutex_exit(&mi->mi_lock);
   1749 	nfs4_ephemeral_umount_unlock(pmust_unlock, pnet);

I.e., if we pass in a mntinfo4, either the ephemeral tree is already NULL or we free it and NULL it out. And oh crap, if mi->mi_ephemeral was a pointer to the same memory location as \*pnet, we just screwed the pooch and caused a core. I.e., the only place we should be removing this memory is at:

   2010 		mutex_enter(&mi->mi_lock);
   2011 		mi->mi_ephemeral_tree = NULL;
   2012 		mutex_exit(&mi->mi_lock);

Well, this entry is long enough as it is. I'll convince myself that removing 1744 is the correct thing to do and then test that fix. Note, this kmem_free might explain why systems do not always core. If the memory is garbage, all it has to be is non-zero for the assert to not trigger. Whether or not other nasty things could have happened is to be seen later.

UPDATE: Hmm, one is a mi->mi_ephemeral and the other is a mi->mi_ephemeral_tree. The end result may stand, but my analysis may be off. Back with more later!

Originally posted on Kool Aid Served Daily
Copyright (C) 2008, Kool Aid Served Daily



« April 2014