Friday Oct 10, 2008

Understanding layout creation to understand what spe will have to do

In my task list for spe, a large item has been how to tie it into the current code base - you might have seen me reference it as translating data path to guid. To do that, I've had to understand what the current code is doing and the limitations in that code. I've also had to question exactly what it is we want done.

Quick overview of spe

The Simple Policy Engine (spe) tells the pNFS MetaData Server (MDS) how to layout the stripes on the data servers (DS) at file creation time. If you think of RAID, a file is striped across disks and we need to know how many disks it is striped across and what is the width of the stripe. Then to determine which disk a particular piece of data is on, we can divide the file offset by the stripe width to get the disk.

This is simplistic, but is also the basic concept behind layout creation in pNFS. A huge difference is that we need to tell the client not only the stripe count and width, but the machine addresses of the DSes. It is a little bit more complex than that as each DS might have several data stores associated with it, a data store might be moved to a different DS, etc. We capture that complexity in the globally unique id (guid) assigned to each data store. But conceptually, lets consider just the base case of each DS having only one data store and it is always on that DS.

Overview of Current Layout Generation

So the NFSv4.1 protocol defines an OPEN operation and a LAYOUTGET operation. It doesn't define how an implementation will determine which data sets are put into the layout.

In the current OpenSolaris implementation, these two operations result in the following call chains:

"OPEN" -> mds_op_open -> mds_do_opennull -> mds_create_file
"LAYOUTGET" -> mds_op_layout_get -> mds_fetch_layout -> mds_get_flo -> fake_spe

In my development gate, a call to spe_allocate currently occurs in mds_create_file.

The relevant files to look at are: usr/src/uts/common/fs/nfs/nfs41_srv.c and usr/src/uts/common/fs/nfs/nfs41_state.c.

Note: I will be quoting routines in the above two files. Over time, those files will change and will not match up what I quote.


The interesting stuff in layout creation occurs in mds_fetch_layout:

Note that we have starting with nfs41_srv.c.

   8320 	if (mds_get_flo(cs, &lp) != NFS4_OK)


And in mds_get_flo:

   8269 	mutex_enter(&cs->vp->v_lock);
   8270 	fp = (rfs4_file_t \*)vsd_get(cs->vp, cs->instp->vkey);
   8271 	mutex_exit(&cs->vp->v_lock);
   8273 	/\* Odd.. no rfs4_file_t for the vnode.. \*/
   8274 	if (fp == NULL)

Which basically states that the file must have been created and in memory. These is not a panic for at least the following reasons:

  1. Client may have sent the LAYOUTGET before the OPEN. A crappy thing to do, but not a reason for a panic.
  2. The server may have rebooted since the client sent the OPEN. Even if the file is on disk on the MDS, it is not incore. Clue the client in that they may need to reissue the OPEN.
   8277 	/\* do we have a odl already ? \*/
   8278 	if (fp->flp == NULL) {
   8279 		/\* Nope, read from disk \*/
   8280 		if (mds_get_odl(cs->vp, &fp->flp) != NFS4_OK) {
   8281 			/\*
   8282 			 \* XXXXX:
   8283 			 \* XXXXX: No ODL, so lets go query PE
   8284 			 \* XXXXX:
   8285 			 \*/
   8286 			fake_spe(cs->instp, &fp->flp);
   8288 			if (fp->flp == NULL)
   8289 				return (NFS4ERR_LAYOUTUNAVAILABLE);
   8290 		}
   8291 	}

Note that an odl is a on-disk layout. And the statement on 8278 is how I will tie the spe in with this code. During an OPEN, I can simply set fp->flp and bypass this logic. If there is any error, then this field will be NULL and we can grab a simple default layout here. So I'll probably rename fake_spe to be mds_generate_default_flo.


So understanding what fake_spe does will help me understand what the real spe will have to do:

   8236 	int key = 1;
   8241 	\*flp = NULL;
   8243 	rw_enter(&instp->mds_layout_lock, RW_READER);
   8244 	lp = (mds_layout_t \*)rfs4_dbsearch(instp->mds_layout_idx,
   8245 	    (void \*)(uintptr_t)key, &create, NULL, RFS4_DBS_VALID);
   8246 	rw_exit(&instp->mds_layout_lock);
   8248 	if (lp == NULL)
   8249 		lp = mds_gen_default_layout(instp, mds_max_lo_devs);
   8251 	if (lp != NULL)
   8252 		\*flp = lp;

The current code only ever has 1 layout in memory. Hence, the key is 1. We'll need to see how that layout is generated. And that occurs in mds_gen_default_layout. Note how simplistic this code is - if for any reason the layout is deleted from the table, it is simply added back in here. Right now, the only reason the layout would be deleted is if a DS reboots (look at ds_exchange in ds_srv.c).


This is the code builds up the layout and stuffs it in memory:

Note that we have switched into nfs41_state.c.

   1046 int mds_default_stripe = 32;
   1047 int mds_max_lo_devs = 20;
   1052 	struct mds_gather_args args;
   1053 	mds_layout_t \*lop;
   1055 	bzero(&args, sizeof (args));
   1057 	args.max_devs_needed = MIN(max_devs_needed,
   1058 	    MIN(mds_max_lo_devs, 99));
   1060 	rw_enter(&instp->ds_addr_lock, RW_READER);
   1061 	rfs4_dbe_walk(instp->ds_addr_tab, mds_gather_devs, &args);
   1062 	rw_exit(&instp->ds_addr_lock);
   1064 	/\*
   1065 	 \* if we didn't find any devices then we do no service
   1066 	 \*/
   1067 	if (args.dex == 0)
   1068 		return (NULL);
   1070 	args.lo_arg.loid = 1;
   1071 	args.lo_arg.lo_stripe_unit = mds_default_stripe \* 1024;
   1073 	rw_enter(&instp->mds_layout_lock, RW_WRITER);
   1074 	lop = (mds_layout_t \*)rfs4_dbcreate(instp->mds_layout_idx,
   1075 	    (void \*)&args);
   1076 	rw_exit(&instp->mds_layout_lock);

We first walk across the instp->ds_addr_tab and look for effectively 20 entries. Note that max_devs_needed is always 20 for this code and so will be args.max_devs_needed.

I think the check on 1067 is incorrect and a result of the current implementation normally being on a community with 1 DS. It should be the case that args.dex is greater than or equal to max_devs_needed. Actually, we need to be passing in how many devices we will have D (the ones assigned to a policy) and how many we need to use S, with S <= D. The args.dex will have to be >= S.

Note that on 1070, we assign it the only layout id which will ever be generated. And if we play things right, we could store this layout id back in the policy and avoid regenerating the layout if at all possible.

Finally we stuff the newly created layout into the table.


So mds_gather_devs does the work of stuffing the layout. It gets called for every entry found in the instp->ds_addr_tab:

    974 	if (gap->dex < gap->max_devs_needed) {
    975 		gap->lo_arg.lo_devs[gap->dex] = rfs4_dbe_getid(dp->dbe);
    976 		gap->dev_ptr[gap->dex] = dp;
    977 		gap->dex++;
    978 	}

So we keep on reading ds_addr_t data structures until we have enough.

Now, how is that table populated? You can look for ds_addr_idx over in usr/src/uts/common/fs/nfs/ds_srv.c, but basically, for each data store that a DS registers, one of these is created.

The upshot of all this is that if a pNFS community has N data stores, then the layout generated for the current implementation will have a stripe count of N.

Back to mds_fetch_layout

Note and nfs41_srv.c.

Okay, we've generated the layout and start to generate the otw (over the wire) layout:

   8333 	mds_set_deviceid(lp->dev_id, &otw_flo.nfl_deviceid);

Crap, it is sending the device id across the wire! I'm going to have to rethink my approach. Instead of storing a policy as a device list and picking which devices I want out of that list (i.e., a Round Robin (RR) scheduler), I'm going to have to store each generated set as a new device list.

I don't understand the process like I thought I did.

Going back to mds_gather_devs, it is not stuffing data stores into a table as I thought. Instead, it is stuffing DS network addesses into a table.

Missing link

What I'm missing is how the ds_addr entries map back to data stores. Okay, this code in mds_gen_default_layout does it:

mds_layout_lock, RW_WRITER);
   1074 	lop = (mds_layout_t \*)rfs4_dbcreate(instp->mds_layout_idx,
   1075 	    (void \*)&args);
   1076 	rw_exit(&instp->mds_layout_lock);

We have just gotten the device list via the walk over mds_gather_devs. And now we effectively call mds_layout_create on 1074.


   1104 	ds_addr_t \*dp;
   1105 	struct mds_gather_args \*gap = (struct mds_gather_args \*)arg;
   1106 	struct mds_addlo_args \*alop = &gap->lo_arg;
   1119 	lp->layout_type = LAYOUT4_NFSV4_1_FILES;
   1120 	lp->stripe_unit = alop->lo_stripe_unit;
   1122 	for (i = 0; alop->lo_devs[i] && i < 100; i++) {
   1123 		lp->devs[i] = alop->lo_devs[i];
   1124 		dp = mds_find_ds_addr(instp, alop->lo_devs[i]);
   1125 		/\* lets hope this doesn't occur \*/
   1126 		if (dp == NULL)
   1127 			return (FALSE);
   1128 		gap->dev_ptr[i] = dp;
   1129 	}

Okay, alop->lo_devs is the array we built in mds_gather_devs. Yes, yes, that is true.

I just figured out where all of my confusion is coming from - the code has struct ds_addr and ds_addr_t. In the xdr code, struct ds_addr is just an address (usr/src/head/rpcsvc/ds_prot.x):

    338 /\*
    339  \* ds_addr -
    340  \*
    341  \* A structure that is used to specify an address and
    342  \* its usage.
    343  \*
    344  \*    addr:
    345  \*
    346  \*    The specific address on the DS.
    347  \*
    348  \*    validuse:
    349  \*
    350  \*    Bitmap associating the netaddr defined in "addr"
    351  \*    to the protocols that are valid for that interface.
    352  \*/
    353 struct ds_addr {
    354 	struct netaddr4     addr;
    355 	ds_addruse          validuse;
    356 };

But in the code I've been looking at, ds_addr_t is a different structure (see usr/src/uts/common/nfs/mds_state.h):

    133 /\*
    134  \* ds_addr:
    135  \*
    136  \* This list is updated via the control-protocol
    137  \* message DS_REPORTAVAIL.
    138  \*
    139  \* FOR NOW: We scan this list to automatically build the default
    140  \* layout and the multipath device struct (mds_mpd)
    141  \*/
    142 typedef struct {
    143 	rfs4_dbe_t		\*dbe;
    144 	netaddr4		dev_addr;
    145 	struct knetconfig	\*dev_knc;
    146 	struct netbuf		\*dev_nb;
    147 	uint_t			dev_flags;
    148 	ds_owner_t		\*ds_owner;
    149 	list_node_t		ds_addr_next;
    150 } ds_addr_t;

This is pure evil because we typically equate foo_t as being typedef struct foo foo_t. As you can see, I've been fighting that in the above analysis.

I'm going to file an issue on this naming convention and leave the analysis here. I'll come back to it and rewrite it as if I knew all along that I was using a ds_addr_t and not a struct ds_addr.

Originally posted on Kool Aid Served Daily
Copyright (C) 2008, Kool Aid Served Daily



« October 2016