Thursday Oct 16, 2008

Barebones framework in place for snoop!

I just got a framework in place for a table driven approach for snoop to decode the Control Protocol used between our DS and MDS servers for pNFS.

Here you can see the DS talking to the MDS and the MDS sending a NULL query:

[th199096@pnfs-9-25 ~]> sudo snoop.ctl -i /root/ds2tmds.snoop | grep CTL
  4   0.00596    pnfs-9-25 -> pnfs-9-26.Central.Sun.COM CTL-DS C DS_EXIBI
  6   0.00000 pnfs-9-26.Central.Sun.COM -> pnfs-9-25    CTL-DS R DS_EXIBI
  8   0.00009    pnfs-9-25 -> pnfs-9-26.Central.Sun.COM CTL-DS C DS_REPORTAVAIL
 13   0.00000 pnfs-9-26.Central.Sun.COM -> pnfs-9-25    CTL-MDS C MDS_NULL
 15   2.99097    pnfs-9-25 -> pnfs-9-26.Central.Sun.COM CTL-MDS R MDS_NULL
 17   0.00012 pnfs-9-26.Central.Sun.COM -> pnfs-9-25    CTL-DS R DS_REPORTAVAIL

Guess I'll have to file in the guts of this to see what is really being said here!

Originally posted on Kool Aid Served Daily
Copyright (C) 2008, Kool Aid Served Daily

Friday Oct 10, 2008

Understanding layout creation to understand what spe will have to do

In my task list for spe, a large item has been how to tie it into the current code base - you might have seen me reference it as translating data path to guid. To do that, I've had to understand what the current code is doing and the limitations in that code. I've also had to question exactly what it is we want done.

Quick overview of spe

The Simple Policy Engine (spe) tells the pNFS MetaData Server (MDS) how to layout the stripes on the data servers (DS) at file creation time. If you think of RAID, a file is striped across disks and we need to know how many disks it is striped across and what is the width of the stripe. Then to determine which disk a particular piece of data is on, we can divide the file offset by the stripe width to get the disk.

This is simplistic, but is also the basic concept behind layout creation in pNFS. A huge difference is that we need to tell the client not only the stripe count and width, but the machine addresses of the DSes. It is a little bit more complex than that as each DS might have several data stores associated with it, a data store might be moved to a different DS, etc. We capture that complexity in the globally unique id (guid) assigned to each data store. But conceptually, lets consider just the base case of each DS having only one data store and it is always on that DS.

Overview of Current Layout Generation

So the NFSv4.1 protocol defines an OPEN operation and a LAYOUTGET operation. It doesn't define how an implementation will determine which data sets are put into the layout.

In the current OpenSolaris implementation, these two operations result in the following call chains:

"OPEN" -> mds_op_open -> mds_do_opennull -> mds_create_file
"LAYOUTGET" -> mds_op_layout_get -> mds_fetch_layout -> mds_get_flo -> fake_spe

In my development gate, a call to spe_allocate currently occurs in mds_create_file.

The relevant files to look at are: usr/src/uts/common/fs/nfs/nfs41_srv.c and usr/src/uts/common/fs/nfs/nfs41_state.c.

Note: I will be quoting routines in the above two files. Over time, those files will change and will not match up what I quote.


The interesting stuff in layout creation occurs in mds_fetch_layout:

Note that we have starting with nfs41_srv.c.

   8320 	if (mds_get_flo(cs, &lp) != NFS4_OK)


And in mds_get_flo:

   8269 	mutex_enter(&cs->vp->v_lock);
   8270 	fp = (rfs4_file_t \*)vsd_get(cs->vp, cs->instp->vkey);
   8271 	mutex_exit(&cs->vp->v_lock);
   8273 	/\* Odd.. no rfs4_file_t for the vnode.. \*/
   8274 	if (fp == NULL)

Which basically states that the file must have been created and in memory. These is not a panic for at least the following reasons:

  1. Client may have sent the LAYOUTGET before the OPEN. A crappy thing to do, but not a reason for a panic.
  2. The server may have rebooted since the client sent the OPEN. Even if the file is on disk on the MDS, it is not incore. Clue the client in that they may need to reissue the OPEN.
   8277 	/\* do we have a odl already ? \*/
   8278 	if (fp->flp == NULL) {
   8279 		/\* Nope, read from disk \*/
   8280 		if (mds_get_odl(cs->vp, &fp->flp) != NFS4_OK) {
   8281 			/\*
   8282 			 \* XXXXX:
   8283 			 \* XXXXX: No ODL, so lets go query PE
   8284 			 \* XXXXX:
   8285 			 \*/
   8286 			fake_spe(cs->instp, &fp->flp);
   8288 			if (fp->flp == NULL)
   8289 				return (NFS4ERR_LAYOUTUNAVAILABLE);
   8290 		}
   8291 	}

Note that an odl is a on-disk layout. And the statement on 8278 is how I will tie the spe in with this code. During an OPEN, I can simply set fp->flp and bypass this logic. If there is any error, then this field will be NULL and we can grab a simple default layout here. So I'll probably rename fake_spe to be mds_generate_default_flo.


So understanding what fake_spe does will help me understand what the real spe will have to do:

   8236 	int key = 1;
   8241 	\*flp = NULL;
   8243 	rw_enter(&instp->mds_layout_lock, RW_READER);
   8244 	lp = (mds_layout_t \*)rfs4_dbsearch(instp->mds_layout_idx,
   8245 	    (void \*)(uintptr_t)key, &create, NULL, RFS4_DBS_VALID);
   8246 	rw_exit(&instp->mds_layout_lock);
   8248 	if (lp == NULL)
   8249 		lp = mds_gen_default_layout(instp, mds_max_lo_devs);
   8251 	if (lp != NULL)
   8252 		\*flp = lp;

The current code only ever has 1 layout in memory. Hence, the key is 1. We'll need to see how that layout is generated. And that occurs in mds_gen_default_layout. Note how simplistic this code is - if for any reason the layout is deleted from the table, it is simply added back in here. Right now, the only reason the layout would be deleted is if a DS reboots (look at ds_exchange in ds_srv.c).


This is the code builds up the layout and stuffs it in memory:

Note that we have switched into nfs41_state.c.

   1046 int mds_default_stripe = 32;
   1047 int mds_max_lo_devs = 20;
   1052 	struct mds_gather_args args;
   1053 	mds_layout_t \*lop;
   1055 	bzero(&args, sizeof (args));
   1057 	args.max_devs_needed = MIN(max_devs_needed,
   1058 	    MIN(mds_max_lo_devs, 99));
   1060 	rw_enter(&instp->ds_addr_lock, RW_READER);
   1061 	rfs4_dbe_walk(instp->ds_addr_tab, mds_gather_devs, &args);
   1062 	rw_exit(&instp->ds_addr_lock);
   1064 	/\*
   1065 	 \* if we didn't find any devices then we do no service
   1066 	 \*/
   1067 	if (args.dex == 0)
   1068 		return (NULL);
   1070 	args.lo_arg.loid = 1;
   1071 	args.lo_arg.lo_stripe_unit = mds_default_stripe \* 1024;
   1073 	rw_enter(&instp->mds_layout_lock, RW_WRITER);
   1074 	lop = (mds_layout_t \*)rfs4_dbcreate(instp->mds_layout_idx,
   1075 	    (void \*)&args);
   1076 	rw_exit(&instp->mds_layout_lock);

We first walk across the instp->ds_addr_tab and look for effectively 20 entries. Note that max_devs_needed is always 20 for this code and so will be args.max_devs_needed.

I think the check on 1067 is incorrect and a result of the current implementation normally being on a community with 1 DS. It should be the case that args.dex is greater than or equal to max_devs_needed. Actually, we need to be passing in how many devices we will have D (the ones assigned to a policy) and how many we need to use S, with S <= D. The args.dex will have to be >= S.

Note that on 1070, we assign it the only layout id which will ever be generated. And if we play things right, we could store this layout id back in the policy and avoid regenerating the layout if at all possible.

Finally we stuff the newly created layout into the table.


So mds_gather_devs does the work of stuffing the layout. It gets called for every entry found in the instp->ds_addr_tab:

    974 	if (gap->dex < gap->max_devs_needed) {
    975 		gap->lo_arg.lo_devs[gap->dex] = rfs4_dbe_getid(dp->dbe);
    976 		gap->dev_ptr[gap->dex] = dp;
    977 		gap->dex++;
    978 	}

So we keep on reading ds_addr_t data structures until we have enough.

Now, how is that table populated? You can look for ds_addr_idx over in usr/src/uts/common/fs/nfs/ds_srv.c, but basically, for each data store that a DS registers, one of these is created.

The upshot of all this is that if a pNFS community has N data stores, then the layout generated for the current implementation will have a stripe count of N.

Back to mds_fetch_layout

Note and nfs41_srv.c.

Okay, we've generated the layout and start to generate the otw (over the wire) layout:

   8333 	mds_set_deviceid(lp->dev_id, &otw_flo.nfl_deviceid);

Crap, it is sending the device id across the wire! I'm going to have to rethink my approach. Instead of storing a policy as a device list and picking which devices I want out of that list (i.e., a Round Robin (RR) scheduler), I'm going to have to store each generated set as a new device list.

I don't understand the process like I thought I did.

Going back to mds_gather_devs, it is not stuffing data stores into a table as I thought. Instead, it is stuffing DS network addesses into a table.

Missing link

What I'm missing is how the ds_addr entries map back to data stores. Okay, this code in mds_gen_default_layout does it:

mds_layout_lock, RW_WRITER);
   1074 	lop = (mds_layout_t \*)rfs4_dbcreate(instp->mds_layout_idx,
   1075 	    (void \*)&args);
   1076 	rw_exit(&instp->mds_layout_lock);

We have just gotten the device list via the walk over mds_gather_devs. And now we effectively call mds_layout_create on 1074.


   1104 	ds_addr_t \*dp;
   1105 	struct mds_gather_args \*gap = (struct mds_gather_args \*)arg;
   1106 	struct mds_addlo_args \*alop = &gap->lo_arg;
   1119 	lp->layout_type = LAYOUT4_NFSV4_1_FILES;
   1120 	lp->stripe_unit = alop->lo_stripe_unit;
   1122 	for (i = 0; alop->lo_devs[i] && i < 100; i++) {
   1123 		lp->devs[i] = alop->lo_devs[i];
   1124 		dp = mds_find_ds_addr(instp, alop->lo_devs[i]);
   1125 		/\* lets hope this doesn't occur \*/
   1126 		if (dp == NULL)
   1127 			return (FALSE);
   1128 		gap->dev_ptr[i] = dp;
   1129 	}

Okay, alop->lo_devs is the array we built in mds_gather_devs. Yes, yes, that is true.

I just figured out where all of my confusion is coming from - the code has struct ds_addr and ds_addr_t. In the xdr code, struct ds_addr is just an address (usr/src/head/rpcsvc/ds_prot.x):

    338 /\*
    339  \* ds_addr -
    340  \*
    341  \* A structure that is used to specify an address and
    342  \* its usage.
    343  \*
    344  \*    addr:
    345  \*
    346  \*    The specific address on the DS.
    347  \*
    348  \*    validuse:
    349  \*
    350  \*    Bitmap associating the netaddr defined in "addr"
    351  \*    to the protocols that are valid for that interface.
    352  \*/
    353 struct ds_addr {
    354 	struct netaddr4     addr;
    355 	ds_addruse          validuse;
    356 };

But in the code I've been looking at, ds_addr_t is a different structure (see usr/src/uts/common/nfs/mds_state.h):

    133 /\*
    134  \* ds_addr:
    135  \*
    136  \* This list is updated via the control-protocol
    137  \* message DS_REPORTAVAIL.
    138  \*
    139  \* FOR NOW: We scan this list to automatically build the default
    140  \* layout and the multipath device struct (mds_mpd)
    141  \*/
    142 typedef struct {
    143 	rfs4_dbe_t		\*dbe;
    144 	netaddr4		dev_addr;
    145 	struct knetconfig	\*dev_knc;
    146 	struct netbuf		\*dev_nb;
    147 	uint_t			dev_flags;
    148 	ds_owner_t		\*ds_owner;
    149 	list_node_t		ds_addr_next;
    150 } ds_addr_t;

This is pure evil because we typically equate foo_t as being typedef struct foo foo_t. As you can see, I've been fighting that in the above analysis.

I'm going to file an issue on this naming convention and leave the analysis here. I'll come back to it and rewrite it as if I knew all along that I was using a ds_addr_t and not a struct ds_addr.

Originally posted on Kool Aid Served Daily
Copyright (C) 2008, Kool Aid Served Daily

Friday Oct 03, 2008

New gate and closed bins build a working pNFS community

So I have a successful community up and running. I'll push the closed binaries out later tonight. Life impinges...

Originally posted on Kool Aid Served Daily
Copyright (C) 2008, Kool Aid Served Daily

Thursday Oct 02, 2008

Setting up a NFS41 gate

We just opened up a new Mercurial gate of NFSv41 on Eventually it will automatically push changes as they occur to our gate. I also need to figure out a way to automatically update the closed-bins.

The hardest part was figuring out the naming convention. Some links of interest are Some work on libMicro; Mercurial transition notes and finally How to Use Mercurial (hg) Repositories. Look for For Project Leads: How to set up a Mercurial repository.

Update: Also, SCMVolunteers, look for Setting up a new (Mercurial) Project repository on

In any event, you can grab a copy of the source at:

hg clone ssh://

Note the lack of a double '/' after the FQDN - normally I would take that as a sign of a bug with Mercurial.

Note that while this compiles, you can't run it without a corresponding closed-bins.

Eventually, you should be able to browse the source via Cross Reference: nfs41-gate.

And a big thanks to David Marker for providing the help necessary to getting this to go live!

Originally posted on Kool Aid Served Daily
Copyright (C) 2008, Kool Aid Served Daily

NetApp announces pNFS server

I wanted to blog about this I Left Out One Detail from the September 2008 pNFS Bake-A-Thon Report because of the hard work of a friend of mine - Pranoop Erasani. I couldn't because of the NDA in place.

I actually don't know any details about their server implementation, but I do know he was quite proud of the work he did leading up to the BakeAThon - I saw him later in the Austin airport and he was beaming with pride.

I'll pick on Mike now, he states:

Congratulations to NetApp's Pranoop Erasani, who is leading our Data ONTAP pNFS server project and the rest of the Data ONTAP NFS team.

The first time I read this, I thought Mike was just pimping out Pranoop, the leader of both the pNFS server project and the Data ONTAP NFS team. It took me a couple of tries to realize that Pranoop hadn't been promoted again, and instead Mike was pimping out both Pranoop and the rest of the Data ONTAP NFS team.

Originally posted on Kool Aid Served Daily
Copyright (C) 2008, Kool Aid Served Daily

Saturday Sep 27, 2008

Working towards a vbox image for distribution

One of the difficulties with pushing out a release for OpenSolaris Project: NFS version 4.1 pNFS is that we could only release source and BFU. We could not release a live image.

To complicate matters, part of the NFS code is in the closed repository. The impact of which was we had to also release a special closed-bins.

The difficulty lay in two areas:

  1. We weren't allowed to take the DVD image, install our bits, and send that back out. Note, if you search for kanigix on my blog, you'll see I provide recipes for making your own customized DVD, but I don't distribute DVDs.
  2. People, even ex-Sun employees, didn't want to install a stock system and BFU the updates.

We started to get requests for VMWare images. And we still weren't allowed to hand those out.

But OpenSolaris is adaptive to pressures in the community. I just asked again and was pointed towards the Hadoop project and especially this one: OpenSolaris Project: Hadoop Live CD.

My understanding is that we aren't trying to make a distribution, we aren't trying to steal thunder, instead we are trying to get systems out there to enable interoperability testing.

So now I'm working on a framework to get OpenSolaris + pNFS on a VirtualBox image.

Stay tuned as I go down the wrong path several times, but emerge with a working process.

Originally posted on Kool Aid Served Daily
Copyright (C) 2008, Kool Aid Served Daily

Wednesday Feb 06, 2008

Getting together a pNFS storage community

We decided to get some Mac Minis as test machines for pNFS. We needed small and quiet machines.

The first problem was that the minis do not have a serial port for a console. I bought an IOGear kvm that supported 4 machines - right now I have 3. It fits right in the mini stack. The front view:

Not shown

And the side view:

Not shown

I know I can install Nevada 79 (the pNFS codebase is currently ontop of 79) on them - Trebor has done it - but he won't blog about it. So off I go. The first issue is that in boot up, the USB keyboard is not seen, so I have to take the default install of the developer edition. Note that I like a finer grained install which allows you to setup multiple slices. In fact, I will need that for the DS machines - they need ZFS.

Anyway, onward. Note that there are some resources for doing this with bootcamp. I want the whole disk, so I'm trying to treat this like any other x86 box.

The next hurdle is when partitioning it craps out right away with a warning that the '/' slice extends beyond HBA cylinder 1023. The nice thing about the installer is that I do not have to reboot right here. Hmm, trying a 32G root partition does the same thing. I wonder if I'm hitting the bug Paul reports in How to Dual Partition a MacBook Pro with MacOS and Solaris, e.g., CR6413235?

Nope, I bet it is the EFI issue. Okay, I asked Trebor (dark side of Robert Gordon ) if he hit this and the answer was yes. His suggestion was to let the install work until it failed, and then in the resulting terminal do fdisk -E blah, where blah for me was either 'c1d1' or 'c1d1p0'. Both gave the same output for me.

I still selected a 40G partition because I want to have a ZFS partition on the rest of the disk. And now the installation is just proceeding. And it dies trying to install the boot blocks. This was after installing everything.

I'm trying Ubuntu right now to get a different perspective. It also had an issue with the keyboard at first. Note that I can reconfigure the nevada ISO to boot grub up into the regular install if needed.

Anyway, the Ubuntu install is chugging away. It is done and appears to start to reboot. Very slow compared to OS X. And it is networked!

Now I want to see what happens when I install Nevada (snv 79) on top of Ubuntu. The first difference is that I do not get that annoying '/' 1023 cylinder message. Umm! And that works!

Okay, I want to avoid the booting Ubuntu step in the future. I'm pretty sure the problems I am seeing are with the disk label. I guess I really need to pay attention to steps 6-9 of Alan Perry's Setting up a Mac Mini for dual booting Solaris and MacOS X - note that neither Alan or Paul had problems with their keyboards. Paul wouldn't on a laptop. And Alan may have been doing this before the Developer installation was an option.

To get networking going, you need a Marvell Yuokon driver, which you can get from this entry: Solaris 10 U3 on Gateway MX6453 - be sure to follow the advice to get the 64 bit version. And note, it downloaded as skgesol_x64v8.19.1.3.tar.Z.tar for me, so rename it to get rid of the trailing '.tar'.

Read that blog entry again, I needed to add to my '/etc/driver_aliases':

skge "pci11ab,4362"

Okay, I'll continue this later...

Originally posted on Kool Aid Served Daily
Copyright (C) 2008, Kool Aid Served Daily

Saturday Dec 22, 2007

A stand alone policy rule engine rule verifier

We want admins to be able to debug policy engine rulesets - they need to be able to determine before hand which rule will apply and they need to be able to see afterwards which rule applied.

Some background, a policy rule states that if an expression of attributes evaluates to true, then a file create under pNFS will be given a certain layout. And a layout is basically a stripe count and width. The count is the number of DS to stripe the file across and the width is how large of a chunk of data to send to each DS.

A client can generate a layout hint that it can send to the server. And the server is free to reject it, especially if the server already has a rule. The way to think of this is that it allows an admin on the client, who does not have administration rights on the server, to define new policies in the middle of the night. I.e., no need to wake the server admin up.

The client's hint lacks the final necessary information, the set of DS to be used. So even if the server accepts the hint, it needs to be instantiated with the actual DS hosts. The server policy engine will determine that set by looking at usage information for the DS - or it might just pick them in some round robion fashion. This is a classic scheduling problem from AI.

To enable the admin to debug, we need to allow access to both the client and server policy rulesets. But we should start simple and get some code which works on a ruleset.

I'm going to skip how rules are loaded to the sped (Simple Policy Engine Daemon) and how we get our hands on them from it. Instead, I'm going to create a tool which handles a flat file. Furthermore, that format may not be what I end up using - right now this debug tool is also a design tool.

I could write it in Perl, which is perfect for basically string processing, but I think I will be stealing major chunks of the code for sped. So, I'll write it in C.

The very first thing I want to look at are the parameters to the program. I need to get at attributes such as the path, the extension, the UID, the GID, and the IP. I was going to grab the time and day from the system, but I just realized I can be doing postmortem debugging and need to get these. Okay, I also need to be able to get the policy rulesets read in from a file.

I'm going to present a chunk of code, of which I'll end up throwing some away. I want to look at option handling and make sure it works before I do anything else:

#include <stdio.h>
#include <stdarg.h>
#include <unistd.h>

main(int argc, char \*argv[])
        int     i;
        int     iFlags = 0;
        int     ch;

        int     iFoundSome = 0;

        while ((ch = getopt(argc, argv, "?vr:p:u:g:i:h:d:")) != -1) {
                iFoundSome = 1;

                switch (ch) {
                case 'v' :
                        fprintf(stdout, "Oh, be chatty!\\n");
                case 'r' :
                        fprintf(stdout, "The rules are in %s!\\n", optarg);
                case 'h' :
                        fprintf(stdout, "The hour is %s!\\n", optarg);
                case 'd' :
                        fprintf(stdout, "The day is %s!\\n", optarg);
                case 'p' :
                        fprintf(stdout, "with the %s!\\n", optarg);
                case 'u' :
                        fprintf(stdout, "It was %s,\\n", optarg);
                case 'g' :
                        fprintf(stdout, "The group is %s!\\n", optarg);
                case 'i' :
                        fprintf(stdout, "in the %s,\\n", optarg);
                case '?' :
                default :
                        goto usage;

        if (!iFoundSome)
                goto usage;

        argc -= optind;
        argv += optind;

        return (0);


                "speadm explain -r rules-file [-v]" 
                " [-p proposed-filename] [-u uid] [-g gid] [-i ip]"
                " [-h hour] [-d day]\\n");
        return (1);

Okay, as an aside, OSX Leopard cut and paste can rock! Seeing the text being dragged from the Terminal to Firefox was amazing.

The first thing to notice is that I've used -h for hour and not help. Next notice that the rule file is not optional. But I've used a flag for it. I did this to allow it to appear anywhere in the argument list. I will have to eventually add some code to make sure it is present.

A short test run shows us some neat things that getopt() does for us:

stealth:spe tdh$ gcc main.c 
stealth:spe tdh$ ./a.out 
speadm explain -r rules-file [-v] [-p proposed-filename] [-u uid] [-g gid] [-i ip] [-h hour] [-d day]
stealth:spe tdh$ ./a.out -r tests/simple.txt 
The rules are in tests/simple.txt!
stealth:spe tdh$ ./a.out -r 
./a.out: option requires an argument -- r
speadm explain -r rules-file [-v] [-p proposed-filename] [-u uid] [-g gid] [-i ip] [-h hour] [-d day]
stealth:spe tdh$

I didn't have to explicitly enter in error handling for detecting when an option was missing. But wait, does it work like I want:

stealth:spe tdh$ ./a.out -r -v
The rules are in -v!

In the next entry, I'll do the sanity checking for the arguments. This will include setting the default values. And it will also have to consider if an argument is allowed to begin with a '-'...

Originally posted on Kool Aid Served Daily
Copyright (C) 2007, Kool Aid Served Daily

Sunday Dec 02, 2007

The Interconnect between the MDS and DSs

In a traditional server, reliability is achieved by removing single points of failure. There is more than 1 network connection, there is more than 1 interconnect to the drives, etc. And, there is more than 1 machine:

Not shown

Both partners in this simple cluster are connected to all of the disk drives. And each connection actually represents two different loops to the drives.

The green line represents the interconnect link between the partners and at a bare minimum is a heartbeat by which the they can determine if the partner is down and if a machine should take over the disks. Note that we haven't stated which disks physically belong to which system.

The easiest problem to spot with this system is what happens if all that breaks is the interconnect? Both machines are up and ready to serve data. If we have more than 2 boxes, we could use a quorum. Since we don't, we can in effect reserve an area on the drives to act as a secondary heartbeat. I'm not interested in these mechanics, just that the problem exists.

The deep, dark secret of pnfs is that it is in effect a cluster. I'm not stating that a standby MDS is in place, that DS can have a partner, but that I should have really said that pnfs is a distributed file system. :-). And what I really mean is that to be effective, the MDS needs to communicate with the DSs. But, as designed, the DSs do not need to communicate with each other.

The question arises though as to how the MDS communicates with the DSs. Do we have an interconnect between it and each DS? Are they strung out in a chain? The answer is that no, such approaches get unwieldy, complicated, and cause the ship date to slip.

But if we think about it, we already have a physical interface in place to communicate between an arbitrary node in the distributed file system: the network. The problem then becomes that we need a protocol to allow the MDS to talk to the DSs. We need heartbeats - "Are you still alive?". We need to know bogged down a DS is - "Thank you sir, may I have another!".

But if you look at the NFSv4.1 RFC (which is the authoritative source for all things pnfs), you'll realize that the communication between the MDS and DSs is an implementation detail. If you look close enough, you'll realize nothing states that the MDS and DS(s) have to be on different physical machines.

The reason why this approach was taken was to allow each vendor to leverage their existing technologies. (For example, an OS might already have existing protocols to migrate or replicate data between boxes.)

The other reason here is that there was no desire to mix implementations in the DSs. I.e., NetApp and EMC do not have to worry about both being a DS for a Sun MDS. An interesting twist to this is that nothing prevents vendors from doing this on their own. With the OpenSolaris code available for download and the Linux reference available for download, it would be easy for a 3rd party to learn to talk the talk.

The protocol used to communicate is called control, although do not hold me to that name - it could change before final putback. I'll quote from Spencer's NFSv4.1's pNFS for Solaris (which is not as quick and dirty of an overview as my A quick and dirty overview of pnfs),

Coordination of the pNFS Community

To this point, we have a Solaris pNFS client interacting with a pNFS server over a flexible network configuration. The meta-data server is using ZFS as the underlying filesystem for the "regular" filesystem information (names, directories, attributes). The data server is using the ZFS pool to organize the data for the various layouts the meta-data server is handing out to the clients. What is coordinating all of the pNFS community members?

pNFS Control Protocol

The control protocol is the piece of the pNFS server solution that is left to the various implementations to define. Since the Solaris pNFS solution is taking a fairly straightforward approach to the construction of the pNFS community, this allows for the use of ZFS' special combination of features to organize the attached storage devices. This will allow for the control protocol to focus on higher level control of the pNFS community members.

Some of the highlights of the control protocol are:

  • Meta-data and data server reboot / network partition indication
  • Filehandle, file state, and layout validation
  • Reporting of data server resources
  • Inter data server data movement
  • Meta-data server proxy I/O
  • Data server state invalidation

I'm building up a background for understanding what I am currently working on...

Originally posted on Kool Aid Served Daily
Copyright (C) 2007, Kool Aid Served Daily

Saturday Dec 01, 2007

A quick and dirty overview of pnfs

I've started working on the pnfs (parallel NFS) project, which can be found as NFS version 4.1 pNFS. The intent is to parallelize NFSv4 traffic to a "server" - which is basically N+1 different storage units. One of the first things I am working on is the simple policy engine - which is used to decide how to parallelize the access. And the thing I am working on right now is gathering statistics on how the pnfs server is working.

We can consider the classical NFS client and server relationship in this diagram:

Not shown

The client connects to a server via a network and the server has one or more disks either inside or attached to it. To get the most out of the box, we would add RAM, NVRAM, solid state disk, faster drives, etc. The reason that we don't have just one drive in there is that we can stripe the data access across multiple drives. Using RAID technology and volumes (i.e., make several physical disks appear as a logical unit), we can can parallelize data access to a set of disks.

What this means is that conceptually as one disk is busy doing I/O for us, we can be sending more work to the next disk. A very simplistic view would be to incorporate all of the disks into one large volume. You could then create project directories underneath the mount point. But, how do you enforce how much space a project gets? How do you keep people from looking where they should not? How do you get a very intensive application tied to the fastest disks?

These questions are ones of policy - how do we want to divide the resources? In a traditional server, we would create volumes dedicated to a policy. Below, we see we have /home, /builds, and /engineering. And we assume that the system (/, /usr, etc) is 'internal' to the server:

Not shown

Perhaps the disks for /builds are faster. Perhaps the disks for /home are larger. All that matters here is that we made a static policy decision and changing it can be very difficult.

It is easy to see that eventually the limiting factor to data transfer is the bottleneck caused by the network connection. We went from 10Mbs to 100Mbs to 1G networks (and are working on making 10G a reality) in part because the server subsystems were faster than the network. But workloads have increased or client farms have scaled way out of proportion. Replace that single client with 1000 of them all trying to access the /home volume at once.

A solution is to parallelize the server by adding more boxes to it. A very simplistic approach would be the following:

Not shown

We want the server to advertised with a single IP address, so we put a box in front to essentially route all traffic to the correct storage box. Think of the problems this presents:

  • The routing box has to open up the packet enough to figure out where to re-route it.
    • We really want zero copies of the network packet.
    • Ideally it would go straight to the target storage box.
  • How have we replaced the bottleneck of one machine processing every packet?

We want to leave the routing to network hardware. What does our router do? It decides which of the data servers to send the data to. Why does that decision need to be repeated for every packet? What if the very first time we started file access we decided which machine the file was going to go to?

The problem with this is that traditional NFSv3 and NFSv4 clients expect that the file will be on the same machine that was queried for the open. (This is not entirely accurate, NFSv4 could handle this migration of a file.) And if we wanted to split that file across multiple machines, well that is definitely outside of the scope of these protocols.

The solution that pnfs takes is to provide a router, which is called a MDS (metadata server) and push the actually routing of the data back to the client:

Not shown

Just as with RAID, we want to stripe access to the I/O. So, the client queries the MDS about the layout of the file. The layout is basically a list of DS (data servers) and the stripe size.

Not shown

At a simplistic level, the client will access the first stripe size chunk on the first DS, the second chunk on the second machine, and so on. Once it has done that for all machines, it will wrap back to the first. If the client decides to start at a position other than the start of the file, it can do some simple calculations to determine which DS has the data.

A write operation can be visualized in the following:

Not shown

To summarize, the pnfs client (for all intents and purposes a NFSv4.1 client) talks to the MDS to get metadata information (attributes) and the set of DSs which have the file. The client then directly talks to each of them for file access.

It is then the job of the spe to determine which set of machines that should be used. Instead of the static decision made when there were disks attached, we now make a dynamic decision. If the file access is for reading, well we don't need to consult the spe, the policy is in place. For writes, the spe will need to look at specified rules (policies) and the current state of the DSs to determine the best place to stripe the file. Note that nothing we have said (nor the above picture) dictates that the DS have adjacent IPs.

I'll talk more about the spe in a future entry.

Originally posted on Kool Aid Served Daily
Copyright (C) 2007, Kool Aid Served Daily

Tuesday Jun 26, 2007

OpenSolaris Project Models and pNFS

I believe that the majority of OpenSolaris development occurs within Sun Microsystems Engineering. As much as we would like for it to snowball in the wild, that has not happened. I'm saying this from my biased view, I know some projects have been proposed externally from Sun, e.g., the i18n port of the closed library. I also acknowledge the work that Dennis Clark is leading for the PPC port. There are more and I am not trying to take away from them. I am relating my experience with trying to get projects off the ground on OpenSolaris - see for example OpenSolaris Project: NFS Server in non-Global Zones.

So what does happen is that a new project gets started and there is no external indication of forward progress. People might start asking for code drops and the reality is that because of the huge internal pressure towards quality in Sun Engineering, that is not going to happen until the code has baked a bit. It gets to the point that a prime question on new project proposals is will code be released. Again, there isn't some hidden agenda within Sun to withhold the code - we are just new to this model and we want things to be perfect, not just good enough.

Look back at the discussion that went on for Project Proposal -- Honeycomb Information and dev tools and the lack of a code drop. The OpenSolaris Project: HoneyComb Fixed Content Storage already shows a binary drop and plans for a code drop in the Fall of 2007. Some valid reasons for a group to not drop code right away are that they do not understand the process (they need someone to help them) and they need to clear a legal hurdle to make sure that they are not violating the rights of either an individual or a company. I've seen both occur internally. The good news is that we have internal people ready and willing to help development groups.

What I find really exciting are projects that have a significant external presence. And sometimes that external pressure doesn't contribute directly to the code work. In NFSv4 and NFSv4.1, the external collaboration takes place through the IETF and Connectathon. Both companies and open source developers come together to design and implement future NFS protocol extensions. Interoperability across multiple OS platforms is ensured via the yearly meetings at Connectathon. And with the UMICH CITI developers working on Projects: NFS Version 4 Open Source Reference Implementation, which is mainly distributed to Linux, but forms a reference for both BSD directly and OSX indirectly, and Sun working on OpenSolaris, it is possible for vendors to do compatibility testing all year long.

Take for example NetApp, which provides only a NFS server. They are able to test new NFSv4.1 features against Linux and OpenSolaris clients. Admittedly this isn't new, NetApp was able to use the Solaris 10 beta code to test NFSv4. And the companies in question all sign NDAs and exchange hardware and engineering drops of binaries for testing.

So there is almost no work being driven from OpenSolaris into this open design project. There is a OpenSolaris Project: NFS version 4.1 pNFS, but it is mainly a portal to the Sun NFS team's work. A question that they asked themselves was whether they were going to do binary drops, code drops, or any drop at all. It wasn't a legal issue, the design is done in the open and all of the coding is new development. It wasn't a fear of the unknown, they had already shared binaries in the past. No, rather it was a concern on the impact of providing a drop on the development schedule. Would the overhead of publishing code and/or binaries kill the final deliverable?

Another OpenSolaris reality is that Sun expects to make money. I know that is an evil concept to some open source developers, but we bet the company on being able to deliver quality and sell service along with the source. So making the deadline for the pNFS deliverable is a major concern for the group.

I'm happy that the group decided that they could both deliver on time and make code and binary drops. Lisa just announced for the group the latest drop in FYI: pNFS Code and BFU Archives posted. You can check out the b66 implementation by downloading it. The code is rough in the sense that you wouldn't want to put it in production, but it gives other developers a chance to see what is going on and allows them to test their own implementations. Remember, this code has not been putback into Nevada - it lives in a group workspace. Before OpenSolaris, it would have only been shared under NDA and the expectation that the person installing the code assumed responsibility for any problems.

Project development in OpenSolaris is different than that occurring in other open source communities. There are different hurdles to jump, but there are different expectations as well. Internal developers are proud of the quality that they demand of the code and want to keep that bar high. That in turn makes early code drops hard for them to deliver. It is something they are learning to do. And the pNFS team is leading the way.

Originally posted on Kool Aid Served Daily
Copyright (C) 2007, Kool Aid Served Daily



« June 2016