Friday Nov 14, 2008


Traditionally, NFS has always been able to share an ordinary file system, and thus has always dealt with vnodes. With the advent of Parallel NFS (pNFS), which has stripes of data distributed across multiple servers, this is no longer the case. The initial implementation of pNFS uses the DMU API provided by ZFS to store its stripe data. The pNFS protocol also requires that a server community support proxy i/o; that is, a client must be able to perform all i/o against one server, and if the data requested is on another node, then the server must perform the i/o by proxy. Neither the DMU nor proxy i/o are accessed via vnodes.

Another change brought by pNFS is the distributed nature of the server. What has always been confined to a single server is now distributed over multiple servers. This necessitates a control protocol, for communication among the server nodes, and implies that some server tasks may have longer latencies than before. In pathological cases, e.g. a server reboot of one particular server, the latencies may be very high. This will likely require new ways that the NFS server is implemented, and it will become advantageous for the NFS server code to use new APIs, APIs with different design goals from vnodes. Asynchronous methods become more desirable, to deal with the new latencies involved with processing client requests.

Enter nnodes. nnodes can be thought of as vnodes, but customized for the needs of NFS (especially pNFS), and with three distinct ops-vector/private-data sections. The figure below shows an NFS server implementation interacting with an nnode that is backed by a vnode.

The fact that the nnode has the three distinct sections for data, metadata, and state, makes it easy to mix and match commonly needed implementations. Here are some examples:

Traditional shared file systemvnodevnoderfs4_db
pNFS metadata serverproxy i/ovnoderfs4_db
pNFS data serverDMUnot applicableproxy to MDS

But this is just the beginning. There will doubtless be more constructions in the future.

nnodes also serve as a place to cache or store information relevant to a file-like object. For example, in the pNFS data server, we can cache stateids that are known to be valid. Thus, the data server will not need to contact the metadata server on every i/o operation.

Today I have been writing a more verbose comment header for the header file "nnode.h". Look for it in our repository soon.

Sunday May 25, 2008

goodbye MANPATH

I don't usually blog about every config file change I make, but here's one I'm particularly happy with, as it's removing a kludge. This is a change made to my zsh configuration that only runs on Solaris:

-# Sun's annoying man command...
-for dir in $path
-    mdir=${dir%/\*}/man
-	test -d $mdir && manpath=($manpath $mdir)

I was happy to play a very small roll in this putback to Nevada.

Thanks Mike!

Wednesday Apr 02, 2008

smcup/rmcup: hate

Some terminals have the capability to save and restore themselves. Some programs take advantage of this, so that when you exit the program, your screen is restored to its previous state. In terms of terminfo, these capabilities are known as smcup and rmcup.

I hate this. Hate.

Let's say you want to run a command, but want to look at its man page first. The man command sends its output to your $PAGER, which is less. The less command saves/restores your screen. So, you scroll to exactly the example you want: perfect! You hit 'q' to quit... and the example is erased. Hate. Many more examples are possible, but you get the idea.

Here's how I eradicate this in my world. I use this on Solaris 10 and 11, and MacOS 10.4 and 10.5. Your mileage will probably vary, but feel free to give it a try. It's in my .zshrc file, so it's using zsh's builtin "[[" and "]]" operators, as well as "$(command)". If this fails for you, you can probably just replace "[[" with "[", "]]" with "]", and "$(command)" with "`command`".

TERMINFO="/tmp/$(id -un)-terminfo-$(uname -s)"
if [[ ! -d $TERMINFO ]]; then
    mkdir -p $TERMINFO
    infocmp | sed -e 's/smcup.\*,//' -e 's/rmcup.\*,//' -e '/\^[ \\t]\*$/d' \\
        > $TERMINFO/fixed
    sed -e '1d' -e '3,$d' < $TERMINFO/fixed | grep -w $TERM >/dev/null 2>&1
    if [[ $? -ne 0 ]]; then
        mv $TERMINFO/fixed $TERMINFO/broken
        sed -e "2s/\^/$TERM|/" < $TERMINFO/broken > $TERMINFO/fixed
    tic $TERMINFO/fixed

Thursday Mar 13, 2008

Using SMF to make DTrace a better printf replacement

When DTrace was introduced, life got much easier for kernel developers. For example, it became very easy to check if and when a certain function was being called. Before DTrace, if a debugger breakpoint was too heavy, we would often use printf()s. With DTrace, no more recompile/reboot!

Sure enough, one thing often discussed in the hallways was, "Yea! No more printf()s!" Yet I still find myself using printf()s. Not as often, certainly, and not in as many situations, but they haven't disappeared.

I think the reason I still use printf()s is that printf()s are "always on". With a printf() in the kernel, all output goes to /var/adm/messages. For example, if I'm writing bleeding edge code, I'd like to know if a certain routine fails. The routine might work at first, then fail a week later. With a printf(), I can always go back and look to see what happened. With DTrace, it's only there if I was running my script.

If only there were a way that I could use DTrace instead. If DTrace were always running on my development box, I wouldn't miss those unexpected things, and I wouldn't find myself wishing I had had a printf() at some point. If only there were some facility to start a D script, so that you wouldn't forget to launch it every day when you start working... Duh. Use SMF!

An SMF service to run a D script

You can find an SMF manifest for running a D script as a service here. You write a script, point this service at the script, enable via smf, and the script is always running! Output goes to /var/adm/messages via syslog.

Here's how to use it (you can also read these steps in the manifest itself):

  1. Download the file and save it in /var/svc/manifest/application/dtsyslog.xml
  2. # svccfg import /var/svc/manifest/application/dtsyslog.xml
  3. Write a script that you would like to run all the time. Put it somewhere, e.g. in /var/tmp.
  4. # svccfg
  5. svccfg> select dtsyslog
  6. svccfg> editprop
  7. Look for the line that says
    # setprop script/path = astring: /path/to/D/script
    Uncomment this line, and change "/path/to/D/script" to point to the path of your script.
  8. svccfg> quit
  9. #svcadm refresh dtsyslog
  10. #svcadm enable dtsyslog


Let's try an example. Suppose I want to know when nfs4_setattr() is being called. Let's make a script:

#! /usr/sbin/dtrace -s

#pragma D option quiet

	printf("setattr called on %s\\n", stringof(args[0]->v_path));

Note the "#pragma" line: there is a way with the SMF properties that we could have passed a "-q" option to dtrace, but I prefer the pragma, as it makes the script more self contained. So, we store this file as /var/tmp/logger.d, and use svccfg to set the script/path property, and enable, and we're good to go!

burr/3/\* echo hello > foo
burr/4/\* chmod 666 foo 
burr/5/\* tail /var/adm/messages
Mar 13 14:02:50 burr pseudo: [ID 129642] pseudo-device: fbt0
Mar 13 14:02:50 burr genunix: [ID 936769] fbt0 is /pseudo/fbt@0
Mar 13 14:02:50 burr pseudo: [ID 129642] pseudo-device: systrace0
Mar 13 14:02:50 burr genunix: [ID 936769] systrace0 is /pseudo/systrace@0
Mar 13 14:02:50 burr pseudo: [ID 129642] pseudo-device: profile0
Mar 13 14:02:50 burr genunix: [ID 936769] profile0 is /pseudo/profile@0
Mar 13 14:02:50 burr pseudo: [ID 129642] pseudo-device: lockstat0
Mar 13 14:02:50 burr genunix: [ID 936769] lockstat0 is /pseudo/lockstat@0
Mar 13 14:21:38 burr dtsyslog: [ID 702911 daemon.notice] setattr called on /home/samf/foo

Other Possibilities

You will likely want to edit your D script from time to time. When you do, just edit the script, and invoke "svcadm restart dtsyslog" to pick up the changes.

If you wish to have more than one script, you can easily create multiple instances of the dtsyslog service. Look at the man page for svccfg. Just edit the script/path property for each instance.

You can control the output via /etc/syslogd.conf. The messages are coming as daemon.notice by default, but you can also configure that via svccfg.

I hope you find this useful!

Wednesday May 23, 2007

nfs4trace: a New Direction

My previous DTrace provider for NFS has been floundering, due to a needed design change, and my priorities with pNFS. Fortunately, a new design is being worked by two engineers recently assigned to the project.

[Read More]

Thursday Jun 08, 2006

nfs4trace release: all ops covered!

I have just pushed a new release of the DTrace provider for NFSv4. This is in fact the second release I've made -- the first release didn't get a blog entry. You can be sure to see all releases by watching the announcements section of the nfs4trace project. An RSS feed can be had here:

This release is based on Nevada build 41. I mention this so that you will know what bugs and bug fixes are present in this release. The release is in the form of bfu archives, so follow the usual procedure to install them.

New Features

This release of nfs4trace provides a set of probes for every NFSv4 operation. There is an overarching probe for the compound op, called "op-compound". Each op within a compound has its own probe, e.g. "op-lookup".

For every probe, args[0] is a pointer to a nfs4_dtrace_info_t. This has the following fields:

dt_xid RPC transaction ID you can use this to correlate "start" with "done"
dt_cr pointer to the credential structure used for this operation use args[0]->dt_cr->cr_uid to get the user ID
dt_tag NFSv4 tag field associated with this over-the-wire call this is entirely implementation defined, and is meant simply to be human readable
dt_status the status of the entire compound operation  
dt_addr the IP address of the "other end" of the request for a client, the "other end" is the server; for a server, the "other end" is the client
dt_netid the network ID of the "other end" of the request network ID will be either "tcp" for IPv4 or "tcp6" for IPv6.

Besides the normal compound operations, there is a probe for all callback-related operations. The probe "cb-compound" is analagous to "op-compound", but covers the callback channel. Each operation within a callback, e.g. OP_CB_GETATTR, also has a probe -- for this example, "cb-getattr".

To see a complete list of all probes, run dtrace -P nfs4c -l for client probes, and dtrace -P nfs4s -l for server probes.

What's Left?

Eventually, there will be probes for every nontrivial attribute, which will fire regardless of which operation is using them. But not all of these attributes are accounted for yet. Stay tuned.

What else is left? Your feedback will help decide this! Please try this, and let me know if you have any input. Thanks!

Thursday Feb 09, 2006

SMF Manifest for Perforce

If you want to run a Perforce server from Solaris 10 or greater, you should be using SMF instead of /etc/init.d scripts or inetd. I haven't seen a SMF manifest for a Perforce server as of yet; so, I created one. It handles p4d (the main server) and p4p (the proxy server).

You can grab it from here:

The manifest itself has a quick "cheat sheet" to run either p4d (the "main" server) or p4p (the proxy server). But let's look at the more correct way to configure and run the services, and then we'll look at how to run multiple instances of the services.

Importing the manifest and configuring a default instance

First, let's take the downloaded "perforcexml.txt" file and put it in its proper place; then import it into SMF. Assume you've saved perforcexml.txt in /tmp.

# cd /var/svc/manifest/application
# mv /tmp/perforcexml.txt ./perforce.xml
# chown root:sys perforce.xml
# chmod 644 perforce.xml
# svccfg import perforce.xml

Notice that we didn't edit perforce.xml. Now, let's configure it from within SMF. We'll make the following assumptions:

  • The "p4d" and "p4" executables are in /usr/local/perforce/bin.
  • The perforce user with administrative privileges is named "bob". This is a perforce user -- not the same thing as a login in /etc/passwd.
  • The root of the perforce repository is /var/perforce.
  • We'll take the default for all other parameters, e.g. the tcp port number.
# svccfg
svc:> select p4d
svc:/application/perforce/p4d> editprop

editprop throws us into $EDITOR, your favorite editor. From here, we make the following changes:

# Property group "executables"
# delprop executables
# addpg executables application
setprop executables/client = astring:"/usr/local/perforce/bin/p4"
setprop executables/server = astring:"/usr/local/perforce/bin/p4d"

# Property group "options"
# delprop options
# addpg options application
# setprop options/journal = astring: (journal)
# setprop options/port = astring: (1666)
setprop options/adminuser = astring:"bob"
setprop options/root = astring:"/var/perforce"

The lines that we changed are the lines that are now uncommented. Save and quit your editor.

svc:/application/perforce/p4d> quit
# svcadm refresh p4d

We're ready! Make sure that /var/perforce is owned by daemon (the login that will actually be running the p4d daemon), and kick it off.

# chown daemon /var/perforce
# svcadm enable p4d
# svcs -x

Woohoo! No problems! If there had been a problem, "svcs -x" would have shown it. Test a perforce client against the server. If there are any problems, check "svcs -x" again.

The usual caveats apply: if you kill p4d, or run "p4 admin stop", SMF will immediately restart p4d. To stop the p4d server correctly, run "svcadm disable p4d". This will correctly shut down p4d (via "p4 admin stop").

Starting a second instance on the same machine

We can use SMF to start another Perforce repository on the same machine. The two instances can be administered separately (e.g. one can be taken down while the other one is active).

For the second instance, we'll assume the following:

  • The root directory of its Perforce repository is /var/altperforce.
  • The administrative user is "carl".
  • We will use tcp port 1667 (since 1666 was already taken by the first server).
  • In SMF terms, the second instance will be called "alt". The first one was called "default". The name of the first doesn't matter as much until you create a second.

When creating the second instance, we will want to borrow some of the properties from the default instance. The "editprop" command comes in handy here. If your favorite editor allows you to save a range of lines, then you can save the relevant section to a temporary file.

# svccfg
svc:> select p4d
svc:/application/perforce/p4d> editprop

Now we save the following lines into a temporary file:

# Property group "options"
# delprop options
# addpg options application
# setprop options/journal = astring: (journal)
# setprop options/port = astring: (1666)
# setprop options/adminuser = astring: (bob)
# setprop options/root = astring: (/var/perforce)

Now quit your editor, and we'll create the new instance.

svc:/application/perforce/p4d> add alt
svc:/application/perforce/p4d> select alt
svc:/application/perforce/p4d:alt> editprop

We will see the following:

select svc:/application/perforce/p4d:alt

Now read the temporary file into your editor, and edit it to look like this:

select svc:/application/perforce/p4d:alt

# Property group "options"
# delprop options
addpg options application
# setprop options/journal = astring: (journal)
setprop options/port = astring:"1667"
setprop options/adminuser = astring:"carl"
setprop options/root = astring:"/var/altperforce"

Save and quit your editor. We're almost done!

svc:/application/perforce/p4d:alt> quit
# chown daemon /var/altperforce
# svcadm refresh p4d
svcadm: Pattern 'p4d' matches multiple instances:

Oops, things really have changed! Let's do it right:

# svcadm refresh p4d:alt
# svcadm enable p4d:alt
# svcs -x

Success. The same technique can be applied for p4p (the proxy server). The configurable options for p4p are pretty much the same as p4d.


I hope you find this useful. This was my first learning experience for SMF, and it was really fun. But it's just a small sample of what SMF can do.

If I make any more changes to this manifest, either from your feedback or from changes to Perforce itself, I will blog about it here.

Technorati tags:

Friday Dec 23, 2005

A DTrace Provider for NFS

As a kernel developer working on NFSv4, I love DTrace. The fbt and syscall providers are my constant companions.

But last June, at the NFSv4 Bakeathon, I had a more difficult problem -- an infinite loop involving volatile filehandles. The problem was reproducible (good), but it took a couple of minutes running a test suite to get to that point (bad). Thousands of over-the-wire operations occurred before the more interesting things began to happen.

Using snoop, which "knows" the NFSv4 protocol, I knew what Solaris NFSv4 was doing, but I didn't know why it was doing it. DTrace could tell me the "why" part, but I couldn't think of a script that would get me just the right information, without drowning me in uninteresting information.

Sleep deprived, I began whining about how DTrace should have a new provider; one than "knows" the NFSv4 protocol, and lets you write protocol related events into your scripts. A provider that lets you do things like grab filehandles, and stuff them into variables, to be used later in conditionals.

Upon getting home, I began writing such a provider. In fact, I wrote one that could solve the problem discussed above. As it grew, there were some parts of the code that were getting messy, so now I'm in the midst of refactoring the provider.

A New Provider

Here is how the NFSv4 provider is starting to look. As usual for DTrace probes, we have:


Provider is either "nfs" or "nfs-server". Plain old "nfs" is the NFS client -- the thing that lets you mount remote file systems. Yes, this means that technically there are really two providers, but they will be nearly the same structure. Learning one provider will be sufficient to learn both.

Module is either "nfs" or "nfssrv", again for client and server. I would recommend just leaving this part blank, since it doesn't buy you any more than the provider slot. If you do not leave it blank, make sure that provider and module match. If they don't match, you won't get any probes. For example, "nfs:nfssrv:<something>:<something>" will not match any probes.

Function is an operation or an attribute. More on that below. :-)

Name is either "start" or "done". For a probe on an client operation, "start" would fire when the client sent the operation over-the-wire to the server, and "done" would fire when it received its response from the server. For the server, "start" would fire when the server received a request, and "done" would fire when the server answered the request. Callback functions, where the server makes a request of the client, are just what you would expect -- client "start" fires when the client first receives a request from the server, etc.

The Probes

As mentioned above, the function slot is an operation or an attribute. Operations are of the form "op-<operation-name>", where <operation-name> is an operation defined in the NFSv4 protocol. Examples:

nfs::op-read:start /\* client did a read operation \*/

nfs-server::op-read:start /\* server got a read request \*/

nfs::op-compound:done /\* client's compound op finished \*/

For attribute probes, we use "attr-<attribute-name>", where <attribute-name> is a type defined in the NFSv4 protocol. It can be either an argument to an operation (examples:)

nfs::attr-seqid4:start /\* client sent a sequence ID with some op\*/

nfs::attr-filehandle:done /\* client received a filehandle \*/

or an attribute of a file. Examples:

nfs::attr-owner:start /\* client is sending an owner file attribute \*/

nfs::attr-filehandle:done /\* client received a filehandle \*/

Above, notice that attr-filehandle makes sense as an argument to an operation, or an attribute of a file. The nice thing about the attr-\* probes is that you can trace these attributes going over the wire, without caring about what operation is sending them, or how it's being sent.

Arguments to the Probes: args[0] and args[1]

For all probes, there are two arguments. args[0] is the same for all probes: it is a structure that holds things such as the tag and transaction id (xid) of the operation. It will likely hold the network address of the machine that it's talking with; more things may be added in the future.

xidTransaction ID
tagDescriptive tag for the request (see section 14.2 of the NFSv4 spec).
addrAddress of the remote host (in a format yet to be determined)
more may be added; stay tuned.

The second argument, args[1], is different for every probe. For op-\* probes, it's the structure that is being sent or received. For attr-\* probes, it's just the attribute itself. Examples:

ProbeType for args[1]
Types for attributes may tend to be more simple types, but this is still to be decided. Here are some possibilities:

Any working examples yet?

The provider is still being implemented, and it doesn't yet handle very many of the details of the NFSv4 protocol. But the framework is there, and it does handle "compound", which is the fundamental over-the-wire operation in NFSv4.

As its name implies, a compound operation can do more than one thing. For example, it can create a new file and get the new file's attributes, all in one over-the-wire operation. As mentioned before, compounds have a field called a "tag", which is a descriptive string attached that very briefly describes the purpose of the request. The tag for our hypothetical create-file-and-get-its-attributes operation might simply be "create".

Here is a simple D script that uses the op-compound probe to show which over-the-wire requests the client spends the most time waiting upon. The requests are collated by the tag. You could use such a script to help get an idea of what compounds are taking the most time, and maybe start thinking about where to do further performance analysis.

Following the example script is example output. While running this example script, I used "tar" to copy a bunch of files within an NFS mounted directory, and then used "rm -rf" to delete them. Don't treat this as a serious benchmark or anything -- it's just to give an idea of what's easily possible with the new provider.

#! /usr/sbin/dtrace -s

#pragma D option quiet

 \* Record the time when a request is sent to the server.

    cstart[args[0]->xid] = timestamp;

 \* We received a response from the server.  Below, all times are converted
 \* from nanoseconds to microseconds (dividing by 1000); this was more readable
 \* in my particular setup.

/cstart[args[0]->xid] != 0/
    /\* Stuff the descriptive "tag" into a convenience variable \*/
    this->tag = stringof(args[0]->tag);

    /\* Compute total time spent processing a given tag \*/
    @ttime[this->tag] =
        sum((timestamp - cstart[args[0]->xid]) / 1000);

    /\* Free up the cstart[] slot \*/
    cstart[args[0]->xid] = 0;

 \* We're done, probably via the user hitting \^C.  Print the results.
    printf("\\ntotal time spent in calls\\n");


total time spent in calls

  renew                                                           276
  commit                                                        11054
  link                                                          21647
  rmdir                                                        367802
  mkdir                                                        422512
  access                                                       599114
  readdir                                                      753105
  lookup valid                                                 822106
  lookup                                                       862495
  read                                                        1026893
  close                                                       1095138
  getattr                                                     1841042
  write                                                       6674532
  remove                                                      6802618
  setattr                                                     7101811
  open                                                        7539189

Technorati tags:

Monday Nov 21, 2005

IETF, ZFS and DTrace

What have I been up to?

I recently traveled to beautiful Vancouver for the 64th IETF. There, Lisa and I attended many meetings, most notably the NFSv4 working group meeting, and presented our recent Internet Draft.

The Internet Draft concerns NFSv4 ACLs. It attempts to clear up ambiguities in the NFSv4 spec, RFC3530. It also proposes a way for ACLs and UNIX-style modes to live together in harmony. Hopefully, it will end up as one or more RFCs, and NFSv4 clients and servers can have a truly useful ACL model.

Meanwhile, as has been mentioned in many other places, ZFS has been released. Check out blog entries from Lisa and Mark on how ACLs work in ZFS.

I also gave a presentation on DTrace at a joint meeting of the Front Range OpenSolaris User Group (FROSUG) and the Front Range Unix User Group (FRUUG).

Why was I, a humble NFS engineer, giving a presentation on DTrace? Well, for one thing, I use DTrace quite a bit. But I'll be blogging more about DTrace and NFS a bit later.

Technorati tags:

Tuesday Jun 14, 2005

ACLs Everywhere

I was tempted to call this posting ACLs are my resume, but that's a bit extreme. Still, Access Control Lists (ACLs) seem to follow me around. Here are some of the places where I have worked on ACLs.

  • I hacked ufsdump/ufsrestore to deal with UFS ACLs
  • I enabled caching of ACLs in CacheFS
  • I made a modification of disconnected CacheFS to allow ACLs to work while disconnected
  • I worked on a very different sort of application-level ACLs at a failed startup
  • I now work on ACLs for NFS version 4.

Solaris 10

Lisa and I managed to integrate support for NFSv4 ACLs into Solaris 10. The effort to add ACL support began late in the Solaris 10 release cycle. Some of the problems we hit (outlined below) weren't even thought about when we began. We couldn't do much testing against other vendors until very late in the release cycle. There were a few show stopper bugs we had to fix before Solaris 10 could ship. This was one of the most intense but rewarding projects I've worked on!

NFSv4 ACL support breaks down into two big pieces: support for the over-the-wire operations involving ACLs, and translation between the various ACL models (more on them below). The over-the-wire pieces of ACL handling are scattered throughout NFSv4. nfs4_vnops.c has the usual vnode ops. See nfs4_getsecattr() and nfs4_setsecattr() for the front end (as far as the file system is concerned) to ACLs. Other pieces are in nfs4_client.c and nfs4_xdr.c. The translators are contained in nfs4_acl.c. For this article, I will focus on the translators.


Solaris has had support for ACLs for a long time. The ACL model supported before Solaris 10 is called POSIX-draft. This was supposed to become a POSIX standard, but the effort was abandoned. The latest draft is what was implemented for Solaris. For on-disk file systems, the Solaris UFS filesystem implements POSIX-draft ACLs. For versions two and three of the Network File System (NFS), an undocumented side-band protocol enables users to manipulate ACLs on the server. To the best of my knowledge, the only implementation of this protocol outside of Solaris is for Linux.

NFSv4 introduces a powerful new ACL model. It's powerful enough that every POSIX-draft ACL can be translated into an NFSv4 ACL. But NFSv4 ACLs can go beyond POSIX-draft semantics; thus, not all NFSv4 ACLs can be translated into POSIX-draft ACLs.

The presence of two ACL models makes it desirable to seamlessly translate between the two, so we implemented ACL translation in the kernel for NFSv4. Translation gives us many benefits:

  • POSIX-draft ACLs can be sent over-the-wire using the NFSv4 protocol. Thus, we don't have to bolt on another undocumented side-band ACL protocol for NFS version 4.
  • Users on Solaris clients can manipulate POSIX-draft ACLs (or perhaps the semantically equivalent NFSv4 ACLs) on non-Solaris servers.
  • Utilities that restore POSIX-draft ACLs (e.g. tar, cpio, ufsrestore) can restore semantically equivalent ACLs onto an NFSv4 filesystem, even if the server's filesystem is not Solaris UFS.
  • Non-Solaris NFSv4 clients can manipulate ACLs on a present-day Solaris server (using UFS), provided they stay within the confines of ACLs that can be translated into POSIX-draft.

At Connectathon 2005, we gave a presentation on implementing NFSv4 ACLs in Solaris 10. However, now that Solaris is open, I would like to talk about the translators at the code level.

Translating POSIX-draft ACLs into NFSv4 ACLs

In the Solaris kernel, ACLs are passed around inside of vsecattr_t structures. The main entry point for translating POSIX-draft ACLs into NFSv4 ACLs is vs_aent_to_ace4(). Here is what the call stack looks like when a POSIX-draft ACL is translated into an NFSv4 ACL.

  ln_aent_to_ace4() (once for the regular ACL, once for the default ACL)
    for every ACE:

  1. vs_aent_to_ace4() sets up new vsecattr_t's to hold the new ACLs. It calls ln_aent_to_ace4() on each of the POSIX-draft ACL parts: the ACL, and the "default" (inheritable) ACL.
  2. ln_aent_to_ace4() does the main part of the conversion. First it calls ln_aent_preprocess(), to scan the incoming ACL to find various statistics (e.g. how many group entries).
  3. For every ACL entry, ln_aent_to_ace4() determines what type of ACEs to create. It calls mode_to_ace4_access() which creates an access mask to be used in an ALLOW ACE.
  4. ln_aent_to_ace4() also calls ace4_make_deny(), to make DENY ACEs that compliment the ALLOW ACEs.
  5. Both mode_to_ace4_access() and ace4_make_deny() call access_mask_set(). This function determines the values of several access mask bits, beyond the usual read/write/execute. Its intent is to do so in a generic manner. In order to understand its operation, see the global variables nfs4_acl_{client,server}_produce.

Translating NFSv4 ACLs into POSIX-draft ACLs

The entry point for translating NFSv4 ACLs into POSIX-draft ACLs is vs_ace4_to_aent(). Here is what the call stack looks like when making this translation.


  1. vs_ace4_to_aent() is a very thin front-end to ln_ace4_to_aent(). It doesn't do much more than call this function, and pass on errors.
  2. The first thing ln_ace4_to_aent() does is to call ace4_list_init() twice: once to hold information for the POSIX-draft ACL, and once for the POSIX-draft "default" (inheritable) ACL. The ace4_list_t holds all the information collected from an NFSv4 ACL needed in order to transform it into a POSIX-draft ACL.
  3. ace4_to_aent_legal() is called to look for common problems in the NFSv4 ACL. Some of these problems could come from a bad NFSv4 ACL, but most often, the problem is an NFSv4 ACL that is not translatable into a POSIX-draft ACL.
  4. Notice that ace4_to_aent_legal() calls access_mask_check(). This function is the counterpart to access_mask_set(), used when translating in the other direction. In a Solaris-only environment, you can think of access_mask_check() as a function to check the work done by access_mask_set().
  5. If ace4_to_aent_legal() does not find any problems, ln_ace4_to_aent() processes each ACE in the NFSv4 ACL. There are still plenty of places where translation can fail. But if all goes well, ace4vals_find() is called to find a place to store information from the ACE, which will be used in producing the POSIX-draft ACL.
  6. After all of the ACEs are processed, ace4_list_to_aent() converts the ace4_list_t into a POSIX-draft ACL. This is called once for the "normal" ACL, and once for the "default" ACL.
  7. Finally, ace4_list_free() frees the memory allocated to hold the translations in progress.

To see when some things go wrong, you can turn on nfs4_acl_debug. The debugging code isn't very complete, and it might be replaced by static Dtrace probes some day. But for now, you can do this:

# mdb -kw
> nfs4_acl_debug/W 1

Future ACL support in Solaris

As you can see from reading the code, and perhaps from all of the debugging prints, there are lots of things that can go wrong when translating ACLs from NFSv4 into POSIX-draft format. Besides that, wouldn't it be nice to be able to set an arbitrary NFSv4 ACL, and use the new ACL model?

Things are getting much better with the arrival of ZFS. The goal of ZFS's ACL implementation is to implement NFSv4 ACLs in a way that is compatible with Solaris. The ZFS ACL model is still in flux, but it is rapidly solidifying.

We will be releasing an Internet Draft in the near future, in which we will propose a way for UNIX and UNIX-like systems to support NFSv4 ACLs. If all goes well, this will be the ACL model used by ZFS.

Technorati tag:
Technorati tag:

Friday May 13, 2005

Friday the 13th

I must be pretty brave to be making my first post on Friday the 13th.

I'm Sam Falkner, and I work in Solaris, specifically focusing on NFS. I've been at Sun since 1992, with a roughly two year hiatus in 1997 and 1998. I've worked on ufsdump/ufsrestore, Backup Copilot, "Online: Backup 2.0", CacheFS, UFS, and finally NFS.

I don't know how often I'll be posting, since things are pretty hectic right now. If you haven't taken the plunge into RSS yet, I would highly recommend it! I recently started using Sage under Firefox. It's easy and it makes infrequent blog posters less annoying. At least I hope so.




« April 2014