Friday Nov 14, 2008


Traditionally, NFS has always been able to share an ordinary file system, and thus has always dealt with vnodes. With the advent of Parallel NFS (pNFS), which has stripes of data distributed across multiple servers, this is no longer the case. The initial implementation of pNFS uses the DMU API provided by ZFS to store its stripe data. The pNFS protocol also requires that a server community support proxy i/o; that is, a client must be able to perform all i/o against one server, and if the data requested is on another node, then the server must perform the i/o by proxy. Neither the DMU nor proxy i/o are accessed via vnodes.

Another change brought by pNFS is the distributed nature of the server. What has always been confined to a single server is now distributed over multiple servers. This necessitates a control protocol, for communication among the server nodes, and implies that some server tasks may have longer latencies than before. In pathological cases, e.g. a server reboot of one particular server, the latencies may be very high. This will likely require new ways that the NFS server is implemented, and it will become advantageous for the NFS server code to use new APIs, APIs with different design goals from vnodes. Asynchronous methods become more desirable, to deal with the new latencies involved with processing client requests.

Enter nnodes. nnodes can be thought of as vnodes, but customized for the needs of NFS (especially pNFS), and with three distinct ops-vector/private-data sections. The figure below shows an NFS server implementation interacting with an nnode that is backed by a vnode.

The fact that the nnode has the three distinct sections for data, metadata, and state, makes it easy to mix and match commonly needed implementations. Here are some examples:

Traditional shared file systemvnodevnoderfs4_db
pNFS metadata serverproxy i/ovnoderfs4_db
pNFS data serverDMUnot applicableproxy to MDS

But this is just the beginning. There will doubtless be more constructions in the future.

nnodes also serve as a place to cache or store information relevant to a file-like object. For example, in the pNFS data server, we can cache stateids that are known to be valid. Thus, the data server will not need to contact the metadata server on every i/o operation.

Today I have been writing a more verbose comment header for the header file "nnode.h". Look for it in our repository soon.

Wednesday May 23, 2007

nfs4trace: a New Direction

My previous DTrace provider for NFS has been floundering, due to a needed design change, and my priorities with pNFS. Fortunately, a new design is being worked by two engineers recently assigned to the project.

[Read More]

Thursday Jun 08, 2006

nfs4trace release: all ops covered!

I have just pushed a new release of the DTrace provider for NFSv4. This is in fact the second release I've made -- the first release didn't get a blog entry. You can be sure to see all releases by watching the announcements section of the nfs4trace project. An RSS feed can be had here:

This release is based on Nevada build 41. I mention this so that you will know what bugs and bug fixes are present in this release. The release is in the form of bfu archives, so follow the usual procedure to install them.

New Features

This release of nfs4trace provides a set of probes for every NFSv4 operation. There is an overarching probe for the compound op, called "op-compound". Each op within a compound has its own probe, e.g. "op-lookup".

For every probe, args[0] is a pointer to a nfs4_dtrace_info_t. This has the following fields:

dt_xid RPC transaction ID you can use this to correlate "start" with "done"
dt_cr pointer to the credential structure used for this operation use args[0]->dt_cr->cr_uid to get the user ID
dt_tag NFSv4 tag field associated with this over-the-wire call this is entirely implementation defined, and is meant simply to be human readable
dt_status the status of the entire compound operation  
dt_addr the IP address of the "other end" of the request for a client, the "other end" is the server; for a server, the "other end" is the client
dt_netid the network ID of the "other end" of the request network ID will be either "tcp" for IPv4 or "tcp6" for IPv6.

Besides the normal compound operations, there is a probe for all callback-related operations. The probe "cb-compound" is analagous to "op-compound", but covers the callback channel. Each operation within a callback, e.g. OP_CB_GETATTR, also has a probe -- for this example, "cb-getattr".

To see a complete list of all probes, run dtrace -P nfs4c -l for client probes, and dtrace -P nfs4s -l for server probes.

What's Left?

Eventually, there will be probes for every nontrivial attribute, which will fire regardless of which operation is using them. But not all of these attributes are accounted for yet. Stay tuned.

What else is left? Your feedback will help decide this! Please try this, and let me know if you have any input. Thanks!

Monday Nov 21, 2005

IETF, ZFS and DTrace

What have I been up to?

I recently traveled to beautiful Vancouver for the 64th IETF. There, Lisa and I attended many meetings, most notably the NFSv4 working group meeting, and presented our recent Internet Draft.

The Internet Draft concerns NFSv4 ACLs. It attempts to clear up ambiguities in the NFSv4 spec, RFC3530. It also proposes a way for ACLs and UNIX-style modes to live together in harmony. Hopefully, it will end up as one or more RFCs, and NFSv4 clients and servers can have a truly useful ACL model.

Meanwhile, as has been mentioned in many other places, ZFS has been released. Check out blog entries from Lisa and Mark on how ACLs work in ZFS.

I also gave a presentation on DTrace at a joint meeting of the Front Range OpenSolaris User Group (FROSUG) and the Front Range Unix User Group (FRUUG).

Why was I, a humble NFS engineer, giving a presentation on DTrace? Well, for one thing, I use DTrace quite a bit. But I'll be blogging more about DTrace and NFS a bit later.

Technorati tags:

Tuesday Jun 14, 2005

ACLs Everywhere

I was tempted to call this posting ACLs are my resume, but that's a bit extreme. Still, Access Control Lists (ACLs) seem to follow me around. Here are some of the places where I have worked on ACLs.

  • I hacked ufsdump/ufsrestore to deal with UFS ACLs
  • I enabled caching of ACLs in CacheFS
  • I made a modification of disconnected CacheFS to allow ACLs to work while disconnected
  • I worked on a very different sort of application-level ACLs at a failed startup
  • I now work on ACLs for NFS version 4.

Solaris 10

Lisa and I managed to integrate support for NFSv4 ACLs into Solaris 10. The effort to add ACL support began late in the Solaris 10 release cycle. Some of the problems we hit (outlined below) weren't even thought about when we began. We couldn't do much testing against other vendors until very late in the release cycle. There were a few show stopper bugs we had to fix before Solaris 10 could ship. This was one of the most intense but rewarding projects I've worked on!

NFSv4 ACL support breaks down into two big pieces: support for the over-the-wire operations involving ACLs, and translation between the various ACL models (more on them below). The over-the-wire pieces of ACL handling are scattered throughout NFSv4. nfs4_vnops.c has the usual vnode ops. See nfs4_getsecattr() and nfs4_setsecattr() for the front end (as far as the file system is concerned) to ACLs. Other pieces are in nfs4_client.c and nfs4_xdr.c. The translators are contained in nfs4_acl.c. For this article, I will focus on the translators.


Solaris has had support for ACLs for a long time. The ACL model supported before Solaris 10 is called POSIX-draft. This was supposed to become a POSIX standard, but the effort was abandoned. The latest draft is what was implemented for Solaris. For on-disk file systems, the Solaris UFS filesystem implements POSIX-draft ACLs. For versions two and three of the Network File System (NFS), an undocumented side-band protocol enables users to manipulate ACLs on the server. To the best of my knowledge, the only implementation of this protocol outside of Solaris is for Linux.

NFSv4 introduces a powerful new ACL model. It's powerful enough that every POSIX-draft ACL can be translated into an NFSv4 ACL. But NFSv4 ACLs can go beyond POSIX-draft semantics; thus, not all NFSv4 ACLs can be translated into POSIX-draft ACLs.

The presence of two ACL models makes it desirable to seamlessly translate between the two, so we implemented ACL translation in the kernel for NFSv4. Translation gives us many benefits:

  • POSIX-draft ACLs can be sent over-the-wire using the NFSv4 protocol. Thus, we don't have to bolt on another undocumented side-band ACL protocol for NFS version 4.
  • Users on Solaris clients can manipulate POSIX-draft ACLs (or perhaps the semantically equivalent NFSv4 ACLs) on non-Solaris servers.
  • Utilities that restore POSIX-draft ACLs (e.g. tar, cpio, ufsrestore) can restore semantically equivalent ACLs onto an NFSv4 filesystem, even if the server's filesystem is not Solaris UFS.
  • Non-Solaris NFSv4 clients can manipulate ACLs on a present-day Solaris server (using UFS), provided they stay within the confines of ACLs that can be translated into POSIX-draft.

At Connectathon 2005, we gave a presentation on implementing NFSv4 ACLs in Solaris 10. However, now that Solaris is open, I would like to talk about the translators at the code level.

Translating POSIX-draft ACLs into NFSv4 ACLs

In the Solaris kernel, ACLs are passed around inside of vsecattr_t structures. The main entry point for translating POSIX-draft ACLs into NFSv4 ACLs is vs_aent_to_ace4(). Here is what the call stack looks like when a POSIX-draft ACL is translated into an NFSv4 ACL.

  ln_aent_to_ace4() (once for the regular ACL, once for the default ACL)
    for every ACE:

  1. vs_aent_to_ace4() sets up new vsecattr_t's to hold the new ACLs. It calls ln_aent_to_ace4() on each of the POSIX-draft ACL parts: the ACL, and the "default" (inheritable) ACL.
  2. ln_aent_to_ace4() does the main part of the conversion. First it calls ln_aent_preprocess(), to scan the incoming ACL to find various statistics (e.g. how many group entries).
  3. For every ACL entry, ln_aent_to_ace4() determines what type of ACEs to create. It calls mode_to_ace4_access() which creates an access mask to be used in an ALLOW ACE.
  4. ln_aent_to_ace4() also calls ace4_make_deny(), to make DENY ACEs that compliment the ALLOW ACEs.
  5. Both mode_to_ace4_access() and ace4_make_deny() call access_mask_set(). This function determines the values of several access mask bits, beyond the usual read/write/execute. Its intent is to do so in a generic manner. In order to understand its operation, see the global variables nfs4_acl_{client,server}_produce.

Translating NFSv4 ACLs into POSIX-draft ACLs

The entry point for translating NFSv4 ACLs into POSIX-draft ACLs is vs_ace4_to_aent(). Here is what the call stack looks like when making this translation.


  1. vs_ace4_to_aent() is a very thin front-end to ln_ace4_to_aent(). It doesn't do much more than call this function, and pass on errors.
  2. The first thing ln_ace4_to_aent() does is to call ace4_list_init() twice: once to hold information for the POSIX-draft ACL, and once for the POSIX-draft "default" (inheritable) ACL. The ace4_list_t holds all the information collected from an NFSv4 ACL needed in order to transform it into a POSIX-draft ACL.
  3. ace4_to_aent_legal() is called to look for common problems in the NFSv4 ACL. Some of these problems could come from a bad NFSv4 ACL, but most often, the problem is an NFSv4 ACL that is not translatable into a POSIX-draft ACL.
  4. Notice that ace4_to_aent_legal() calls access_mask_check(). This function is the counterpart to access_mask_set(), used when translating in the other direction. In a Solaris-only environment, you can think of access_mask_check() as a function to check the work done by access_mask_set().
  5. If ace4_to_aent_legal() does not find any problems, ln_ace4_to_aent() processes each ACE in the NFSv4 ACL. There are still plenty of places where translation can fail. But if all goes well, ace4vals_find() is called to find a place to store information from the ACE, which will be used in producing the POSIX-draft ACL.
  6. After all of the ACEs are processed, ace4_list_to_aent() converts the ace4_list_t into a POSIX-draft ACL. This is called once for the "normal" ACL, and once for the "default" ACL.
  7. Finally, ace4_list_free() frees the memory allocated to hold the translations in progress.

To see when some things go wrong, you can turn on nfs4_acl_debug. The debugging code isn't very complete, and it might be replaced by static Dtrace probes some day. But for now, you can do this:

# mdb -kw
> nfs4_acl_debug/W 1

Future ACL support in Solaris

As you can see from reading the code, and perhaps from all of the debugging prints, there are lots of things that can go wrong when translating ACLs from NFSv4 into POSIX-draft format. Besides that, wouldn't it be nice to be able to set an arbitrary NFSv4 ACL, and use the new ACL model?

Things are getting much better with the arrival of ZFS. The goal of ZFS's ACL implementation is to implement NFSv4 ACLs in a way that is compatible with Solaris. The ZFS ACL model is still in flux, but it is rapidly solidifying.

We will be releasing an Internet Draft in the near future, in which we will propose a way for UNIX and UNIX-like systems to support NFSv4 ACLs. If all goes well, this will be the ACL model used by ZFS.

Technorati tag:
Technorati tag:



« April 2014