Wednesday Jun 27, 2007

NFSv4.1's pNFS for Solaris

I have been involved in the IETF NFSv4 working group from almost the very beginning. One of the things that I have really enjoyed about the experience is the NFS community that has come together to move NFS forward in the form of NFSv4 and now NFSv4.1. I suppose the level of cooperation has grown out of the 20+ years of Connectathon testing; a place where the engineers get together to find and fix the interoperability bugs. That tradition continued with bakeathon events -- they started when NFSv4 was being built; they proved to be very useful in working on interpretation and interoperability bugs while the protocol was still being defined. Mike describes the most recent bakeathon we hosted in Austin. Very successful; as they always are.

Mike does a good job of reviewing what NFSv4.1's pNFS is and can accomplish. I also agree with him about the narrow definition of HPC and the need for pNFS in many areas that may not be apparent at first blush.

In any case, Mike goes on to describe what Netapp is doing with pNFS. All good stuff. There has been increasing press about pNFS lately. That is a good thing but I hope that as we move forward that we maintain the NFS community's shared ownership and responsibility to building a protocol that is useful and implementations that are even more so.

I have included my brief summary of NFSv4.1 and pNFS and a little about the approach we are taking in building pNFS for Solaris. More details will be showing up on the OpenSolaris pNFS project page. In fact, the most recent news is that we have provided binaries and source matching what we tested at the bakeathon.

Have fun parallelizing your data!

Introduction to Parallel NFS (pNFS)

The IETF NFSv4 working group has been building the first minor version of NFSv4 (NFSv4.1). The current draft of the proposed NFSv4.1 protocol has made significant progress in recent months and is now considered "functionally complete". To assist in bringing the Internet Draft to closure (working group last call), the document editors have been hosting a set of formal reviews that will continue through the summer of 2007. It is expected that the Working Group will complete the document around the end of 2007.

NFSv4's protocol definition defined a set of rules for future minor version development. It was well understood that there would be a need to modify or update NFSv4.0 at some point; hence the minor version rules. Given the ability to update the protocol, the following functional enhancements or additions are currently defined for NFSv4.1; note, this is not an exhaustive list but captures the main set of changes.

  • Sessions
  • Provides a framework for client and server such that "exactly once semantics" can be achieved at the NFS application level. Sessions also assists in mandating the availability of a

  • Directory Delegations and Notifications
  • Allows for effective caching of directory contents and delegation management.

  • Multi-server Namespace (Namespace extensions)
  • Allow a namespace to extend beyond the boundaries of a single server.

  • ACL inheritance allows propagation of access permissions and restriction down a directory tree as file system objects are created.

  • Retention Attributes
  • File object can be placed in an immutable, undeletable, unrenamable state for a fixed or infinite duration of time.

  • Correct NFSv4.0 security negotiation mechanism.

  • Parallel NFS (pNFS)
  • See below.

Functionality introduced or modified in a minor version must be optional to implement. This is true for all of changes in NFSv4.1 with the exception of Sessions. For NFSv4.1, Sessions is a mandatory to implement feature. The basic functionality that Sessions provides has become an integral part of the other features of NFSv4.1; the consensus was that it was undesirable to have a version of each piece of NFSv4.1 functionality that was defined with and without Sessions support. Therefore, Sessions is mandatory to implement for NFSv4.1.

The Parallel NFS or pNFS functionality, as its name implies, is a method of introducing data access parallelism. The NFSv4.1 protocol defines a method of separating the meta-data (names and attributes) of a filesystem from the location of the file data; it goes beyond simple name/data separation to define a method of striping the data amongst a set of data servers. This is obviously very different from the traditional NFS server which holds the names of files and their data under the single umbrella of the server. There are products in existence that are multi-node NFS servers but they are limited in the participation from the client in separation of meta-data and data. The NFSv4.1 client can be enabled to be a direct participant in the exact location of file data and avoid sole interaction with the single NFS server when moving data.

The NFSv4.1 pNFS server is now a collection or community of server resources or components; these community members are assumed to be controlled by the meta-data server.

The pNFS client still accesses a single meta-data server for traversal or interaction with the namespace; when the client moves data to and from the server it may be directly interacting with the set of data servers belonging to the pNFS server community.

     | pNFS Server                                    |
     |                                                |
     | .--------------    .--------------             |
     | |data-server  |    |data-server  |             |
     | |             |    |             |             |
     | `-.------------    `-----.--------             |
     |   |     .--------------  |  .--------------    |
     |   |     |data-server  |  |  |data-server  |    |
     |   |     |             |  |  |             |    |
     |   |     `------.-------  |  `------.-------    |
     |  _|____________|_________|_________|__________ |
     |                       |                        |
     |             ,---------'-----------             |
     |             | meta-data server   |             |
     |             |____________________|             |
         |      |        |             |       |
           |               |                     |
           |               |                     |
     .-----+-----.   .-----+-----.         .-----+-----.
     |           |   |           |         |           |
     |pNFS Client|   |pNFS Client|  ....   |pNFS Client|
     |           |   |           |         |           |
     `-----------'   `-----------'         `-----------'

As mentioned, the user's view of the "pNFS Server" continues to appear on the network as a regular NFS server even though there are multiple, distinct and addressable components of the server. There is a single server from which a filesystem is mounted and accessed. The administrator knows the details of the "pNFS Server" or community. The pNFS Client implementation will know the details of the pNFS server through NFSv4.1 protocol interaction. When it comes to things like mount points and automount maps, the look and feel of the NFS server is the same as it has been: single server name and its self-contained namespace.

The pNFS enabled client determines the location of file data by directly querying the pNFS server. In pNFS nomenclature, the client requests a file "layout". When a file is opened, the pNFS client will ask the meta-data server for the file's layout. If available it is available, the server will give the layout to the client. When the client moves data, the layout is consulted as to the data-server(s) upon which the data resides; once the offset and range is matched to the appropriate data-server(s), the data movement is complete with read and write requests. The pNFS protocol's layout definition provides for straightforward striping of data only. There is one twist to the striping -- the location may be specified by two paths thus allowing for a simple multi-pathing feature.

With the layout in hand, the pNFS client is then enabled to generate read/write requests in parallel or by a method of its own choice. The layout is thus a simple enablement for the pNFS client to increase its overall data throughput. The pNFS server is also a benefactor by the nature of horizontal scale for data access along with the reduction of read/write operations being directly serviced by the meta-data server. Obviously, the main purpose or intent of the NFSv4.1 pNFS feature is to significantly improve the data throughput capabilities of NFS servers. The NFSv4.1 protocol requires that the meta-data server always be able to service read/write request itself. This requirement allows for NFSv4.1 clients that are not enabled for pNFS or for cases that the available parallelism is not required.

The NFSv4.1 protocol defines interaction between client and server. There is no specification for the interaction between components of the pNFS server. This interaction or coordination of the pNFS server community members is left as a pNFS server implementation detail. Given the lack of an open protocol definition, pNFS server components will be homogeneous in their implementation. This isn't necessarily a bad thing since there is a variety of server filesystem architectures already present in the NFS server market. The lack of protocol definition allows for the most effective reuse of existing filesystem and server technology. Obviously there is a well-defined set of requirements or expectations of the meta-data and data servers in the form of the NFSv4.1 protocol.

Maintaining the theme of inclusiveness, the pNFS protocol allows for a variety of data movement or transfer methods between the client and pNFS server. The NFSv4.1 layout mechanism defines layout "types". The types are then defined as a particular data movement or transport protocol. The layout mechanism also allows for inclusion of newly defined types such that the NFSv4.1 protocol can adapt to future requirements or ideas.

There are three types of layouts currently being defined for pNFS; they are generically referred to as: files, objects, blocks. The "files" layout type uses the NFSv4.1 read/write operations to the data-server. The "files" type is being defined within the NFSv4.1 Internet Draft. The "objects" layout type refers to the OSD storage protocol as defined by the T10 "SCSI Object-Based Storage Device Commands" protocol draft. The "blocks" layout refers to the use of SCSI (in its many forms). The pNFS OSD and block layout definitions are defined in separate Internet Drafts.

For additional detail, the current Internet Drafts for the items mentioned above are:

NFSv4 Minor Version 1

Object-based pNFS Operations

pNFS Block/Volume Layout

Solaris pNFS

The initial instantiation of NFSv4.1 for Solaris will deliver the Sessions and pNFS functionality using the files layout type.

The Transports

With the introduction of NFSv4.1, there will be no change in the network transports available to the client or server. The kernel RPC interfaces will continue to provide TCP, and RDMA (in the form of Infiniband) network connectivity. The current RPCSEC_GSS mechanisms will continue to be supported as well.

The Client

The Solaris NFSv4.1 client will be a straightforward implementation of the Sessions and pNFS capabilities. The administrative model for the client will remain the same as it is today. As mentioned in the introduction, the client will continue to mount from a single server and provide a path (e.g. server.domain:/path/name).

Since NFSv4.1 constitutes a new version of the protocol, the client will negotiate the use of NFSv4.1 with the server as it has done in the past. The client will have a preference for the highest version of NFS offered by the server. As the client accesses filesystems it will query the server for the availability of the pNFS functionality. Then, on OPENing a file, the client attempts to obtain a layout and then uses the layout information when READing/WRITEing and COMMITing data to the "server" to provide data access parallelism.

The Server

As mentioned in the introduction, the NFSv4.1 protocol does not define the architecture of the pNFS server; only the outward facing protocol and behavior is defined. Given this flexibility, a multitude of architectures can fit the model of pNFS service. For the Solaris pNFS server, a straightforward model will be used. Each member of the pNFS server community (meta-data server, and data servers) are to be thought of as a self-contained storage unit.

Meta-data server

For the Solaris pNFS server, there is one meta-data server. The meta-data server may be configured with a high availability component that allows for active/passive availability. Solaris pNFS meta-data server looks and feels like a regular NFS server. There are extensions to the management model to integrate the use of the data servers. The meta-data server is in full control of the pNFS server in the sense that it decides which data server to utilize (initial inclusion and allocations for new file layouts).

The meta-data server will require the use of ZFS as its underlying filesystem for storage of the filesystem namespace and attributes. Architecturally, other filesystems may be used but they must provide like-functionality for use by the pNFS meta-data server; the most important features are NFSv4 ACL support and system extended attributes.

Data server

The pNFS data servers do not need a traditional filesystem namespace for associating names to data given that the meta-data server provides this service by definition. The data server will then be free to associate the file objects that are to be stored upon them with their filehandle (the identifier shared between the data server and meta-data server) in any fashion appropriate. One may think that it is a requirement to have the data server mimic the filesystem namespace of the meta-data server but this is untrue. In fact, the mimicking of the meta-data namespace would prove to be cumbersome for regular NFS server operation. The Solaris pNFS data server will have an architecture that will allow for the direct use of ZFS pools for file data storage.

Network Requirement

The pNFS diagram above implies that there is a need for two separate networks; one internal to the pNFS community and one for client access to the pNFS server. This is but a possibility -- a logical representation; not a requirement. The only requirement with respect to network configuration is that each component within the diagram above be addressable or routable to each other. Therefore, there can be variety of networking technologies or topologies employed for the pNFS server. The choice of topology or interconnect will be based on the workload being served by the pNFS server.

Coordination of the pNFS Community

To this point, we have a Solaris pNFS client interacting with a pNFS server over a flexible network configuration. The meta-data server is using ZFS as the underlying filesystem for the "regular" filesystem information (names, directories, attributes). The data server is using the ZFS pool to organize the data for the various layouts the meta-data server is handing out to the clients. What is coordinating all of the pNFS community members?

pNFS Control Protocol

The control protocol is the piece of the pNFS server solution that is left to the various implementations to define. Since the Solaris pNFS solution is taking a fairly straightforward approach to the construction of the pNFS community, this allows for the use of ZFS' special combination of features to organize the attached storage devices. This will allow for the control protocol to focus on higher level control of the pNFS community members.

Some of the highlights of the control protocol are:

  • Meta-data and data server reboot / network partition indication
  • Filehandle, file state, and layout validation
  • Reporting of data server resources
  • Inter data server data movement
  • Meta-data server proxy I/O
  • Data server state invalidation

In Summary

NFSv4.1 will deliver a range of new features. The two initially addressed for Solaris will be Sessions and pNFS. The Solaris pNFS client will be a simple implementation of the protocol. The Solaris pNFS server will layer on top of existing Solaris technologies to offer a feature rich solution.

Tuesday May 08, 2007

The FAB NFS/ZFS Appliance

I recently served as photog, and as occasional assembly assist, for a friend's shiny new Ultra 40. He put together a nice little package for an in-home NFS/ZFS Appliance. Check it out; great use of Comic Life and of the Ultra 40.

Monday Apr 30, 2007

NFSv4.1, pNFS and a demo (oh, my!)

One of my day time jobs is that of co-chair of the IETF NFSv4 working group. The working group has been busy for "awhile" in building the first minor version of NFS version 4 (NFSv4.1).

Well, the working group has made very good progress as of late and my co-editors (Mike Eisler and David Noveck) and I have been hosting a series of formal reviews of the Internet Draft in an attempt to flush out subtleties in the protocol that have yet to be noticed. We are making very good progress with the help of many people within the working group (Thanks everyone!). It is a lot of work to consume the NFSv4.1 Internet Draft and to make considered, helpful comments on how to improve it. I very much appreciate it.

In any case, NFSv4.1 adds many needed features. The areas of special note are:

  • Sessions
  • Directory Delegations
  • ACL inheritance
  • Namespace Extensions
  • pNFS (parallel NFS)

I will return another day to describe more about the various things on this list but today's focus is the pNFS (parallel NFS) feature.

A brief description of pNFS goes something like this:

pNFS (parallel NFS) is a distributed, parallel file system which provides a highly scalable data management solution through:

  1. Parallel data transfers across many NFSv4.1 data servers
  2. A single, unified namespace for all objects in the pNFS file system

Sounds very formal doesn't it; heh. In any case, the ideas behind the pNFS functionality are not new to filesystem technology, however it is the first embodiment in an openly defined protocol.

The OpenSolaris pNFS team has put together a quick demo of our pNFS prototype. Check it out and then follow along over at the NFSv41 OpenSolaris project page.

Wednesday Mar 21, 2007

YANFS is the new WebNFS (and is opensource as well)

I would like to announce that what was once called WebNFS has been renamed and released as YANFS and is now opensource.

So, why the change in name? The WebNFS name has been used in the description of two sets of work. The first use of WebNFS was in the context of a set of NFS protocol interpretations.

The second use of WebNFS has been by Sun to describe a Java implementation of the XDR, RPC, and NFSv2/NFSv3 (client) protocols. It was my opinion that the WebNFS name really should remain as a reference to the protocol interpretations and not a particular implementation of those and other protocols.

Hence the new name: YANFS (Yet Another NFS).

The other question that is likely to arise is "why now?". Why after all these years would Sun opensource the Java NFS implementation. Well, with the upcoming NFSv4.1 protocol there has been additional interest in its various features and in particular the pNFS feature. I had received a couple of queries about our Java NFS implementation from non-Sun developers that wanted to take advantage of the work that had been done within Sun. So, I have been able to work through the internal process issues to release a bit of source to help kick-start a Java implementation of an NFSv4.1 client and server.

I have created a new project over at for YANFS. The original set of source is just the XDR, RPC, and NFSv2/NFSv3 client classes. It also includes the XFile APIs. First thing on my list is to do a little cleanup of the original code and classes. Once that is complete, I will be sorting through some of the prototype code that is around that implements a Java NFS server so there is a client and server base to work from. Once that is complete, I should be moving on to the NFSv4.1 implementation.

Given the original queries about our WebNFS implementation, I know there is an interest in working on NFSv4.1 in Java. If that is you, join the YANFS project and help out.

There were many people involved in the creation of the original WebNFS implementation but one name is completely obvious if you followed the RFC links above: Brent Callaghan. He drove the WebNFS development work for a long time within Sun. Thanks, Brent!

I would also like to take an opportunity to recognize Mike Eisler for pushing the XDR protocol documents forward within the IETF process; it is now an Internet Standard. Thanks, Mike!

Friday May 27, 2005

Port 623 or the mount from the NFS client hangs

Where-'o-where have my ACKs gone. I am using port 623 and my ACKs are no where to be found. My NFS client is attempting to mount and the server is not responding. The server is there and it likes other applications; why not my friendly NFS client. The ACKs have been pulled into a black hole; they are on the network but not in my networking stack.

What is going on?

Someone reported a problem, intermittent of course, where the NFS client would be happy for awhile and then attempt to mount from a server and it would just hang. Nothing from the client; no message, no error, nothing. A little network tracing with snoop showed that the client was in the middle of establishing a TCP connection with the server. However, the server never responded. Must be a problem with the NFS server. Go over to the server and do a little snooping there and WHAT! The server is receiving the SYN and dutifully responding with its ACK. Well, where is it going? Back to the client?

Snoop at the client should be showing me everything that is coming in; used a little Dtrace to poke around to double check that packets were not being dropped further down the stack.... nothing. Someone happened to have a port set up on the switch to which the client was attached and was able to capture the network traffic going to the client's port and the ACKs were in fact making it to the client. How weird.

Along the way, connectivity to the server was tested via other means. Ping, telnet, ssh. Everything else was working except this silly NFS client mount.

Well, along they way, after my mind had blown a fuse as to why the client wasn't seeing a response, I had asked for help from others. Someone more perceptive than I took notice of the port number the NFS client had chosen for its connection attempt. Port 623. A quick google search for "NFS port 623" and this blog entry pops up. Interesting. Intel network interfaces and port 623. Well, this is the source of the problem I have been looking at.

Turns out that NFS client is in fact a Sun Fire V65x which uses an Intel system board and network interface. If one looks closely at the spec sheet for this system, it says "Remote Management: IPMI and DMI compliant Management Service Processor" Well, that's interesting.

Let's go look for port 623 over at IANA and see what they have to say about this port. The assignments says:

asf-rmcp        623/tcp    ASF Remote Management and Control Protocol
asf-rmcp        623/udp    ASF Remote Management and Control Protocol

So it seems that ASF expands to "Alert Standard Forum" and, while I don't know the lineage, ASF has transformed itself into IPMI or "Intelligent Platform Management Interface". In any case, port 623 has been assigned for this purpose and there must be something going on.

A quick check with the user that reported the problem and sure enough, the IPMI feature is configured on the NFS client in question and a quick check with the ipmiadm tool confirms it. So, the ACKs being sent from the server are being sucked into the black hole of the client's IPMI service processor and the client is none the wiser.

So now that we know that the client has been configured to use IPMI and that anything destined for that port on the client will never make it into the client's networking stack, what can be done?

The first thing that can be done is to disable the IPMI feature on the system. In this case, that was not possible; the feature was being used for remote power management. Second would be to convince the client to avoid using that port number. There are two ways of doing that in the Solaris client. The first is to tell the NFS client to avoid using "reserved" or low port numbers. This is done by setting a kernel variable via


with an entry something like this:

set rpcmod:clnt_cots_do_bindresvport=0x0 

Upon reboot, the NFS client will not use reserved ports. Of course, there may be NFS server that require reserved port usage (how lame) and this is not a good workaround.

The second way to avoid using port 623 is to set up a dummy inetd service that will hold onto the port and therefore preventing the NFS client from using it. The way this can be done is to add the following to

rmcp            623/udp
rmcp            623/tcp

and then to place the following in /etc/inetd.conf.

rmcp    dgram   udp     wait    nobody  /bin/false
rmcp    stream  tcp     wait    nobody  /bin/false

After this is done, and inetd is restarted/refreshed. Port 623 on the client will be bound and the NFS client will be able to function once again.

Are there other conditions under which this can happen? Yes, one of the other scenarios is if there is an IPSEC configuration where a port has been configured for restricted access; while the IPSEC configuration is still in place, the application is no longer configured. This allows the NFS client to bind to the IPSEC restricted port and the same behavior ensues.

Thursday Apr 07, 2005

NFSv4 at Usenix '05


Presentation from Usenix now available here

So next week, April 10-15 to be exact, is Usenix '05 in Anaheim, CA. The program committee was generous and asked me for an invited talk on NFS version 4. If you have plans on attending Usenix, stop by. My plan is to cover the internals of the protocol itself with some of motivation behind the feature set. Will cover where NFSv4 is available today and finish with a list of future enhancements that are being discussed in the IETF Working Group. Also, look for the Solaris BOFs. I should be at the BOFs and will certainly be available after the presentation for questions.

Monday Feb 07, 2005

Tunneling NFS traffic via ssh

Having received a couple of questions lately about Solaris support for ssh tunneling of NFS traffic, a short description of how it can be done and what happens to NFS when ssh tunneling is used.

First thing to do is to set up the tunnel on the NFS client:

# ssh -fN -L "3049:servername:2049" servername

This sets up the ssh tunnel with the client's local port of 3049 and connects to the NFS server named 'servername' on its port 2049 (the well-known NFS server port).

Remember that the server sshd needs to be configured to allow for port forwarding. If not in place, an update to /etc/ssh/sshd_config to ensure that the config file looks like this for the port forwarding entry:

# Port forwarding
AllowTcpForwarding yes

To restart the sshd in Solaris 10, use the following after the config file change:

svcadm restart network/ssh

Once the tunnel is in place, NFS mount syntax looks like this:

# mount -o port=3049 nfs://localhost/serverpath /localpath

Note that the mount syntax is different than usual. The usual mount to the server would look like this:

# mount servername:/serverpath /localpath

Obviously, since the ssh tunnel is the preferred transport, the mount syntax might look like this:

# mount -o port=3049 localhost:/serverpath /localpath

The problem with this is that the traditional NFS version 3 client will use the MOUNT protocol to obtain the mapping between /serverpath to filehandle. However, in Solaris, there a method to redirect the use of the MOUNT protocol to an alternate port does not exist. Therefore, with the above syntax, the mount command will attempt to contact the server localhost for the MOUNT program and this is done with regular RPC mechanisms (contact the rpcbind daemon, etc.). That doesn't work. Therefore the use of the NFS URL.

The NFS URL usage directs the mount command to use only the NFS protocol to map the /serverpath to the filehandle. This will be done with the port specified in the port=3049 option. The unfortunate side effect of using the NFS URL is that NFS file locking will not be used. The reason for this is that the NFS version 2 and 3 locking support uses another RPC program; two actually: NLM and NSM. As with the MOUNT protocol usage, the locking requests won't be tunneled since there isn't an option to direct the requests via the ssh tunnel.

NFS version 4 is better

NFS version 4 may not be better for everything, but in this case, it clearly is.

With NFSv4, all of the various protocols, MOUNT, NSM, NLM, ACL, were folding into a single protocol with the use of a single port. This means that if something like the ssh tunneling is used, all of the NFSv4 traffic will travel via the specified port in the mount options. So worries about loss of file locking support or issues with the MOUNT protocol. The regular mount syntax

# mount -o port=3049 localhost:/serverpath /localpath

will work via the ssh tunnel on port 3049.

For compatibility with clients or servers that may not yet support NFSv4, the NFS URL syntax will work for NFSv4.

Note that there is one piece of functionality that will be lost for NFSv4 when using ssh tunnels. Delegation support. NFSv4's file-level delegation support requires the use of a callback RPC program. As with the problems of MOUNT, NLM use in NFSv2/v3, NFSv4 file delegation will not work over the ssh tunnels. This is an area that is being addressed in the IETF NFSv4 Working Group with the use of the SESSIONS extension that has been proposed for a NFSv4 minor version.


Restricting access to the server only?

One other thing that is nice about this approach. Since the ssh tunnel re-writes the source address of incoming NFS requests, the NFS server will see the requests as if they were coming from the server (as a client). This means that the directories that are shared can be "locked" down to provide access to the local server only and only via the ssh tunnels. For example, the /etc/dfs/dfstab could have an entry like:

share -F nfs -o rw=servername /export

and the server will only allow NFS access to /export for clients connected from ssh tunnels.




« August 2016