By shepler on Jun 27, 2007
I have been involved in the IETF NFSv4 working group from almost the very beginning. One of the things that I have really enjoyed about the experience is the NFS community that has come together to move NFS forward in the form of NFSv4 and now NFSv4.1. I suppose the level of cooperation has grown out of the 20+ years of Connectathon testing; a place where the engineers get together to find and fix the interoperability bugs. That tradition continued with bakeathon events -- they started when NFSv4 was being built; they proved to be very useful in working on interpretation and interoperability bugs while the protocol was still being defined. Mike describes the most recent bakeathon we hosted in Austin. Very successful; as they always are.
Mike does a good job of reviewing what NFSv4.1's pNFS is and can accomplish. I also agree with him about the narrow definition of HPC and the need for pNFS in many areas that may not be apparent at first blush.
In any case, Mike goes on to describe what Netapp is doing with pNFS. All good stuff. There has been increasing press about pNFS lately. That is a good thing but I hope that as we move forward that we maintain the NFS community's shared ownership and responsibility to building a protocol that is useful and implementations that are even more so.
I have included my brief summary of NFSv4.1 and pNFS and a little about the approach we are taking in building pNFS for Solaris. More details will be showing up on the OpenSolaris pNFS project page. In fact, the most recent news is that we have provided binaries and source matching what we tested at the bakeathon.
Have fun parallelizing your data!
Introduction to Parallel NFS (pNFS)The IETF NFSv4 working group has been building the first minor version of NFSv4 (NFSv4.1). The current draft of the proposed NFSv4.1 protocol has made significant progress in recent months and is now considered "functionally complete". To assist in bringing the Internet Draft to closure (working group last call), the document editors have been hosting a set of formal reviews that will continue through the summer of 2007. It is expected that the Working Group will complete the document around the end of 2007.
NFSv4's protocol definition defined a set of rules for future minor version development. It was well understood that there would be a need to modify or update NFSv4.0 at some point; hence the minor version rules. Given the ability to update the protocol, the following functional enhancements or additions are currently defined for NFSv4.1; note, this is not an exhaustive list but captures the main set of changes.
- Directory Delegations and Notifications
- Multi-server Namespace (Namespace extensions)
- ACL, SACL, DACL
- Retention Attributes
- Parallel NFS (pNFS)
Provides a framework for client and server such that "exactly once semantics" can be achieved at the NFS application level. Sessions also assists in mandating the availability of a
Allows for effective caching of directory contents and delegation management.
Allow a namespace to extend beyond the boundaries of a single server.
ACL inheritance allows propagation of access permissions and restriction down a directory tree as file system objects are created.
File object can be placed in an immutable, undeletable, unrenamable state for a fixed or infinite duration of time.
Correct NFSv4.0 security negotiation mechanism.
Functionality introduced or modified in a minor version must be optional to implement. This is true for all of changes in NFSv4.1 with the exception of Sessions. For NFSv4.1, Sessions is a mandatory to implement feature. The basic functionality that Sessions provides has become an integral part of the other features of NFSv4.1; the consensus was that it was undesirable to have a version of each piece of NFSv4.1 functionality that was defined with and without Sessions support. Therefore, Sessions is mandatory to implement for NFSv4.1.
The Parallel NFS or pNFS functionality, as its name implies, is a method of introducing data access parallelism. The NFSv4.1 protocol defines a method of separating the meta-data (names and attributes) of a filesystem from the location of the file data; it goes beyond simple name/data separation to define a method of striping the data amongst a set of data servers. This is obviously very different from the traditional NFS server which holds the names of files and their data under the single umbrella of the server. There are products in existence that are multi-node NFS servers but they are limited in the participation from the client in separation of meta-data and data. The NFSv4.1 client can be enabled to be a direct participant in the exact location of file data and avoid sole interaction with the single NFS server when moving data.
The NFSv4.1 pNFS server is now a collection or community of server resources or components; these community members are assumed to be controlled by the meta-data server.
The pNFS client still accesses a single meta-data server for traversal or interaction with the namespace; when the client moves data to and from the server it may be directly interacting with the set of data servers belonging to the pNFS server community.
pNFS -------------------------------------------------- | pNFS Server | | | | .-------------- .-------------- | | |data-server | |data-server | | | | | | | | | `-.------------ `-----.-------- | | | .-------------- | .-------------- | | | |data-server | | |data-server | | | | | | | | | | | | `------.------- | `------.------- | | _|____________|_________|_________|__________ | | | | | ,---------'----------- | | | meta-data server | | | |____________________| | `---.------.--------.-------------.-------.------- | | | | | ____|______|________|_____________|_______|________ | | | | | | .-----+-----. .-----+-----. .-----+-----. | | | | | | |pNFS Client| |pNFS Client| .... |pNFS Client| | | | | | | `-----------' `-----------' `-----------'
As mentioned, the user's view of the "pNFS Server" continues to appear on the network as a regular NFS server even though there are multiple, distinct and addressable components of the server. There is a single server from which a filesystem is mounted and accessed. The administrator knows the details of the "pNFS Server" or community. The pNFS Client implementation will know the details of the pNFS server through NFSv4.1 protocol interaction. When it comes to things like mount points and automount maps, the look and feel of the NFS server is the same as it has been: single server name and its self-contained namespace.
The pNFS enabled client determines the location of file data by directly querying the pNFS server. In pNFS nomenclature, the client requests a file "layout". When a file is opened, the pNFS client will ask the meta-data server for the file's layout. If available it is available, the server will give the layout to the client. When the client moves data, the layout is consulted as to the data-server(s) upon which the data resides; once the offset and range is matched to the appropriate data-server(s), the data movement is complete with read and write requests. The pNFS protocol's layout definition provides for straightforward striping of data only. There is one twist to the striping -- the location may be specified by two paths thus allowing for a simple multi-pathing feature.
With the layout in hand, the pNFS client is then enabled to generate read/write requests in parallel or by a method of its own choice. The layout is thus a simple enablement for the pNFS client to increase its overall data throughput. The pNFS server is also a benefactor by the nature of horizontal scale for data access along with the reduction of read/write operations being directly serviced by the meta-data server. Obviously, the main purpose or intent of the NFSv4.1 pNFS feature is to significantly improve the data throughput capabilities of NFS servers. The NFSv4.1 protocol requires that the meta-data server always be able to service read/write request itself. This requirement allows for NFSv4.1 clients that are not enabled for pNFS or for cases that the available parallelism is not required.
The NFSv4.1 protocol defines interaction between client and server. There is no specification for the interaction between components of the pNFS server. This interaction or coordination of the pNFS server community members is left as a pNFS server implementation detail. Given the lack of an open protocol definition, pNFS server components will be homogeneous in their implementation. This isn't necessarily a bad thing since there is a variety of server filesystem architectures already present in the NFS server market. The lack of protocol definition allows for the most effective reuse of existing filesystem and server technology. Obviously there is a well-defined set of requirements or expectations of the meta-data and data servers in the form of the NFSv4.1 protocol.
Maintaining the theme of inclusiveness, the pNFS protocol allows for a variety of data movement or transfer methods between the client and pNFS server. The NFSv4.1 layout mechanism defines layout "types". The types are then defined as a particular data movement or transport protocol. The layout mechanism also allows for inclusion of newly defined types such that the NFSv4.1 protocol can adapt to future requirements or ideas.
There are three types of layouts currently being defined for pNFS; they are generically referred to as: files, objects, blocks. The "files" layout type uses the NFSv4.1 read/write operations to the data-server. The "files" type is being defined within the NFSv4.1 Internet Draft. The "objects" layout type refers to the OSD storage protocol as defined by the T10 "SCSI Object-Based Storage Device Commands" protocol draft. The "blocks" layout refers to the use of SCSI (in its many forms). The pNFS OSD and block layout definitions are defined in separate Internet Drafts.
For additional detail, the current Internet Drafts for the items mentioned above are:
The initial instantiation of NFSv4.1 for Solaris will deliver the Sessions and pNFS functionality using the files layout type.
With the introduction of NFSv4.1, there will be no change in the network transports available to the client or server. The kernel RPC interfaces will continue to provide TCP, and RDMA (in the form of Infiniband) network connectivity. The current RPCSEC_GSS mechanisms will continue to be supported as well.
The Solaris NFSv4.1 client will be a straightforward implementation of the Sessions and pNFS capabilities. The administrative model for the client will remain the same as it is today. As mentioned in the introduction, the client will continue to mount from a single server and provide a path (e.g. server.domain:/path/name).
Since NFSv4.1 constitutes a new version of the protocol, the client will negotiate the use of NFSv4.1 with the server as it has done in the past. The client will have a preference for the highest version of NFS offered by the server. As the client accesses filesystems it will query the server for the availability of the pNFS functionality. Then, on OPENing a file, the client attempts to obtain a layout and then uses the layout information when READing/WRITEing and COMMITing data to the "server" to provide data access parallelism.
As mentioned in the introduction, the NFSv4.1 protocol does not define the architecture of the pNFS server; only the outward facing protocol and behavior is defined. Given this flexibility, a multitude of architectures can fit the model of pNFS service. For the Solaris pNFS server, a straightforward model will be used. Each member of the pNFS server community (meta-data server, and data servers) are to be thought of as a self-contained storage unit.
For the Solaris pNFS server, there is one meta-data server. The meta-data server may be configured with a high availability component that allows for active/passive availability. Solaris pNFS meta-data server looks and feels like a regular NFS server. There are extensions to the management model to integrate the use of the data servers. The meta-data server is in full control of the pNFS server in the sense that it decides which data server to utilize (initial inclusion and allocations for new file layouts).
The meta-data server will require the use of ZFS as its underlying filesystem for storage of the filesystem namespace and attributes. Architecturally, other filesystems may be used but they must provide like-functionality for use by the pNFS meta-data server; the most important features are NFSv4 ACL support and system extended attributes.
The pNFS data servers do not need a traditional filesystem namespace for associating names to data given that the meta-data server provides this service by definition. The data server will then be free to associate the file objects that are to be stored upon them with their filehandle (the identifier shared between the data server and meta-data server) in any fashion appropriate. One may think that it is a requirement to have the data server mimic the filesystem namespace of the meta-data server but this is untrue. In fact, the mimicking of the meta-data namespace would prove to be cumbersome for regular NFS server operation. The Solaris pNFS data server will have an architecture that will allow for the direct use of ZFS pools for file data storage.
The pNFS diagram above implies that there is a need for two separate networks; one internal to the pNFS community and one for client access to the pNFS server. This is but a possibility -- a logical representation; not a requirement. The only requirement with respect to network configuration is that each component within the diagram above be addressable or routable to each other. Therefore, there can be variety of networking technologies or topologies employed for the pNFS server. The choice of topology or interconnect will be based on the workload being served by the pNFS server.
Coordination of the pNFS Community
To this point, we have a Solaris pNFS client interacting with a pNFS server over a flexible network configuration. The meta-data server is using ZFS as the underlying filesystem for the "regular" filesystem information (names, directories, attributes). The data server is using the ZFS pool to organize the data for the various layouts the meta-data server is handing out to the clients. What is coordinating all of the pNFS community members?
pNFS Control Protocol
The control protocol is the piece of the pNFS server solution that is left to the various implementations to define. Since the Solaris pNFS solution is taking a fairly straightforward approach to the construction of the pNFS community, this allows for the use of ZFS' special combination of features to organize the attached storage devices. This will allow for the control protocol to focus on higher level control of the pNFS community members.
Some of the highlights of the control protocol are:
- Meta-data and data server reboot / network partition indication
- Filehandle, file state, and layout validation
- Reporting of data server resources
- Inter data server data movement
- Meta-data server proxy I/O
- Data server state invalidation
NFSv4.1 will deliver a range of new features. The two initially addressed for Solaris will be Sessions and pNFS. The Solaris pNFS client will be a simple implementation of the protocol. The Solaris pNFS server will layer on top of existing Solaris technologies to offer a feature rich solution.