I have been involved in the IETF NFSv4
working group from almost the very beginning. One of the things that I have really
enjoyed about the experience is the NFS community that has come together
to move NFS forward in the form of NFSv4 and now NFSv4.1.
I suppose the level of cooperation has grown out of the 20+ years of Connectathon testing; a place where the engineers get together to find and fix the interoperability bugs. That tradition
continued with bakeathon events -- they started when NFSv4 was being built; they proved to be very useful in working on interpretation and interoperability bugs while the protocol was still being defined. Mike describes the most recent bakeathon we hosted in Austin. Very successful; as they always are.
Mike does a good job of reviewing what NFSv4.1's pNFS is and can accomplish.
I also agree with him about the narrow definition of HPC and the need for
pNFS in many areas that may not be apparent at first blush.
In any case, Mike goes on to describe what Netapp is doing with pNFS. All good
stuff. There has been increasing press about pNFS lately. That is a good thing
but I hope that as we move forward that we maintain the NFS community's shared
ownership and responsibility to building a protocol that is useful and implementations
that are even more so.
I have included my brief summary of NFSv4.1 and pNFS and a little
about the approach we are taking in building pNFS for Solaris.
More details will be showing up on the
OpenSolaris pNFS project page.
In fact, the most recent news is that we have provided binaries and source matching
what we tested at the bakeathon.
Have fun parallelizing your data!
Introduction to Parallel NFS (pNFS)
The IETF NFSv4 working group has been building the first minor version
of NFSv4 (NFSv4.1). The current draft of the proposed NFSv4.1
protocol has made significant progress in recent months and is now
considered "functionally complete". To assist in bringing the
Internet Draft to closure (working group last call), the document
editors have been hosting a set of formal reviews that will continue
through the summer of 2007. It is expected that the Working Group
will complete the document around the end of 2007.
NFSv4's protocol definition defined a set of rules for future minor
version development. It was well understood that there would be a
need to modify or update NFSv4.0 at some point; hence the minor
version rules. Given the ability to update the protocol, the
following functional enhancements or additions are currently defined
for NFSv4.1; note, this is not an exhaustive list but captures the
main set of changes.
Provides a framework for client and server such that "exactly once
semantics" can be achieved at the NFS application level. Sessions
also assists in mandating the availability of a
- Directory Delegations and Notifications
Allows for effective caching of directory contents and delegation
- Multi-server Namespace (Namespace extensions)
Allow a namespace to extend beyond the boundaries of a single
- ACL, SACL, DACL
ACL inheritance allows propagation of access permissions and
restriction down a directory tree as file system objects are
- Retention Attributes
File object can be placed in an immutable, undeletable,
unrenamable state for a fixed or infinite duration of time.
Correct NFSv4.0 security negotiation mechanism.
- Parallel NFS (pNFS)
Functionality introduced or modified in a minor version must be
optional to implement. This is true for all of changes in NFSv4.1
with the exception of Sessions. For NFSv4.1, Sessions is a mandatory
to implement feature. The basic functionality that Sessions provides
has become an integral part of the other features of NFSv4.1; the
consensus was that it was undesirable to have a version of each piece
of NFSv4.1 functionality that was defined with and without Sessions
support. Therefore, Sessions is mandatory to implement for NFSv4.1.
The Parallel NFS or pNFS functionality, as its name implies, is a
method of introducing data access parallelism. The NFSv4.1 protocol
defines a method of separating the meta-data (names and attributes) of
a filesystem from the location of the file data; it goes beyond simple
name/data separation to define a method of striping the data amongst a
set of data servers. This is obviously very different from the
traditional NFS server which holds the names of files and their data
under the single umbrella of the server. There are products in
existence that are multi-node NFS servers but they are limited in the
participation from the client in separation of meta-data and data.
The NFSv4.1 client can be enabled to be a direct participant in the
exact location of file data and avoid sole interaction with the
single NFS server when moving data.
The NFSv4.1 pNFS server is now a collection or community of server
resources or components; these community members are assumed to be
controlled by the meta-data server.
The pNFS client still accesses a single meta-data server for traversal
or interaction with the namespace; when the client moves data to and
from the server it may be directly interacting with the set of data
servers belonging to the pNFS server community.
| pNFS Server |
| .-------------- .-------------- |
| |data-server | |data-server | |
| | | | | |
| `-.------------ `-----.-------- |
| | .-------------- | .-------------- |
| | |data-server | | |data-server | |
| | | | | | | |
| | `------.------- | `------.------- |
| _|____________|_________|_________|__________ |
| | |
| ,---------'----------- |
| | meta-data server | |
| |____________________| |
| | | | |
| | |
| | |
.-----+-----. .-----+-----. .-----+-----.
| | | | | |
|pNFS Client| |pNFS Client| .... |pNFS Client|
| | | | | |
`-----------' `-----------' `-----------'
As mentioned, the user's view of the "pNFS Server" continues to appear
on the network as a regular NFS server even though there are multiple,
distinct and addressable components of the server. There is a single
server from which a filesystem is mounted and accessed. The
administrator knows the details of the "pNFS Server" or community.
The pNFS Client implementation will know the details of the pNFS
server through NFSv4.1 protocol interaction. When it comes to things
like mount points and automount maps, the look and feel of the NFS
server is the same as it has been: single server name and its
The pNFS enabled client determines the location of file data by
directly querying the pNFS server. In pNFS nomenclature, the client
requests a file "layout". When a file is opened, the pNFS client will
ask the meta-data server for the file's layout. If available it is
available, the server will give the layout to the client. When the
client moves data, the layout is consulted as to the data-server(s)
upon which the data resides; once the offset and range is matched to
the appropriate data-server(s), the data movement is complete with
read and write requests. The pNFS protocol's layout definition
provides for straightforward striping of data only. There is one
twist to the striping -- the location may be specified by two paths
thus allowing for a simple multi-pathing feature.
With the layout in hand, the pNFS client is then enabled to generate
read/write requests in parallel or by a method of its own choice. The
layout is thus a simple enablement for the pNFS client to increase its
overall data throughput. The pNFS server is also a benefactor by the
nature of horizontal scale for data access along with the reduction of
read/write operations being directly serviced by the meta-data server.
Obviously, the main purpose or intent of the NFSv4.1 pNFS feature is
to significantly improve the data throughput capabilities of NFS
servers. The NFSv4.1 protocol requires that the meta-data server
always be able to service read/write request itself. This requirement
allows for NFSv4.1 clients that are not enabled for pNFS or for cases
that the available parallelism is not required.
The NFSv4.1 protocol defines interaction between client and server.
There is no specification for the interaction between components of
the pNFS server. This interaction or coordination of the pNFS server
community members is left as a pNFS server implementation detail.
Given the lack of an open protocol definition, pNFS server components
will be homogeneous in their implementation. This isn't necessarily a
bad thing since there is a variety of server filesystem architectures
already present in the NFS server market. The lack of protocol
definition allows for the most effective reuse of existing filesystem
and server technology. Obviously there is a well-defined set of
requirements or expectations of the meta-data and data servers in the
form of the NFSv4.1 protocol.
Maintaining the theme of inclusiveness, the pNFS protocol allows for a
variety of data movement or transfer methods between the client and
pNFS server. The NFSv4.1 layout mechanism defines layout "types".
The types are then defined as a particular data movement or transport
protocol. The layout mechanism also allows for inclusion of newly
defined types such that the NFSv4.1 protocol can adapt to future
requirements or ideas.
There are three types of layouts currently being defined for pNFS;
they are generically referred to as: files, objects, blocks. The
"files" layout type uses the NFSv4.1 read/write operations to the
data-server. The "files" type is being defined within the NFSv4.1
Internet Draft. The "objects" layout type refers to the OSD storage
protocol as defined by the T10 "SCSI Object-Based Storage Device
Commands" protocol draft. The "blocks" layout refers to the use of
SCSI (in its many forms). The pNFS OSD and block layout definitions
are defined in separate Internet Drafts.
For additional detail, the current Internet Drafts for the items
mentioned above are:
NFSv4 Minor Version 1
Object-based pNFS Operations
pNFS Block/Volume Layout
The initial instantiation of NFSv4.1 for Solaris will deliver the
Sessions and pNFS functionality using the files layout type.
With the introduction of NFSv4.1, there will be no change in the
network transports available to the client or server. The kernel RPC
interfaces will continue to provide TCP, and RDMA (in the form of
Infiniband) network connectivity. The current RPCSEC_GSS mechanisms
will continue to be supported as well.
The Solaris NFSv4.1 client will be a straightforward implementation of
the Sessions and pNFS capabilities. The administrative model for the
client will remain the same as it is today. As mentioned in the
introduction, the client will continue to mount from a single server
and provide a path (e.g. server.domain:/path/name).
Since NFSv4.1 constitutes a new version of the protocol, the client
will negotiate the use of NFSv4.1 with the server as it has done in
the past. The client will have a preference for the highest version
of NFS offered by the server. As the client accesses filesystems it
will query the server for the availability of the pNFS functionality.
Then, on OPENing a file, the client attempts to obtain a layout and
then uses the layout information when READing/WRITEing and COMMITing
data to the "server" to provide data access parallelism.
As mentioned in the introduction, the NFSv4.1 protocol does not define
the architecture of the pNFS server; only the outward facing protocol
and behavior is defined. Given this flexibility, a multitude of
architectures can fit the model of pNFS service. For the Solaris pNFS
server, a straightforward model will be used. Each member of the pNFS
server community (meta-data server, and data servers) are to be
thought of as a self-contained storage unit.
For the Solaris pNFS server, there is one meta-data server. The
meta-data server may be configured with a high availability component
that allows for active/passive availability. Solaris pNFS meta-data
server looks and feels like a regular NFS server. There are
extensions to the management model to integrate the use of the data
servers. The meta-data server is in full control of the pNFS server
in the sense that it decides which data server to utilize (initial
inclusion and allocations for new file layouts).
The meta-data server will require the use of ZFS as its underlying
filesystem for storage of the filesystem namespace and attributes.
Architecturally, other filesystems may be used but they must provide
like-functionality for use by the pNFS meta-data server; the most
important features are NFSv4 ACL support and system extended
The pNFS data servers do not need a traditional filesystem namespace
for associating names to data given that the meta-data server provides
this service by definition. The data server will then be free to
associate the file objects that are to be stored upon them with their
filehandle (the identifier shared between the data server and meta-data
server) in any fashion appropriate. One may think that it is a
requirement to have the data server mimic the filesystem namespace of
the meta-data server but this is untrue. In fact, the mimicking of the
meta-data namespace would prove to be cumbersome for regular NFS
server operation. The Solaris pNFS data server will have an
architecture that will allow for the direct use of ZFS pools for file
The pNFS diagram above implies that there is a need for two separate
networks; one internal to the pNFS community and one for client access
to the pNFS server. This is but a possibility -- a logical
representation; not a requirement. The only requirement with respect
to network configuration is that each component within the diagram
above be addressable or routable to each other. Therefore, there can
be variety of networking technologies or topologies employed for the
pNFS server. The choice of topology or interconnect will be based on
the workload being served by the pNFS server.
Coordination of the pNFS Community
To this point, we have a Solaris pNFS client interacting with a pNFS
server over a flexible network configuration. The meta-data server is
using ZFS as the underlying filesystem for the "regular" filesystem
information (names, directories, attributes). The data server is
using the ZFS pool to organize the data for the various layouts the
meta-data server is handing out to the clients. What is coordinating
all of the pNFS community members?
pNFS Control Protocol
The control protocol is the piece of the pNFS server solution that is
left to the various implementations to define. Since the Solaris pNFS
solution is taking a fairly straightforward approach to the
construction of the pNFS community, this allows for the use of ZFS'
special combination of features to organize the attached storage
devices. This will allow for the control protocol to focus on higher
level control of the pNFS community members.
Some of the highlights of the control protocol are:
- Meta-data and data server reboot / network partition indication
- Filehandle, file state, and layout validation
- Reporting of data server resources
- Inter data server data movement
- Meta-data server proxy I/O
- Data server state invalidation
NFSv4.1 will deliver a range of new features. The two initially
addressed for Solaris will be Sessions and pNFS. The Solaris pNFS
client will be a simple implementation of the protocol. The Solaris
pNFS server will layer on top of existing Solaris technologies to
offer a feature rich solution.