Wednesday Jun 18, 2014

NFS Over RDMA Community Development

Recently, Chuck Lever and Shirley Ma of the Oracle Mainline Linux kernel team have been working with the community on bringing Linux NFS over RDMA (remote direct memory access) up to full production quality.

At the 2014 OpenFabrics International Developer Workshop in March/April, they presented an overview of NFSoRDMA for Linux, outlining a rationale for investing resources in the project, as well as identifying what needs to be done to bring the implementation up to production quality.

Slides from their presentation may be downloaded here, while a video of Shirley and Chuck's presentation may be viewed here.

Shirley Ma wrote the following report on the workshop:

This year OpenFabrics international workshop is dedicated to the development and improvement of OpenFabrics Software. The workshop covers topics from Exascale systems I/O, Enterprise applications to distributed computing, storage, data access and data analysis applications.

In this workshop our goal (Chuck Lever and myself) was to bring more interest parties to NFSoRDMA, work together as a community to make NFSoRDMA better on functionality, reliability and efficiency as well as adding more NFSoRDMA test coverage in OFED validation test.  NFSoRDMA both Linux client and server upstream codes have been lying there for years from 2007. Lacking of upstream maintenance and support keeps customers away. Linux NFS is over IPoIB in InfiniBand Fabric, which consumes more resources than over RDMA (high CPU utilization, contiguous memory reservation to achieve better bandwidth). We briefly evaluated NFSoRDMA vs. NFS/IPoIB using direct I/O performance benchmark IOZone. The results showed that NFSoRDMA had better bandwidth in all different record size among 1KB to 16MB, and better CPU efficiency for record size greater than 4KB. As expected NFSoRDMA RPC read, write round trip time is much shorter than NFS/IPoIB. There will be more desire for NFSoRDMA when storage and memory are merged, I/O latency reduced.

Jeff Becker(NASA) and Susan Coulter (LANL) would like to join NFSoRDMA efforts after our talk. They have large scale computing nodes, a decent scaled validation environment. Tom Talpey (NFSoRDMA client original author) agreed with our proposal of future work: splitting send/recv completion queue, creating multiple QPs for scalability... He also gave advises on NFSoRDMA performance measurement based upon his SMB work and Don Lovinger's performance measurement work on SMB 3.0. (http://www.snia.org/sites/default/files2/SDC2013/presentations/Revisions/DanLovinger-Scaled-RDMA-Revision.pdf).

Sayantan Sur (Intel) right now is using NFS/IPoIB in their IB cluster. We advised him some tuning method on NFS/IPoIB, he was happy to get 100 times better bandwidth than before for small I/O size from 2MB to 200MB/s. He is thinking to move to NFSoRDMA once it's stable. When we talked about wireshark NFSoRDMA dissector, Doug Oucharek (Intel) mentioned that he had implemented some luster RDMA packet dissector for wireshark which is not upstream yet, discussed with him about luster RDMA packet dissector to see whether we can borrow some codes for dissect NFSoRDMA IB packets. Chuck and I also discussed with OFILG interoperability tester Edward Mossman (IOL) regarding adding more NFSoRDMA coverage into their test suites.

The OFA has moved from hardware vendor driven workshop to software driven since last year. Most of the attendees were and OpenFabrics software and application developers. Intel has the most attendees, more than 20 people came from HW, OpenFabrics Stack, HPC and other applications departments.

Topics could be related to NFSoRDMA in the future:
A new working group (OpenFabrics Interface OFI WG) is created, the goal is to minimize interfaces complexity and APIs overhead. The new framework was proposed to provide different fabric interfaces to hide different fabrics providers implementation. The OFI WG hosts weekly telecons every Tuesday, everyone is welcome. Sean Hefty (Intel) analyzed current stack APIs overhead and cache memory footprint, presented the interfaces framework in little bit detail, check his presentation:
https://www.openfabrics.org/images/Workshops_2014/DevWorkshop/presos/Monday/pdf/09.30_2014%20_OFA_Workshop_ofa-sfi.pdf

VMware is working on virtualization support for host and guest service over RDMA. On guest it implements paravirtual vRDMA device support Verbs. Device is emulated in ESXi hypervisor. Guest physical memory regions are mapped to ESXi and passed down to physical RDMA HCA, DMA directly from/to guest physical memory. Guest on same host latency is about 20us.

Liran Liss from Mellanox gave a talk about RDMA on demand paging update, which intends to address RDMA memory registration challenge for the cost, the size, lock, sync. He proposed non-pinned memory region which requires OS PTE table changes. More details is here: https://www.openfabrics.org/images/Workshops_2014/DevWorkshop/presos/Tuesday/pdf/09.30_2014_OFA_Workshop_ODP_update_final.pdf
He also presented RDMA bonding approach from transport level. A sudo vHCA (vQP, vPD, vCQ, vMRs ...) is created to use for bonding (failure over and aggregation). So the bonding will be hardware independent. The detail of the proposal is as below, don't know how feasible to do it, and the outcome of performance. The sudo HCA driver idea is similar to VMware vRDMA driver. 
https://www.openfabrics.org/images/Workshops_2014/DevWorkshop/presos/Tuesday/pdf/16.30_2014_OFA_Workshop_rdma_bonding_final.pdf

Mellanox gave RoCE(RDMA over Converged Ethernet) v2 update -- IP routable packet format. RoCEv2 encapsulates IB packet to UDP packet, which has presented to IETF in Nov. 2013. This might introduce more challenge for Fabrics congestion control.

Developers are still complaining about usability (different vendors have different implementations) and RDMA scalability in the area of RDMA-CM, subnet manager, QP resources, memory registration... RDMA socket is still under discussion... Were they news to me after many years absent from RDMA :)

There are lots of other interesting application topics which I don't cover here. If you are interested, here is the link to the whole presentations:
https://www.openfabrics.org/index.php/press-room/2014-international-developer-workshop.html

Since the workshop, a bi-weekly conference call has been established, with developers from many companies and organizations participating.  Minutes from these calls are posted to the linux-rdma and linux-nfs mailing lists.   Minutes so far:

Code stability has been significantly improved, with increased testing by developers and bugfixes being merged.  Anna Schumaker of NetApp is now maintaining a git tree for NFSoRDMA, feeding up to the core NFS maintainers.

For folks wishing to get involved in development, see the NFSoRDMA client wiki page for more information.







Monday Sep 02, 2013

IETF 87 NFSv4 Working Group meeting report by Chuck Lever

This is a contributed post from Chuck Lever, who heads up NFS development for the mainline Linux kernel team.

Executive summary:

The 87th meeting of the IETF was held July 28 - August 2 in Berlin, Germany.

I was in Berlin for the week to attend the NFSv4 Working Group meeting and hold informal discussions related to NFS standardization with other attendees. The Internet Engineering Task Force (IETF) produces high quality technical documents that influence the way people design, use and manage the Internet. Essentially, this is the body that regulates the protocols computers use to communicate on the Internet, for the purpose of improving interoperability.

An IETF meeting is held every four months in venues around the world. Sponsorship for each event varies. DENIC, the central registry for domain names under .de, was the primary sponsor for this event. Participation is open to anyone, but a registration fee is required to attend.

NFS version 4 is the IETF standard for file sharing. The charter of the Working Group is to maintain NFS specifications and introduce new NFS features through NFSv4 minor versions. More on the Working Group charter can be found here: http://datatracker.ietf.org/wg/nfsv4/charter/

I attend each NFSv4 Working Group meeting to represent Oracle's interest in various current and new NFS-related features, including pNFS, NFSv4.2, and FedFS. I'm the editor of two of the IETF FedFS protocol specifications, and a co-author of an Internet-Draft that addresses protocol issues affecting NFSv4 migration. Other representatives at this meeting include Microsoft, EMC, NetApp, IBM, Oracle, Tonian, and others. Topics include progress updates on Internet-Drafts on their way to become standards, reports on implementation experience, and requests to start new work or restart old work. See: https://datatracker.ietf.org/meeting/87/materials.html#nfsv4

Meeting agenda, presentation materials, and minutes are available at this location.

Drill down:

Working Group editor Tom Haynes (NetApp) reported on several areas where progress appears to be stalled. In general we face challenges completing our deliverables because the IETF is a volunteer organization, and the tasks at hand are large. The largest item is RFC 3530bis, which is holding up FedFS and NFSv4.2. RFC 3530bis was rejected during IESG review mainly due to the new chapter that attempts to bridge the gap between existing i18n implementations in NFS, and how we'd like i18n to work.

The problem is nobody has implemented i18n for NFSv4, and the IETF has revised i18n since 3530 was ratified. The consensus was to move the offending section to a separate Internet-Draft where the correct language can be hammered out without holding up RFC 3530bis. NFSv4.2 is held up by a lack of enthusiasm for finishing a new revision of RPCSEC GSS. The GSS I-D has languished without an author or editor for many months, and two items in NFSv4.2 depend on its completion: labeled NFS and server-to-server copy. A rough consensus was not achieved, but Tom and Andy Adamson (NetApp) will investigate options, including removing the parts of server copy and labeled NFS that depend on GSSv3, and report back.

Benny Halevy (Tonian) has submitted a fresh draft proposing "Flexible File Layouts" which is a new pNFS layout type that improves upon the existing pNFS file layout defined in RFC 5661. Motivation for a new layout scheme includes: algorithmic data striping to support load balancing, life-cycle management, and other advanced administrative features; support for using legacy NFS servers as pNFS data servers; and direct pNFS support for existing cluster filesystems such as Ceph and GlusterFS.

Chuck Lever (Oracle) described recent progress to address security concerns in the FedFS documents waiting in the RFC Editor queue. He continued by walking through a group of possible future work items, including more modern LDAP security modes, additional administrative operations, and better mechanisms for clients to choose working fileset locations. Does the working group have the energy to consider a new revision of these documents? Or should we continue to focus on making small changes? This was left unresolved.

Sorin Faibish (EMC) discussed the need for a new layout enabling pNFS clients to access Lustre data servers directly. After a lot of discussion, the issue appears to be that the NFS protocol on high performance transports is not performant enough. The proposed solution was to use LNET over RDMA. It was suggested that it would be more interesting to the Working Group if we focused on fixing the performance issues in our RDMA specifications instead.

Marc Eshel (IBM) wanted to restart the age-old conversation on tightening NFS's data cache coherency. The immediate question is whether POSIX semantics are interesting given today's compute workloads and network environment. Implementing POSIX data coherency among multiple networked systems is still a challenge. Consensus that a callback-based solution, where network traffic is proportional to the level of inter-client sharing, was most appropriate. Such a solution (byte-range delegation) was proposed by Trond Myklebust in 2006. It was recommended to start with that work.

Chuck Lever (Oracle) proposed an experimental extension to NFS that enables NFS client and servers to convey end-to-end data integrity metadata. A new I-D has been submitted that describes the protocol changes. No prototype is available yet; the I-D is meant to coordinate discussion of technical details, and enable interoperable prototype implementations.

David Noveck (EMC) elaborated on the need to allow protocol changes outside of the NFS minor version process. He described the limitations of batching unrelated features together and waiting for a full pass through the IETF review process. There was some interest in allowing innovation outside of the minor version process. The Area Directory and Working Group chair felt that there is currently not enough energy behind work already planned for delivery.

Matt Benjamin (Linux Box) is restarting work on a feature proposed several years ago by Mike Eisler that allows directories to be striped across pNFS data servers, just like file data is today. An Internet-Draft is available, and a prototype is underway.

-- Chuck Lever

Thursday Mar 21, 2013

IETF 86 NFSv4 Working Group meeting report by Chuck Lever

This is a contributed post from Chuck Lever, who heads up NFS development for the mainline Linux kernel team.


Executive summary:


On Monday (11th March) I attended the IETF NFSv4 Working Group meeting at IETF 86 in Orlando, Florida.

The Internet Engineering Task Force (IETF) produces high quality technical documents that influence the way people design, use and manage the Internet.  Essentially, this is the body that regulates the protocols computers use to communicate on the Internet, for the purpose of improving interoperability.

An IETF meeting is held every four months in venues around the world.  Sponsorship for each event varies.  This event was sponsored by Comcast and NBCUniversal.  Participation is open to anyone, but a registration fee is required to attend.

NFS version 4 is the IETF standard for file sharing.  The charter of the Working Group is to maintain NFS specifications and introduce new NFS features through NFSv4 minor versions.  More on the Working Group charter can be found here:

http://datatracker.ietf.org/wg/nfsv4/charter/

I attend each NFSv4 Working Group meeting to represent Oracle's interest in various current and new NFS-related features, including pNFS, NFSv4.2, and FedFS.  I'm the editor of two of the IETF FedFS protocol specifications, and a co-author on a draft that discusses experience implementing NFSv4 migration.


Other representatives at this meeting include Microsoft, EMC, NetApp, Oracle, Panasas, and others.  Topics include progress updates on drafts on their way to become standards, reports on implementation experience, and requests to start new work or restart old work.  See:

https://datatracker.ietf.org/meeting/86/materials.html#nfsv4

Meeting slides are available now at this location.  Minutes are coming soon.


Drill down:


Tom Haynes (NetApp) reported on progress with RFC 3530bis, a refreshed specification for NFSv4.0.  This document has passed the Area Director check, and is ready for IESG review.  This document is a top priority because other unfinished documents which cite this document are held up waiting for its completion.

Labeled NFS, a part of the forthcoming NFSv4.2 protocol, has a Linux prototype that was demonstrated at Connectathon last month.

The RPCSEC_GSSv3 standard has not made progress, but an editor (Dros Adamson) was assigned during IETF 85.  This document is blocking progress on NFSv4.2.

The NFSv4.2 draft is in WG last call, which ends today (Monday, March 11).  No new issues were raised, so the Working Group chair will move this forward.

Tom Haynes presented a brief set of slides on how NFSv3 client should interpret the presence of AUTH_NONE in the list of security flavors a server supports.  There was never a formal standard describing this, and now we need an interoperability document.  As we explore this issue we may discover some real problems.  A fresh draft was requested.

Dave Noveck (EMC) discussed progress on the draft co-authored with Bill Baker, Piyush Shivam, and myself on NFSv4 migration issues.  As part of the discussion, we visited the issue of how to prevent client progress when a server freezes open and lock state before a migration.  Adding a new error code was mentioned, but that is against the minor version rules and would cause interoperability problems with clients that don't recognize the new error code.  Otherwise we have the procedural issues taken care of to advance this document to become an informational RFC.

A draft covering NFSv4.1 migration issues would probably not be needed, as the changes are small and could be covered in an RFC 5661bis, when it is opened.  There doesn't seem to be urgency here.

Chuck Lever (Oracle) described implementation experience with the recommendations of Dave's migration draft.  The experience arises from the Linux Uniform Client String changes Chuck has done, and a number of items discovered by the Solaris NFS team.

Chuck Lever reported on progress with the FedFS draft standards.  Short story: They are in the RFC Editor queue awaiting completion of RFC 3530bis.

Trond Myklebust (NetApp) presented an issue with NFSv4.1 session slot table management that he also has reported at Connectathon.  It was agreed that an errata to RFC 5661 would be produced that describes how implementations will add missing behavior.  No on-the-wire protocol changes.

Matt Benjamin (The Linux Box) requested a revisit of a 2008 proposal by Mike Eisler to stripe POSIX directories across multiple data servers.  An algorithm would generate an offset into a table of device IDs, which would indicate on which data server to find a directory entry.  Matt claimed there would be changes to the proposal to deal with Ceph and CohortFS.  Chair requested a draft, Matt to deliver soon.

Chuck Lever asked if we still need an NFS-specific mechanism for provisioning NFSv4 ID domain names.  The feeling is that this domain name is determined by the system's authentication service, not by NFS, so NFS should not have its own way to set this.  Consensus that there is some work to do here, and it should be done under umbrella of the ongoing multi-domain work.

Spencer Shepler (Microsoft) closed the meeting with a house-keeping item.  There is some desire to reduce travel by moving more work to the mailing list.  The plan is to ask about agenda items for a meeting before requesting a meeting slot at the next IETF.

Several folks wanted to discuss Bill Baker's micro-versioning proposal.  Dave Noveck stated the problem this way: NFSv4.1 is a heavyweight minor version with a bunch of features, so fixes for 4.0 aren't possible with our spiffy minor versioning scheme.  Spencer felt we should visit this only when we encounter a problem we must address with major protocol changes.  The room was divided; some felt waiting was best so that a problem statement can be formulated; others were concerned that it was almost certain we would need to alter the NFSv4.0 XDR at some point, and we should start working this out now.

In the near term, protocol issues should come to the mailing list sooner rather than later so we can work them out together.

-- Chuck Lever

About

The Oracle mainline Linux kernel team works as part of the Linux kernel community to develop new features and maintain existing code.


Our team is globally distributed and includes leading core kernel developers and industry veterans.


This blog is maintained by James Morris <james.l.morris@oracle.com>

Search

Categories
Archives
« August 2015
SunMonTueWedThuFriSat
      
1
2
3
4
5
6
7
8
9
10
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
     
Today