By Jamesmorris-Oracle on Jun 18, 2014
Recently, Chuck Lever and Shirley Ma of the Oracle Mainline Linux kernel team have been working with the community on bringing Linux NFS over RDMA (remote direct memory access) up to full production quality.
At the 2014 OpenFabrics International Developer Workshop in March/April, they presented an overview of NFSoRDMA for Linux, outlining a rationale for investing resources in the project, as well as identifying what needs to be done to bring the implementation up to production quality.
Shirley Ma wrote the following report on the workshop:
This year OpenFabrics international workshop is dedicated to the development and improvement of OpenFabrics Software. The workshop covers topics from Exascale systems I/O, Enterprise applications to distributed computing, storage, data access and data analysis applications.
In this workshop our goal (Chuck Lever and myself) was to bring more interest parties to NFSoRDMA, work together as a community to make NFSoRDMA better on functionality, reliability and efficiency as well as adding more NFSoRDMA test coverage in OFED validation test. NFSoRDMA both Linux client and server upstream codes have been lying there for years from 2007. Lacking of upstream maintenance and support keeps customers away. Linux NFS is over IPoIB in InfiniBand Fabric, which consumes more resources than over RDMA (high CPU utilization, contiguous memory reservation to achieve better bandwidth). We briefly evaluated NFSoRDMA vs. NFS/IPoIB using direct I/O performance benchmark IOZone. The results showed that NFSoRDMA had better bandwidth in all different record size among 1KB to 16MB, and better CPU efficiency for record size greater than 4KB. As expected NFSoRDMA RPC read, write round trip time is much shorter than NFS/IPoIB. There will be more desire for NFSoRDMA when storage and memory are merged, I/O latency reduced.
Jeff Becker(NASA) and Susan Coulter (LANL) would like to join NFSoRDMA efforts after our talk. They have large scale computing nodes, a decent scaled validation environment. Tom Talpey (NFSoRDMA client original author) agreed with our proposal of future work: splitting send/recv completion queue, creating multiple QPs for scalability... He also gave advises on NFSoRDMA performance measurement based upon his SMB work and Don Lovinger's performance measurement work on SMB 3.0. (http://www.snia.org/sites/default/files2/SDC2013/presentations/Revisions/DanLovinger-Scaled-RDMA-Revision.pdf).
Sayantan Sur (Intel) right now is using NFS/IPoIB in their IB cluster. We advised him some tuning method on NFS/IPoIB, he was happy to get 100 times better bandwidth than before for small I/O size from 2MB to 200MB/s. He is thinking to move to NFSoRDMA once it's stable. When we talked about wireshark NFSoRDMA dissector, Doug Oucharek (Intel) mentioned that he had implemented some luster RDMA packet dissector for wireshark which is not upstream yet, discussed with him about luster RDMA packet dissector to see whether we can borrow some codes for dissect NFSoRDMA IB packets. Chuck and I also discussed with OFILG interoperability tester Edward Mossman (IOL) regarding adding more NFSoRDMA coverage into their test suites.
The OFA has moved from hardware vendor driven workshop to software driven since last year. Most of the attendees were and OpenFabrics software and application developers. Intel has the most attendees, more than 20 people came from HW, OpenFabrics Stack, HPC and other applications departments.
Topics could be related to NFSoRDMA in the future:
A new working group (OpenFabrics Interface OFI WG) is created, the goal is to minimize interfaces complexity and APIs overhead. The new framework was proposed to provide different fabric interfaces to hide different fabrics providers implementation. The OFI WG hosts weekly telecons every Tuesday, everyone is welcome. Sean Hefty (Intel) analyzed current stack APIs overhead and cache memory footprint, presented the interfaces framework in little bit detail, check his presentation:
VMware is working on virtualization support for host and guest service over RDMA. On guest it implements paravirtual vRDMA device support Verbs. Device is emulated in ESXi hypervisor. Guest physical memory regions are mapped to ESXi and passed down to physical RDMA HCA, DMA directly from/to guest physical memory. Guest on same host latency is about 20us.
Liran Liss from Mellanox gave a talk about RDMA on demand paging update, which intends to address RDMA memory registration challenge for the cost, the size, lock, sync. He proposed non-pinned memory region which requires OS PTE table changes. More details is here: https://www.openfabrics.org/images/Workshops_2014/DevWorkshop/presos/Tuesday/pdf/09.30_2014_OFA_Workshop_ODP_update_final.pdf
He also presented RDMA bonding approach from transport level. A sudo vHCA (vQP, vPD, vCQ, vMRs ...) is created to use for bonding (failure over and aggregation). So the bonding will be hardware independent. The detail of the proposal is as below, don't know how feasible to do it, and the outcome of performance. The sudo HCA driver idea is similar to VMware vRDMA driver.
Mellanox gave RoCE(RDMA over Converged Ethernet) v2 update -- IP routable packet format. RoCEv2 encapsulates IB packet to UDP packet, which has presented to IETF in Nov. 2013. This might introduce more challenge for Fabrics congestion control.
Developers are still complaining about usability (different vendors have different implementations) and RDMA scalability in the area of RDMA-CM, subnet manager, QP resources, memory registration... RDMA socket is still under discussion... Were they news to me after many years absent from RDMA :)
There are lots of other interesting application topics which I don't cover here. If you are interested, here is the link to the whole presentations:
Since the workshop, a bi-weekly conference call has been established, with developers from many companies and organizations participating. Minutes from these calls are posted to the linux-rdma and linux-nfs mailing lists. Minutes so far:
Code stability has been significantly improved, with increased testing by developers and bugfixes being merged. Anna Schumaker of NetApp is now maintaining a git tree for NFSoRDMA, feeding up to the core NFS maintainers.
For folks wishing to get involved in development, see the NFSoRDMA client wiki page for more information.