Wednesday Jun 18, 2014

NFS Over RDMA Community Development

Recently, Chuck Lever and Shirley Ma of the Oracle Mainline Linux kernel team have been working with the community on bringing Linux NFS over RDMA (remote direct memory access) up to full production quality.

At the 2014 OpenFabrics International Developer Workshop in March/April, they presented an overview of NFSoRDMA for Linux, outlining a rationale for investing resources in the project, as well as identifying what needs to be done to bring the implementation up to production quality.

Slides from their presentation may be downloaded here, while a video of Shirley and Chuck's presentation may be viewed here.

Shirley Ma wrote the following report on the workshop:

This year OpenFabrics international workshop is dedicated to the development and improvement of OpenFabrics Software. The workshop covers topics from Exascale systems I/O, Enterprise applications to distributed computing, storage, data access and data analysis applications.

In this workshop our goal (Chuck Lever and myself) was to bring more interest parties to NFSoRDMA, work together as a community to make NFSoRDMA better on functionality, reliability and efficiency as well as adding more NFSoRDMA test coverage in OFED validation test.  NFSoRDMA both Linux client and server upstream codes have been lying there for years from 2007. Lacking of upstream maintenance and support keeps customers away. Linux NFS is over IPoIB in InfiniBand Fabric, which consumes more resources than over RDMA (high CPU utilization, contiguous memory reservation to achieve better bandwidth). We briefly evaluated NFSoRDMA vs. NFS/IPoIB using direct I/O performance benchmark IOZone. The results showed that NFSoRDMA had better bandwidth in all different record size among 1KB to 16MB, and better CPU efficiency for record size greater than 4KB. As expected NFSoRDMA RPC read, write round trip time is much shorter than NFS/IPoIB. There will be more desire for NFSoRDMA when storage and memory are merged, I/O latency reduced.

Jeff Becker(NASA) and Susan Coulter (LANL) would like to join NFSoRDMA efforts after our talk. They have large scale computing nodes, a decent scaled validation environment. Tom Talpey (NFSoRDMA client original author) agreed with our proposal of future work: splitting send/recv completion queue, creating multiple QPs for scalability... He also gave advises on NFSoRDMA performance measurement based upon his SMB work and Don Lovinger's performance measurement work on SMB 3.0. (

Sayantan Sur (Intel) right now is using NFS/IPoIB in their IB cluster. We advised him some tuning method on NFS/IPoIB, he was happy to get 100 times better bandwidth than before for small I/O size from 2MB to 200MB/s. He is thinking to move to NFSoRDMA once it's stable. When we talked about wireshark NFSoRDMA dissector, Doug Oucharek (Intel) mentioned that he had implemented some luster RDMA packet dissector for wireshark which is not upstream yet, discussed with him about luster RDMA packet dissector to see whether we can borrow some codes for dissect NFSoRDMA IB packets. Chuck and I also discussed with OFILG interoperability tester Edward Mossman (IOL) regarding adding more NFSoRDMA coverage into their test suites.

The OFA has moved from hardware vendor driven workshop to software driven since last year. Most of the attendees were and OpenFabrics software and application developers. Intel has the most attendees, more than 20 people came from HW, OpenFabrics Stack, HPC and other applications departments.

Topics could be related to NFSoRDMA in the future:
A new working group (OpenFabrics Interface OFI WG) is created, the goal is to minimize interfaces complexity and APIs overhead. The new framework was proposed to provide different fabric interfaces to hide different fabrics providers implementation. The OFI WG hosts weekly telecons every Tuesday, everyone is welcome. Sean Hefty (Intel) analyzed current stack APIs overhead and cache memory footprint, presented the interfaces framework in little bit detail, check his presentation:

VMware is working on virtualization support for host and guest service over RDMA. On guest it implements paravirtual vRDMA device support Verbs. Device is emulated in ESXi hypervisor. Guest physical memory regions are mapped to ESXi and passed down to physical RDMA HCA, DMA directly from/to guest physical memory. Guest on same host latency is about 20us.

Liran Liss from Mellanox gave a talk about RDMA on demand paging update, which intends to address RDMA memory registration challenge for the cost, the size, lock, sync. He proposed non-pinned memory region which requires OS PTE table changes. More details is here:
He also presented RDMA bonding approach from transport level. A sudo vHCA (vQP, vPD, vCQ, vMRs ...) is created to use for bonding (failure over and aggregation). So the bonding will be hardware independent. The detail of the proposal is as below, don't know how feasible to do it, and the outcome of performance. The sudo HCA driver idea is similar to VMware vRDMA driver.

Mellanox gave RoCE(RDMA over Converged Ethernet) v2 update -- IP routable packet format. RoCEv2 encapsulates IB packet to UDP packet, which has presented to IETF in Nov. 2013. This might introduce more challenge for Fabrics congestion control.

Developers are still complaining about usability (different vendors have different implementations) and RDMA scalability in the area of RDMA-CM, subnet manager, QP resources, memory registration... RDMA socket is still under discussion... Were they news to me after many years absent from RDMA :)

There are lots of other interesting application topics which I don't cover here. If you are interested, here is the link to the whole presentations:

Since the workshop, a bi-weekly conference call has been established, with developers from many companies and organizations participating.  Minutes from these calls are posted to the linux-rdma and linux-nfs mailing lists.   Minutes so far:

Code stability has been significantly improved, with increased testing by developers and bugfixes being merged.  Anna Schumaker of NetApp is now maintaining a git tree for NFSoRDMA, feeding up to the core NFS maintainers.

For folks wishing to get involved in development, see the NFSoRDMA client wiki page for more information.

Wednesday May 08, 2013

Ext4 filesystem workshop report by Mingming Cao

This is a contributed post from Mingming Cao, lead ext4 developer for the Oracle mainline Linux kernel team.

I attended the second ext4 workshop hosted at the third day of Linux Collaboration Summit 2013.  Participants included Google, RedHat, SuSE, Taobao, and Lustre. We had about 2-4 hours of good discussion about the roadmap of ext4 for next year.

Ext4 write stall issue

A write Stall issue was reported by MM folks found during page claim testing over ext4. There is lock contention in JBD2 between journal commit and new transaction, resulting blocking IOs waiting for locks. More precisely it is caused by do_get_write_access() will block at lock_buffer(). The problem is nothing new should be visible in ext3 too. But new kernel becomes more visitable. Ted has proposed two fixes 1) avoid calling lock_buffer() during do_get_write_acess() 2) adjust jbd2 to manage buffer_head itself to reduce latency. Fixing in JBD2 would be a big effort. Propose 1) sounds more reasonable to work with.  The first action is to mark metadata update with RED_* to avoid the priority disorder meanwhile looking at the block IO layer and see if there is a way to move blocking IOs to a separate queue.

DIO lock contention issue

Another topic brought up is the Direct IO locking contention issue.  On DIO read side there is already no lock hold, but only for pagesize=blocksize case. There is not a fundamental issue why the no lock for direct IO read is not possible for blocksize <Pagesize -- agree we should remove this limit. On the Direct IO write side, two proposals about concurrent direct IO writes. One is based on in memory extent status tree, similar to xfs does, which allows dio write to different range of file possible. Another proposal is the general VFS solution which lock the pages in range during direct IO write. This would benefit all filesystems, but has challenge of sorting out multiple locks orders.  Jan Kara had a LSF session for this in more details. Looks like this approach is more promising.

Extent tree disk layout

There is discussion about support true 64 bit ext4 filesystem (64bit inode number and 64 bit block number -- currently 32 bit inode number and 48 bit blocknumber) in order to scales well. The ext4 on disk extent structure could be extended to support larger file, such as 64-bits physical block, bigger logical block, and using cluster-size as unit in extent.  This is easy to handle in e2fsprogs, but change on disk extent tree format is quite tricky to play well with punch hole, truncate etc., which depends on extent tree format. One solution is to add an layer of extent tree abstraction in memory, but this considered a big effort.

This was not entirely impossible.Jan Kara is working on extent tree code clean up, trying to factor out some common code first and teach the block allocation related code doesn't have to reply on on disk extent format. This is not a high priority for now.

Fallocate performance issue

A performance problem has been reported with fallocate really large file. Ext4 multile block allocator code(mballoc) currently limits how large a chunk of blocks could be allocated at a time. Should able to hack mballoc at lest 16MB at a time, instead of 2MB a time.

This brought out another related mballoc limitation. At present the mballoc normalize the request size to the nearest power of 2, up to 1MB.  The original requirement for this is for raid alignment.  If we lift up this limitation, with non normalized request size, fallocate could be 3 times faster.  Most likely we will address this quickly.

Buddy bitmap flush from mem too quickly

Under some memory pressure test, the buddy bitmap used to guide ext4 block allocation was been pushed out from memory too quickly, even though mark page dirty doesn't strong enough -- talk to mm people about interface mark page access() interface alternate, which ended with agreement to use fadvise to mark the pages as metadata.

data=guarded journaling mode

Back to ext3 time when there is no delayed allocation, the fsync() performance is badly hurt by the data=ordered mode, which forces flush out the data first (might be entire filesystem dirty data) before commit a metadata update. There is proposal of data=guarded mode which protect data inconsistency issue upon power failure, but would result in much better fsync result. The basic idea is the isize update wont be updated until the data has flushed to disk. This would drop of difference between data=writeback mode and data=ordered mode.

At the meeting this journalling mode was brought up again to see if we need this for ext4. Given ext4 implemented delayed allocation, the fsync performance was much improved (no need to flush unrelated file data), due to the benefit of delayed allocation, so performance benefit is not so obvious. But the benefit of this new journalling mode would great help 1) unwritten extent conversion issue, so that we could have full dio read no lock implementation, 2) also get ride of extra journalling mode.

ext4 filesystem mount options

There is discussion of ext4 testing cost due to many many different combination of ext4 mount options (total 70). Part of the reason is distro is trying to just maintain ext4 filesystem for all three filesystem (ext2.3.4) there is effort to test and valid the ext4 module still work as expected when mounted as ext3 with different mount options.  A few important mount options which need special care/investigate including support for indirect-based/extent-based files; support for Asynchronous journal commit; data=journal and delayed allocation exclusive issue.

So short summary of next year ext4 development is to mostly focus on reduce latency, improve performance and code reorganization.

-- Mingming Cao

Sunday Mar 17, 2013

Welcome to the mainline Linux kernel blog

I'd like to welcome everyone to this new blog, where I'll be discussing what's happening with the Oracle mainline Linux kernel development team.   I'm James Morris, manager of the team.  I report to Wim Coekaerts.  I'm also the maintainer of the Linux kernel security subsystem, which I blog about separately here.

The Oracle mainline Kernel team works as part of the Linux community to develop new features and maintain existing kernel code.   Our work is contributed directly to the community under the GPL.  Several members of the team are established leaders of the kernel community.  The team has a strong history of Linux community contributions.

We also work with Oracle business stakeholders to ensure that Oracle Linux continues to be the best platform for their needs.   Oracle Linux users include:

  • Oracle Global IT, with 42,000 Linux servers, 84,000 internal users, and four million external users
  • Oracle software engineering, with more than 20,000 developers using Oracle Linux
  • Engineered systems teams at Oracle such as those behind the Exadata database machine
  • Commercial Oracle Linux customers

In this blog I'll be discussing new and interesting happenings with the team.  There'll be some technical deep dives and contributed posts from team members.  The primary audiences are technical Oracle Linux users and the wider Linux community.

The mainline team has been expanding significantly over the past year, and continues to do so.  We have several exciting initiatives under way, and when possible, I'll be covering those. 

Stay tuned :)


The Oracle mainline Linux kernel team works as part of the Linux kernel community to develop new features and maintain existing code.

Our team is globally distributed and includes leading core kernel developers and industry veterans.

This blog is edited by James Morris <>


« July 2016