Tuesday Apr 01, 2014

LSF/MM 2014 and ext4 Summit Notes by Darrick Wong

This is a contributed post from Darrick Wong, storage engineer on the Oracle mainline Linux kernel team.

The following are my notes from LSF/MM 2014 and the ext4 summit, held last week in Napa Valley, CA.

  • Discussed the draft DIX passthrough interface. Based on Zach Brown's suggestions last week, I rolled out a version of the patch with a statically defined io extensions struct, and Martin Petersen said he'd try porting some existing asmlib clients to use the new interface, with a few field-enlarging tweaks. For the most part nobody objected; Al Viro said he had no problems "yet" -- but I couldn't tell if he had no idea what I was talking about, or if he was on board with the API. It was also suggested that I seek the opinion of Michael Kerrisk (the manpages maintainer) about the API. As for the actual implementation, there are plenty of holes in it that I intend to fix this week. The NFS/CIFS developers I spoke to were generally happy to hear that the storage side was finally starting to happen, and that they could get to working on the net-fs side of things now. Nicholas Bellinger noted that targetcli can create DIF disks even with the fileio backend, so he suggested I play with that over scsi_debug.

  • A large part of LSF was taken up with the discussion of how to handle the brave new world of weird storage devices. To recap: in the beginning, software had to deal with the mechanical aspects of a rotating disk; addressing had to be done in terms of cylinders, heads, and sectors (CHS). This made it difficult to innovate drive mechanics, as it was impossible to express things like variable zone density to existing software. SCSI eliminated this pain by abstracting a disk into a big tub of consecutive sectors, which simplified software quite a bit, though at some cost to performance. But most programs weren't trying to wring the last iota of performance out of disks and didn't care. So long as some attention was paid to data locality, disks performed adequately. Fast forward to 2014: now we have several different storage device classes: Flash, which has no seek penalty but prefers large writeouts; SMR drives with hard-disk seek penalties but requirements that all writes within a ~256MB zone be written in linear order; RAIDs, which by virtue of stripe geometries violate a few of the classic hard disk thinking; and NVMe devices which implement atomic read and write operations. Dave Chinner suggests that rather than retrofitting each filesystem to deal with each of these devices, it might be worth shoving all the block allocation and mapping operation down to a device mapper (dm) shim layer that can abstract away different types of storage, leaving FSes to manage namespace information. This suggestion is very attractive on a few levels: Benefits include the ability to emulate atomic read/writes with journalling, more flexible software-defined FTLs for flash and SMR, and improved communication with cloud storage systems -- Mike Snitzer had a session about dm-thinp and the proper way for FSes to communicate allocation hints to the underlying storage; this would certainly seem to fit the bill. I mentioned that Oracle's plans for cheap ext4 reflink would be trivial to implement with dm shims. Unfortunately, the devil is in the details -- when will we see code? For that reason, Ted Ts'o was openly skeptical.

  • The postgresql developers showed up to complain about stable pages and to ask for a less heavyweight fsync() -- currently, when fsync is called, it assumes that the caller wants all dirty data written out NOW, so it writes dirty pages with WRITE_SYNC, which starves reads. For postgresql this is suboptimal since fsync is typically called by the checkpointing code, which doesn't need to be fast and doesn't care if fsync writeback is not fast. There was an interlock scheduled for Thursday afternoon, but I was unable to attend. See LWN for more detailed coverage of the postgresql (and FB) sessions.

  • At the ext4 summit, we discussed a few cleanups, such as removing the use of buffer_heads and the impending removal of the ext2/3 drivers. Removing buffer_heads in the data path has the potential benefit that it'll make the transition to supporting block/sector size > page size easier, as well as reducing memory requirements (buffer heads are a heavyweight structure now). There was also the feeling that once most enterprise distros move to ext4, it will be a lot easier to remove ext3 upstream because there will be a lot more testing of the use of ext4.ko to handle ext2/3 filesystems. There was a discussion of removing ext2 as well, though that stalled on concerns that Christoph Hellwig (hch) would like to see ext2 remain as a "sample" filesystem, though Jan Kara could be heard muttering that nobody wants a bitrotten example.

  • The other major new ext4 feature discussed at the ext4 summit is per-data block metadata. This got started when Lukas Czerner (lukas) proposed adding data block checksums to the filesystem. I quickly chimed in that for e2fsck it would be helpful to have per-block back references to ease reconstruction of the filesystem, at which point the group started thinking that rather than a huge static array of block data, the complexity of a b-tree with variable key size might well be worth the effort. Then again, with all the proposed filesystem/block layer changes, Ted said that he might be open to a quick v1 implementation because the block shim layer discussed in the SMR forum could very well obviate the need for a lot of ext4 features. Time will tell; Ted and I were not terribly optimistic that any of that software is coming soon. In any case, lukas went home to refine his proposal. The biggest problem is ext4's current lack of a btree implementation; this would have to be written or borrowed, and then tested. I mentioned to him that this could be the cornerstone of reimplementing a lot of ext4 features with btrees instead of static arrays, which could be a good thing if RH is willing to spend a lot of engineering time on ext4.

  • Michael Halcrow, speaking at the ext4 summit, discussed implementing a lightweight encrypted filesystem subtree feature. This sounds a lot like ecryptfs, but hopefully less troublesome than the weird shim fs that is ecryptfs. For the most part he seemed to need (a) the ability to inject his code into the read/write path and some ability to store a small amount of per-inode encryption data. His use-case is Chrome OS, which apparently needs the ability for cache management programs to erase parts of a(nother) user's cache files without having the ability to access the file. The discussion concluded that it wouldn't be too difficult for him to start an initial implementation with ext4, but that much of this ought to be in the VFS layer.

 -- Darrick

[Ed: see also the LWN coverage of LSF/MM]

Monday Sep 02, 2013

IETF 87 NFSv4 Working Group meeting report by Chuck Lever

This is a contributed post from Chuck Lever, who heads up NFS development for the mainline Linux kernel team.

Executive summary:

The 87th meeting of the IETF was held July 28 - August 2 in Berlin, Germany.

I was in Berlin for the week to attend the NFSv4 Working Group meeting and hold informal discussions related to NFS standardization with other attendees. The Internet Engineering Task Force (IETF) produces high quality technical documents that influence the way people design, use and manage the Internet. Essentially, this is the body that regulates the protocols computers use to communicate on the Internet, for the purpose of improving interoperability.

An IETF meeting is held every four months in venues around the world. Sponsorship for each event varies. DENIC, the central registry for domain names under .de, was the primary sponsor for this event. Participation is open to anyone, but a registration fee is required to attend.

NFS version 4 is the IETF standard for file sharing. The charter of the Working Group is to maintain NFS specifications and introduce new NFS features through NFSv4 minor versions. More on the Working Group charter can be found here: http://datatracker.ietf.org/wg/nfsv4/charter/

I attend each NFSv4 Working Group meeting to represent Oracle's interest in various current and new NFS-related features, including pNFS, NFSv4.2, and FedFS. I'm the editor of two of the IETF FedFS protocol specifications, and a co-author of an Internet-Draft that addresses protocol issues affecting NFSv4 migration. Other representatives at this meeting include Microsoft, EMC, NetApp, IBM, Oracle, Tonian, and others. Topics include progress updates on Internet-Drafts on their way to become standards, reports on implementation experience, and requests to start new work or restart old work. See: https://datatracker.ietf.org/meeting/87/materials.html#nfsv4

Meeting agenda, presentation materials, and minutes are available at this location.

Drill down:

Working Group editor Tom Haynes (NetApp) reported on several areas where progress appears to be stalled. In general we face challenges completing our deliverables because the IETF is a volunteer organization, and the tasks at hand are large. The largest item is RFC 3530bis, which is holding up FedFS and NFSv4.2. RFC 3530bis was rejected during IESG review mainly due to the new chapter that attempts to bridge the gap between existing i18n implementations in NFS, and how we'd like i18n to work.

The problem is nobody has implemented i18n for NFSv4, and the IETF has revised i18n since 3530 was ratified. The consensus was to move the offending section to a separate Internet-Draft where the correct language can be hammered out without holding up RFC 3530bis. NFSv4.2 is held up by a lack of enthusiasm for finishing a new revision of RPCSEC GSS. The GSS I-D has languished without an author or editor for many months, and two items in NFSv4.2 depend on its completion: labeled NFS and server-to-server copy. A rough consensus was not achieved, but Tom and Andy Adamson (NetApp) will investigate options, including removing the parts of server copy and labeled NFS that depend on GSSv3, and report back.

Benny Halevy (Tonian) has submitted a fresh draft proposing "Flexible File Layouts" which is a new pNFS layout type that improves upon the existing pNFS file layout defined in RFC 5661. Motivation for a new layout scheme includes: algorithmic data striping to support load balancing, life-cycle management, and other advanced administrative features; support for using legacy NFS servers as pNFS data servers; and direct pNFS support for existing cluster filesystems such as Ceph and GlusterFS.

Chuck Lever (Oracle) described recent progress to address security concerns in the FedFS documents waiting in the RFC Editor queue. He continued by walking through a group of possible future work items, including more modern LDAP security modes, additional administrative operations, and better mechanisms for clients to choose working fileset locations. Does the working group have the energy to consider a new revision of these documents? Or should we continue to focus on making small changes? This was left unresolved.

Sorin Faibish (EMC) discussed the need for a new layout enabling pNFS clients to access Lustre data servers directly. After a lot of discussion, the issue appears to be that the NFS protocol on high performance transports is not performant enough. The proposed solution was to use LNET over RDMA. It was suggested that it would be more interesting to the Working Group if we focused on fixing the performance issues in our RDMA specifications instead.

Marc Eshel (IBM) wanted to restart the age-old conversation on tightening NFS's data cache coherency. The immediate question is whether POSIX semantics are interesting given today's compute workloads and network environment. Implementing POSIX data coherency among multiple networked systems is still a challenge. Consensus that a callback-based solution, where network traffic is proportional to the level of inter-client sharing, was most appropriate. Such a solution (byte-range delegation) was proposed by Trond Myklebust in 2006. It was recommended to start with that work.

Chuck Lever (Oracle) proposed an experimental extension to NFS that enables NFS client and servers to convey end-to-end data integrity metadata. A new I-D has been submitted that describes the protocol changes. No prototype is available yet; the I-D is meant to coordinate discussion of technical details, and enable interoperable prototype implementations.

David Noveck (EMC) elaborated on the need to allow protocol changes outside of the NFS minor version process. He described the limitations of batching unrelated features together and waiting for a full pass through the IETF review process. There was some interest in allowing innovation outside of the minor version process. The Area Directory and Working Group chair felt that there is currently not enough energy behind work already planned for delivery.

Matt Benjamin (Linux Box) is restarting work on a feature proposed several years ago by Mike Eisler that allows directories to be striped across pNFS data servers, just like file data is today. An Internet-Draft is available, and a prototype is underway.

-- Chuck Lever

Wednesday May 22, 2013

Oracle Linux Kernel Developers Speaking at LinuxCon Japan 2013

LinuxCon Japan 2013 is happening next week (May 29-31) in Tokyo, and several members of the Oracle mainline Linux kernel team are speaking:

For those attending, feel free to come and chat with us during the conference.

We'll post links to slides once they're available.

Wednesday May 08, 2013

Ext4 filesystem workshop report by Mingming Cao

This is a contributed post from Mingming Cao, lead ext4 developer for the Oracle mainline Linux kernel team.

I attended the second ext4 workshop hosted at the third day of Linux Collaboration Summit 2013.  Participants included Google, RedHat, SuSE, Taobao, and Lustre. We had about 2-4 hours of good discussion about the roadmap of ext4 for next year.

Ext4 write stall issue

A write Stall issue was reported by MM folks found during page claim testing over ext4. There is lock contention in JBD2 between journal commit and new transaction, resulting blocking IOs waiting for locks. More precisely it is caused by do_get_write_access() will block at lock_buffer(). The problem is nothing new should be visible in ext3 too. But new kernel becomes more visitable. Ted has proposed two fixes 1) avoid calling lock_buffer() during do_get_write_acess() 2) adjust jbd2 to manage buffer_head itself to reduce latency. Fixing in JBD2 would be a big effort. Propose 1) sounds more reasonable to work with.  The first action is to mark metadata update with RED_* to avoid the priority disorder meanwhile looking at the block IO layer and see if there is a way to move blocking IOs to a separate queue.

DIO lock contention issue

Another topic brought up is the Direct IO locking contention issue.  On DIO read side there is already no lock hold, but only for pagesize=blocksize case. There is not a fundamental issue why the no lock for direct IO read is not possible for blocksize <Pagesize -- agree we should remove this limit. On the Direct IO write side, two proposals about concurrent direct IO writes. One is based on in memory extent status tree, similar to xfs does, which allows dio write to different range of file possible. Another proposal is the general VFS solution which lock the pages in range during direct IO write. This would benefit all filesystems, but has challenge of sorting out multiple locks orders.  Jan Kara had a LSF session for this in more details. Looks like this approach is more promising.

Extent tree disk layout

There is discussion about support true 64 bit ext4 filesystem (64bit inode number and 64 bit block number -- currently 32 bit inode number and 48 bit blocknumber) in order to scales well. The ext4 on disk extent structure could be extended to support larger file, such as 64-bits physical block, bigger logical block, and using cluster-size as unit in extent.  This is easy to handle in e2fsprogs, but change on disk extent tree format is quite tricky to play well with punch hole, truncate etc., which depends on extent tree format. One solution is to add an layer of extent tree abstraction in memory, but this considered a big effort.

This was not entirely impossible.Jan Kara is working on extent tree code clean up, trying to factor out some common code first and teach the block allocation related code doesn't have to reply on on disk extent format. This is not a high priority for now.

Fallocate performance issue

A performance problem has been reported with fallocate really large file. Ext4 multile block allocator code(mballoc) currently limits how large a chunk of blocks could be allocated at a time. Should able to hack mballoc at lest 16MB at a time, instead of 2MB a time.

This brought out another related mballoc limitation. At present the mballoc normalize the request size to the nearest power of 2, up to 1MB.  The original requirement for this is for raid alignment.  If we lift up this limitation, with non normalized request size, fallocate could be 3 times faster.  Most likely we will address this quickly.

Buddy bitmap flush from mem too quickly

Under some memory pressure test, the buddy bitmap used to guide ext4 block allocation was been pushed out from memory too quickly, even though mark page dirty doesn't strong enough -- talk to mm people about interface mark page access() interface alternate, which ended with agreement to use fadvise to mark the pages as metadata.

data=guarded journaling mode

Back to ext3 time when there is no delayed allocation, the fsync() performance is badly hurt by the data=ordered mode, which forces flush out the data first (might be entire filesystem dirty data) before commit a metadata update. There is proposal of data=guarded mode which protect data inconsistency issue upon power failure, but would result in much better fsync result. The basic idea is the isize update wont be updated until the data has flushed to disk. This would drop of difference between data=writeback mode and data=ordered mode.

At the meeting this journalling mode was brought up again to see if we need this for ext4. Given ext4 implemented delayed allocation, the fsync performance was much improved (no need to flush unrelated file data), due to the benefit of delayed allocation, so performance benefit is not so obvious. But the benefit of this new journalling mode would great help 1) unwritten extent conversion issue, so that we could have full dio read no lock implementation, 2) also get ride of extra journalling mode.

ext4 filesystem mount options

There is discussion of ext4 testing cost due to many many different combination of ext4 mount options (total 70). Part of the reason is distro is trying to just maintain ext4 filesystem for all three filesystem (ext2.3.4) there is effort to test and valid the ext4 module still work as expected when mounted as ext3 with different mount options.  A few important mount options which need special care/investigate including support for indirect-based/extent-based files; support for Asynchronous journal commit; data=journal and delayed allocation exclusive issue.

So short summary of next year ext4 development is to mostly focus on reduce latency, improve performance and code reorganization.

-- Mingming Cao

Thursday Mar 21, 2013

IETF 86 NFSv4 Working Group meeting report by Chuck Lever

This is a contributed post from Chuck Lever, who heads up NFS development for the mainline Linux kernel team.

Executive summary:

On Monday (11th March) I attended the IETF NFSv4 Working Group meeting at IETF 86 in Orlando, Florida.

The Internet Engineering Task Force (IETF) produces high quality technical documents that influence the way people design, use and manage the Internet.  Essentially, this is the body that regulates the protocols computers use to communicate on the Internet, for the purpose of improving interoperability.

An IETF meeting is held every four months in venues around the world.  Sponsorship for each event varies.  This event was sponsored by Comcast and NBCUniversal.  Participation is open to anyone, but a registration fee is required to attend.

NFS version 4 is the IETF standard for file sharing.  The charter of the Working Group is to maintain NFS specifications and introduce new NFS features through NFSv4 minor versions.  More on the Working Group charter can be found here:


I attend each NFSv4 Working Group meeting to represent Oracle's interest in various current and new NFS-related features, including pNFS, NFSv4.2, and FedFS.  I'm the editor of two of the IETF FedFS protocol specifications, and a co-author on a draft that discusses experience implementing NFSv4 migration.

Other representatives at this meeting include Microsoft, EMC, NetApp, Oracle, Panasas, and others.  Topics include progress updates on drafts on their way to become standards, reports on implementation experience, and requests to start new work or restart old work.  See:


Meeting slides are available now at this location.  Minutes are coming soon.

Drill down:

Tom Haynes (NetApp) reported on progress with RFC 3530bis, a refreshed specification for NFSv4.0.  This document has passed the Area Director check, and is ready for IESG review.  This document is a top priority because other unfinished documents which cite this document are held up waiting for its completion.

Labeled NFS, a part of the forthcoming NFSv4.2 protocol, has a Linux prototype that was demonstrated at Connectathon last month.

The RPCSEC_GSSv3 standard has not made progress, but an editor (Dros Adamson) was assigned during IETF 85.  This document is blocking progress on NFSv4.2.

The NFSv4.2 draft is in WG last call, which ends today (Monday, March 11).  No new issues were raised, so the Working Group chair will move this forward.

Tom Haynes presented a brief set of slides on how NFSv3 client should interpret the presence of AUTH_NONE in the list of security flavors a server supports.  There was never a formal standard describing this, and now we need an interoperability document.  As we explore this issue we may discover some real problems.  A fresh draft was requested.

Dave Noveck (EMC) discussed progress on the draft co-authored with Bill Baker, Piyush Shivam, and myself on NFSv4 migration issues.  As part of the discussion, we visited the issue of how to prevent client progress when a server freezes open and lock state before a migration.  Adding a new error code was mentioned, but that is against the minor version rules and would cause interoperability problems with clients that don't recognize the new error code.  Otherwise we have the procedural issues taken care of to advance this document to become an informational RFC.

A draft covering NFSv4.1 migration issues would probably not be needed, as the changes are small and could be covered in an RFC 5661bis, when it is opened.  There doesn't seem to be urgency here.

Chuck Lever (Oracle) described implementation experience with the recommendations of Dave's migration draft.  The experience arises from the Linux Uniform Client String changes Chuck has done, and a number of items discovered by the Solaris NFS team.

Chuck Lever reported on progress with the FedFS draft standards.  Short story: They are in the RFC Editor queue awaiting completion of RFC 3530bis.

Trond Myklebust (NetApp) presented an issue with NFSv4.1 session slot table management that he also has reported at Connectathon.  It was agreed that an errata to RFC 5661 would be produced that describes how implementations will add missing behavior.  No on-the-wire protocol changes.

Matt Benjamin (The Linux Box) requested a revisit of a 2008 proposal by Mike Eisler to stripe POSIX directories across multiple data servers.  An algorithm would generate an offset into a table of device IDs, which would indicate on which data server to find a directory entry.  Matt claimed there would be changes to the proposal to deal with Ceph and CohortFS.  Chair requested a draft, Matt to deliver soon.

Chuck Lever asked if we still need an NFS-specific mechanism for provisioning NFSv4 ID domain names.  The feeling is that this domain name is determined by the system's authentication service, not by NFS, so NFS should not have its own way to set this.  Consensus that there is some work to do here, and it should be done under umbrella of the ongoing multi-domain work.

Spencer Shepler (Microsoft) closed the meeting with a house-keeping item.  There is some desire to reduce travel by moving more work to the mailing list.  The plan is to ask about agenda items for a meeting before requesting a meeting slot at the next IETF.

Several folks wanted to discuss Bill Baker's micro-versioning proposal.  Dave Noveck stated the problem this way: NFSv4.1 is a heavyweight minor version with a bunch of features, so fixes for 4.0 aren't possible with our spiffy minor versioning scheme.  Spencer felt we should visit this only when we encounter a problem we must address with major protocol changes.  The room was divided; some felt waiting was best so that a problem statement can be formulated; others were concerned that it was almost certain we would need to alter the NFSv4.0 XDR at some point, and we should start working this out now.

In the near term, protocol issues should come to the mailing list sooner rather than later so we can work them out together.

-- Chuck Lever

Sunday Mar 17, 2013

Welcome to the mainline Linux kernel blog

I'd like to welcome everyone to this new blog, where I'll be discussing what's happening with the Oracle mainline Linux kernel development team.   I'm James Morris, manager of the team.  I report to Wim Coekaerts.  I'm also the maintainer of the Linux kernel security subsystem, which I blog about separately here.

The Oracle mainline Kernel team works as part of the Linux community to develop new features and maintain existing kernel code.   Our work is contributed directly to the community under the GPL.  Several members of the team are established leaders of the kernel community.  The team has a strong history of Linux community contributions.

We also work with Oracle business stakeholders to ensure that Oracle Linux continues to be the best platform for their needs.   Oracle Linux users include:

  • Oracle Global IT, with 42,000 Linux servers, 84,000 internal users, and four million external users
  • Oracle software engineering, with more than 20,000 developers using Oracle Linux
  • Engineered systems teams at Oracle such as those behind the Exadata database machine
  • Commercial Oracle Linux customers

In this blog I'll be discussing new and interesting happenings with the team.  There'll be some technical deep dives and contributed posts from team members.  The primary audiences are technical Oracle Linux users and the wider Linux community.

The mainline team has been expanding significantly over the past year, and continues to do so.  We have several exciting initiatives under way, and when possible, I'll be covering those. 

Stay tuned :)


The Oracle mainline Linux kernel team works as part of the Linux kernel community to develop new features and maintain existing code.

Our team is globally distributed and includes leading core kernel developers and industry veterans.

This blog is maintained by James Morris <james.l.morris@oracle.com>


« April 2014