News, tips, partners, and perspectives for the Oracle Linux operating system and upstream Linux kernel work

LSF/MM 2014 and ext4 Summit Notes by Darrick Wong

This is a contributed post from Darrick Wong, storage engineer on the Oracle mainline Linux kernel team.

The following are my notes from LSF/MM 2014 and the ext4 summit, held last week in Napa Valley, CA.

  • Discussed the draft DIX passthrough interface. Based on Zach Brown's
    suggestions last week, I rolled out a version of the patch with a statically
    defined io extensions struct, and Martin Petersen said he'd try porting some
    existing asmlib clients to use the new interface, with a few field-enlarging
    tweaks. For the most part nobody objected; Al Viro said he had no problems
    "yet" -- but I couldn't tell if he had no idea what I was talking about, or
    if he was on board with the API. It was also suggested that I seek the
    opinion of Michael Kerrisk (the manpages maintainer) about the API.
    As for the actual implementation, there are plenty of holes in it that I
    intend to fix this week. The NFS/CIFS developers I spoke to were generally
    happy to hear that the storage side was finally starting to happen, and
    that they could get to working on the net-fs side of things now.
    Nicholas Bellinger noted that targetcli can create DIF disks even with the
    fileio backend, so he suggested I play with that over scsi_debug.

  • A large part of LSF was taken up with the discussion of how to handle the
    brave new world of weird storage devices. To recap: in the beginning,
    software had to deal with the mechanical aspects of a rotating disk;
    addressing had to be done in terms of cylinders, heads, and sectors (CHS).
    This made it difficult to innovate drive mechanics, as it was impossible to
    express things like variable zone density to existing software. SCSI
    eliminated this pain by abstracting a disk into a big tub of consecutive
    sectors, which simplified software quite a bit, though at some cost to
    performance. But most programs weren't trying to wring the last iota of
    performance out of disks and didn't care. So long as some attention was
    paid to data locality, disks performed adequately.
    Fast forward to 2014: now we have several different storage device classes:
    Flash, which has no seek penalty but prefers large writeouts; SMR drives
    with hard-disk seek penalties but requirements that all writes within a
    ~256MB zone be written in linear order; RAIDs, which by virtue of stripe
    geometries violate a few of the classic hard disk thinking; and NVMe devices
    which implement atomic read and write operations. Dave Chinner suggests
    that rather than retrofitting each filesystem to deal with each of these
    devices, it might be worth shoving all the block allocation and mapping
    operation down to a device mapper (dm) shim layer that can abstract away
    different types of storage, leaving FSes to manage namespace information.
    This suggestion is very attractive on a few levels:
    Benefits include the ability to emulate atomic read/writes with
    journalling, more flexible software-defined FTLs for flash and SMR, and
    improved communication with cloud storage systems -- Mike Snitzer had a
    session about dm-thinp and the proper way for FSes to communicate allocation
    hints to the underlying storage; this would certainly seem to fit the bill.
    I mentioned that Oracle's plans for cheap ext4 reflink would be trivial to
    implement with dm shims.
    Unfortunately, the devil is in the details -- when will we see code? For
    that reason, Ted Ts'o was openly skeptical.

  • The postgresql developers showed up to complain about stable pages and to
    ask for a less heavyweight fsync() -- currently, when fsync is called, it
    assumes that the caller wants all dirty data written out NOW, so it writes
    dirty pages with WRITE_SYNC, which starves reads. For postgresql this is
    suboptimal since fsync is typically called by the checkpointing code, which
    doesn't need to be fast and doesn't care if fsync writeback is not fast.
    There was an interlock scheduled for Thursday afternoon, but I was unable to
    attend. See LWN for more detailed coverage of the
    postgresql (and FB) sessions.

  • At the ext4 summit, we discussed a few cleanups, such as removing the use of
    buffer_heads and the impending removal of the ext2/3 drivers. Removing
    buffer_heads in the data path has the potential benefit that it'll make the
    transition to supporting block/sector size > page size easier, as well as
    reducing memory requirements (buffer heads are a heavyweight structure now).
    There was also the feeling that once most enterprise distros move to ext4,
    it will be a lot easier to remove ext3 upstream because there will be a lot
    more testing of the use of ext4.ko to handle ext2/3 filesystems. There was
    a discussion of removing ext2 as well, though that stalled on concerns that
    Christoph Hellwig (hch) would like to see ext2 remain as a "sample"
    filesystem, though Jan Kara could be heard muttering that nobody wants a
    bitrotten example.

  • The other major new ext4 feature discussed at the ext4 summit is per-data
    block metadat
    a. This got started when Lukas Czerner (lukas) proposed adding
    data block checksums to the filesystem. I quickly chimed in that for e2fsck
    it would be helpful to have per-block back references to ease reconstruction
    of the filesystem, at which point the group started thinking that rather
    than a huge static array of block data, the complexity of a b-tree with
    variable key size might well be worth the effort. Then again, with all the
    proposed filesystem/block layer changes, Ted said that he might be open to a
    quick v1 implementation because the block shim layer discussed in the SMR
    forum could very well obviate the need for a lot of ext4 features. Time
    will tell; Ted and I were not terribly optimistic that any of that software
    is coming soon.
    In any case, lukas went home to refine his proposal. The biggest problem
    is ext4's current lack of a btree implementation; this would have to be
    written or borrowed, and then tested. I mentioned to him that this could
    be the cornerstone of reimplementing a lot of ext4 features with btrees
    instead of static arrays, which could be a good thing if RH is willing to
    spend a lot of engineering time on ext4.

  • Michael Halcrow, speaking at the ext4 summit, discussed implementing alightweight encrypted filesystem subtree feature. This sounds a lot like
    ecryptfs, but hopefully less troublesome than the weird shim fs that
    is ecryptfs. For the most part he seemed to need (a) the ability to inject
    his code into the read/write path and some ability to store a small amount
    of per-inode encryption data. His use-case is Chrome OS, which apparently
    needs the ability for cache management programs to erase parts of a(nother)
    user's cache files without having the ability to access the file. The
    discussion concluded that it wouldn't be too difficult for him to start an
    initial implementation with ext4, but that much of this ought to be in the
    VFS layer.

 -- Darrick

[Ed: see also the LWN coverage of LSF/MM]

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.Captcha