By jamesmorris on Apr 01, 2014
This is a contributed post from Darrick Wong, storage engineer on the Oracle mainline Linux kernel team.
The following are my notes from LSF/MM 2014 and the ext4 summit, held last week in Napa Valley, CA.
- Discussed the draft DIX passthrough interface. Based on Zach Brown's
suggestions last week, I rolled out a version of the patch with a statically
defined io extensions struct, and Martin Petersen said he'd try porting some
existing asmlib clients to use the new interface, with a few field-enlarging
tweaks. For the most part nobody objected; Al Viro said he had no problems
"yet" -- but I couldn't tell if he had no idea what I was talking about, or
if he was on board with the API. It was also suggested that I seek the
opinion of Michael Kerrisk (the manpages maintainer) about the API.
As for the actual implementation, there are plenty of holes in it that I
intend to fix this week. The NFS/CIFS developers I spoke to were generally
happy to hear that the storage side was finally starting to happen, and
that they could get to working on the net-fs side of things now.
Nicholas Bellinger noted that targetcli can create DIF disks even with the
fileio backend, so he suggested I play with that over scsi_debug.
- A large part of LSF was taken up with the discussion of how to handle the
brave new world of weird storage devices. To recap: in the beginning,
software had to deal with the mechanical aspects of a rotating disk;
addressing had to be done in terms of cylinders, heads, and sectors (CHS).
This made it difficult to innovate drive mechanics, as it was impossible to
express things like variable zone density to existing software. SCSI
eliminated this pain by abstracting a disk into a big tub of consecutive
sectors, which simplified software quite a bit, though at some cost to
performance. But most programs weren't trying to wring the last iota of
performance out of disks and didn't care. So long as some attention was
paid to data locality, disks performed adequately.
Fast forward to 2014: now we have several different storage device classes:
Flash, which has no seek penalty but prefers large writeouts; SMR drives
with hard-disk seek penalties but requirements that all writes within a
~256MB zone be written in linear order; RAIDs, which by virtue of stripe
geometries violate a few of the classic hard disk thinking; and NVMe devices
which implement atomic read and write operations. Dave Chinner suggests
that rather than retrofitting each filesystem to deal with each of these
devices, it might be worth shoving all the block allocation and mapping
operation down to a device mapper (dm) shim layer that can abstract away
different types of storage, leaving FSes to manage namespace information.
This suggestion is very attractive on a few levels:
Benefits include the ability to emulate atomic read/writes with
journalling, more flexible software-defined FTLs for flash and SMR, and
improved communication with cloud storage systems -- Mike Snitzer had a
session about dm-thinp and the proper way for FSes to communicate allocation
hints to the underlying storage; this would certainly seem to fit the bill.
I mentioned that Oracle's plans for cheap ext4 reflink would be trivial to
implement with dm shims.
Unfortunately, the devil is in the details -- when will we see code? For
that reason, Ted Ts'o was openly skeptical.
- The postgresql developers showed up to complain about stable pages and to
ask for a less heavyweight fsync() -- currently, when fsync is called, it
assumes that the caller wants all dirty data written out NOW, so it writes
dirty pages with WRITE_SYNC, which starves reads. For postgresql this is
suboptimal since fsync is typically called by the checkpointing code, which
doesn't need to be fast and doesn't care if fsync writeback is not fast.
There was an interlock scheduled for Thursday afternoon, but I was unable to
attend. See LWN for more detailed coverage of the
postgresql (and FB) sessions.
- At the ext4 summit, we discussed a few cleanups, such as removing the use of
buffer_heads and the impending removal of the ext2/3 drivers. Removing
buffer_heads in the data path has the potential benefit that it'll make the
transition to supporting block/sector size > page size easier, as well as
reducing memory requirements (buffer heads are a heavyweight structure now).
There was also the feeling that once most enterprise distros move to ext4,
it will be a lot easier to remove ext3 upstream because there will be a lot
more testing of the use of ext4.ko to handle ext2/3 filesystems. There was
a discussion of removing ext2 as well, though that stalled on concerns that
Christoph Hellwig (hch) would like to see ext2 remain as a "sample"
filesystem, though Jan Kara could be heard muttering that nobody wants a
- The other major new ext4 feature discussed at the ext4 summit is per-data
block metadata. This got started when Lukas Czerner (lukas) proposed adding
data block checksums to the filesystem. I quickly chimed in that for e2fsck
it would be helpful to have per-block back references to ease reconstruction
of the filesystem, at which point the group started thinking that rather
than a huge static array of block data, the complexity of a b-tree with
variable key size might well be worth the effort. Then again, with all the
proposed filesystem/block layer changes, Ted said that he might be open to a
quick v1 implementation because the block shim layer discussed in the SMR
forum could very well obviate the need for a lot of ext4 features. Time
will tell; Ted and I were not terribly optimistic that any of that software
is coming soon.
In any case, lukas went home to refine his proposal. The biggest problem
is ext4's current lack of a btree implementation; this would have to be
written or borrowed, and then tested. I mentioned to him that this could
be the cornerstone of reimplementing a lot of ext4 features with btrees
instead of static arrays, which could be a good thing if RH is willing to
spend a lot of engineering time on ext4.
- Michael Halcrow, speaking at the ext4 summit, discussed implementing a
lightweight encrypted filesystem subtree feature. This sounds a lot like
ecryptfs, but hopefully less troublesome than the weird shim fs that
is ecryptfs. For the most part he seemed to need (a) the ability to inject
his code into the read/write path and some ability to store a small amount
of per-inode encryption data. His use-case is Chrome OS, which apparently
needs the ability for cache management programs to erase parts of a(nother)
user's cache files without having the ability to access the file. The
discussion concluded that it wouldn't be too difficult for him to start an
initial implementation with ext4, but that much of this ought to be in the
[Ed: see also the LWN coverage of LSF/MM]