Ext4 filesystem workshop report by Mingming Cao
By Jamesmorris-Oracle on May 08, 2013
This is a contributed post from Mingming Cao, lead ext4 developer for the Oracle mainline Linux kernel team.
I attended the second ext4 workshop hosted at the third day of Linux Collaboration Summit 2013. Participants included Google, RedHat, SuSE, Taobao, and Lustre. We had about 2-4 hours of good discussion about the roadmap of ext4 for next year.
Ext4 write stall issue
A write Stall issue was reported by MM folks found during page claim
testing over ext4. There is lock contention in JBD2 between journal
commit and new transaction, resulting blocking IOs waiting for locks.
More precisely it is caused by do_get_write_access() will block at
lock_buffer(). The problem is nothing new should be visible in ext3
too. But new kernel becomes more visitable. Ted has proposed two fixes
1) avoid calling lock_buffer() during do_get_write_acess() 2) adjust
jbd2 to manage buffer_head itself to reduce latency. Fixing in JBD2
would be a big effort. Propose 1) sounds more reasonable to work with.
The first action is to mark metadata update with RED_* to avoid the priority
disorder meanwhile looking at the block IO layer and see if there is a
way to move blocking IOs to a separate queue.
DIO lock contention issue
Another topic brought up is the Direct IO locking contention issue. On
DIO read side there is already no lock hold, but only for
pagesize=blocksize case. There is not a fundamental issue why the no lock
for direct IO read is not possible for blocksize <Pagesize -- agree we
should remove this limit. On the Direct IO write side, two proposals
about concurrent direct IO writes. One is based on in memory extent
status tree, similar to xfs does, which allows dio write to different
range of file possible. Another proposal is the general VFS solution
which lock the pages in range during direct IO write. This would benefit
all filesystems, but has challenge of sorting out multiple locks orders.
Jan Kara had a LSF session for this in more details. Looks like this approach is more promising.
Extent tree disk layout
There is discussion about support true 64 bit ext4 filesystem (64bit
inode number and 64 bit block number -- currently 32 bit inode number and
48 bit blocknumber) in order to scales well. The ext4 on disk extent
structure could be extended to support larger file, such as 64-bits
physical block, bigger logical block, and using cluster-size as unit in
extent. This is easy to handle in e2fsprogs, but change on disk extent
tree format is quite tricky to play well with punch hole, truncate etc.,
which depends on extent tree format. One solution is to add an layer of
extent tree abstraction in memory, but this considered a big effort.
This was not entirely impossible.Jan Kara is working on extent tree code clean up, trying to factor out some common code first and teach the block allocation related code doesn't have to reply on on disk extent format. This is not a high priority for now.
Fallocate performance issue
A performance problem has been reported with fallocate really large
file. Ext4 multile block allocator code(mballoc) currently limits how
large a chunk of blocks could be allocated at a time. Should able to
hack mballoc at lest 16MB at a time, instead of 2MB a time.
This brought out another related mballoc limitation. At present the mballoc normalize the request size to the nearest power of 2, up to 1MB. The original requirement for this is for raid alignment. If we lift up this limitation, with non normalized request size, fallocate could be 3 times faster. Most likely we will address this quickly.
Buddy bitmap flush from mem too quickly
Under some memory pressure test, the buddy bitmap used to guide ext4
block allocation was been pushed out from memory too quickly, even
though mark page dirty doesn't strong enough
-- talk to mm people about interface mark page access() interface alternate, which ended with agreement to use fadvise to mark the pages
data=guarded journaling mode
Back to ext3 time when there is no delayed allocation, the fsync() performance is badly hurt by the data=ordered mode, which forces flush out the data first (might be entire filesystem dirty data) before commit a metadata update. There is proposal of data=guarded mode which protect data inconsistency issue upon power failure, but would result in much better fsync result. The basic idea is the isize update wont be updated until the data has flushed to disk. This would drop of difference between data=writeback mode and data=ordered mode.
At the meeting this journalling mode was brought up again to see if we need this for ext4. Given ext4 implemented delayed allocation, the fsync performance was much improved (no need to flush unrelated file data), due to the benefit of delayed allocation, so performance benefit is not so obvious. But the benefit of this new journalling mode would great help 1) unwritten extent conversion issue, so that we could have full dio read no lock implementation, 2) also get ride of extra journalling mode.
ext4 filesystem mount options
There is discussion of ext4 testing cost due to many many different
combination of ext4 mount options (total 70). Part of the reason is
distro is trying to just maintain ext4 filesystem for all three
filesystem (ext2.3.4) there is effort to test and valid the ext4 module
still work as expected when mounted as ext3 with different mount
options. A few important mount options which need special care/investigate including support for indirect-based/extent-based files; support for Asynchronous journal commit; data=journal and delayed
allocation exclusive issue.
So short summary of next year ext4 development is to mostly focus on reduce latency, improve performance and code reorganization.
-- Mingming Cao