This is a contributed post by Oracle mainline Linux kernel developer, Liu Bo, who recently presented at the 2014 China Linux Storage and File System (CLSF) Workshop. This event was held on October 21st in Beijing.
Xue Jiufei from Huawei showed us their work on OCFS2 of the last year,
including bug fixes like "Packet loss when reconnect" and a few features
like "Range lock based on DLM" and "Self-heal when fault recover".
She also mentioned that Huawei already plans to use OCFS2 inproduction.
Highlights of "range lock based on DLM", -- it's a Red-Black internal
locking tree and write has higher priority but read range and write range
can merge together, and it does delayed unlock which is expected to improve
Yu Chao from Samsung introduced F2FS update. F2FS is designed to be a
flash friendly filesystem and it can overcome some problems of old flash fs,
for example, snowball effect of wandering tree.
The most important is F2FS is much faster than other flash
filesystems, perhaps that's why it's merged into mainline so fast.
He talked about some details of F2FS including disk layout and core data
structure, and his work mainly focuses on bug fixes.
I held this slot and talked about updates in the last year, such as
"NO-HOLE", async metadata reclaim and the infamous bugs.
And several people were very interested in how the bug was nailed down
when I said that it's related to workqueue and is very difficult to reproduce.
There is an engineer from Fujitsu who worked a lot on workqueue, and he
thought that the bug also should be a workqueue bug, and we had a discussion on the
details behind the bug, and after he figured it out he said he would talk to
Memblaze is a company which focuses on flash storage. One of their
engineers shared with us their product, AFA (All Flash Array).
He mainly talked about the trend of current SSD-oriented file system,
and they're using NVDIMM on AFA, but there are some challenges of using it on Linux,
because Linux has heavy block layer and scalability problem, the bottleneck is Linux's IO
latencies, context switch cost and interrupt issue because flash is very fast so
there are plenty of interrupts sent to Linux.
Peng Tao from Primarydata talked about NFS update in the last year, he
covered new features of NFSv42 and talked about pNFS's Flex file layout and
nfsd's per-bucket spinlock.
Huawei's Hu Jianyang held this UBIFS (Unsorted Block Image filesystem) slot. He talked about UBIFS's infrastruture and the difference with other UBI
upper layer, it acts similar to FTL, for example, it can read/write/erase. It can also do map/unmap.
UBIFS has features like static wear-leveling, transparent compression
(lz4hc supported), writeback, and it supports flash of up to 32G size.
He also gave more details of UBIFS's FASTMAP feature, the normal UBIFS
mount needs a time costing probe when flash media is fairly large, and
fastmap addresses this scalability issue, it only scans a fixed number of blocks.
Fengguang Wu from Intel held this slot. He is the author of 0day test system.
He said that this system runs thousands of vm test machines and tests
over 400 Linux kernel git trees.
Initially this test system can only do compile/build test, and when
errors occur, it will automatically try to git bisect to the buggy commit
and notice the commit author and maintainer of the subsystem.
And this test system is flexible, it can easily add testcases and now it
supports performance test.
For filesystems, it runs popular tests like xfstests, fsmark, etc.
Compared to other test system, like Open POSIX Test Suite, this test
system has much less code.
However, there are some issues, for example, random kernel config
testing is not capable, because the number is huge, and the similar case
is to handle filesystem's mount options and mkfs options.
Fengguang said that these need developers help to filter out what
option combination is needed.
Xie Liang from Xiao Mi mainly talked about their issues of using ext4 and
linux block layer. They tried HBASE and others to build storage for
their cloud service, like Micloud.
They found that local filesystem + bio has some problems, one is ext4's
buffered IO latency is not good enough. It always fluctuates with journal
enabled. ext4 developers suggested them to use no journal and async journal,
they said having journal only benifits ext4's fsck speed. Another problem
is IO priority issue, the current io scheduler
in linux didn't perform well on their systems, no matter cfq or
deadline. But Taobao's engineer suggested to use their newly written
io schedulers, an io scheduler of mixed cfq and deadline and another new
scheduler based on IOPS called TPPS.
Taobao's Liu Zheng also gave an update of ext4. Frankly there is no new
kernel feaures in the last year, but they're planning some, he talked
about engcryption support on ext4, project quota, data block
checksum & reflink (from Oracle's Minging Cao).
Someone asked that why ext4 needs encryption, because we already have
encryptionfs, why not use it instead? Zheng answered that with "Perhaps Google wants to use ext4's encryption
support for chromium os user cache". [ED: see here]
Besides, e2fsprogs has a new compat feature, "sparse_super2", added by
maintainer, it'll be used on SMR disk.
e2fsprogs now supports metadata prefetch.
Asias He from OSv took this slot and this is an very interesting topic
He introduced OSv's infrastruture. It uses BSD licence, its purpose is
an OS for virtual machine in the cloud. It works ina hypervisor, and the big difference is that it has single
address space. There is no kernel space or user space, there is no processes,
only threads, there is no spinlock, only lock-free mutex. And it also
supports zero-copy. With all of those features, it's very fast according
to benchmarks of running memcached, cansandra and redis.
After that we had a long discussion of "container based docker" vs
"OSv", the conclusion it's well-designed , promising and more secure.
-- Liu Bo