Monday Nov 17, 2014

China Linux Storage and File System (CLSF) Workshop 2014

This is a contributed post by Oracle mainline Linux kernel developer, Liu Bo, who recently presented at the 2014 China Linux Storage and File System (CLSF) Workshop.  This event was held on October 21st in Beijing. 


Xue Jiufei from Huawei showed us their work on OCFS2 of the last year, including bug fixes like "Packet loss when reconnect" and a few features like "Range lock based on DLM" and "Self-heal when fault recover".

She also mentioned that Huawei already plans to use OCFS2 inproduction.

Highlights of "range lock based on DLM", -- it's a Red-Black internal locking tree and write has higher priority but read range and write range can merge together, and it does delayed unlock which is expected to improve unlock performance.


Yu Chao from Samsung introduced F2FS update. F2FS is designed to be a flash friendly filesystem and it can overcome some problems of old flash fs, for example, snowball effect of wandering tree. The most important is F2FS is much faster than other flash filesystems, perhaps that's why it's merged into mainline so fast.

He talked about some details of F2FS including disk layout and core data structure, and his work mainly focuses on bug fixes.


I held this slot and talked about updates in the last year, such as "NO-HOLE", async metadata reclaim and the infamous bugs. And several people were very interested in how the bug was nailed down when I said that it's related to workqueue and is very difficult to reproduce.

There is an engineer from Fujitsu who worked a lot on workqueue, and he thought that the bug also should be a workqueue bug, and we had a discussion on the details behind the bug, and after he figured it out he said he would talk to workqueue's maintainer.


Memblaze is a company which focuses on flash storage. One of their engineers shared with us their product, AFA (All Flash Array). He mainly talked about the trend of current SSD-oriented file system, and they're using NVDIMM on AFA, but there are some challenges of using it on Linux, because Linux has heavy block layer and scalability problem, the bottleneck is Linux's IO latencies, context switch cost and interrupt issue because flash is very fast so there are plenty of interrupts sent to Linux.


Peng Tao from Primarydata talked about NFS update in the last year, he covered new features of NFSv42 and talked about pNFS's Flex file layout and nfsd's per-bucket spinlock.


Huawei's Hu Jianyang held this UBIFS (Unsorted Block Image filesystem) slot.  He talked about UBIFS's infrastruture and the difference with other UBI upper layer, it acts similar to FTL, for example, it can read/write/erase.  It can also do map/unmap.

UBIFS has features like static wear-leveling, transparent compression (lz4hc supported), writeback, and it supports flash of up to 32G size.

He also gave more details of UBIFS's FASTMAP feature, the normal UBIFS mount needs a time costing probe when flash media is fairly large, and fastmap addresses this scalability issue, it only scans a fixed number of blocks.

Linux Kernel Performance and 0day Testing

Fengguang Wu from Intel held this slot. He is the author of 0day test system. He said that this system runs thousands of vm test machines and tests over 400 Linux kernel git trees.

Initially this test system can only do compile/build test, and when errors occur, it will automatically try to git bisect to the buggy commit and notice the commit author and maintainer of the subsystem. And this test system is flexible, it can easily add testcases and now it supports performance test.

For filesystems, it runs popular tests like xfstests, fsmark, etc.

Compared to other test system, like Open POSIX Test Suite, this test system has much less code.

However, there are some issues, for example, random kernel config testing is not capable, because the number is huge, and the similar case is to handle filesystem's mount options and mkfs options. Fengguang said that these need developers help to filter out what option combination is needed.


Xie Liang from Xiao Mi mainly talked about their issues of using ext4 and linux block layer. They tried HBASE and others to build storage for their cloud service, like Micloud.

They found that local filesystem + bio has some problems, one is ext4's buffered IO latency is not good enough.  It always fluctuates with journal enabled. ext4 developers suggested them to use no journal and async journal, they said having journal only benifits ext4's fsck speed. Another problem is IO priority issue, the current io scheduler in linux didn't perform well on their systems, no matter cfq or deadline. But Taobao's engineer suggested to use their newly written io schedulers, an io scheduler of mixed cfq and deadline and another new scheduler based on IOPS called TPPS.

Taobao's Liu Zheng also gave an update of ext4.  Frankly there is no new kernel feaures in the last year, but they're planning some, he talked about engcryption support on ext4, project quota, data block checksum & reflink (from Oracle's Minging Cao).

Someone asked that why ext4 needs encryption, because we already have encryptionfs, why not use it instead? Zheng answered that with "Perhaps Google wants to use ext4's encryption support for chromium os user cache". [ED: see here]

Besides, e2fsprogs has a new compat feature, "sparse_super2", added by maintainer, it'll be used on SMR disk. e2fsprogs now supports metadata prefetch.

And ext4 is planning to remove old 'buffer head' code, support larger file(16TB to 1EB), more optimization towards SMR/flash device.


Asias He from OSv took this slot and this is an very interesting topic IMO. He introduced OSv's infrastruture. It uses BSD licence, its purpose is an OS for virtual machine in the cloud.  It works ina  hypervisor, and the big difference is that it has single address space. There is no kernel space or user space, there is no processes, only threads, there is no spinlock, only lock-free mutex. And it also supports zero-copy. With all of those features, it's very fast according to benchmarks of running memcached, cansandra and redis. After that we had a long discussion of "container based docker" vs "OSv", the conclusion it's well-designed , promising and more secure.

-- Liu Bo

Note: our coverage of the 2013 event is here.  Also, Robin Dong has also published a write-up of the 2014 event, here.

Wednesday Jun 18, 2014

NFS Over RDMA Community Development

Recently, Chuck Lever and Shirley Ma of the Oracle Mainline Linux kernel team have been working with the community on bringing Linux NFS over RDMA (remote direct memory access) up to full production quality.

At the 2014 OpenFabrics International Developer Workshop in March/April, they presented an overview of NFSoRDMA for Linux, outlining a rationale for investing resources in the project, as well as identifying what needs to be done to bring the implementation up to production quality.

Slides from their presentation may be downloaded here, while a video of Shirley and Chuck's presentation may be viewed here.

Shirley Ma wrote the following report on the workshop:

This year OpenFabrics international workshop is dedicated to the development and improvement of OpenFabrics Software. The workshop covers topics from Exascale systems I/O, Enterprise applications to distributed computing, storage, data access and data analysis applications.

In this workshop our goal (Chuck Lever and myself) was to bring more interest parties to NFSoRDMA, work together as a community to make NFSoRDMA better on functionality, reliability and efficiency as well as adding more NFSoRDMA test coverage in OFED validation test.  NFSoRDMA both Linux client and server upstream codes have been lying there for years from 2007. Lacking of upstream maintenance and support keeps customers away. Linux NFS is over IPoIB in InfiniBand Fabric, which consumes more resources than over RDMA (high CPU utilization, contiguous memory reservation to achieve better bandwidth). We briefly evaluated NFSoRDMA vs. NFS/IPoIB using direct I/O performance benchmark IOZone. The results showed that NFSoRDMA had better bandwidth in all different record size among 1KB to 16MB, and better CPU efficiency for record size greater than 4KB. As expected NFSoRDMA RPC read, write round trip time is much shorter than NFS/IPoIB. There will be more desire for NFSoRDMA when storage and memory are merged, I/O latency reduced.

Jeff Becker(NASA) and Susan Coulter (LANL) would like to join NFSoRDMA efforts after our talk. They have large scale computing nodes, a decent scaled validation environment. Tom Talpey (NFSoRDMA client original author) agreed with our proposal of future work: splitting send/recv completion queue, creating multiple QPs for scalability... He also gave advises on NFSoRDMA performance measurement based upon his SMB work and Don Lovinger's performance measurement work on SMB 3.0. (

Sayantan Sur (Intel) right now is using NFS/IPoIB in their IB cluster. We advised him some tuning method on NFS/IPoIB, he was happy to get 100 times better bandwidth than before for small I/O size from 2MB to 200MB/s. He is thinking to move to NFSoRDMA once it's stable. When we talked about wireshark NFSoRDMA dissector, Doug Oucharek (Intel) mentioned that he had implemented some luster RDMA packet dissector for wireshark which is not upstream yet, discussed with him about luster RDMA packet dissector to see whether we can borrow some codes for dissect NFSoRDMA IB packets. Chuck and I also discussed with OFILG interoperability tester Edward Mossman (IOL) regarding adding more NFSoRDMA coverage into their test suites.

The OFA has moved from hardware vendor driven workshop to software driven since last year. Most of the attendees were and OpenFabrics software and application developers. Intel has the most attendees, more than 20 people came from HW, OpenFabrics Stack, HPC and other applications departments.

Topics could be related to NFSoRDMA in the future:
A new working group (OpenFabrics Interface OFI WG) is created, the goal is to minimize interfaces complexity and APIs overhead. The new framework was proposed to provide different fabric interfaces to hide different fabrics providers implementation. The OFI WG hosts weekly telecons every Tuesday, everyone is welcome. Sean Hefty (Intel) analyzed current stack APIs overhead and cache memory footprint, presented the interfaces framework in little bit detail, check his presentation:

VMware is working on virtualization support for host and guest service over RDMA. On guest it implements paravirtual vRDMA device support Verbs. Device is emulated in ESXi hypervisor. Guest physical memory regions are mapped to ESXi and passed down to physical RDMA HCA, DMA directly from/to guest physical memory. Guest on same host latency is about 20us.

Liran Liss from Mellanox gave a talk about RDMA on demand paging update, which intends to address RDMA memory registration challenge for the cost, the size, lock, sync. He proposed non-pinned memory region which requires OS PTE table changes. More details is here:
He also presented RDMA bonding approach from transport level. A sudo vHCA (vQP, vPD, vCQ, vMRs ...) is created to use for bonding (failure over and aggregation). So the bonding will be hardware independent. The detail of the proposal is as below, don't know how feasible to do it, and the outcome of performance. The sudo HCA driver idea is similar to VMware vRDMA driver.

Mellanox gave RoCE(RDMA over Converged Ethernet) v2 update -- IP routable packet format. RoCEv2 encapsulates IB packet to UDP packet, which has presented to IETF in Nov. 2013. This might introduce more challenge for Fabrics congestion control.

Developers are still complaining about usability (different vendors have different implementations) and RDMA scalability in the area of RDMA-CM, subnet manager, QP resources, memory registration... RDMA socket is still under discussion... Were they news to me after many years absent from RDMA :)

There are lots of other interesting application topics which I don't cover here. If you are interested, here is the link to the whole presentations:

Since the workshop, a bi-weekly conference call has been established, with developers from many companies and organizations participating.  Minutes from these calls are posted to the linux-rdma and linux-nfs mailing lists.   Minutes so far:

Code stability has been significantly improved, with increased testing by developers and bugfixes being merged.  Anna Schumaker of NetApp is now maintaining a git tree for NFSoRDMA, feeding up to the core NFS maintainers.

For folks wishing to get involved in development, see the NFSoRDMA client wiki page for more information.

Tuesday Apr 01, 2014

LSF/MM 2014 and ext4 Summit Notes by Darrick Wong

This is a contributed post from Darrick Wong, storage engineer on the Oracle mainline Linux kernel team.

The following are my notes from LSF/MM 2014 and the ext4 summit, held last week in Napa Valley, CA.

  • Discussed the draft DIX passthrough interface. Based on Zach Brown's suggestions last week, I rolled out a version of the patch with a statically defined io extensions struct, and Martin Petersen said he'd try porting some existing asmlib clients to use the new interface, with a few field-enlarging tweaks. For the most part nobody objected; Al Viro said he had no problems "yet" -- but I couldn't tell if he had no idea what I was talking about, or if he was on board with the API. It was also suggested that I seek the opinion of Michael Kerrisk (the manpages maintainer) about the API. As for the actual implementation, there are plenty of holes in it that I intend to fix this week. The NFS/CIFS developers I spoke to were generally happy to hear that the storage side was finally starting to happen, and that they could get to working on the net-fs side of things now. Nicholas Bellinger noted that targetcli can create DIF disks even with the fileio backend, so he suggested I play with that over scsi_debug.

  • A large part of LSF was taken up with the discussion of how to handle the brave new world of weird storage devices. To recap: in the beginning, software had to deal with the mechanical aspects of a rotating disk; addressing had to be done in terms of cylinders, heads, and sectors (CHS). This made it difficult to innovate drive mechanics, as it was impossible to express things like variable zone density to existing software. SCSI eliminated this pain by abstracting a disk into a big tub of consecutive sectors, which simplified software quite a bit, though at some cost to performance. But most programs weren't trying to wring the last iota of performance out of disks and didn't care. So long as some attention was paid to data locality, disks performed adequately. Fast forward to 2014: now we have several different storage device classes: Flash, which has no seek penalty but prefers large writeouts; SMR drives with hard-disk seek penalties but requirements that all writes within a ~256MB zone be written in linear order; RAIDs, which by virtue of stripe geometries violate a few of the classic hard disk thinking; and NVMe devices which implement atomic read and write operations. Dave Chinner suggests that rather than retrofitting each filesystem to deal with each of these devices, it might be worth shoving all the block allocation and mapping operation down to a device mapper (dm) shim layer that can abstract away different types of storage, leaving FSes to manage namespace information. This suggestion is very attractive on a few levels: Benefits include the ability to emulate atomic read/writes with journalling, more flexible software-defined FTLs for flash and SMR, and improved communication with cloud storage systems -- Mike Snitzer had a session about dm-thinp and the proper way for FSes to communicate allocation hints to the underlying storage; this would certainly seem to fit the bill. I mentioned that Oracle's plans for cheap ext4 reflink would be trivial to implement with dm shims. Unfortunately, the devil is in the details -- when will we see code? For that reason, Ted Ts'o was openly skeptical.

  • The postgresql developers showed up to complain about stable pages and to ask for a less heavyweight fsync() -- currently, when fsync is called, it assumes that the caller wants all dirty data written out NOW, so it writes dirty pages with WRITE_SYNC, which starves reads. For postgresql this is suboptimal since fsync is typically called by the checkpointing code, which doesn't need to be fast and doesn't care if fsync writeback is not fast. There was an interlock scheduled for Thursday afternoon, but I was unable to attend. See LWN for more detailed coverage of the postgresql (and FB) sessions.

  • At the ext4 summit, we discussed a few cleanups, such as removing the use of buffer_heads and the impending removal of the ext2/3 drivers. Removing buffer_heads in the data path has the potential benefit that it'll make the transition to supporting block/sector size > page size easier, as well as reducing memory requirements (buffer heads are a heavyweight structure now). There was also the feeling that once most enterprise distros move to ext4, it will be a lot easier to remove ext3 upstream because there will be a lot more testing of the use of ext4.ko to handle ext2/3 filesystems. There was a discussion of removing ext2 as well, though that stalled on concerns that Christoph Hellwig (hch) would like to see ext2 remain as a "sample" filesystem, though Jan Kara could be heard muttering that nobody wants a bitrotten example.

  • The other major new ext4 feature discussed at the ext4 summit is per-data block metadata. This got started when Lukas Czerner (lukas) proposed adding data block checksums to the filesystem. I quickly chimed in that for e2fsck it would be helpful to have per-block back references to ease reconstruction of the filesystem, at which point the group started thinking that rather than a huge static array of block data, the complexity of a b-tree with variable key size might well be worth the effort. Then again, with all the proposed filesystem/block layer changes, Ted said that he might be open to a quick v1 implementation because the block shim layer discussed in the SMR forum could very well obviate the need for a lot of ext4 features. Time will tell; Ted and I were not terribly optimistic that any of that software is coming soon. In any case, lukas went home to refine his proposal. The biggest problem is ext4's current lack of a btree implementation; this would have to be written or borrowed, and then tested. I mentioned to him that this could be the cornerstone of reimplementing a lot of ext4 features with btrees instead of static arrays, which could be a good thing if RH is willing to spend a lot of engineering time on ext4.

  • Michael Halcrow, speaking at the ext4 summit, discussed implementing a lightweight encrypted filesystem subtree feature. This sounds a lot like ecryptfs, but hopefully less troublesome than the weird shim fs that is ecryptfs. For the most part he seemed to need (a) the ability to inject his code into the read/write path and some ability to store a small amount of per-inode encryption data. His use-case is Chrome OS, which apparently needs the ability for cache management programs to erase parts of a(nother) user's cache files without having the ability to access the file. The discussion concluded that it wouldn't be too difficult for him to start an initial implementation with ext4, but that much of this ought to be in the VFS layer.

 -- Darrick

[Ed: see also the LWN coverage of LSF/MM]

Tuesday Nov 05, 2013

CLSF & CLK 2013 Trip Report by Jeff Liu and Liu Bo

This is a contributed post from Jeff Liu, lead XFS developer for the Oracle mainline Linux kernel team, with contributions from Liu Bo, our lead BTRFS developer.

Recently, we attended the China Linux Storage and Filesystem workshop (CLSF), and the China Linux Kernel conference (CLK), which were held in Shanghai.

Here are the highlights for both events.

CLSF - 17th October

XFS update (led by Jeff Liu)

XFS keeps rapid progress with a lot of changes, especially focused on the infrastructure/performance improvements as well as  new feature development.  This can be reflected with a sample statistics among XFS/Ext4+JBD2/Btrfs via:

# git diff --stat --minimal -C -M v3.7..v3.12-rc4 -- fs/xfs|fs/ext4+fs/jbd2|fs/btrfs

XFS:       141 files changed, 27598 insertions(+), 19113 deletions(-)
Ext4+JBD2: 39 files changed,  10487 insertions(+), 5454 deletions(-)
Btrfs:     70 files changed,  19875 insertions(+), 8130 deletions(-)

  • What made up those changes in XFS?
    • Self-describing metadata(CRC32c). This is a new feature and it contributed about 70% code changes, it can be enabled via `mkfs.xfs -m crc=1 /dev/xxx` for v5 superblock.
    • Transaction log space reservation improvements. With this change, we can calculate the log space reservation at mount time rather than runtime to reduce the the CPU overhead.
    • User namespace support. So both XFS and USERNS can be enabled on kernel configuration begin from Linux 3.10. Thanks Dwight Engen's efforts for this thing.
    • Split project/group quota inodes. Originally, project quota can not be enabled with group quota at the same time because they were share the same quota file inode, now it works but only for v5 super block. i.e, CRC enabled.
    • CONFIG_XFS_WARN, an new lightweight runtime debugger which can be deployed in production environment.
    • Readahead log object recovery, this change can speed up the log replay progress significantly.
    • Speculative preallocation inode tracking, clearing and throttling. The main purpose is to deal with inodes with post-EOF space due to speculative preallocation, support improved quota management to free up a significant amount of unwritten space when at or near EDQUOT. It support backgroup scanning which occurs on a longish interval(5 mins by default, tunable), and on-demand scanning/trimming via ioctl(2).
  • Bitter arguments ensued from this session, especially for the comparison between Ext4 and Btrfs in different areas, I have to spent a whole morning of the 1st day answering those questions. We basically agreed on XFS is the best choice in Linux nowadays because:
    • Stable, XFS has a good record in stability in the past 10 years. Fengguang Wu who lead the 0-day kernel test project also said that he has observed less error than other filesystems in the past 1+ years, I own it to the XFS upstream code reviewer, they always performing serious code review as well as testing.
    • Good performance for large/small files, XFS does not works very well for small files has already been an old story for years.
    • Best choice (maybe) for distributed PB filesystems. e.g, Ceph recommends delopy OSD daemon on XFS because Ext4 has limited xattr size.
    • Best choice for large storage (>16TB). Ext4 does not support a single file more than around 15.95TB.
    • Scalability, any objection to XFS is best in this point? :)
    • XFS is better to deal with transaction concurrency than Ext4, why? The maximum size of the log in XFS is 2038MB compare to 128MB in Ext4.
  • Misc. Ext4 is widely used and it has been proved fast/stable in various loads and scenarios, XFS just need more customers, and Btrfs is still on the road to be a manhood.

Ceph Introduction (Led by Li Wang)

This a hot topic.  Li gave us a nice introduction about the design as well as their current works. Actually, Ceph client has been included in Linux kernel since 2.6.34 and supported by Openstack since Folsom but it seems that it has not yet been widely deployment in production environment.

Their major work is focus on the inline data support to separate the metadata and data storage, reduce the file access time, i.e, a file access need communication twice, fetch the metadata from MDS and then get data from OSD, and also, the small file access is limited by the network latency.

The solution is, for the small files they would like to store the data at metadata so that when accessing a small file, the metadata server can push both metadata and data to the client at the same time. In this way, they can reduce the overhead of calculating the data offset and save the communication to OSD.

For this feature, they have only run some small scale testing but really saw noticeable improvements. Test environment: Intel 2 CPU 12 Core, 64GB RAM, Ubuntu 12.04, Ceph 0.56.6 with 200GB SATA disk, 15 OSD, 1 MDS, 1 MON. The sequence read performance for 1K size files improved about 50%.

I have asked Li and Zheng Yan (the core developer of Ceph, who also worked on Btrfs) whether Ceph is really stable and can be deployed at production environment for large scale PB level storage, but they can not give a positive answer, looks Ceph even does not spread over Dreamhost (subject to confirmation). From Li, they only deployed Ceph for a small scale storage(32 nodes) although they'd like to try 6000 nodes in the future.

Improve Linux swap for Flash storage (led by Shaohua Li)

Because of high density, low power and low price, flash storage (SSD) is a good candidate to partially replace DRAM. A quick answer for this is using SSD as swap. But Linux swap is designed for slow hard disk storage, so there are a lot of challenges to efficiently use SSD for swap.

    • swap_map scan
      swap_map is the in-memory data structure to track swap disk usage, but it is a slow linear scan. It will become a bottleneck while finding many adjacent pages in the use of SSD. Shaohua Li have changed it to a cluster(128K) list, resulting in O(1) algorithm. However, this apporoach needs restrictive cluster alignment and only enabled for SSD.
    • IO pattern
      In most cases, the swap io is in interleaved pattern because of mutiple reclaimers or a free cluster is shared by all reclaimers. Even though block layer can merge interleaved IO to some extent, but we cannot count on it completely. Hence the per-cpu cluster is added base on the previous change, it can help reclaimer do sequential IO and the block layer will be easier to merge IO.
    • TLB flush:
      If we're reclaiming one active page, we should first move the page from active lru list to inactive lru list, and then reclaim the page from inactive lru to swap it out. During the process, we need to clear PTE twice: first is 'A'(ACCESS) bit, second is 'P'(PRESENT) bit. Processors need to send lots of ipi which make the TLB flush really expensive. Some works have been done to improve this, including rework smp_call_functiom_many() or remove the first TLB flush in x86, but there still have some arguments here and only parts of works have been pushed to mainline.
    • Page fault does iodepth=1 sync io, but it's a little waste if only issue a page size's IO. The obvious solution is doing swap readahead. But the current in-kernel swap readahead is arbitary(always 8 pages), and it always doesn't perform well for both random and sequential access workload. Shaohua introduced a new flag for madvise(MADV_WILLNEED) to do swap prefetch, so the changes happen in userspace API and leave the in-kernel readahead unchanged(but I think some improvement can also be done here).
  • SWAP discard
    • As we know, discard is important for SSD write throughout, but the current swap discard implementation is synchronous. He changed it to async discard which allow discard and write run in the same time. Meanwhile, the unit of discard is also optimized to cluster.
  • Misc: lock contention
    • For many concurrent swapout and swapin , the lock contention such as anon_vma or swap_lock is high, so he changed the swap_lock to a per-swap lock. But there still have some lock contention in very high speed SSD because of swapcache address_space lock.

Zproject (led by Bob Liu)

Bob gave us a very nice introduction about the current memory compression status. Now there are 3 projects(zswap/zram/zcache) which all aim at smooth swap IO storm and promote performance, but they all have their own pros and cons.
    • It is implemented based on frontswap API and it uses a dynamic allocater named Zbud to allocate free pages. Zbud means pairs of zpages are "buddied" and it can only store at most two compressed pages in one page frame, so the max compress ratio is 50%. Each page frame is lru-linked and can do shink in memory pressure. If the compressed memory pool reach its limitation, shink or reclaim happens. It decompress the page frame into two new allocated pages and then write them to real swap device, but it can fail when allocating the two pages.
  • ZRAM
    • Acts as a compressed ramdisk and used as swap device, and it use zsmalloc as its allocator which has high density but may have fragmentation issues. Besides, page reclaim is hard since it will need more pages to uncompress and free just one page. ZRAM is preferred by embedded system which may not have any real swap device. Now both ZRAM and ZSWAP are in driver/staging tree, and in the mm community there are some disscussions of merging ZRAM into ZSWAP or viceversa, but no agreement yet.
    • Handles file page compression but it is removed out of staging recently.

From industry (led by Tang Jie, LSI)

An LSI engineer introduced several new produces to us. The first is raid5/6 cards that it use full stripe writes to improve performance.

The 2nd one he introduced is SandForce flash controller, who can understand data file types (data entropy) to reduce write amplification (WA) for nearly all writes. It's called DuraWrite and typical WA is 0.5. What's more, if enable its Dynamic Logical Capacity function module, the controller can do data compression which is transparent to upper layer. LSI testing shows that with this virtual capacity enables 1x TB drive can support up to 2x TB capacity, but the application must monitor free flash space to maintain optimal performance and to guard against free flash space exhaustion. He said the most useful application is for datebase.

Another thing I think it's worth to mention is that a NV-DRAM memory in NMR/Raptor which is directly exposed to host system. Applications can directly access the NV-DRAM via a memory address - using standard system call mmap(). He said that it is very useful for database logging now. This kind of NVM produces are beginning to appear in recent years, and it is said that Samsung is building a research center in China for related produces. IMHO, NVM will bring an effect to current os layer especially on file system, e.g. its journaling may need to redesign to fully utilize these nonvolatile memory.

OCFS2 (led by Canquan Shen)

Without a doubt, HuaWei is the biggest contributor to OCFS2 in the past two years. They have posted 46 upstream patches and 39 patches have been merged. Their current project is based on 32/64 nodes cluster, but they also tried 128 nodes at the experimental stage. The major work they are working is to support ATS (atomic test and set), it can be works with DLM at the same time. Looks this idea is inspired by the vmware VMFS locking, i.e,

EXT4 (led by Zheng Liu)

Zheng Liu says ext4 keeps its stable style, so the major part is bug-fixes and cleanups while the minor is new features and improvements. He first talked about AIO writes performance gain on ext4, it makes use of extent status cache. So the problem is that they find the AIO path waiting on get_block_t(), ending up some unaccepted latencies, the solution is to batch get_block_t() with "fiemap(2) + FEMAP_FLAG_CACHE" and "ioclt(2) + EXT4_IOC_PRECACHE_EXTENT".

IOW, this just hands off latency from the kernel to the userspace.

BTRFS (led by Liu Bo)

I (Liu Bo) held the session and mainly talked about new features in the last year (2013). People are happy to see that more features are developed in btrfs, but are meanwhile confused about what btrfs wants to be -- generally speaking, as a 5-year-old FS, btrfs should try to be stable firstly anyway.

CLK - 18th October 2013

Improving Linux Development with Better Tools (Andi Kleen)

This talk focused on how to find/solve bugs along with the Linux complexity growing. Generally, we can do this with the following kind of tools:

  • Static code checkers tools. e.g, sparse, smatch, coccinelle, clang checker, checkpatch, gcc -W/LTO, stanse. This can help check a lot of things, simple mistakes, complex problems, but the challenges are: some are very slow, false positives, may need a concentrated effort to get false positives down. Especially, no static checker I found can follow indirect calls (“OO in C”, common in kernel):
    struct foo_ops {
            int (*do_foo)(struct foo *obj);
  • Dynamic runtime checkers, e.g, thread checkers, kmemcheck, lockdep. Ideally all kernel code would come with a test suite, then someone could run all the dynamic checkers.
  • Fuzzers/test suites. e.g, Trinity is a great tool, it finds many bugs, but needs manual model for each syscall. Modern fuzzers around using automatic feedback, but notfor kernel yet:
  • Debuggers/Tracers to understand code, e.g, ftrace, can dump on events/oops/custom triggers, but still too much overhead in many cases to run always during debug.
  • Tools to read/understand source, e.g, grep/cscope work great for many cases, but do not understand indirect pointers (OO in C model used in kernel), give us all “do_foo” instances:
    struct foo_ops {
          int (*do_foo)(struct foo *obj);
    } = { .do_foo = my_foo };
    That would be great to have a cscope like tool that understands this based on types/initializers

XFS: The High Performance Enterprise File System (Jeff Liu)


I gave a talk for introducing the disk layout, unique features, as well as the recent changes.   The slides include some charts to reflect the performances between XFS/Btrfs/Ext4 for small files.

About a dozen users raised their hands when I asking who has experienced with XFS. I remembered that when I asked the same question in LinuxCon/Japan, only 3 people raised their hands, but they are Chris Mason, Ric Wheeler, and another attendee.
The attendee questions were mainly focused on stability, and comparison with other file systems.

Linux Containers (Feng Gao)

The speaker introduced us that the purpose for those kind of namespaces, include mount/UTS/IPC/Network/Pid/User, as well as the system API/ABI. For the userspace tools, He mainly focus on the Libvirt LXC rather than us(LXC). Libvirt LXC is another userspace container management tool, implemented as one type of libvirt driver, it can manage containers, create namespace, create private filesystem layout for container, Create devices for container and setup resources controller via cgroup.
In this talk, Feng also mentioned another two possible new namespaces in the future, the 1st is the audit, but not sure if it should be assigned to user namespace or not. Another is about syslog, but the question is do we really need it?

In-memory Compression (Bob Liu)

Same as CLSF, a nice introduction that I have already mentioned above.

0-day Linux Kernel Performance Test (Yuanhan Liu)

Based on Fengguang Wu's 0-day autotest framework, Yuanhan Liu 0-day performance test integrates with the existing test tools and generates both ASCII and graphic results from test numbers. But it's not yet open sourced, only Intel internal, and the developers say that it's a bit difficult to make it open, because:

  1. it's not easy to setup the whole testsuite, a lot of efforts involved
  2. it needs many powerful machines on where there'll be a great number of VMs installed.

 Despite that it's not open, the framework does find bugs on various code in kernel, including btrfs, good for me :) [LB]


There were some other talks related to ACPI based memory hotplug, smart wake-affinity in scheduler etc., but my head is not big enough to record all those things.

-- Jeff Liu & Liu Bo

Monday Sep 02, 2013

IETF 87 NFSv4 Working Group meeting report by Chuck Lever

This is a contributed post from Chuck Lever, who heads up NFS development for the mainline Linux kernel team.

Executive summary:

The 87th meeting of the IETF was held July 28 - August 2 in Berlin, Germany.

I was in Berlin for the week to attend the NFSv4 Working Group meeting and hold informal discussions related to NFS standardization with other attendees. The Internet Engineering Task Force (IETF) produces high quality technical documents that influence the way people design, use and manage the Internet. Essentially, this is the body that regulates the protocols computers use to communicate on the Internet, for the purpose of improving interoperability.

An IETF meeting is held every four months in venues around the world. Sponsorship for each event varies. DENIC, the central registry for domain names under .de, was the primary sponsor for this event. Participation is open to anyone, but a registration fee is required to attend.

NFS version 4 is the IETF standard for file sharing. The charter of the Working Group is to maintain NFS specifications and introduce new NFS features through NFSv4 minor versions. More on the Working Group charter can be found here:

I attend each NFSv4 Working Group meeting to represent Oracle's interest in various current and new NFS-related features, including pNFS, NFSv4.2, and FedFS. I'm the editor of two of the IETF FedFS protocol specifications, and a co-author of an Internet-Draft that addresses protocol issues affecting NFSv4 migration. Other representatives at this meeting include Microsoft, EMC, NetApp, IBM, Oracle, Tonian, and others. Topics include progress updates on Internet-Drafts on their way to become standards, reports on implementation experience, and requests to start new work or restart old work. See:

Meeting agenda, presentation materials, and minutes are available at this location.

Drill down:

Working Group editor Tom Haynes (NetApp) reported on several areas where progress appears to be stalled. In general we face challenges completing our deliverables because the IETF is a volunteer organization, and the tasks at hand are large. The largest item is RFC 3530bis, which is holding up FedFS and NFSv4.2. RFC 3530bis was rejected during IESG review mainly due to the new chapter that attempts to bridge the gap between existing i18n implementations in NFS, and how we'd like i18n to work.

The problem is nobody has implemented i18n for NFSv4, and the IETF has revised i18n since 3530 was ratified. The consensus was to move the offending section to a separate Internet-Draft where the correct language can be hammered out without holding up RFC 3530bis. NFSv4.2 is held up by a lack of enthusiasm for finishing a new revision of RPCSEC GSS. The GSS I-D has languished without an author or editor for many months, and two items in NFSv4.2 depend on its completion: labeled NFS and server-to-server copy. A rough consensus was not achieved, but Tom and Andy Adamson (NetApp) will investigate options, including removing the parts of server copy and labeled NFS that depend on GSSv3, and report back.

Benny Halevy (Tonian) has submitted a fresh draft proposing "Flexible File Layouts" which is a new pNFS layout type that improves upon the existing pNFS file layout defined in RFC 5661. Motivation for a new layout scheme includes: algorithmic data striping to support load balancing, life-cycle management, and other advanced administrative features; support for using legacy NFS servers as pNFS data servers; and direct pNFS support for existing cluster filesystems such as Ceph and GlusterFS.

Chuck Lever (Oracle) described recent progress to address security concerns in the FedFS documents waiting in the RFC Editor queue. He continued by walking through a group of possible future work items, including more modern LDAP security modes, additional administrative operations, and better mechanisms for clients to choose working fileset locations. Does the working group have the energy to consider a new revision of these documents? Or should we continue to focus on making small changes? This was left unresolved.

Sorin Faibish (EMC) discussed the need for a new layout enabling pNFS clients to access Lustre data servers directly. After a lot of discussion, the issue appears to be that the NFS protocol on high performance transports is not performant enough. The proposed solution was to use LNET over RDMA. It was suggested that it would be more interesting to the Working Group if we focused on fixing the performance issues in our RDMA specifications instead.

Marc Eshel (IBM) wanted to restart the age-old conversation on tightening NFS's data cache coherency. The immediate question is whether POSIX semantics are interesting given today's compute workloads and network environment. Implementing POSIX data coherency among multiple networked systems is still a challenge. Consensus that a callback-based solution, where network traffic is proportional to the level of inter-client sharing, was most appropriate. Such a solution (byte-range delegation) was proposed by Trond Myklebust in 2006. It was recommended to start with that work.

Chuck Lever (Oracle) proposed an experimental extension to NFS that enables NFS client and servers to convey end-to-end data integrity metadata. A new I-D has been submitted that describes the protocol changes. No prototype is available yet; the I-D is meant to coordinate discussion of technical details, and enable interoperable prototype implementations.

David Noveck (EMC) elaborated on the need to allow protocol changes outside of the NFS minor version process. He described the limitations of batching unrelated features together and waiting for a full pass through the IETF review process. There was some interest in allowing innovation outside of the minor version process. The Area Directory and Working Group chair felt that there is currently not enough energy behind work already planned for delivery.

Matt Benjamin (Linux Box) is restarting work on a feature proposed several years ago by Mike Eisler that allows directories to be striped across pNFS data servers, just like file data is today. An Internet-Draft is available, and a prototype is underway.

-- Chuck Lever

Thursday Mar 21, 2013

IETF 86 NFSv4 Working Group meeting report by Chuck Lever

This is a contributed post from Chuck Lever, who heads up NFS development for the mainline Linux kernel team.

Executive summary:

On Monday (11th March) I attended the IETF NFSv4 Working Group meeting at IETF 86 in Orlando, Florida.

The Internet Engineering Task Force (IETF) produces high quality technical documents that influence the way people design, use and manage the Internet.  Essentially, this is the body that regulates the protocols computers use to communicate on the Internet, for the purpose of improving interoperability.

An IETF meeting is held every four months in venues around the world.  Sponsorship for each event varies.  This event was sponsored by Comcast and NBCUniversal.  Participation is open to anyone, but a registration fee is required to attend.

NFS version 4 is the IETF standard for file sharing.  The charter of the Working Group is to maintain NFS specifications and introduce new NFS features through NFSv4 minor versions.  More on the Working Group charter can be found here:

I attend each NFSv4 Working Group meeting to represent Oracle's interest in various current and new NFS-related features, including pNFS, NFSv4.2, and FedFS.  I'm the editor of two of the IETF FedFS protocol specifications, and a co-author on a draft that discusses experience implementing NFSv4 migration.

Other representatives at this meeting include Microsoft, EMC, NetApp, Oracle, Panasas, and others.  Topics include progress updates on drafts on their way to become standards, reports on implementation experience, and requests to start new work or restart old work.  See:

Meeting slides are available now at this location.  Minutes are coming soon.

Drill down:

Tom Haynes (NetApp) reported on progress with RFC 3530bis, a refreshed specification for NFSv4.0.  This document has passed the Area Director check, and is ready for IESG review.  This document is a top priority because other unfinished documents which cite this document are held up waiting for its completion.

Labeled NFS, a part of the forthcoming NFSv4.2 protocol, has a Linux prototype that was demonstrated at Connectathon last month.

The RPCSEC_GSSv3 standard has not made progress, but an editor (Dros Adamson) was assigned during IETF 85.  This document is blocking progress on NFSv4.2.

The NFSv4.2 draft is in WG last call, which ends today (Monday, March 11).  No new issues were raised, so the Working Group chair will move this forward.

Tom Haynes presented a brief set of slides on how NFSv3 client should interpret the presence of AUTH_NONE in the list of security flavors a server supports.  There was never a formal standard describing this, and now we need an interoperability document.  As we explore this issue we may discover some real problems.  A fresh draft was requested.

Dave Noveck (EMC) discussed progress on the draft co-authored with Bill Baker, Piyush Shivam, and myself on NFSv4 migration issues.  As part of the discussion, we visited the issue of how to prevent client progress when a server freezes open and lock state before a migration.  Adding a new error code was mentioned, but that is against the minor version rules and would cause interoperability problems with clients that don't recognize the new error code.  Otherwise we have the procedural issues taken care of to advance this document to become an informational RFC.

A draft covering NFSv4.1 migration issues would probably not be needed, as the changes are small and could be covered in an RFC 5661bis, when it is opened.  There doesn't seem to be urgency here.

Chuck Lever (Oracle) described implementation experience with the recommendations of Dave's migration draft.  The experience arises from the Linux Uniform Client String changes Chuck has done, and a number of items discovered by the Solaris NFS team.

Chuck Lever reported on progress with the FedFS draft standards.  Short story: They are in the RFC Editor queue awaiting completion of RFC 3530bis.

Trond Myklebust (NetApp) presented an issue with NFSv4.1 session slot table management that he also has reported at Connectathon.  It was agreed that an errata to RFC 5661 would be produced that describes how implementations will add missing behavior.  No on-the-wire protocol changes.

Matt Benjamin (The Linux Box) requested a revisit of a 2008 proposal by Mike Eisler to stripe POSIX directories across multiple data servers.  An algorithm would generate an offset into a table of device IDs, which would indicate on which data server to find a directory entry.  Matt claimed there would be changes to the proposal to deal with Ceph and CohortFS.  Chair requested a draft, Matt to deliver soon.

Chuck Lever asked if we still need an NFS-specific mechanism for provisioning NFSv4 ID domain names.  The feeling is that this domain name is determined by the system's authentication service, not by NFS, so NFS should not have its own way to set this.  Consensus that there is some work to do here, and it should be done under umbrella of the ongoing multi-domain work.

Spencer Shepler (Microsoft) closed the meeting with a house-keeping item.  There is some desire to reduce travel by moving more work to the mailing list.  The plan is to ask about agenda items for a meeting before requesting a meeting slot at the next IETF.

Several folks wanted to discuss Bill Baker's micro-versioning proposal.  Dave Noveck stated the problem this way: NFSv4.1 is a heavyweight minor version with a bunch of features, so fixes for 4.0 aren't possible with our spiffy minor versioning scheme.  Spencer felt we should visit this only when we encounter a problem we must address with major protocol changes.  The room was divided; some felt waiting was best so that a problem statement can be formulated; others were concerned that it was almost certain we would need to alter the NFSv4.0 XDR at some point, and we should start working this out now.

In the near term, protocol issues should come to the mailing list sooner rather than later so we can work them out together.

-- Chuck Lever


The Oracle mainline Linux kernel team works as part of the Linux kernel community to develop new features and maintain existing code.

Our team is globally distributed and includes leading core kernel developers and industry veterans.

This blog is edited by James Morris <>


« August 2016