Thursday Dec 10, 2015

Smatch Static Analysis Tool Overview, by Dan Carpenter

This is a posting by Oracle Linux Kernel Engineer, Dan Carpenter.  In this, he provides an overview of Smatch, the C static analysis tool which he developed, and which he uses to test the mainline Linux kernel code for security bugs.

My job at Oracle is focused on finding security bugs in the Linux kernel. My favorite type of bug is off by one bugs where the code says:

	if (x > ARRAY_SIZE(table))
		return -EINVAL;

The problem is that > should be changed to >= it so it says:

	if (x >= ARRAY_SIZE(table))
		return -EINVAL;

These are a one-line fix and an easy way for me to boost my patch count. I have made over a hundred of these changes. In fact, I probably have some kind of record for fixing the most off by one bugs! My record breaking secret is that I use an open source, static analysis tool which I wrote called Smatch.

Maybe the easiest way to understand how Smatch works is to download it and play with it a bit. Here are the instructions:

You will first need to install some dependencies such as the sqlite development packages for C, BASH, Perl and Python. Also I would recommend installing the libXML, gtk+2.0 and LLVM development packages as well but those are not required. Then run the following commands:

	git clone git://
	cd smatch

Now let's create a small test.c file:

#include "check_debug.h"

int var;

void function(void)
	if (var < 0 || var >= 10)

The "check_debug.h" file can be included into any .c file. It is used to display internal Smatch information which helps with debugging. If you run `./smatch test.c`, then that prints the value of the "var" variable.

test.c:9 function() implied: var = '0-9'

Smatch also tries to track some relationships between variables so let's change our test.c file to look like this:

#include "check_debug.h"

int a, valid;

void function(void)
	valid = 0;
	if (a >= 0 && a < 10)
		valid = 1;

	if (a == -1)

With this code, since -1 is outside the 0-9 range, that means "valid" is zero.

test.c:12 function() implied: valid = '0'

We could move the limit check into a separate function if we wanted:

#include "check_debug.h"

int is_valid_month(int month)
	if (month < 1 || month > 12)
		return 0;
	return 1;

int var;

void function(void)
	if (is_valid_month(var))

It prints that valid values are 1-12 as expected.

Basically we're tracking the values of all the variables. The math behind this is called flow analysis and the core part of Smatch is a flow analysis engine. The flow analysis engine lets you track more abstract things as well such as if a pointer has been freed or if it has been dereferenced. It easy to hook into the Smatch flow analysis engine and add more checks.

Since 2009, there have been around 3000 kernel bugs patched because of Smatch warnings. Most are minor bugs such as there might be an off by one bug so the computer will crash when someone installs 256 graphics cards. In that situation the programmer made a real mistake and we will fix it, but it has no real world impact. Other times even minor mistakes like returning a wrong error code can be serious, for example in 6d97e55f7172 ('vhost: fix return code for log_access_ok()') we were supposed to return zero on failure but instead we returned -EINVAL. Since -EINVAL is non-zero, that meant access was granted when it was supposed be denied.

The main complaint about every static analysis tool is that the rate of false positives is too high. The problem in the Linux kernel is that the developers fix all the real bugs and so only 100% false positives remain. It's better to focus on new warnings from newly added code because those are often real bugs. I try to discourage people from changing the kernel code just to silence false positives. Changing the code can be a good thing if it makes the code easier to understand but I always tell people that Smatch is still improving so, hopefully, there will be a way to silence the false positive by changing Smatch instead.

I always run Smatch on my patches before sending them to the kernel maintainers and it saves me from embarrassing mistakes. The command to do that is:

~/path/to/smatch/smatch_scripts/kchecker --spammy drivers/modified_file.c

Earlier I showed that Smatch can do cross function analysis. It does analyze short functions inline, as you have seen, but to get the full benefit, you have to build the cross function database. It takes around three hours. The command to is:


Running that command creates a smatch_db.sqlite file. Then re-run the kchecker script and it will use the new cross function database. Or if you want to run Smatch over the whole kernel the command is:


If you have any issues or suggestions feel free to email the list at

-- Dan

Tuesday Oct 27, 2015

China Linux Storage and File System (CLSF) Workshop 2015 Report

This is a contributed post by Oracle mainline Linux kernel developer, Liu Bo, who recently presented at the 2015 China Linux Storage and File System (CLSF) Workshop.  CLSF is an invite-only event for Linux kernel developers in China. This event was held on October 15th and 16th in Nanjing.


Haomai Wang from XSKY introduced some basic concepts about Ceph, e.g. MDS, Monitor, and OSD. Ceph caches all meta-data information (cluster map) on monitor-nodes so clients could fetch data by just on jump in network and if we use cephfs, which adds MDS, it still needs oonly one jump because the MDS doesn't store data or metadata of filesystem but only to store the context of distributed lock (range lock? not sure about this). Ceph also supports Samba 2.0/3.0 now.

In Linux, it is recommended to use iSCSI to access Ceph storage cluster because it will have to update kernel in clients if we use rbd/libceph kernel modules. Ceph uses a pipeline model in message processing therefore it is good for Hard Disk but not SSD. In the future, developers will use async-framework (such as Seastar) to refactor Ceph. And Robin (from Alibaba) asked that if the client can write three copies concurrently if making a setup of three replica in Ceph, Haomai answered it won't, the IO will come to the primary OSD and then the primary OSD issues two other replicated IOs to other two OSDs, waiting until the two IOs back before returning "the IO is success" to client.

The future development plan for Ceph is de-duplication on pool level. Coly Li (Suse) said that de-duplication is better to be implemented on business level instead of block level because the duplicated information has be split in block level.

Ceph Use Cases

Jiaju Zhang from Red Hat lead the topic about use cases of Ceph in enterprises. Ceph has become the most famous open source storage software around the world and also be used in Redhat/Intel/Sandisk(Low-end Storage Array)/Samsung/Suse. In the end, Jiaju proposed a question, "Can Ceph replace traditional storage (NAS, SAN) in future?"

Xen Block PV Driver

Bob Liu from Oracle shared the work he'd done on xen block pv driver, the patch is aimed to improve xen's performance by converting xen block pv driver to use block-mq API and multi ring buffer. The patch is located at

ScyllaDB and Seastar

Asias He from OSv led this topic. ScyllaDB is a distributed Key/Value store engine which is written in C++14 code and completely compatible to Cassandra. It could also run CQL (Cassandra Query Language). The slides show that, ScyllaDB is 40 times more faster than Cassandra. The asynchronous developing framework in ScyllaDB is called Seastar. The magic in ScyllaDB is that it shards requests to every CPU core, and runs with no locks/no threads. Data is zero-copy and use bi-direction queue to transfer messages between cores. The test result is base on kernel TCP/IP network stack but they will use their own network stack in the future.

Yanhai Zhu (from Alibaba) doubted that the test results is not fair enough because ScyllaDB is designed to be run in multi-cores but Cassandra is not, so it'd be more fair to compare ScyllaDB with running 24 Cassandra instances. Asias replied that ScyllaDB uses message queues to transfer messages between CPU cores, so it avoids atomic-operation and lock-operation cost. And, Cassandra is written by Java, which means the performance will be low when the JVM do garbage- collection. ScyllaDB is written completely by c++ so its performance is much steady. Both of two projects are led by the KVM creator, Avi Kivity.


Fengguang Wu from Intel told me that they're using btrfs a bit on their autotest farm but often experience latency issues and we talked about using btrfs as docker's storage backend, everything seems perfect except each docker instance is an individual namespace so that they're not able to share page caching for the same content. This limits btrfs's use if users need to run a great amount of instances, memory becomes the biggest issue, besides that Zheng Liu from Alibaba shared that overlayfs also has latency issues in real production use, ie. if you just touch a large file, the file will be COPIED from the lower layer to the upper layer. So we all agreed that something should happen in this area.


Xu Wang shared what Hyper is, and why they developed it. The goal is to "run VM like a container", more precisely, Hyper allows you to run Docker images on any hypervisor. This is a fantastic thing that you can reuse your virtual infrastructure while keeping the advantages of using containers, for example, small image size, sub-second level boot time, and portability.

This shows one thing, the problem of traditional VM is that all kinds of VMs are aimed to simulate a bare metal machine in order to run a normal OS, but that's not what we want. And one more thing, yper is a vm which can provide secuity that is wanted by all production use cases.

Filesytem Deduplication

Gang He from Suse shared with us how VxFS implements its deduplication. It provides online dedup and a serials of commands to control dedup behaviours, e.g. we can schedule a dedup and control dedup task's cpu usage, memory usage and priority, even it can 'dryrun' to find how many blocks can be deduped but do not really perform dedup.

Code Storage

Robin Dong from Alibaba shared that how they developed a distributed storage system based on a small open-source software called “sheepdog“, and modified it heavily to improve data recovery performance and make sure it could run in low-end but high-density storage servers. He talked about how they came up with the idea, the system design and deployment, the difficult part is not the design process, but to take care of every detail in the deployment of the cluster, e.g. find a proper place and power management for the machine. It's a good example to take advantage of opens ource and make contribute to it.


F2FS Update

Chao Yu from Samsung led a topic about F2FS. He listed what happened in the F2FS community in the last year, and looks like that F2FS tends to be more generic than just a flash friendly filesystem, which is implied by the fact that F2FS now supports larger volume and larger sector and has in-memory extent cache and a global shrinker. Besides that, F2FS also improves its performance including flush performance, mixed data write performance and multi-threads performance. Developers also optimized F2FS a bit for SMR drive by allowing user to choose the over-provision ratio lower than 1% in mkfs.f2fs and tuning GC. In the future F2FS is planning to support online defrag, transparent compression and data deduplication.

Bcache in the Cloud

Yanhai Zhu from Alibaba led a topic about cache in virtual machines environment. Alibaba chose Bcache as code base to develop a new cache software. Yanhai explained why he didn't choose flashcache.  flashcache was his first choice but it has some drawbacks which cannot be worked around, i.e. flashcache uses hash data structure to distributed IO requests at beginning, which will split the cache data in multi-tenant environment, and thus flashcache is unfriendly to sequential-write, so it proves that flashcache doesn't fit Alibaba's requirements. After then they turned to bcache which uses B-tree instead of hash-table to store data. For the strategy, they chose radical writeback strategy in order to make cache squentialize write IOs and make backend better at absorbing peak use.


Zheng Liu from Alibaba gave an update of ext4 in the last year, the biggest one is 'Remove ext3 filesystem driver'. Others are lazytime support, filesystem-level encryption, orphan file handling (by Jan Kara) and project quota. Besides that Seagate developers worked on ext4's SMR support (, in the future ext4 is likely to have data block checksumming and btree (by Mingming Cao of Oracle).


Zhongjie Wu is working at Memblaze, a famous startup company in China on flash storage technology. Zhongjie showed us one of their products on top of NVDIMM. An NVDIMM is not expensive, it is only a DDR DIMM with a capacitor. Memblaze has developed a new 1U storage server with a NVDIMM (as a write cache) and many flash cards (as the backend storage). It contains their own developed OS and could use Fabric-Channel/Ethernet to connect to client. The main purpose of NVDIMM is to reduce latency, and they use write-back strategy. Zhongjie also mentioned that NVDIMM's write performance is quite better than its read performance, so they in fact uses shadow memory to increase read performance.

The big problem they face with NVDIMM is CPU can’t flush data in its L1 cache to NVDIMM when whole server powers down. To solve this problem, Memblaze use write-combining in CPU multi-cores, it hurts the performance a little but avoid the data missing finally.

NVDIMM Support in Linux

Bob Liu of Oracle talked about NVDIMM support in linux. There are three options:

  1. Use it as a block device -- an NVDIMM driver is needed.
  2. Filesystem support persistent memory .  There are already patches called DAX for ext4 and xfs.  Intel developed a PMFS but Bob said there is still room for improvement.
  3. Use it as main memory -- but this needs struct page, and how to allocate memory remains to be a problem.

-- Liu Bo

Ed: note there is also coverage by another invitee, Robin Dong of Alibaba: day one, day two.

Tuesday Oct 06, 2015

Linux Kernel hugetlbfs Enhancements, by Mike Kravetz

The following is a write-up by Oracle Mainline Linux Kernel Engineer, Mike Kravetz, on his recent upstream work on enhancing hugetlbfs support in the Linux kernel.


Linux huge page support has been present in the Linux kernel since 2003. When first introduced, the only way to take advantage of huge pages was via hugetlbfs. This often involved modifications to application code and explicit action by system administrators to set up and reserve pools of huge pages. As a result, the use of huge pages was mostly limited to applications such as large databases which wanted the very best performance possible and had skilled developers who could modify and tune their code.

More recently, Linux kernel development has been focused on Transparent Huge Page (THP) support. THP is a system wide feature that enables the use of huge pages by any application without source code modification or system administrator intervention. The creation, management and use of huge pages is managed transparently by the Linux kernel.

THP works well for most applications today. However, some application developers want to achieve the best performance possible. To achieve this, they are willing to modify their application to use the original hugetlbfs interfaces. Of course, this also requires the application have intimate knowledge of it's interaction with system resources. To meet the evolving needs of these applications, two new enhancements were made to hugetlbfs.

Reserving Huge Pages

Users of hugetlbfs typically reserve huge pages at system boot time. This pool of reserved pages is then used as the applications map and fault in huge pages. Since memory reserved for huge pages is not available for other uses, it is important not to reserve an excessive number of pages. However, if too few pages are reserved the applications may receive out of memory errors when the reserved pool is exhausted. Therefore, users attempt to make an accurate estimate of their huge page needs and have their applications make use of all the reserved huge pages.

One concern in this area is that the pool of reserved pages is global. Therefore, it is possible for any user/application on the system with sufficient privilege to use huge pages in the reserved pool. This could cause problems for an application that expects a certain number of huge pages.

An application would like some reasonable assurance that allocations will not fail due to a lack of huge pages. At application start-up time, the application would like to configure itself to use a specific number of huge pages. Before starting, the application can check to make sure that enough huge pages exist in the system global pools. However, there are no guarantees that those pages will be available when needed by the application. The application really wants exclusive use of a subset of huge pages.

A new hugetlbfs mount option 'min_size=' was developed to indicate the number of huge pages guaranteed to be available for use by the filesystem. At mount time, this number of huge pages will be reserved for exclusive use of the filesystem. If there are not a sufficient number of free pages, the mount will fail. As applications allocate and free huge pages from the filesystem, the number of reserved pages is adjusted so that the specified minimum is maintained. In this way, the application is assured the specified number of huge pages will be available for their use.

The min_size mount option for hugetlbfs was added to the 4.1 version of Linux kernel.

Punching Holes in hugetlbfs files 

As mentioned above, applications which make use of huge pages via hugetlbfs often have intimate knowledge of their system resource needs. In addition, these application may use files within hugetlbfs as huge page backed shared memory. Within the application, many processes will be simultaneously mapping these files. Some of the data in these files is long lived, and is used throughout the life of the application. Other data may only be used for a period of time and then never accessed again. When the application knows that data within these files is no longer needed, it would like to release the huge pages associated with the data so that it can be used for other purposes.

Punching holes within files is accomplished with the fallocate() system call in traditional filesystems. In Linux, the tmpfs filesystem also supports fallocate hole punch. Adding this support to hugetlbfs provides the requested functionality to release huge pages within files. Hole punching in hugetlbfs is actually simpler than for other filesystems. This is because hugetlbfs is a memory only filesystem, therefore there is no disk or swap space to be concerned with.

In addition to hole punch, fallocate pre-allocation support was also added for hugetlbfs. This allows one to allocate multiple huge pages with a single system call. Without pre-allocation, each huge page would be allocated at page fault time.

hugetlbfs fallocate support is part of the 4.3 release candidate series of the Linux kernel.

Future enhancements

The 4.3 Linux kernel release candidate series contains support for userfaultfd, by Andrea Archangeli. This new functionality allows for the handling of page faults in user space. An application can monitor a range of virtual addresses. When a page fault happens within this range, the application is notified and can take various actions.

The initial version of userfaultfd only supports anonymous VMA mappings. Applications using hugetlbfs may also like to use userfaultfd. One identified use case is the monitoring of address ranges that were hole punched with fallocate. Access to these areas may be considered an error by the application. Therefore, the application would like to be notified of such accesses.

The addition of userfaultfd support for hugetlbfs is being considered for a future Linux kernel release.

-- Mike Kravetz

Tuesday Aug 11, 2015

Upcoming LinuxCon Participation by Oracle Kernel Team & Related Folk

Next week in Seattle is a big one for Linux, with LinuxCon North America on from 17-19 August, as well as a series of co-located events, including CloudOpen, the Linux Plumbers Conference, ContainerCon, the Xen Project Developer Summit, and the Linux Security Summit.  There's a whole bunch more, including a 5km fun run at some point, for the truly enthusiastic.

Several folks from the Oracle Mainline Linux Kernel team will be participating:

Folks presenting from other, related groups at Oracle include:

Additionally, Wim Coekaerts, the SVP of our group, will be giving a keynote, How the Cloud Revolution is Changing the Role of the Operating System.

There'll be other folks from these teams in attendance.  If you're interested in coming to work on the mainline Linux kernel at Oracle, please feel free to talk to myself or any others you see with an Oracle badge, or drop by our booth.

Friday May 29, 2015

On Self Describing Filesystem Metadata, by Darrick Wong

The following is a write-up by Oracle mainline Linux kernel engineer, Darrick Wong, providing some backround on his work on Linux FS Metadata Checksumming, which, after many years of work, will will be turned on by default in the upcoming e2fsprogs 1.43 and xfsprogs 3.2.3.

One of the bigger problems facing filesystems today is the problem of online verification of the integrity of the metadata. Even though storage bandwidth has increased considerably and (in some cases) seek times have dropped to nearly zero, the forensic work required to square a filesystem back to sense increases at least as quickly as metadata size, which scales up about as quickly as total storage capacity. Furthermore, the threat of random bit corruption in a critical piece of metadata causing unrecoverable filesystem damage remains as true as it ever was -- the author has encountered scenarios where corruption in the block usage data structure results in the block allocator crosslinking file data with metadata, which multiplies the resulting damage.

Self-describing metadata helps both the kernel and the repair tools to decide if a block actually contains the data the filesystem is trying to read. In most cases, this involves tagging each metadata block with a tuple describing the type of the block, the block number where the block lives, a unique identifier tying the block to the filesystem (typically the FS UUID), the checksum of the data in the block, and some sort of pointer to the metadata object that points to the block (the owner). For a transactional filesystem, it is useful also to record the transaction ID to facilitate analyzing where in time a corruption happened. Storing the FS UUID is useful in deciding whether an arbitrary metadata block actually belongs to this filesystem, or if it belongs to something else -- a previous filesystem or perhaps an FS image stored inside the filesystem. Given a theoretical mental model of an FS as a forest of trees all reachable by a single root, owner pointers theoretically enable a repair effort to reconstruct missing parts of the tree.

The checksum, while neither fool-proof nor tamper-proof, is usually a fast method to detect random bit corruption. While it is possible to choose stronger schemes such as sha256 (or even cryptographically signed hashes), these come with high performance and management overhead, which is why most systems choose a checksum of some sort. Both filesystems chose CRC32c, primarily for its ability to detect bit flips and the presence of hardware acceleration on a number of platforms. One area that the neither XFS nor ext4 have touched on is the topic of data checksumming. While it is technically possible to record the same self-description tuple for data blocks (btrfs stores at least the checksum), this was deliberately left out of the design for both XFS and ext4. There will be more to say about data block back-references later. First, requiring a metadata update (and log transaction) for every write of every block will have a sharply negative impact on rewrite performance. Second, some applications ensure that their internal file formats already provide the integrity data that the application requires; for them, the filesystem overhead is unnecessary. Migration of the data and its integrity information is easier when both are encapsulated in a single file. Third, performing file data integrity in userspace has the advantage that the integrity profiles can be customised for each program -- some may deem bitflip detection via CRC to be sufficient; others might want sha256 to take advantage of the reduced probability of collisions; and still more might go all the way to verification through digital signatures. There does not seem to be a pressing need to provide data block integrity specifically through the filesystem, unlike metadata, which is accessible only through the filesystem. In XFS, self-describing metadata was introduced with a new (v5) on-disk format. All existing v4 structure were enlarged to store (type, blocknr, fsuuid, owner, lsn); this allowed XFS to deploy a set of block verifiers to decide quickly if a block being read in matches what the reader expects. These verifiers also perform a quick check of the block's metadata at read and write time to detect bad metadata resulting from coding bugs. Unfortunately, it is necessary to reformat the filesystem to accomodate the resized metadata headers. The kernel and the repair tool, however, are still quick to discard broken metadata; however, as we will see, this new metadata format extension opens the door to enhanced recovery efforts.

For ext4, it was discovered that every metadata structure had sufficient room to squeeze in an extra four or two byte field to store checksum data while leaving the structure size and layout otherwise intact. This meant making a few compromises in the design -- instead of adding the 5 attributes to each block, a single 32-bit checksum is calculated over the type, blocknr, fsuuid, owner and block data; this value is then plugged into the checksum field. This scheme allows ext4 to decide if a block's contents match what we thought we were reading, but it will not enable us to reconstruct missing parts of the FS metadata object hierarchy. However, existing ext2/3/4 filesystems can be upgraded easily via tune2fs. In the near future, XFS could grow a few new features to enable an even greater level of self-directed integrity checking and repair. Inodes may soon grow parent directory pointers, which enable XFS to reconstruct directories by scanning all non-free (link count > 0) inodes in the filesystem. Similarly, a proposed block reverse-mapping btree makes it possible for XFS to rebuild a file by iterating all the rmaps looking for extent data. These two operations can even be performed online, which means that the filesystem can evolve towards self-healing abilities. Major factors blocking this development are (a) the inability to close an open file and (b) the need to shut down the allocators while we repair per-AG data. These improvements will be harder or impossible to implement for ext4, unfortunately.

The metadata checksumming features as described will be enabled by default in the respective mkfs tools as part of the next releases of e2fsprogs (1.43) and xfsprogs (3.2.3). Existing filesystems must be upgraded (ext4) or reformatted and reloaded (xfs) manually.

-- D

Further reading:

Tuesday Dec 02, 2014

Improving Sunvnet Performance on Linux for SPARC, by Sowmini Varadhan

The following is a write-up by Oracle mainline Linux kernel engineer, Sowmini Varadhan, detailing her recent work on improving the performance of the Sunvnet driver on Linux for SPARC.


In the typical device-driver, the Producer (I/O device) notifies the Consumer (device-driver) that data is available for consumption by triggering a hardware interrupt at a fixed Interrupt Priority Level (IPL). In the purely interrupt-driven model, the Consumer then masks off any additional Rx interrupts from the driver, and drains the read-buffers in hardware-interrupt context. A network device-driver would then enqueue packets for the TCP/IP stack where they would typically be processed in software interrupt (softirq) context.

Dispatching an interrupt is an expensive operation, thus network device drivers should attempt to batch interrupts, i.e., process as many packets as possible within the context of one interrupt. Also, hardware interrupts preempt all tasks running at a lower IPL. Thus the amount of time spent in hardware interrupt context should be kept to a a minimum. As pointed out in Mogul1, "If the event rate is high enough to cause the sytem to spend all of its time responding to interrupts, then nothing else will happen, and the system throughput will drop to zero". This condition is called receive-livelock, and all purely interrupt-driven systems are susceptible to it.

We will now talk about the various improvements made to the sunvnet driver on Linux to convert it from being a purely interrupt-driven network device driver to one that implements all of the above prescriptions using Linux's most current device-driver infrastructure.

What is Sunvnet?

In a virtualized environment such as LDoms, the guest Operating Systems (DomU) communicate with each other using a virtual link-layer abstraction called Logical Domain Channel (LDC) on SPARC. The LDC provides point-to-point communication channels between the guests, or between the domU and an external entity such as a service processor or the Hypervisor itself. The LDC provides an encapsulation protocol for other upper-layer protocols such as TCP/IP and Ethernet.

Sunvnet is the device driver that implement this virtual link-layer on Linux.

Batching Interrupts

In its simplest mode of operation, when the LDC Producer wishes to send an IP packet to the consumer, it needs to do two things:

  1. Copy the data packet to a descriptor buffer. In the "TxDring" mode, this buffer is a shared-memory region that is "owned" by the Producer.
  2. After the packet has been successfully copied, the Producer needs to signal to the Consumer that data is available. This is achieved by sending a "start" message over the LDC. A "start" message is a 64-byte message sent over the LDC in the format specified by the VIO protocol. The start message has a subtype of VIO_SUBTYPE_DATA, and specifies the index of the descriptor buffer at which data is available.

The transmission of the LDC "start" message is processed at the Hypervisor, and will result in hard-interrupt at the consumer, which will invoke the ldc_rx() interrupt handler. The Consumer would then process the interrupt in hardirq context, and when it is done, if the Producer had requested a "stopped" ack for the packet, the Consumer will send back a "stopped" message over LDC. Just like the "start" message, the "stopped" message is specified by the VIO protocol. It has a subtype of VIO_SUBTYPE_ACK (0x2) and allows the Consumer to specify the index at which data was last read.

Note that the VIO protocol does not mandate a "stopped" LDC message for every descriptor read/write: the Consumer is required to send back an LDC "stopped" message if, and only if:

  • the Producer has requested it for the descriptor; or,
  • the Consumer has read a full burst of ready data in descriptors, and there are no more ready descriptors.

LDC messaging is expensive to performance: it requires a slot in the LDC ring, in addition to triggering a hardware interrupt at the receiver. Thus the first step to improving sunvnet performance was to optimize the number of LDC messages sent and received, and batching packets as much as possible.

We achieved this with the following patches:

These, along with some other bug fixes, brought sunvnet to a more stable performance level: we observed fewer dev_watchdog hangs (previously seen due to flow-control assertions caused by a full LDC channel) and soft-lockups were seen. It also gave a 25% bump to performance.  In iperf tests on a T5-2 using 16 VCPUs and 16 iperf threads, we were now able to handle approximately 100k pps, whereas we were only able to handle a maximum of 80k pps prior to the fixes. (See diagram below).

But all packets were being received in hard-interrupt context. And as Mogul1 has established in the 90's: that is toxic to performance.


Linux implements the concepts described in Mogul1 through a common device-driver infrastructure called NAPI. The NAPI framework allows a driver to defer reception of packet-bursts from hardware-interrupt context to a polling-mechanism that is invoked in softirq context. In addition to the benefits of interrupt mitigation and avoidance of receive live-lock, this also has other ramifications:

  • Since packet transmission via NET_TX_ACTION is already done in softirq context, moving Rx processing to softirq context now allows Tx reclaim, and recovery from link-congestion, to be done more efficiently.
  • The locking model is simplified, eliminating a number of spin_[un]lock_irqsave[restore] invocations, and improving system performance in general.
  • Moving the Rx processing to softirq context allows the driver to use the vastly more efficient netif_receive_skb() to pass the packet up to the network-stack, instead of being constrained to defer to netif_rx(), which is invoked in the less-desirable process context.
  • We also get the benefits of ksoftirqd to schedule softirq under scheduler control. Otherwise, everything would get processed on the CPU that receives the hardware interrupt, and you would have to configure RPS to distribute those hardirqs (can be done, but requires extra administration).

We'll now walk through the changes made to NAPIfy sunvnet, to examine each of these items.

The details...

The sunvnet driver has a `struct vnet_port' data-structure for each connected peer. At the minimum, there is one such structure for the vswitch peer in Dom0. In addition, if the Dom0 ldm property `inter-vnet-link' has been set to `on' (the default), DomU's on the same physical host will have a virtual point-to-point channel over LDC. Each such channel is represented by a unique `struct vnet_port' and has its own LDC ring and Rx descriptor buffers.

As part of the device probe callback, sunvnet allocates one `struct napi_struct' instance for each `struct vnet_port'.

    struct vnet_port {
	    /* ... */
            struct napi_struct      napi;
	    /* ... */

The next NAPI requirement is to modify the driver's Rx interrupt handler. When a new packet becomes available, the driver must disable any additional Rx interrupts (LDC Rx interrupts in this case), and arrange for polling by invoking napi_schedule. This is achieved as follows:

Both sunvnet and the VDC (virtual disk driver) infrastructure share a common set of routines for processing the VIO messages and LDC interrupts. Thus the Rx interrupt handler (`ldc_rx()') is common to both modules, which hands off packets destined to sunvnet by invoking the `vnet_event()' callback that is registered by sunvnet. In `vnet_event()', we defer packet processing to the NAPI poll callback by recording the events (which may include both LDC control events such as UP/DOWN notifications, as well as notification about incoming data), disabling hardware interrupts, and scheduling a NAPI callback for the poll handler.

static void vnet_event(void *arg, int event)
        struct vnet_port *port = arg;
        struct vio_driver_state *vio = &port->vio;

        port->rx_event |= event;
        vio_set_intr(vio->vdev->rx_ino, HV_INTR_DISABLED);

We now need to set up the poll handler itself. We do this in `vnet_poll()' which has the signature:

        static int vnet_poll(struct napi_struct *napi, int budget);

Thus vnet_poll will be called with a pointer to the NAPI instance, so that the `struct vnet_port' can be obtained as

        struct vnet_port *port = container_of(napi, struct vnet_port, napi);

The `budget' parameter is an upper-bound on the number of packets that can be processed in any single ->poll invocation. The intention of the `budget' parameter is to ensure fair-scheduling across drivers, and avoid starvation when a single driver gets flooded with packet burst. The ->poll() callback, i.e., vnet_poll(), must return the number of packets processed. A return value that is less than the budget can be taken to indicate that we are at the end of a packet burst, i.e., hard-interrupts can be re-enabled. We do this in `vnet_poll()' as

        if (processed < budget) {
                port->rx_event &= ~LDC_EVENT_DATA_READY;
                vio_set_intr(vio->vdev->rx_ino, HV_INTR_ENABLED);

Here the value of processed is obtained by calling

        int processed = vnet_event_napi(port, budget);

where vnet_event_napi examines and processes the `rx_event' bits available on the `vnet_port'. If data is available on the port, `vnet_event_napi()' will read the LDC channel for information about the starting descriptor index, and process a batch of descriptors in softirq mode, passing up the received packets to the network stack using `napi_gro_receive()'. The batch processing of descriptors is constained to at must `budget' descriptors per vnet_event_napi() invocation.

The final step is to inform the NAPI infractructure that `vnet_poll()' is the poll callback. We do this in the vnet_port_probe() routine

        netif_napi_add(port->vp->dev, &port->napi, vnet_poll, NAPI_POLL_WEIGHT);

and actually enable NAPI before marking the port up:


Some caveats specific to sunvnet/LDC

The `budget' parameter passed by the NAPI infra to `vnet_poll()' places an upper-bound on the number of packets that may be processed in a single ->poll callback. While this ensures fair-scheduling across drivers, we should be careful not to unnecessarily send LDC stop/start messages at each `budget' boundary when the packet burst size is larger than the `budget'.

This entails tracking additional state in the `vnet_port' to remember (a) when packet processing is truncated prematurely due to `budget' constraints, (b) the last index processed, when (a) occurs.

Both of these items are tracked in the `vnet_port' as

        bool                    napi_resume;
        u32                     napi_stop_idx;

Benefits of NAPI

The most obvious benefit of NAPIfication is interrupt mitigation. The ability to process packets in softirq context and pass up packets using napi_gro_receive() by itself results in a significant increase in packet processing rate. On a T5-2 with 16 VCPUS, iperf tests using 16 threads results in 230k pps (compared to the newer baseline of 100k pps!). This is a further 130% increase in performance.

In addition, conforming to the NAPI infrastructure automatically provides access to all the newest features and enhancements in the Linux driver infra, such as enhanced RPS.

But there are other benefits as well. With both Tx and Rx packets now being processed in softirq context, the irq save/restore locking done in sunvnet at the port level is eliminated, resulting in lock-less processing. The netif_tx_lock() can instead be used to synchronize access in the critical sections such as Tx reclaim which can now be inlined from the ->poll routine without any pre-emption concerns with dev_watchdog().

Multiple Tx queues

We've mostly talked about Rx side handling here, but on Tx side, when inter-vnet-link is on, we have a virtual point-to-point link between guests on the same physical host. As mentioned earlier, each such point-to-point link is represented as its own data-structure (`struct vnet_port') and has its own LDC ring and Rx descriptor buffers. Thus a flow-controlled path due to bursty traffic between peers A and B should not impact traffic between peers A and C. The Linux driver infrastructure makes this possible through the support for multiple Tx queues.

Briefly, these were the steps to set up multiple Tx queues:

  1. Queue allocation: invoke alloc_etherdev_mqs(), to set up VNET_MAX_TXQS queues when creating the `struct net_device'.
  2. As each port is added, assign a queue index to the port in a round-robin fasion. The assigned index is tracked in the `vnet_port' structure.
  3. Supply a ->ndo_select_queue callback that returns the selected queue to dev_queue_xmit() when it calls netdev_pick_tx(). In the case of sunvnet, the vnet_select_queue() should simply return the index assigned to the vnet_port that would be selected for the outgoing packet.

After the integration of multiple Tx queues, we can do even better at recovering from flow-control.

Flow control is asserted on the Tx side when we exhaust either the descriptor rings for data, and/or run out of resources to send LDC messages. After the batched LDC processing optimizations, it is uncommon to run out-of-resources for LDC messages. Thus flow-control is typically asserted when the Producer generates data much faster than the Consumer, at which point the netif_tx_stop_queue() is asserted, blocking a Tx queue for a specific peer.

The flow-control can thus be released when we get back an LDC stopped ACK from the blocked peer (neatly identified by the LDC message, and by the specific vnet_port and Tx queue!).

Conclusions and Future Work

In addition to NAPI, Linux offers other alternatives to drivers for deferring work away from hard-interrupt context, such as bottom-half (BH) handlers and tasklets.

A BH Rx handler will eliminate the problems of the interrupt context, and packets can now be received in process context, which speeds up things somewhat. But it still cannot call netif_receive_skb(), since that can deadlock on socket locks with the softirq-based tasklets that do TCP timers, packet rexmit, etc. So the BH handler is constrained to use netif_rx_ni(), which is still less efficient than the straight-through call to pass up the packet via netif_receive_skb()

Both NAPI and tasklet based implementations offer softirq context, which allows the driver to safely invoke netif_receive_skb() to deliver the packet to the IP stack. NAPI, which seamlessly allows softirq context for both Tx and Rx processing, and already has the infrastructure to handle bursts of packets with fair-scheduling, proved to be the best option for sunvnet.

In the near future, we will be adding support for Jumbo Frames and TCP Segmentation Offload, to further leverage from hardware support by offloading features where possible. Another feature that offers potential for improving performance is the "RxDring" model, where the Consumer owns the shared-memory buffer for receiving data that the Producer then populates. In the RxDring model, the buffer can then be part of the sk_buff itself, thereby saving one memcpy for the Consumer.


1Mogul - Eliminating receive livelock in an interrupt-driven kernel

-- Sowmini Varadhan

Monday Nov 17, 2014

China Linux Storage and File System (CLSF) Workshop 2014

This is a contributed post by Oracle mainline Linux kernel developer, Liu Bo, who recently presented at the 2014 China Linux Storage and File System (CLSF) Workshop.  This event was held on October 21st in Beijing. 


Xue Jiufei from Huawei showed us their work on OCFS2 of the last year, including bug fixes like "Packet loss when reconnect" and a few features like "Range lock based on DLM" and "Self-heal when fault recover".

She also mentioned that Huawei already plans to use OCFS2 inproduction.

Highlights of "range lock based on DLM", -- it's a Red-Black internal locking tree and write has higher priority but read range and write range can merge together, and it does delayed unlock which is expected to improve unlock performance.


Yu Chao from Samsung introduced F2FS update. F2FS is designed to be a flash friendly filesystem and it can overcome some problems of old flash fs, for example, snowball effect of wandering tree. The most important is F2FS is much faster than other flash filesystems, perhaps that's why it's merged into mainline so fast.

He talked about some details of F2FS including disk layout and core data structure, and his work mainly focuses on bug fixes.


I held this slot and talked about updates in the last year, such as "NO-HOLE", async metadata reclaim and the infamous bugs. And several people were very interested in how the bug was nailed down when I said that it's related to workqueue and is very difficult to reproduce.

There is an engineer from Fujitsu who worked a lot on workqueue, and he thought that the bug also should be a workqueue bug, and we had a discussion on the details behind the bug, and after he figured it out he said he would talk to workqueue's maintainer.


Memblaze is a company which focuses on flash storage. One of their engineers shared with us their product, AFA (All Flash Array). He mainly talked about the trend of current SSD-oriented file system, and they're using NVDIMM on AFA, but there are some challenges of using it on Linux, because Linux has heavy block layer and scalability problem, the bottleneck is Linux's IO latencies, context switch cost and interrupt issue because flash is very fast so there are plenty of interrupts sent to Linux.


Peng Tao from Primarydata talked about NFS update in the last year, he covered new features of NFSv42 and talked about pNFS's Flex file layout and nfsd's per-bucket spinlock.


Huawei's Hu Jianyang held this UBIFS (Unsorted Block Image filesystem) slot.  He talked about UBIFS's infrastruture and the difference with other UBI upper layer, it acts similar to FTL, for example, it can read/write/erase.  It can also do map/unmap.

UBIFS has features like static wear-leveling, transparent compression (lz4hc supported), writeback, and it supports flash of up to 32G size.

He also gave more details of UBIFS's FASTMAP feature, the normal UBIFS mount needs a time costing probe when flash media is fairly large, and fastmap addresses this scalability issue, it only scans a fixed number of blocks.

Linux Kernel Performance and 0day Testing

Fengguang Wu from Intel held this slot. He is the author of 0day test system. He said that this system runs thousands of vm test machines and tests over 400 Linux kernel git trees.

Initially this test system can only do compile/build test, and when errors occur, it will automatically try to git bisect to the buggy commit and notice the commit author and maintainer of the subsystem. And this test system is flexible, it can easily add testcases and now it supports performance test.

For filesystems, it runs popular tests like xfstests, fsmark, etc.

Compared to other test system, like Open POSIX Test Suite, this test system has much less code.

However, there are some issues, for example, random kernel config testing is not capable, because the number is huge, and the similar case is to handle filesystem's mount options and mkfs options. Fengguang said that these need developers help to filter out what option combination is needed.


Xie Liang from Xiao Mi mainly talked about their issues of using ext4 and linux block layer. They tried HBASE and others to build storage for their cloud service, like Micloud.

They found that local filesystem + bio has some problems, one is ext4's buffered IO latency is not good enough.  It always fluctuates with journal enabled. ext4 developers suggested them to use no journal and async journal, they said having journal only benifits ext4's fsck speed. Another problem is IO priority issue, the current io scheduler in linux didn't perform well on their systems, no matter cfq or deadline. But Taobao's engineer suggested to use their newly written io schedulers, an io scheduler of mixed cfq and deadline and another new scheduler based on IOPS called TPPS.

Taobao's Liu Zheng also gave an update of ext4.  Frankly there is no new kernel feaures in the last year, but they're planning some, he talked about engcryption support on ext4, project quota, data block checksum & reflink (from Oracle's Minging Cao).

Someone asked that why ext4 needs encryption, because we already have encryptionfs, why not use it instead? Zheng answered that with "Perhaps Google wants to use ext4's encryption support for chromium os user cache". [ED: see here]

Besides, e2fsprogs has a new compat feature, "sparse_super2", added by maintainer, it'll be used on SMR disk. e2fsprogs now supports metadata prefetch.

And ext4 is planning to remove old 'buffer head' code, support larger file(16TB to 1EB), more optimization towards SMR/flash device.


Asias He from OSv took this slot and this is an very interesting topic IMO. He introduced OSv's infrastruture. It uses BSD licence, its purpose is an OS for virtual machine in the cloud.  It works ina  hypervisor, and the big difference is that it has single address space. There is no kernel space or user space, there is no processes, only threads, there is no spinlock, only lock-free mutex. And it also supports zero-copy. With all of those features, it's very fast according to benchmarks of running memcached, cansandra and redis. After that we had a long discussion of "container based docker" vs "OSv", the conclusion it's well-designed , promising and more secure.

-- Liu Bo

Note: our coverage of the 2013 event is here.  Also, Robin Dong has also published a write-up of the 2014 event, here.

Wednesday Nov 05, 2014

Upcoming BTRFS Presentation by Liu Bo at the Korea Linux Forum

Oracle Linux kernel developer, Liu Bo, will be giving a presentation on BTRFS Integrity at the Korea Linux Forum on 11th Nov 2014.

In VerifyFS in Btrfs Style, Liu Bo will discuss his current work on extending BTRFS to include integrity verification:

There are different scenarios where we need to check whether we can trust an FS image that was handled by other untrusted parties, that means we need to have every part of the FS unchanged. Security is indeed important at any time. So we look forward to how to verify filesystem integrity efficiently. In this presentation, a btrfs way to verify filesystem integrity is introduced, and the talk will discuss the progress and the performance of the work, as well as challenges it faces and how it addresses them.

The conference schedule includes several other interesting kernel talks, including an update from Greg KH on the past year in mainline kernel development.

ETA: The slides from Liu Bo's talk are now available here.

Monday Aug 11, 2014

Improving the Performance of Transparent Huge Pages in Linux, by Khalid Aziz

The following is a write-up by Oracle mainline Linux kernel engineer, Khalid Aziz, detailing his and others' work on improving the performance of Transparent Huge Pages in the Linux kernel.


The Linux kernel uses small page size (4K on x86) to allow for efficient sharing of physical memory among processes. Even though this can maximize utilization of physical memory, it results in large numbers of pages associated with each process and each page requires an entry in the Translation Look-aside Buffer (TLB) to be able to associate a virtual address with the physical memory page it represents. The TLB is a finite resource and large number of entries required for each process forces kernel to constantly swap out entries in TLB. There is a performance impact any time the TLB entry for a virtual address is missing. This impact is especially large for data intensive applications like large databases.

To alleviate this, Linux kernel added support for Huge Pages, which can support significantly larger page sizes for specific uses. This larger page size is variable and depends upon architecture (a few megabytes to gigabytes) . Huge Pages can be used for shared memory or for memory mapping. Huge Pages reduce the number of TLB entries required for a process's data by factor of 100s and thus reduce the number of TLB misses for the process significantly.

Huge Pages are statically allocated and need to be used through a hugetlbfs API, which requires changing applications at source level to take advantage of this feature. The Linux kernel added a Transparent Huge Pages (THP) feature that coalesces multiple contiguous pages in use by a process to create a Huge Page transparently without the process needing to even know about it. This makes the benefits of Huge Pages available to every application without having to rewrite it.

Unfortunately, adding THP caused side-effects for performance. We will explore these performance impacts in more detail in this article and how those issues have been addressed.

The Problem

When Huge Pages were introduced in the kernel, they were meant to be statically allocated in physical memory and never swapped out. This made for simple accounting through use of refcounts for these hugepages. Transparent hugepages on the other hand need to be swappable so a process could take advantage of performance improvements through hugepages and yet not tie up the physical memory for these transparent hugepages. Since the swap subsystem only deals with base page size, it can not swap out larger hugepages. The kernel breaks the hugepages up into base page sizes before swapping transparent huge pages out.

A page is identified as hugepage via page flags and each hugepage is composed of one head page and a number of tail pages. Each tail page has a pointer, first_page, that points back to the head page. The Kernel can break the transparent hugepages up any time there is memory pressure and pages need to be swapped out. This creates a race between the code that breaks hugepages up and the code managing free and busy hugepages. When marking a hugepage busy or free, the code needs to ensure a hugepage is not broken up underneath it. This requires taking reference to the page multiple times, locking the page to ensure page is not broken up and executing memory barriers a few times to ensure any updates to the page flags get flushed out to memory so we retain consistency.

Before THP was introduced into the kernel in 2.6.38, the code to release a page was fairly straightforward. A call to put_page()was made and first thing put_page() checked was to determine if it was dealing with hugepage (also known as compound page) or base page:

void put_page(struct page *page)
	if (unlikely(PageCompound(page))) 

If the page being released is a hugepage, put_compound_page() verifies reference count is 0 and then calls the free routine for compound page which walks the head page and tail pages and frees them all up:

static void put_compound_page(struct page *page)
	page = compound_head(page); 
	if (put_page_testzero(page)) { 
		compound_page_dtor *dtor; 

		dtor = get_compound_page_dtor(page); 

This is fairly straightforward code and has virtually no impact on performance of page release code. After THP was introduced, additional checks, locks, page references and memory barriers were added to ensure correctness. The new put_compound_page() in 2.6.38 looks like:

static void put_compound_page(struct page *page)
	if (unlikely(PageTail(page))) {
		/* __split_huge_page_refcount can run under us */
		struct page *page_head = page->first_page;
		 * If PageTail is still set after smp_rmb() we can be sure
		 * that the page->first_page we read wasn't a dangling pointer.
		 * See __split_huge_page_refcount() smp_wmb().
		if (likely(PageTail(page) && get_page_unless_zero(page_head))) {
			unsigned long flags;
			 * Verify that our page_head wasn't converted
			 * to a a regular page before we got a
			 * reference on it.
			if (unlikely(!PageHead(page_head))) {
				/* PageHead is cleared after PageTail */
				goto out_put_head;
			 * Only run compound_lock on a valid PageHead,
			 * after having it pinned with
			 * get_page_unless_zero() above.
			/* page_head wasn't a dangling pointer */
			flags = compound_lock_irqsave(page_head);
			if (unlikely(!PageTail(page))) {
				/* __split_huge_page_refcount run before us */
				compound_unlock_irqrestore(page_head, flags);
				if (put_page_testzero(page_head))
				if (put_page_testzero(page))
			VM_BUG_ON(page_head != page->first_page);
			 * We can release the refcount taken by
			 * get_page_unless_zero now that
			 * split_huge_page_refcount is blocked on the
			 * compound_lock.
			if (put_page_testzero(page_head))
			/* __split_huge_page_refcount will wait now */
			VM_BUG_ON(atomic_read(&page->_count) <= 0);
			VM_BUG_ON(atomic_read(&page_head->_count) <= 0);
			compound_unlock_irqrestore(page_head, flags);
			if (put_page_testzero(page_head)) {
				if (PageHead(page_head))
		} else {
			/* page_head is a dangling pointer */
			goto out_put_single;
	} else if (put_page_testzero(page)) {
		if (PageHead(page))

The level of complexity of code went up significantly. This complexity guaranteed correctness but sacrificed performance.

Large database applications read large chunks of database into memory using AIO. When databases started using hugepages for these reads into memory, performance went up significantly due to the benefit of much lower number of TLB misses and significantly smaller amount of memory being used up by page table resulting in lower swapping activity. When a database application reads data from disk into memory using AIO, pages from the hugepages pool are allocated for the read and the block I/O subsystem grabs reference to these pages for read and later releases reference to these pages when read is done. This causes traversal of the code referenced above starting with call to put_page(). With the newly introduced THP code, the additional overhead added up to significant performance penalty.

Over the next several kernel releases, the THP code was refined and optimized which helped slightly in some cases while performance got worse in other cases. Subsequent refinements to THP code to do accurate accounting of tail pages introduced the routine __get_page_tail() which is called by get_page() to grab tail pages for the hugepage. This added further performance impact to AIO into hugetlbfs pages. All of this code stays in the code path as long as kernel was compiled with CONFIG_TRANSPARENT_HUGEPAGE. Running echo never > /sys/kernel/mm/transparent_hugepage/enabled does not bypass this new additional code to support THP. Results from a database performance benchmark run using two common read sizes used by databases show this performance degradation clearly:

2.6.32 (pre-THP)

2.6.39 (with THP)

3.11-rc5 (with THP)

1M read

8384 MB/s

5629 MB/s

6501 MB/s

64K read

7867 MB/s

4576 MB/s

4251 MB/s

This amounts to 22% degradation for 1M read and 45% degradation for 64K read! perf top during benchmark runs showed CPU spending more than 40% of cycles in __get_page_tail() and put_compound_page().

The Solution

An Immediate solution to the performance degradation comes from the fact that hugetlbfs pages can never be split and hence all the overhead added for THP can be bypassed. I added code to __get_page_tail() and put_compound_page()to check for hugetlbfs page up front and bypass all the additional checks for those pages:

static void put_compound_page(struct page *page) 
      if (PageHuge(page)) { 
              page = compound_head(page); 
              if (put_page_testzero(page)) 


bool __get_page_tail(struct page *page)

      if (PageHuge(page)) { 
              page_head = compound_head(page); 
             got = true; 
      } else { 


This resulted in immediate performance gain. Running the same benchmark as before with THP enabled, the new performance numbers for aio reads are below:



3.11-rc5 + patch

1M read

8384 MB/s

6501 MB/s

8371 MB/s

64K read

7867 MB/s

4251 MB/s

6510 MB/s

This patch was sent to linux-mm and linux kernel mailing lists in August 2013 [link] and was subsequently integrated into kernel version 3.12. This is a significant performance boost for database applications.

Further review of the original patch by Andrea Arcangeli during integration of this patch into stable kernels exposed issues with refcounting of pages and revealed this patch had introduced a subtle bug where a page pointer could become a dangling link under certain circumstances. Andrea Arcangeli and author worked to address these issues and revised the code in __get_page_tail() and put_compound_page() to eliminate extraneous locks and memory barriers, fixed incorrect refcounting of tail pages and eliminate some of the inefficiencies in the code.

Andrea sent out an initial series of patches to address all of these issues [link].

Further discussions and refinements led to the final version of these patches which were integrated into kernel version 3.13 [link],[link].

AIO performance has gone up significantly with these patches but it is still not at the same level as it used to be for smaller block sizes before THP was introduced to the kernel. THP and hugetlbfs code in the kernel is better at guaranteeing correctness but it still comes at the cost of performance, so there is room for improvement.

-- Khalid Aziz.

Wednesday Jun 18, 2014

NFS Over RDMA Community Development

Recently, Chuck Lever and Shirley Ma of the Oracle Mainline Linux kernel team have been working with the community on bringing Linux NFS over RDMA (remote direct memory access) up to full production quality.

At the 2014 OpenFabrics International Developer Workshop in March/April, they presented an overview of NFSoRDMA for Linux, outlining a rationale for investing resources in the project, as well as identifying what needs to be done to bring the implementation up to production quality.

Slides from their presentation may be downloaded here, while a video of Shirley and Chuck's presentation may be viewed here.

Shirley Ma wrote the following report on the workshop:

This year OpenFabrics international workshop is dedicated to the development and improvement of OpenFabrics Software. The workshop covers topics from Exascale systems I/O, Enterprise applications to distributed computing, storage, data access and data analysis applications.

In this workshop our goal (Chuck Lever and myself) was to bring more interest parties to NFSoRDMA, work together as a community to make NFSoRDMA better on functionality, reliability and efficiency as well as adding more NFSoRDMA test coverage in OFED validation test.  NFSoRDMA both Linux client and server upstream codes have been lying there for years from 2007. Lacking of upstream maintenance and support keeps customers away. Linux NFS is over IPoIB in InfiniBand Fabric, which consumes more resources than over RDMA (high CPU utilization, contiguous memory reservation to achieve better bandwidth). We briefly evaluated NFSoRDMA vs. NFS/IPoIB using direct I/O performance benchmark IOZone. The results showed that NFSoRDMA had better bandwidth in all different record size among 1KB to 16MB, and better CPU efficiency for record size greater than 4KB. As expected NFSoRDMA RPC read, write round trip time is much shorter than NFS/IPoIB. There will be more desire for NFSoRDMA when storage and memory are merged, I/O latency reduced.

Jeff Becker(NASA) and Susan Coulter (LANL) would like to join NFSoRDMA efforts after our talk. They have large scale computing nodes, a decent scaled validation environment. Tom Talpey (NFSoRDMA client original author) agreed with our proposal of future work: splitting send/recv completion queue, creating multiple QPs for scalability... He also gave advises on NFSoRDMA performance measurement based upon his SMB work and Don Lovinger's performance measurement work on SMB 3.0. (

Sayantan Sur (Intel) right now is using NFS/IPoIB in their IB cluster. We advised him some tuning method on NFS/IPoIB, he was happy to get 100 times better bandwidth than before for small I/O size from 2MB to 200MB/s. He is thinking to move to NFSoRDMA once it's stable. When we talked about wireshark NFSoRDMA dissector, Doug Oucharek (Intel) mentioned that he had implemented some luster RDMA packet dissector for wireshark which is not upstream yet, discussed with him about luster RDMA packet dissector to see whether we can borrow some codes for dissect NFSoRDMA IB packets. Chuck and I also discussed with OFILG interoperability tester Edward Mossman (IOL) regarding adding more NFSoRDMA coverage into their test suites.

The OFA has moved from hardware vendor driven workshop to software driven since last year. Most of the attendees were and OpenFabrics software and application developers. Intel has the most attendees, more than 20 people came from HW, OpenFabrics Stack, HPC and other applications departments.

Topics could be related to NFSoRDMA in the future:
A new working group (OpenFabrics Interface OFI WG) is created, the goal is to minimize interfaces complexity and APIs overhead. The new framework was proposed to provide different fabric interfaces to hide different fabrics providers implementation. The OFI WG hosts weekly telecons every Tuesday, everyone is welcome. Sean Hefty (Intel) analyzed current stack APIs overhead and cache memory footprint, presented the interfaces framework in little bit detail, check his presentation:

VMware is working on virtualization support for host and guest service over RDMA. On guest it implements paravirtual vRDMA device support Verbs. Device is emulated in ESXi hypervisor. Guest physical memory regions are mapped to ESXi and passed down to physical RDMA HCA, DMA directly from/to guest physical memory. Guest on same host latency is about 20us.

Liran Liss from Mellanox gave a talk about RDMA on demand paging update, which intends to address RDMA memory registration challenge for the cost, the size, lock, sync. He proposed non-pinned memory region which requires OS PTE table changes. More details is here:
He also presented RDMA bonding approach from transport level. A sudo vHCA (vQP, vPD, vCQ, vMRs ...) is created to use for bonding (failure over and aggregation). So the bonding will be hardware independent. The detail of the proposal is as below, don't know how feasible to do it, and the outcome of performance. The sudo HCA driver idea is similar to VMware vRDMA driver.

Mellanox gave RoCE(RDMA over Converged Ethernet) v2 update -- IP routable packet format. RoCEv2 encapsulates IB packet to UDP packet, which has presented to IETF in Nov. 2013. This might introduce more challenge for Fabrics congestion control.

Developers are still complaining about usability (different vendors have different implementations) and RDMA scalability in the area of RDMA-CM, subnet manager, QP resources, memory registration... RDMA socket is still under discussion... Were they news to me after many years absent from RDMA :)

There are lots of other interesting application topics which I don't cover here. If you are interested, here is the link to the whole presentations:

Since the workshop, a bi-weekly conference call has been established, with developers from many companies and organizations participating.  Minutes from these calls are posted to the linux-rdma and linux-nfs mailing lists.   Minutes so far:

Code stability has been significantly improved, with increased testing by developers and bugfixes being merged.  Anna Schumaker of NetApp is now maintaining a git tree for NFSoRDMA, feeding up to the core NFS maintainers.

For folks wishing to get involved in development, see the NFSoRDMA client wiki page for more information.

Tuesday Apr 01, 2014

LSF/MM 2014 and ext4 Summit Notes by Darrick Wong

This is a contributed post from Darrick Wong, storage engineer on the Oracle mainline Linux kernel team.

The following are my notes from LSF/MM 2014 and the ext4 summit, held last week in Napa Valley, CA.

  • Discussed the draft DIX passthrough interface. Based on Zach Brown's suggestions last week, I rolled out a version of the patch with a statically defined io extensions struct, and Martin Petersen said he'd try porting some existing asmlib clients to use the new interface, with a few field-enlarging tweaks. For the most part nobody objected; Al Viro said he had no problems "yet" -- but I couldn't tell if he had no idea what I was talking about, or if he was on board with the API. It was also suggested that I seek the opinion of Michael Kerrisk (the manpages maintainer) about the API. As for the actual implementation, there are plenty of holes in it that I intend to fix this week. The NFS/CIFS developers I spoke to were generally happy to hear that the storage side was finally starting to happen, and that they could get to working on the net-fs side of things now. Nicholas Bellinger noted that targetcli can create DIF disks even with the fileio backend, so he suggested I play with that over scsi_debug.

  • A large part of LSF was taken up with the discussion of how to handle the brave new world of weird storage devices. To recap: in the beginning, software had to deal with the mechanical aspects of a rotating disk; addressing had to be done in terms of cylinders, heads, and sectors (CHS). This made it difficult to innovate drive mechanics, as it was impossible to express things like variable zone density to existing software. SCSI eliminated this pain by abstracting a disk into a big tub of consecutive sectors, which simplified software quite a bit, though at some cost to performance. But most programs weren't trying to wring the last iota of performance out of disks and didn't care. So long as some attention was paid to data locality, disks performed adequately. Fast forward to 2014: now we have several different storage device classes: Flash, which has no seek penalty but prefers large writeouts; SMR drives with hard-disk seek penalties but requirements that all writes within a ~256MB zone be written in linear order; RAIDs, which by virtue of stripe geometries violate a few of the classic hard disk thinking; and NVMe devices which implement atomic read and write operations. Dave Chinner suggests that rather than retrofitting each filesystem to deal with each of these devices, it might be worth shoving all the block allocation and mapping operation down to a device mapper (dm) shim layer that can abstract away different types of storage, leaving FSes to manage namespace information. This suggestion is very attractive on a few levels: Benefits include the ability to emulate atomic read/writes with journalling, more flexible software-defined FTLs for flash and SMR, and improved communication with cloud storage systems -- Mike Snitzer had a session about dm-thinp and the proper way for FSes to communicate allocation hints to the underlying storage; this would certainly seem to fit the bill. I mentioned that Oracle's plans for cheap ext4 reflink would be trivial to implement with dm shims. Unfortunately, the devil is in the details -- when will we see code? For that reason, Ted Ts'o was openly skeptical.

  • The postgresql developers showed up to complain about stable pages and to ask for a less heavyweight fsync() -- currently, when fsync is called, it assumes that the caller wants all dirty data written out NOW, so it writes dirty pages with WRITE_SYNC, which starves reads. For postgresql this is suboptimal since fsync is typically called by the checkpointing code, which doesn't need to be fast and doesn't care if fsync writeback is not fast. There was an interlock scheduled for Thursday afternoon, but I was unable to attend. See LWN for more detailed coverage of the postgresql (and FB) sessions.

  • At the ext4 summit, we discussed a few cleanups, such as removing the use of buffer_heads and the impending removal of the ext2/3 drivers. Removing buffer_heads in the data path has the potential benefit that it'll make the transition to supporting block/sector size > page size easier, as well as reducing memory requirements (buffer heads are a heavyweight structure now). There was also the feeling that once most enterprise distros move to ext4, it will be a lot easier to remove ext3 upstream because there will be a lot more testing of the use of ext4.ko to handle ext2/3 filesystems. There was a discussion of removing ext2 as well, though that stalled on concerns that Christoph Hellwig (hch) would like to see ext2 remain as a "sample" filesystem, though Jan Kara could be heard muttering that nobody wants a bitrotten example.

  • The other major new ext4 feature discussed at the ext4 summit is per-data block metadata. This got started when Lukas Czerner (lukas) proposed adding data block checksums to the filesystem. I quickly chimed in that for e2fsck it would be helpful to have per-block back references to ease reconstruction of the filesystem, at which point the group started thinking that rather than a huge static array of block data, the complexity of a b-tree with variable key size might well be worth the effort. Then again, with all the proposed filesystem/block layer changes, Ted said that he might be open to a quick v1 implementation because the block shim layer discussed in the SMR forum could very well obviate the need for a lot of ext4 features. Time will tell; Ted and I were not terribly optimistic that any of that software is coming soon. In any case, lukas went home to refine his proposal. The biggest problem is ext4's current lack of a btree implementation; this would have to be written or borrowed, and then tested. I mentioned to him that this could be the cornerstone of reimplementing a lot of ext4 features with btrees instead of static arrays, which could be a good thing if RH is willing to spend a lot of engineering time on ext4.

  • Michael Halcrow, speaking at the ext4 summit, discussed implementing a lightweight encrypted filesystem subtree feature. This sounds a lot like ecryptfs, but hopefully less troublesome than the weird shim fs that is ecryptfs. For the most part he seemed to need (a) the ability to inject his code into the read/write path and some ability to store a small amount of per-inode encryption data. His use-case is Chrome OS, which apparently needs the ability for cache management programs to erase parts of a(nother) user's cache files without having the ability to access the file. The discussion concluded that it wouldn't be too difficult for him to start an initial implementation with ext4, but that much of this ought to be in the VFS layer.

 -- Darrick

[Ed: see also the LWN coverage of LSF/MM]

Tuesday Nov 05, 2013

CLSF & CLK 2013 Trip Report by Jeff Liu and Liu Bo

This is a contributed post from Jeff Liu, lead XFS developer for the Oracle mainline Linux kernel team, with contributions from Liu Bo, our lead BTRFS developer.

Recently, we attended the China Linux Storage and Filesystem workshop (CLSF), and the China Linux Kernel conference (CLK), which were held in Shanghai.

Here are the highlights for both events.

CLSF - 17th October

XFS update (led by Jeff Liu)

XFS keeps rapid progress with a lot of changes, especially focused on the infrastructure/performance improvements as well as  new feature development.  This can be reflected with a sample statistics among XFS/Ext4+JBD2/Btrfs via:

# git diff --stat --minimal -C -M v3.7..v3.12-rc4 -- fs/xfs|fs/ext4+fs/jbd2|fs/btrfs

XFS:       141 files changed, 27598 insertions(+), 19113 deletions(-)
Ext4+JBD2: 39 files changed,  10487 insertions(+), 5454 deletions(-)
Btrfs:     70 files changed,  19875 insertions(+), 8130 deletions(-)

  • What made up those changes in XFS?
    • Self-describing metadata(CRC32c). This is a new feature and it contributed about 70% code changes, it can be enabled via `mkfs.xfs -m crc=1 /dev/xxx` for v5 superblock.
    • Transaction log space reservation improvements. With this change, we can calculate the log space reservation at mount time rather than runtime to reduce the the CPU overhead.
    • User namespace support. So both XFS and USERNS can be enabled on kernel configuration begin from Linux 3.10. Thanks Dwight Engen's efforts for this thing.
    • Split project/group quota inodes. Originally, project quota can not be enabled with group quota at the same time because they were share the same quota file inode, now it works but only for v5 super block. i.e, CRC enabled.
    • CONFIG_XFS_WARN, an new lightweight runtime debugger which can be deployed in production environment.
    • Readahead log object recovery, this change can speed up the log replay progress significantly.
    • Speculative preallocation inode tracking, clearing and throttling. The main purpose is to deal with inodes with post-EOF space due to speculative preallocation, support improved quota management to free up a significant amount of unwritten space when at or near EDQUOT. It support backgroup scanning which occurs on a longish interval(5 mins by default, tunable), and on-demand scanning/trimming via ioctl(2).
  • Bitter arguments ensued from this session, especially for the comparison between Ext4 and Btrfs in different areas, I have to spent a whole morning of the 1st day answering those questions. We basically agreed on XFS is the best choice in Linux nowadays because:
    • Stable, XFS has a good record in stability in the past 10 years. Fengguang Wu who lead the 0-day kernel test project also said that he has observed less error than other filesystems in the past 1+ years, I own it to the XFS upstream code reviewer, they always performing serious code review as well as testing.
    • Good performance for large/small files, XFS does not works very well for small files has already been an old story for years.
    • Best choice (maybe) for distributed PB filesystems. e.g, Ceph recommends delopy OSD daemon on XFS because Ext4 has limited xattr size.
    • Best choice for large storage (>16TB). Ext4 does not support a single file more than around 15.95TB.
    • Scalability, any objection to XFS is best in this point? :)
    • XFS is better to deal with transaction concurrency than Ext4, why? The maximum size of the log in XFS is 2038MB compare to 128MB in Ext4.
  • Misc. Ext4 is widely used and it has been proved fast/stable in various loads and scenarios, XFS just need more customers, and Btrfs is still on the road to be a manhood.

Ceph Introduction (Led by Li Wang)

This a hot topic.  Li gave us a nice introduction about the design as well as their current works. Actually, Ceph client has been included in Linux kernel since 2.6.34 and supported by Openstack since Folsom but it seems that it has not yet been widely deployment in production environment.

Their major work is focus on the inline data support to separate the metadata and data storage, reduce the file access time, i.e, a file access need communication twice, fetch the metadata from MDS and then get data from OSD, and also, the small file access is limited by the network latency.

The solution is, for the small files they would like to store the data at metadata so that when accessing a small file, the metadata server can push both metadata and data to the client at the same time. In this way, they can reduce the overhead of calculating the data offset and save the communication to OSD.

For this feature, they have only run some small scale testing but really saw noticeable improvements. Test environment: Intel 2 CPU 12 Core, 64GB RAM, Ubuntu 12.04, Ceph 0.56.6 with 200GB SATA disk, 15 OSD, 1 MDS, 1 MON. The sequence read performance for 1K size files improved about 50%.

I have asked Li and Zheng Yan (the core developer of Ceph, who also worked on Btrfs) whether Ceph is really stable and can be deployed at production environment for large scale PB level storage, but they can not give a positive answer, looks Ceph even does not spread over Dreamhost (subject to confirmation). From Li, they only deployed Ceph for a small scale storage(32 nodes) although they'd like to try 6000 nodes in the future.

Improve Linux swap for Flash storage (led by Shaohua Li)

Because of high density, low power and low price, flash storage (SSD) is a good candidate to partially replace DRAM. A quick answer for this is using SSD as swap. But Linux swap is designed for slow hard disk storage, so there are a lot of challenges to efficiently use SSD for swap.

    • swap_map scan
      swap_map is the in-memory data structure to track swap disk usage, but it is a slow linear scan. It will become a bottleneck while finding many adjacent pages in the use of SSD. Shaohua Li have changed it to a cluster(128K) list, resulting in O(1) algorithm. However, this apporoach needs restrictive cluster alignment and only enabled for SSD.
    • IO pattern
      In most cases, the swap io is in interleaved pattern because of mutiple reclaimers or a free cluster is shared by all reclaimers. Even though block layer can merge interleaved IO to some extent, but we cannot count on it completely. Hence the per-cpu cluster is added base on the previous change, it can help reclaimer do sequential IO and the block layer will be easier to merge IO.
    • TLB flush:
      If we're reclaiming one active page, we should first move the page from active lru list to inactive lru list, and then reclaim the page from inactive lru to swap it out. During the process, we need to clear PTE twice: first is 'A'(ACCESS) bit, second is 'P'(PRESENT) bit. Processors need to send lots of ipi which make the TLB flush really expensive. Some works have been done to improve this, including rework smp_call_functiom_many() or remove the first TLB flush in x86, but there still have some arguments here and only parts of works have been pushed to mainline.
    • Page fault does iodepth=1 sync io, but it's a little waste if only issue a page size's IO. The obvious solution is doing swap readahead. But the current in-kernel swap readahead is arbitary(always 8 pages), and it always doesn't perform well for both random and sequential access workload. Shaohua introduced a new flag for madvise(MADV_WILLNEED) to do swap prefetch, so the changes happen in userspace API and leave the in-kernel readahead unchanged(but I think some improvement can also be done here).
  • SWAP discard
    • As we know, discard is important for SSD write throughout, but the current swap discard implementation is synchronous. He changed it to async discard which allow discard and write run in the same time. Meanwhile, the unit of discard is also optimized to cluster.
  • Misc: lock contention
    • For many concurrent swapout and swapin , the lock contention such as anon_vma or swap_lock is high, so he changed the swap_lock to a per-swap lock. But there still have some lock contention in very high speed SSD because of swapcache address_space lock.

Zproject (led by Bob Liu)

Bob gave us a very nice introduction about the current memory compression status. Now there are 3 projects(zswap/zram/zcache) which all aim at smooth swap IO storm and promote performance, but they all have their own pros and cons.
    • It is implemented based on frontswap API and it uses a dynamic allocater named Zbud to allocate free pages. Zbud means pairs of zpages are "buddied" and it can only store at most two compressed pages in one page frame, so the max compress ratio is 50%. Each page frame is lru-linked and can do shink in memory pressure. If the compressed memory pool reach its limitation, shink or reclaim happens. It decompress the page frame into two new allocated pages and then write them to real swap device, but it can fail when allocating the two pages.
  • ZRAM
    • Acts as a compressed ramdisk and used as swap device, and it use zsmalloc as its allocator which has high density but may have fragmentation issues. Besides, page reclaim is hard since it will need more pages to uncompress and free just one page. ZRAM is preferred by embedded system which may not have any real swap device. Now both ZRAM and ZSWAP are in driver/staging tree, and in the mm community there are some disscussions of merging ZRAM into ZSWAP or viceversa, but no agreement yet.
    • Handles file page compression but it is removed out of staging recently.

From industry (led by Tang Jie, LSI)

An LSI engineer introduced several new produces to us. The first is raid5/6 cards that it use full stripe writes to improve performance.

The 2nd one he introduced is SandForce flash controller, who can understand data file types (data entropy) to reduce write amplification (WA) for nearly all writes. It's called DuraWrite and typical WA is 0.5. What's more, if enable its Dynamic Logical Capacity function module, the controller can do data compression which is transparent to upper layer. LSI testing shows that with this virtual capacity enables 1x TB drive can support up to 2x TB capacity, but the application must monitor free flash space to maintain optimal performance and to guard against free flash space exhaustion. He said the most useful application is for datebase.

Another thing I think it's worth to mention is that a NV-DRAM memory in NMR/Raptor which is directly exposed to host system. Applications can directly access the NV-DRAM via a memory address - using standard system call mmap(). He said that it is very useful for database logging now. This kind of NVM produces are beginning to appear in recent years, and it is said that Samsung is building a research center in China for related produces. IMHO, NVM will bring an effect to current os layer especially on file system, e.g. its journaling may need to redesign to fully utilize these nonvolatile memory.

OCFS2 (led by Canquan Shen)

Without a doubt, HuaWei is the biggest contributor to OCFS2 in the past two years. They have posted 46 upstream patches and 39 patches have been merged. Their current project is based on 32/64 nodes cluster, but they also tried 128 nodes at the experimental stage. The major work they are working is to support ATS (atomic test and set), it can be works with DLM at the same time. Looks this idea is inspired by the vmware VMFS locking, i.e,

EXT4 (led by Zheng Liu)

Zheng Liu says ext4 keeps its stable style, so the major part is bug-fixes and cleanups while the minor is new features and improvements. He first talked about AIO writes performance gain on ext4, it makes use of extent status cache. So the problem is that they find the AIO path waiting on get_block_t(), ending up some unaccepted latencies, the solution is to batch get_block_t() with "fiemap(2) + FEMAP_FLAG_CACHE" and "ioclt(2) + EXT4_IOC_PRECACHE_EXTENT".

IOW, this just hands off latency from the kernel to the userspace.

BTRFS (led by Liu Bo)

I (Liu Bo) held the session and mainly talked about new features in the last year (2013). People are happy to see that more features are developed in btrfs, but are meanwhile confused about what btrfs wants to be -- generally speaking, as a 5-year-old FS, btrfs should try to be stable firstly anyway.

CLK - 18th October 2013

Improving Linux Development with Better Tools (Andi Kleen)

This talk focused on how to find/solve bugs along with the Linux complexity growing. Generally, we can do this with the following kind of tools:

  • Static code checkers tools. e.g, sparse, smatch, coccinelle, clang checker, checkpatch, gcc -W/LTO, stanse. This can help check a lot of things, simple mistakes, complex problems, but the challenges are: some are very slow, false positives, may need a concentrated effort to get false positives down. Especially, no static checker I found can follow indirect calls (“OO in C”, common in kernel):
    struct foo_ops {
            int (*do_foo)(struct foo *obj);
  • Dynamic runtime checkers, e.g, thread checkers, kmemcheck, lockdep. Ideally all kernel code would come with a test suite, then someone could run all the dynamic checkers.
  • Fuzzers/test suites. e.g, Trinity is a great tool, it finds many bugs, but needs manual model for each syscall. Modern fuzzers around using automatic feedback, but notfor kernel yet:
  • Debuggers/Tracers to understand code, e.g, ftrace, can dump on events/oops/custom triggers, but still too much overhead in many cases to run always during debug.
  • Tools to read/understand source, e.g, grep/cscope work great for many cases, but do not understand indirect pointers (OO in C model used in kernel), give us all “do_foo” instances:
    struct foo_ops {
          int (*do_foo)(struct foo *obj);
    } = { .do_foo = my_foo };
    That would be great to have a cscope like tool that understands this based on types/initializers

XFS: The High Performance Enterprise File System (Jeff Liu)


I gave a talk for introducing the disk layout, unique features, as well as the recent changes.   The slides include some charts to reflect the performances between XFS/Btrfs/Ext4 for small files.

About a dozen users raised their hands when I asking who has experienced with XFS. I remembered that when I asked the same question in LinuxCon/Japan, only 3 people raised their hands, but they are Chris Mason, Ric Wheeler, and another attendee.
The attendee questions were mainly focused on stability, and comparison with other file systems.

Linux Containers (Feng Gao)

The speaker introduced us that the purpose for those kind of namespaces, include mount/UTS/IPC/Network/Pid/User, as well as the system API/ABI. For the userspace tools, He mainly focus on the Libvirt LXC rather than us(LXC). Libvirt LXC is another userspace container management tool, implemented as one type of libvirt driver, it can manage containers, create namespace, create private filesystem layout for container, Create devices for container and setup resources controller via cgroup.
In this talk, Feng also mentioned another two possible new namespaces in the future, the 1st is the audit, but not sure if it should be assigned to user namespace or not. Another is about syslog, but the question is do we really need it?

In-memory Compression (Bob Liu)

Same as CLSF, a nice introduction that I have already mentioned above.

0-day Linux Kernel Performance Test (Yuanhan Liu)

Based on Fengguang Wu's 0-day autotest framework, Yuanhan Liu 0-day performance test integrates with the existing test tools and generates both ASCII and graphic results from test numbers. But it's not yet open sourced, only Intel internal, and the developers say that it's a bit difficult to make it open, because:

  1. it's not easy to setup the whole testsuite, a lot of efforts involved
  2. it needs many powerful machines on where there'll be a great number of VMs installed.

 Despite that it's not open, the framework does find bugs on various code in kernel, including btrfs, good for me :) [LB]


There were some other talks related to ACPI based memory hotplug, smart wake-affinity in scheduler etc., but my head is not big enough to record all those things.

-- Jeff Liu & Liu Bo

Monday Sep 02, 2013

IETF 87 NFSv4 Working Group meeting report by Chuck Lever

This is a contributed post from Chuck Lever, who heads up NFS development for the mainline Linux kernel team.

Executive summary:

The 87th meeting of the IETF was held July 28 - August 2 in Berlin, Germany.

I was in Berlin for the week to attend the NFSv4 Working Group meeting and hold informal discussions related to NFS standardization with other attendees. The Internet Engineering Task Force (IETF) produces high quality technical documents that influence the way people design, use and manage the Internet. Essentially, this is the body that regulates the protocols computers use to communicate on the Internet, for the purpose of improving interoperability.

An IETF meeting is held every four months in venues around the world. Sponsorship for each event varies. DENIC, the central registry for domain names under .de, was the primary sponsor for this event. Participation is open to anyone, but a registration fee is required to attend.

NFS version 4 is the IETF standard for file sharing. The charter of the Working Group is to maintain NFS specifications and introduce new NFS features through NFSv4 minor versions. More on the Working Group charter can be found here:

I attend each NFSv4 Working Group meeting to represent Oracle's interest in various current and new NFS-related features, including pNFS, NFSv4.2, and FedFS. I'm the editor of two of the IETF FedFS protocol specifications, and a co-author of an Internet-Draft that addresses protocol issues affecting NFSv4 migration. Other representatives at this meeting include Microsoft, EMC, NetApp, IBM, Oracle, Tonian, and others. Topics include progress updates on Internet-Drafts on their way to become standards, reports on implementation experience, and requests to start new work or restart old work. See:

Meeting agenda, presentation materials, and minutes are available at this location.

Drill down:

Working Group editor Tom Haynes (NetApp) reported on several areas where progress appears to be stalled. In general we face challenges completing our deliverables because the IETF is a volunteer organization, and the tasks at hand are large. The largest item is RFC 3530bis, which is holding up FedFS and NFSv4.2. RFC 3530bis was rejected during IESG review mainly due to the new chapter that attempts to bridge the gap between existing i18n implementations in NFS, and how we'd like i18n to work.

The problem is nobody has implemented i18n for NFSv4, and the IETF has revised i18n since 3530 was ratified. The consensus was to move the offending section to a separate Internet-Draft where the correct language can be hammered out without holding up RFC 3530bis. NFSv4.2 is held up by a lack of enthusiasm for finishing a new revision of RPCSEC GSS. The GSS I-D has languished without an author or editor for many months, and two items in NFSv4.2 depend on its completion: labeled NFS and server-to-server copy. A rough consensus was not achieved, but Tom and Andy Adamson (NetApp) will investigate options, including removing the parts of server copy and labeled NFS that depend on GSSv3, and report back.

Benny Halevy (Tonian) has submitted a fresh draft proposing "Flexible File Layouts" which is a new pNFS layout type that improves upon the existing pNFS file layout defined in RFC 5661. Motivation for a new layout scheme includes: algorithmic data striping to support load balancing, life-cycle management, and other advanced administrative features; support for using legacy NFS servers as pNFS data servers; and direct pNFS support for existing cluster filesystems such as Ceph and GlusterFS.

Chuck Lever (Oracle) described recent progress to address security concerns in the FedFS documents waiting in the RFC Editor queue. He continued by walking through a group of possible future work items, including more modern LDAP security modes, additional administrative operations, and better mechanisms for clients to choose working fileset locations. Does the working group have the energy to consider a new revision of these documents? Or should we continue to focus on making small changes? This was left unresolved.

Sorin Faibish (EMC) discussed the need for a new layout enabling pNFS clients to access Lustre data servers directly. After a lot of discussion, the issue appears to be that the NFS protocol on high performance transports is not performant enough. The proposed solution was to use LNET over RDMA. It was suggested that it would be more interesting to the Working Group if we focused on fixing the performance issues in our RDMA specifications instead.

Marc Eshel (IBM) wanted to restart the age-old conversation on tightening NFS's data cache coherency. The immediate question is whether POSIX semantics are interesting given today's compute workloads and network environment. Implementing POSIX data coherency among multiple networked systems is still a challenge. Consensus that a callback-based solution, where network traffic is proportional to the level of inter-client sharing, was most appropriate. Such a solution (byte-range delegation) was proposed by Trond Myklebust in 2006. It was recommended to start with that work.

Chuck Lever (Oracle) proposed an experimental extension to NFS that enables NFS client and servers to convey end-to-end data integrity metadata. A new I-D has been submitted that describes the protocol changes. No prototype is available yet; the I-D is meant to coordinate discussion of technical details, and enable interoperable prototype implementations.

David Noveck (EMC) elaborated on the need to allow protocol changes outside of the NFS minor version process. He described the limitations of batching unrelated features together and waiting for a full pass through the IETF review process. There was some interest in allowing innovation outside of the minor version process. The Area Directory and Working Group chair felt that there is currently not enough energy behind work already planned for delivery.

Matt Benjamin (Linux Box) is restarting work on a feature proposed several years ago by Mike Eisler that allows directories to be striped across pNFS data servers, just like file data is today. An Internet-Draft is available, and a prototype is underway.

-- Chuck Lever

Wednesday May 22, 2013

Oracle Linux Kernel Developers Speaking at LinuxCon Japan 2013

LinuxCon Japan 2013 is happening next week (May 29-31) in Tokyo, and several members of the Oracle mainline Linux kernel team are speaking:

For those attending, feel free to come and chat with us during the conference.

We'll post links to slides once they're available.

Wednesday May 08, 2013

Ext4 filesystem workshop report by Mingming Cao

This is a contributed post from Mingming Cao, lead ext4 developer for the Oracle mainline Linux kernel team.

I attended the second ext4 workshop hosted at the third day of Linux Collaboration Summit 2013.  Participants included Google, RedHat, SuSE, Taobao, and Lustre. We had about 2-4 hours of good discussion about the roadmap of ext4 for next year.

Ext4 write stall issue

A write Stall issue was reported by MM folks found during page claim testing over ext4. There is lock contention in JBD2 between journal commit and new transaction, resulting blocking IOs waiting for locks. More precisely it is caused by do_get_write_access() will block at lock_buffer(). The problem is nothing new should be visible in ext3 too. But new kernel becomes more visitable. Ted has proposed two fixes 1) avoid calling lock_buffer() during do_get_write_acess() 2) adjust jbd2 to manage buffer_head itself to reduce latency. Fixing in JBD2 would be a big effort. Propose 1) sounds more reasonable to work with.  The first action is to mark metadata update with RED_* to avoid the priority disorder meanwhile looking at the block IO layer and see if there is a way to move blocking IOs to a separate queue.

DIO lock contention issue

Another topic brought up is the Direct IO locking contention issue.  On DIO read side there is already no lock hold, but only for pagesize=blocksize case. There is not a fundamental issue why the no lock for direct IO read is not possible for blocksize <Pagesize -- agree we should remove this limit. On the Direct IO write side, two proposals about concurrent direct IO writes. One is based on in memory extent status tree, similar to xfs does, which allows dio write to different range of file possible. Another proposal is the general VFS solution which lock the pages in range during direct IO write. This would benefit all filesystems, but has challenge of sorting out multiple locks orders.  Jan Kara had a LSF session for this in more details. Looks like this approach is more promising.

Extent tree disk layout

There is discussion about support true 64 bit ext4 filesystem (64bit inode number and 64 bit block number -- currently 32 bit inode number and 48 bit blocknumber) in order to scales well. The ext4 on disk extent structure could be extended to support larger file, such as 64-bits physical block, bigger logical block, and using cluster-size as unit in extent.  This is easy to handle in e2fsprogs, but change on disk extent tree format is quite tricky to play well with punch hole, truncate etc., which depends on extent tree format. One solution is to add an layer of extent tree abstraction in memory, but this considered a big effort.

This was not entirely impossible.Jan Kara is working on extent tree code clean up, trying to factor out some common code first and teach the block allocation related code doesn't have to reply on on disk extent format. This is not a high priority for now.

Fallocate performance issue

A performance problem has been reported with fallocate really large file. Ext4 multile block allocator code(mballoc) currently limits how large a chunk of blocks could be allocated at a time. Should able to hack mballoc at lest 16MB at a time, instead of 2MB a time.

This brought out another related mballoc limitation. At present the mballoc normalize the request size to the nearest power of 2, up to 1MB.  The original requirement for this is for raid alignment.  If we lift up this limitation, with non normalized request size, fallocate could be 3 times faster.  Most likely we will address this quickly.

Buddy bitmap flush from mem too quickly

Under some memory pressure test, the buddy bitmap used to guide ext4 block allocation was been pushed out from memory too quickly, even though mark page dirty doesn't strong enough -- talk to mm people about interface mark page access() interface alternate, which ended with agreement to use fadvise to mark the pages as metadata.

data=guarded journaling mode

Back to ext3 time when there is no delayed allocation, the fsync() performance is badly hurt by the data=ordered mode, which forces flush out the data first (might be entire filesystem dirty data) before commit a metadata update. There is proposal of data=guarded mode which protect data inconsistency issue upon power failure, but would result in much better fsync result. The basic idea is the isize update wont be updated until the data has flushed to disk. This would drop of difference between data=writeback mode and data=ordered mode.

At the meeting this journalling mode was brought up again to see if we need this for ext4. Given ext4 implemented delayed allocation, the fsync performance was much improved (no need to flush unrelated file data), due to the benefit of delayed allocation, so performance benefit is not so obvious. But the benefit of this new journalling mode would great help 1) unwritten extent conversion issue, so that we could have full dio read no lock implementation, 2) also get ride of extra journalling mode.

ext4 filesystem mount options

There is discussion of ext4 testing cost due to many many different combination of ext4 mount options (total 70). Part of the reason is distro is trying to just maintain ext4 filesystem for all three filesystem (ext2.3.4) there is effort to test and valid the ext4 module still work as expected when mounted as ext3 with different mount options.  A few important mount options which need special care/investigate including support for indirect-based/extent-based files; support for Asynchronous journal commit; data=journal and delayed allocation exclusive issue.

So short summary of next year ext4 development is to mostly focus on reduce latency, improve performance and code reorganization.

-- Mingming Cao


The Oracle mainline Linux kernel team works as part of the Linux kernel community to develop new features and maintain existing code.

Our team is globally distributed and includes leading core kernel developers and industry veterans.

This blog is edited by James Morris <>


« July 2016