Saturday Jan 28, 2012

The latest bits around ocfs2

It's been a while since we last posted something about ocfs2 on our Oracle blogs but that doesn't mean the filesystem hasn't evolved. I had to write a bit of a summary for a customer so I figured it would be a good idea to just add a blog entry and document it here as well.

OCFS2 is a native Linux cluster filesystem that has been around for quite a few years now and was developed at Oracle and also has had a ton of contributions from the folks at SuSE over several years. The filesystem got officially merged into 2.6.16 and all the changes since have been going into mainline first and then trickled down into versions we build for Linux distributions. So we have ocfs2 versions 1.2, 1.4, 1.6 (1.8 for Oracle VM 3) which are specific snapshots of the filesystem code and then released for specific kernels like 2.6.18 or 2.6.32.

SLES has a version of ocfs2 that they build, other vendors decided to not compile in the filesystem so for Oracle Linux, we of course make sure we have current versions available as well. We also provide support for the filesystem as part of Oracle Linux support. You do not need to buy extra clustering or filesystem add-on options, the code is part of Oracle Linux and the support is part of our regular Oracle Linux support subscriptions.

Many of the ocfs2 users, use the filesystem as an alternative to nfs, when I read the articles on the ocfs2 public maillists, this is a comment that comes back frequently. So there must be some truth to it as these are all unsolicited 3rd party comments :)... One nice thing with ocfs2 is that it's so very easy to set up. Just a simple text file (config file) on each node with the list of hostnames, ip addresses and you're basically good to go. One does need shared storage as it's a real cluster filesystem. This shared storage can be iscsi, san/fc or shared scsi and we highly recommend a private network so that you can isolate the cluster traffic. The main problem reports we get tend to be due to overloading servers. In a cluster filesystem you have to ensure that you know really what is going on with all servers, otherwise there is the potential for data corruption. This means that if a node gets in trouble, overloaded network or running out of memory, it will likely end up halting or rebooting the node so that the other servers can happily continue. A large percentage of customer reports tend to be related to misconfigured networks (share the interconnect / cluster traffic with everything else) or bad/slow disk subsystems that get overloaded and the heartbeat IOs cannot make it to the device.

One of the reasons ocfs2 is so trivial to configure, is that the entire ecosystem is integrated. It comes with its own embedded clustering stack, o2cb. This mini, specialized clusterstack provides node membership, heartbeat services and a distributed lock manager. This stack is not designed to be a general purpose userspace clusterstack but really tailored towards the basic requirements for our filesystem. Another really cool feature, I'd say it's in my top 3 cool features list for ocfs2, is dlmfs. dlmfs is a virtual filesystem that exposes a few simple locktypes : shared read, exclusive and trylock. There's a libo2dlm to use this in applications or you can simply use your shell to create a domain and locks just by doing mkdir and touch of files. Someone with some time on their hands could theoretically with some shell magic hack together a little userspace cluster daemon that could monitor applications or nodes and handle start, stop, restart. It's on my todo list but I haven't had time :) anyway, it's a very nifty feature.

Anyway, I digress... so one of the customer questions I had recently was about what's going on with ocfs2 and has there been any development effort. I decided to go look at the linux kernels since 2.6.27 and collect the list of checkins that have happened since. These features are also in our latest ocfs2 as part of Oracle VM 3.0 and for the most part also in our kernel (Unbreakable Enterprise Kernel). Here is the list, I think it's pretty impressive :

    Remove JBD compatibility layer 
    Add POSIX ACLs
    Add security xattr support (extended attributes for SELinux)
    Implement quota recovery 
    Periodic quota syncing 
    Implementation of local and global quota file handling
    Enable quota accounting on mount, disable on umount 
    Add a name indexed b-tree to directory inodes 
    Optimize inode allocation by remembering last group
    Optimize inode group allocation by recording last used group.
    Expose the file system state via debugfs 
    Add statistics for the checksum and ecc operations.
    Add CoW support. (reflink is unlimited inode-based (file based) writeable snapshots - very very useful for virtualization)
    Add ioctl for reflink. 
    Enable refcount tree support. 
    Always include ACL support 
    Implement allocation reservations, which reduces fragmentation significantly 
    Optimize punching-hole code, speeds up significantly some rare operations 
    Discontiguous block groups, necessary to improve some kind of allocations. It is a feature that marks an incompatible bit, ie, it makes a forward-compatible change
    Make nointr ("don't allow file operations to be interrupted") a default mount option 
    Allow huge (> 16 TiB) volumes to mount   (support for huge volumes)
    Add a mount option "coherency=*" to handle cluster coherency for O_DIRECT writes.
    Add new OCFS2_IOC_INFO ioctl: offers the none-privileged end-user a possibility to get filesys info gathering 
    Add support for heartbeat=global mount option (instead of having a heartbeat per filesystem you can now have a single heartbeat)
    SSD trimming support 
    Support for moving extents (preparation for defragmentation)

There are a number of external articles written about ocfs2, one that I found is here.

Have fun...


Wim Coekaerts is the Senior Vice President of Linux and Virtualization Engineering for Oracle. He is responsible for Oracle's complete desktop to data center virtualization product line and the Oracle Linux support program.

You can follow him on Twitter at @wimcoekaerts


« July 2016