By Wcoekaer-Oracle on Jul 04, 2012
Work on a cluster filesystem at Oracle started many years ago, in the early 2000's when the Oracle Database Cluster development team wrote a cluster filesystem for Windows that was primarily focused on providing an alternative to raw disk devices and help customers with the deployment of Oracle Real Application Cluster (RAC). Oracle RAC is a cluster technology that lets us make a cluster of Oracle Database servers look like one big database. The RDBMS runs on many nodes and they all work on the same data. It's a Shared Disk database design. There are many advantages doing this but I will not go into detail as that is not the purpose of my write up. Suffice it to say that Oracle RAC expects all the database data to be visible in a consistent, coherent way, across all the nodes in the cluster. To do that, there were/are a few options : 1) use raw disk devices that are shared, through SCSI, FC, or iSCSI 2) use a network filesystem (NFS) 3) use a cluster filesystem(CFS) which basically gives you a filesystem that's coherent across all nodes using shared disks. It is sort of (but not quite) combining option 1 and 2 except that you don't do network access to the files, the files are effectively locally visible as if it was a local filesystem.
So OCFS (Oracle Cluster FileSystem) on Windows was born. Since Linux was becoming a very important and popular platform, we decided that we would also make this available on Linux and thus the porting of OCFS/Windows started. The first version of OCFS was really primarily focused on replacing the use of Raw devices with a simple filesystem that lets you create files and provide direct IO to these files to get basically native raw disk performance. The filesystem was not designed to be fully POSIX compliant and it did not have any where near good/decent performance for regular file create/delete/access operations. Cache coherency was easy since it was basically always direct IO down to the disk device and this ensured that any time one issues a write() command it would go directly down to the disk, and not return until the write() was completed. Same for read() any sort of read from a datafile would be a read() operation that went all the way to disk and return. We did not cache any data when it came down to Oracle data files.
So while OCFS worked well for that, since it did not have much of a normal filesystem feel, it was not something that could be submitted to the kernel mail list for inclusion into Linux as another native linux filesystem (setting aside the Windows porting code ...) it did its job well, it was very easy to configure, node membership was simple, locking was disk based (so very slow but it existed), you could create regular files and do regular filesystem operations to a certain extent but anything that was not database data file related was just not very useful in general. Logfiles ok, standard filesystem use, not so much. Up to this point, all the work was done, at Oracle, by Oracle developers.
Once OCFS (1) was out for a while and there was a lot of use in the database RAC world, many customers wanted to do more and were asking for features that you'd expect in a normal native filesystem, a real "general purposes cluster filesystem". So the team sat down and basically started from scratch to implement what's now known as OCFS2 (Oracle Cluster FileSystem release 2). Some basic criteria were :
Needless to say, this was a huge development effort that took many years to complete. A few big milestones happened along the way...
The release model has always been that new feature development happened in the mainline kernel and we then built consistent, well tested, snapshots that had versions, 1.2, 1.4, 1.6, 1.8. But these releases were effectively just snapshots in time that were tested for stability and release quality.
OCFS2 is very easy to use, there's a simple text file that contains the node information (hostname, node number, cluster name) and a file that contains the cluster heartbeat timeouts. It is very small, and very efficient. As Sunil Mushran wrote in the manual :
Because the RHEL kernel did not contain OCFS2 as a kernel module (it is in the source tree but it is not built by the vendor in kernel module form) we stopped adding the extra packages to Oracle Linux and its RHEL compatible kernel and for RHEL. Oracle Linux customers/users obviously get OCFS2 included as part of the Unbreakable Enterprise Kernel, SuSE customers get it by SuSE distributed with SLES and Red Hat can decide to distribute OCFS2 to their customers if they chose to as it's just a matter of compiling the module and making it available.
OCFS2 today, in the mainline kernel is pretty much feature complete in terms of integration with every filesystem feature Linux offers and it is still actively maintained with Joel Becker being the primary maintainer. Since we use OCFS2 as part of Oracle VM, we continue to look at interesting new functionality to add, REFLINK was a good example, and as such we continue to enhance the filesystem where it makes sense. Bugfixes and any sort of code that goes into the mainline Linux kernel that affects filesystems, automatically also modifies OCFS2 so it's in kernel, actively maintained but not a lot of new development happening at this time. We continue to fully support OCFS2 as part of Oracle Linux and the Unbreakable Enterprise Kernel and other vendors make their own decisions on support as it's really a Linux cluster filesystem now more than something that we provide to customers. It really just is part of Linux like EXT3 or BTRFS etc, the OS distribution vendors decide.
Do not confuse OCFS2 with ACFS (ASM cluster Filesystem) also known as Oracle Cloud Filesystem. ACFS is a filesystem that's provided by Oracle on various OS platforms and really integrates into Oracle ASM (Automatic Storage Management). It's a very powerful Cluster Filesystem but it's not distributed as part of the Operating System, it's distributed with the Oracle Database product and installs with and lives inside Oracle ASM. ACFS obviously is fully supported on Linux (Oracle Linux, Red Hat Enterprise Linux) but OCFS2 independently as a native Linux filesystem is also, and continues to also be supported. ACFS is very much tied into the Oracle RDBMS, OCFS2 is just a standard native Linux filesystem with no ties into Oracle products. Customers running the Oracle database and ASM really should consider using ACFS as it also provides storage/clustered volume management. Customers wanting to use a simple, easy to use generic Linux cluster filesystem should consider using OCFS2.
To learn more about OCFS2 in detail, you can find good documentation on http://oss.oracle.com/projects/ocfs2 in the Documentation area, or get the latest mainline kernel from http://kernel.org and read the source.
One final, unrelated note - since I am not always able to publicly answer or respond to comments, I do not want to selectively publish comments from readers. Sometimes I forget to publish comments, sometime I publish them and sometimes I would publish them but if for some reason I cannot publicly comment on them, it becomes a very one-sided stream. So for now I am going to not publish comments from anyone, to be fair to all sides. You are always welcome to email me and I will do my best to respond to technical questions, questions about strategy or direction are sometimes not possible to answer for obvious reasons.