X

Oracle Linux, virtualization , Enterprise and Cloud Management Cloud technology musings

  • May 4, 2009

OCFS2 reflink

It has been a while since I last wrote something about OCFS2. For those that don’t know what this is, OCFS2 is a feature-rich standard Linux cluster filesystem. Linus took OCFS2 into mainline in the 2.6.16 time frame and it is being actively maintained. The majority of the work has always been done at Oracle however folks from Novell have provided many contributions, as well as individuals like Christopher Hellwig.

OCFS2 is a really nice filesystem that is used by many people out there, if we track the ocfs2-users and ocfs2-devel mail lists it is clear that many people out there make use of it for their own applications.

We provide OCFS2 RPMs for Oracle Enterprise Linux(OEL) and Red Hat Enterprise Linux(RHEL) on oracle.com and we make the RPMs available integrated on ULN for the Oracle Unbreakable Linux support customers. Even though the code is in the 2.6.18 kernel that is used in RHEL5, they decided to not compile the modules so we compile them out of the kernel. (we do not modify the kernel config and build them in because that would be considered a change)

For people that want to use OCFS2 or play with it, it’s included in OEL as an extra (not modifications of existing RHEL code). You can get the RPMS for RHEL from the Oracle Technology Network (OTN). It is all free for use and download. If you need support you can purchase an Unbreakable Linux support subscription and that includes support for the filesystem.

You can find tons of information on our oss.oracle.com website http://oss.oracle.com/projects/ocfs2/. Some of the new features that are in the mainline Linux release of the filesystem are listed below. The most notable one is REFLINK which I will cover in more detail. All OCFS2 development is public and every change is immediately published on oss in our git repositories.

- extended attributes. in fact the value of each extended attribute can be as large as a regular file. Which is larger than even ext3 can do.

- Posix ACL support

- support for userspace cluster stacks. If needed it is possible to use OCFS2 with cman and pacemaker

- jbd2 support. This gives us 64-bit blocknumbers and we can theoretically support 4PB filesystems. with jbd1 the limit is/was 16TB per filesystem

- quota support

- metadata checksums and ecc. all metadatablocks in OCFS2 now have a checksum field. If the checksum fails, there is an ECC field that can recover a single bit error. If it is unrecoverable then OCFS2 will make this single inode unreadable but it does not or will not affect the rest of the filesystem. In most filesystems this would take the entire filesystem into read-only mode.

- improved inode allocation. This will help with filesystems which a huge huge number of files.

- indexed directories. This will improve performance of lookups of a single name.

- reflink which creates a target inode that shares the data extents of the source inode in a copy-on-write fashion.


Now, about reflink. The reason we implemented reflink is for Oracle VM. As you know, a virtual machine/guest owns one or more virtual disks. These virtual disks are represented as files on a filesystem hosted by the hypervisor. In the case of Oracle VM, if you have SAN or iSCSI storage, we put an OCFS2 filesystem on top of this, managed by the management domain (dom0). The virtual disks live on top of this OCFS2 volume.

These virtual disks can become very large, they usually are many GB’s in size. So when a user wants to create a clone of a virtual machine or create a virtual machine based on an existing template, we copy the content of the original virtual disks to a new set of virtual disks. By default this duplicates the amount of storage used.

ie. you have VM1 with a 40gb virtual disk (vm1/system.img) and you want to copy that to create VM2 based on the same virtual disk image (vm2/system.img).

The reflink feature in OCFS2 which was published to fs-devel and ocfs2-devel a while back, supports this operation through effectively creating hard links but with copy-on-write (or basically a point-in-time data hard link).

Today, we copy the file vm1/system.img to vm2/system.img. Tomorrow, we do reflink vm1/system.img vm2/system.img. At initial create time no additional space is used, no actual copying is done, it just creates a totally new inode/file and shares the data extents. As soon as a write is done to one or the other side, 1mb chunks are copied over where the writes occur.

This allows us to create instant copies of files (or in the case of Oracle VM, virtual disk images).

Some of the advantages of reflink are :

- Each “hard link” or point-in-time copy, is a regular file for the OS, for an application etc, so there are no changes needed to applications or backup software. This is totally transparent, there is no container around these files etc. Unlike vmdk and vhd where the snapshots live inside the containers.

- It is fully cluster safe so this works in an OCFS2 filesystem cluster so the link and the COW works on any node even if the file is used and opened on another node. This allows us in the Oracle VM case to create snapshots and run these new VMs on a different node than the original VM is running.

- This is a generic feature just like symlink. It is available to any user or application.

- It is open source (part of OCFS2 code) free to use for anyone.


Below is an example of reflink. It shows the diskspace usage, it shows the time it takes to complete the commands and also a simple modification done with dd to one file and show how that affects both files.

ls -l

total 1771896

-rw-r--r-- 1 root root 1814420898 May 1 12:58 el4.5-system.img

===============================================================

df -h .

Filesystem Size Used Avail Use% Mounted on

/dev/sde1 50G 3.9G 47G 8% /ocfs2

===============================================================

reflink el4.5-system.img el4.5-system1.img

real

0m0.030s

user

0m0.000s

sys

0m0.000s

===============================================================

ls -l

total 1771896

-rw-r--r-- 1 root root 1814420898 May 1 12:59 el4.5-system1.img

-rw-r--r-- 1 root root 1814420898 May 1 12:58 el4.5-system.img

===============================================================

df -h .

Filesystem Size Used Avail Use% Mounted on

/dev/sde1 50G 3.9G 47G 8% /ocfs2

===============================================================

md5sum el4.5-system.img el4.5-system1.img

c41b670c59e8a4446ad07e9fb0f98b6d el4.5-system.img

real

0m31.094s

user

0m7.420s

sys

0m10.530s

c41b670c59e8a4446ad07e9fb0f98b6d el4.5-system1.img

real

0m34.553s

user

0m7.500s

sys

0m10.140s

===============================================================

dd if=/dev/zero of=el4.5-system1.img bs=1M count=1000 seek=500 conv=notrunc

1000+0 records in

1000+0 records out

1048576000 bytes (1.0 GB) copied, 104.889 seconds, 10.0 MB/s

===============================================================

ls -l

total 3543792

-rw-r--r-- 1 root root 1814420898 May 1 13:02 el4.5-system1.img

-rw-r--r-- 1 root root 1814420898 May 1 12:58 el4.5-system.img

===============================================================

df -h .

Filesystem Size Used Avail Use% Mounted on

/dev/sde1 50G 4.9G 46G 10% /ocfs2

===============================================================

md5sum el4.5-system.img el4.5-system1.img

c41b670c59e8a4446ad07e9fb0f98b6d el4.5-system.img

real

0m32.430s

user

0m7.920s

sys

0m11.340s

b67b39c3c86a4110cb795f516bc7f86b el4.5-system1.img

real

0m32.069s

user

0m7.920s

sys

0m10.350s

enjoy.

Join the discussion

Comments ( 8 )
  • Joe Hoot Tuesday, May 5, 2009
    awsome... now we just need CLVM2 on top of OCFS2 :)
  • Roland Slee Tuesday, May 5, 2009
    Wim,
    This is very cool. I was not aware that OCFS2 was being enhanced specifically to improve the performance of Oracle Virtual Machine environments, but that makes perfect sense. I was also pleased to hear that OCFS2 has had contributions from Novell and others now. Thanks for the update.
    Roland.
  • Marcelo Ochoa Wednesday, May 6, 2009
    Hi Wim:
    Is there any compiled version (preferred .rpm) of reflink for Oracle VM 2.1.2?
    Or, Should I install latest development version?
    Best regards, Marcelo.
  • Wim Coekaerts Wednesday, May 6, 2009
    Joe : yeah it's on the todo list. just not trivial.
    Roland : yes Novell's done a lot of work. It's nice indeed.
    Marcelo : not yet - that will take a while to get out but we will put those test rpms out there as soon as we can.
  • Leo Thursday, May 28, 2009
    Is there a performance penalty when writing to a reflinked file vs. writing to a normal file?
  • Marcelo Ochoa Monday, December 14, 2009
    Hi Wim:
    I just upgraded to latest Oracle VM server 2.2.0 but I can't find reflink command.
    Packages seem to be latest version:
    # modinfo ocfs2
    filename: /lib/modules/2.6.18-128.2.1.4.9.el5xen/kernel/fs/ocfs2/ocfs2.ko
    license: GPL
    author: Oracle
    version: 1.4.4
    description: OCFS2 1.4.4
    srcversion: 478454745EAD2E534E8743D
    depends: ocfs2_dlm,jbd,ocfs2_nodemanager
    vermagic: 2.6.18-128.2.1.4.9.el5xen SMP mod_unload Xen 686 REGPARM 4KSTACKS gcc-4.1
    module_sig: 883f3504acf89b5c14d82d4d57f659711257930a0d690917461eb4e1cb4c3c271856d764072d7dd950a08da6caae88d6bdfab3a67fd2a833dcc5af845c
    # rpm -qa|grep ocfs
    ocfs2-tools-1.4.3-4.el5
    Where can I find reflink command?
    Do I need to install another package?
    Best regards, Marcelo.
  • wim.coekaerts Monday, December 14, 2009
    Hi Marcelo,
    it's not in the product yet - I was doing an install of testcode to show how it will work. it's not in the product. it's in mainline linux/ocfs2 right now.
  • Jose Mantilla Sunday, December 18, 2016

    Hello everyone, I'd like to know what is the best practice to do a backup of a oracle' vm.

    The answer to this is usually snapshots… Either the underlying storage is capable of providing those, or you will have to go with OCFS2's reflinks option to create "snapshots" of your running guest's vdisks, which you can save via rsync, tar, dd you name it.

    What would be the steps? I feel reflink is like a hard link, then, if I can't poweroff a vm, the reflink would be incrementing too, so it would finished bad. Am I right? could aynone help me with my question? I need to do backups for several virtual machines of ovm but I have only the repositories mounted by nfs and some people told me the rsync would be wrong due to the vm is running and writing.

    What can I do?


Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.