OCFS2 reflink

It has been a while since I last wrote something about OCFS2. For those that don’t know what this is, OCFS2 is a feature-rich standard Linux cluster filesystem. Linus took OCFS2 into mainline in the 2.6.16 time frame and it is being actively maintained. The majority of the work has always been done at Oracle however folks from Novell have provided many contributions, as well as individuals like Christopher Hellwig.

OCFS2 is a really nice filesystem that is used by many people out there, if we track the ocfs2-users and ocfs2-devel mail lists it is clear that many people out there make use of it for their own applications.

We provide OCFS2 RPMs for Oracle Enterprise Linux(OEL) and Red Hat Enterprise Linux(RHEL) on oracle.com and we make the RPMs available integrated on ULN for the Oracle Unbreakable Linux support customers. Even though the code is in the 2.6.18 kernel that is used in RHEL5, they decided to not compile the modules so we compile them out of the kernel. (we do not modify the kernel config and build them in because that would be considered a change)

For people that want to use OCFS2 or play with it, it’s included in OEL as an extra (not modifications of existing RHEL code). You can get the RPMS for RHEL from the Oracle Technology Network (OTN). It is all free for use and download. If you need support you can purchase an Unbreakable Linux support subscription and that includes support for the filesystem.

You can find tons of information on our oss.oracle.com website http://oss.oracle.com/projects/ocfs2/. Some of the new features that are in the mainline Linux release of the filesystem are listed below. The most notable one is REFLINK which I will cover in more detail. All OCFS2 development is public and every change is immediately published on oss in our git repositories.

- extended attributes. in fact the value of each extended attribute can be as large as a regular file. Which is larger than even ext3 can do.

- Posix ACL support

- support for userspace cluster stacks. If needed it is possible to use OCFS2 with cman and pacemaker

- jbd2 support. This gives us 64-bit blocknumbers and we can theoretically support 4PB filesystems. with jbd1 the limit is/was 16TB per filesystem

- quota support

- metadata checksums and ecc. all metadatablocks in OCFS2 now have a checksum field. If the checksum fails, there is an ECC field that can recover a single bit error. If it is unrecoverable then OCFS2 will make this single inode unreadable but it does not or will not affect the rest of the filesystem. In most filesystems this would take the entire filesystem into read-only mode.

- improved inode allocation. This will help with filesystems which a huge huge number of files.

- indexed directories. This will improve performance of lookups of a single name.

- reflink which creates a target inode that shares the data extents of the source inode in a copy-on-write fashion.


Now, about reflink. The reason we implemented reflink is for Oracle VM. As you know, a virtual machine/guest owns one or more virtual disks. These virtual disks are represented as files on a filesystem hosted by the hypervisor. In the case of Oracle VM, if you have SAN or iSCSI storage, we put an OCFS2 filesystem on top of this, managed by the management domain (dom0). The virtual disks live on top of this OCFS2 volume.

These virtual disks can become very large, they usually are many GB’s in size. So when a user wants to create a clone of a virtual machine or create a virtual machine based on an existing template, we copy the content of the original virtual disks to a new set of virtual disks. By default this duplicates the amount of storage used.

ie. you have VM1 with a 40gb virtual disk (vm1/system.img) and you want to copy that to create VM2 based on the same virtual disk image (vm2/system.img).

The reflink feature in OCFS2 which was published to fs-devel and ocfs2-devel a while back, supports this operation through effectively creating hard links but with copy-on-write (or basically a point-in-time data hard link).

Today, we copy the file vm1/system.img to vm2/system.img. Tomorrow, we do reflink vm1/system.img vm2/system.img. At initial create time no additional space is used, no actual copying is done, it just creates a totally new inode/file and shares the data extents. As soon as a write is done to one or the other side, 1mb chunks are copied over where the writes occur.

This allows us to create instant copies of files (or in the case of Oracle VM, virtual disk images).

Some of the advantages of reflink are :

- Each “hard link” or point-in-time copy, is a regular file for the OS, for an application etc, so there are no changes needed to applications or backup software. This is totally transparent, there is no container around these files etc. Unlike vmdk and vhd where the snapshots live inside the containers.

- It is fully cluster safe so this works in an OCFS2 filesystem cluster so the link and the COW works on any node even if the file is used and opened on another node. This allows us in the Oracle VM case to create snapshots and run these new VMs on a different node than the original VM is running.

- This is a generic feature just like symlink. It is available to any user or application.

- It is open source (part of OCFS2 code) free to use for anyone.


Below is an example of reflink. It shows the diskspace usage, it shows the time it takes to complete the commands and also a simple modification done with dd to one file and show how that affects both files.

ls -l
total 1771896
-rw-r--r-- 1 root root 1814420898 May 1 12:58 el4.5-system.img
===============================================================

df -h .
Filesystem Size Used Avail Use% Mounted on
/dev/sde1 50G 3.9G 47G 8% /ocfs2
===============================================================

reflink el4.5-system.img el4.5-system1.img

real 0m0.030s
user 0m0.000s
sys 0m0.000s
===============================================================
ls -l
total 1771896
-rw-r--r-- 1 root root 1814420898 May 1 12:59 el4.5-system1.img
-rw-r--r-- 1 root root 1814420898 May 1 12:58 el4.5-system.img
===============================================================
df -h .
Filesystem Size Used Avail Use% Mounted on
/dev/sde1 50G 3.9G 47G 8% /ocfs2
===============================================================
md5sum el4.5-system.img el4.5-system1.img
c41b670c59e8a4446ad07e9fb0f98b6d el4.5-system.img

real 0m31.094s
user 0m7.420s
sys 0m10.530s
c41b670c59e8a4446ad07e9fb0f98b6d el4.5-system1.img

real 0m34.553s
user 0m7.500s
sys 0m10.140s

===============================================================
dd if=/dev/zero of=el4.5-system1.img bs=1M count=1000 seek=500 conv=notrunc
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 104.889 seconds, 10.0 MB/s
===============================================================
ls -l
total 3543792
-rw-r--r-- 1 root root 1814420898 May 1 13:02 el4.5-system1.img
-rw-r--r-- 1 root root 1814420898 May 1 12:58 el4.5-system.img
===============================================================
df -h .
Filesystem Size Used Avail Use% Mounted on
/dev/sde1 50G 4.9G 46G 10% /ocfs2
===============================================================
md5sum el4.5-system.img el4.5-system1.img
c41b670c59e8a4446ad07e9fb0f98b6d el4.5-system.img

real 0m32.430s
user 0m7.920s
sys 0m11.340s
b67b39c3c86a4110cb795f516bc7f86b el4.5-system1.img

real 0m32.069s
user 0m7.920s
sys 0m10.350s

enjoy.

Comments:

awsome... now we just need CLVM2 on top of OCFS2 :)

Posted by Joe Hoot on May 05, 2009 at 03:01 AM PDT #

Wim, This is very cool. I was not aware that OCFS2 was being enhanced specifically to improve the performance of Oracle Virtual Machine environments, but that makes perfect sense. I was also pleased to hear that OCFS2 has had contributions from Novell and others now. Thanks for the update. Roland.

Posted by Roland Slee on May 05, 2009 at 07:38 AM PDT #

Hi Wim: Is there any compiled version (preferred .rpm) of reflink for Oracle VM 2.1.2? Or, Should I install latest development version? Best regards, Marcelo.

Posted by Marcelo Ochoa on May 06, 2009 at 01:35 AM PDT #

Joe : yeah it's on the todo list. just not trivial. Roland : yes Novell's done a lot of work. It's nice indeed. Marcelo : not yet - that will take a while to get out but we will put those test rpms out there as soon as we can.

Posted by Wim Coekaerts on May 06, 2009 at 02:29 PM PDT #

Is there a performance penalty when writing to a reflinked file vs. writing to a normal file?

Posted by Leo on May 28, 2009 at 05:17 AM PDT #

Hi Wim: I just upgraded to latest Oracle VM server 2.2.0 but I can't find reflink command. Packages seem to be latest version: # modinfo ocfs2 filename: /lib/modules/2.6.18-128.2.1.4.9.el5xen/kernel/fs/ocfs2/ocfs2.ko license: GPL author: Oracle version: 1.4.4 description: OCFS2 1.4.4 srcversion: 478454745EAD2E534E8743D depends: ocfs2_dlm,jbd,ocfs2_nodemanager vermagic: 2.6.18-128.2.1.4.9.el5xen SMP mod_unload Xen 686 REGPARM 4KSTACKS gcc-4.1 module_sig: 883f3504acf89b5c14d82d4d57f659711257930a0d690917461eb4e1cb4c3c271856d764072d7dd950a08da6caae88d6bdfab3a67fd2a833dcc5af845c # rpm -qa|grep ocfs ocfs2-tools-1.4.3-4.el5 Where can I find reflink command? Do I need to install another package? Best regards, Marcelo.

Posted by Marcelo Ochoa on December 14, 2009 at 03:18 AM PST #

Hi Marcelo, it's not in the product yet - I was doing an install of testcode to show how it will work. it's not in the product. it's in mainline linux/ocfs2 right now.

Posted by wim.coekaerts on December 14, 2009 at 03:21 AM PST #

Post a Comment:
  • HTML Syntax: NOT allowed
About

Wim Coekaerts is the Senior Vice President of Linux and Virtualization Engineering for Oracle. He is responsible for Oracle's complete desktop to data center virtualization product line and the Oracle Linux support program.

You can follow him on Twitter at @wimcoekaerts

Search

Categories
Archives
« July 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
  
       
Today