Friday Jan 04, 2013

dlmfs

dlmfs is a really cool nifty feature as part of OCFS2. Basically, it's a virtual filesystem that allows a user/program to use the DLM through simple filesystem commands/manipulation. Without having to write programs that link with cluster libraries or do complex things, you can literally write a few lines of Python, Java or C code that let you create locks across a number of servers. We use this feature in Oracle VM to coordinate the master server and the locking of VMs across multiple nodes in a cluster. It allows us to make sure that a VM cannot start on multiple servers at once. Every VM is backed by a DLM lock, but by using dlmfs, this is simply a file in the dlmfs filesystem.

To show you how easy and powerful this is, I took some of the Oracle VM agent Python code, this is a very simple example of how to create a lock domain, a lock and when you know you get the lock or not. The focus here is just a master lock which y ou could use for an agent that is responsible for a virtual IP or some executable that you want to locate on a given server but the calls to create any kind of lock are in the code. Anyone that wants to experiment with this can add their own bits in a matter of minutes.

The prerequisite is simple : take a number of servers, configure an ocfs2 volume and an ocfs2 cluster (see previous blog entries) and run the script. You do not have to set up an ocfs2 volume if you do not want to, you could just set up the domain without actually mounting the filesystem. (See the global heartbeat blog). So practically this can be done with a very small simple setup.

My example has two nodes, wcoekaer-emgc1 and wcoekaer-emgc2 are the two Oracle Linux 6 nodes, configured with a shared disk and an ocfs2 filesystem mounted. This setup ensures that the dlmfs kernel module is loaded and the cluster is online. Take the python code listed here and just execute it on both nodes.

[root@wcoekaer-emgc2 ~]# lsmod |grep ocfs
ocfs2                1092529  1 
ocfs2_dlmfs            20160  1 
ocfs2_stack_o2cb        4103  1 
ocfs2_dlm             228380  1 ocfs2_stack_o2cb
ocfs2_nodemanager     219951  12 ocfs2,ocfs2_dlmfs,ocfs2_stack_o2cb,ocfs2_dlm
ocfs2_stackglue        11896  3 ocfs2,ocfs2_dlmfs,ocfs2_stack_o2cb
configfs               29244  2 ocfs2_nodemanager
jbd2                   93114  2 ocfs2,ext4
You see that the ocfs2_dlmfs kernel module is loaded.

[root@wcoekaer-emgc2 ~]# mount |grep dlmfs
ocfs2_dlmfs on /dlm type ocfs2_dlmfs (rw)
The dlmfs virtual filesystem is mounted on /dlm.

I now execute dlm.py on both nodes and show some output, after a while I kill (control-c) the script on the master node and you see the other node take over the lock. I then restart the dlm.py script and reboot the other node and you see the same.

[root@wcoekaer-emgc1 ~]# ./dlm.py 
Checking DLM
DLM Ready - joining domain : mycluster
Starting main loop...
i am master of the multiverse
i am master of the multiverse
i am master of the multiverse
i am master of the multiverse
^Ccleaned up master lock file
[root@wcoekaer-emgc1 ~]# ./dlm.py 
Checking DLM
DLM Ready - joining domain : mycluster
Starting main loop...
i am not the master
i am not the master
i am not the master
i am not the master
i am master of the multiverse
This shows that I started as master, then hit ctrl-c, I drop the lock, the other node takes the lock, then I reboot the other node and I take the lock again.

[root@wcoekaer-emgc2 ~]# ./dlm.py
Checking DLM
DLM Ready - joining domain : mycluster
Starting main loop...
i am not the master
i am not the master
i am not the master
i am not the master
i am master of the multiverse
i am master of the multiverse
i am master of the multiverse
^Z
[1]+  Stopped                 ./dlm.py
[root@wcoekaer-emgc2 ~]# bg
[1]+ ./dlm.py &
[root@wcoekaer-emgc2 ~]# reboot -f
Here you see that when this node started without being master, then at time of ctrl-c on the other node, became master, then after a forced reboot, the lock automatically gets released.

And here is the code, just copy it to your servers and execute it...

#!/usr/bin/python
# Copyright (C) 2006-2012 Oracle. All rights reserved.
#
# This program is free software; you can redistribute it and/or modify it under
# the terms of the GNU General Public License as published by the Free Software
# Foundation, version 2.  This program is distributed in the hope that it will
# be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General
# Public License for more details.  You should have received a copy of the GNU
# General Public License along with this program; if not, write to the Free
# Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
# 021110-1307, USA.

import sys
import subprocess
import stat
import time
import os
import re
import socket
from time import sleep

from os.path import join, isdir, exists

# defines
# dlmfs is where the dlmfs filesystem is mounted
# the default, normal place for ocfs2 setups is /dlm
# ocfs2_dlmfs on /dlm type ocfs2_dlmfs (rw)
DLMFS = "/dlm"

# we need a domain name which really is just a subdir in dlmfs
# default to "mycluster" so then it creates /dlm/mycluster
# locks are created inside this directory/domain
DLM_DOMAIN_NAME = "mycluster"
DLM_DOMAIN_PATH = DLMFS + "/" + DLM_DOMAIN_NAME

# the main lock to use for being the owner of a lock
# this can be any name, the filename is just the lockname
DLM_LOCK_MASTER = DLM_DOMAIN_PATH + "/" + "master"

# just a timeout
SLEEP_ON_ERR = 60

def run_cmd(cmd, success_return_code=(0,)):
    if not isinstance(cmd, list):
        raise Exception("Only accepts list!")
    cmd = [str(x) for x in cmd]
    proc = subprocess.Popen(cmd, stdout=subprocess.PIPE,
                            stderr=subprocess.PIPE, close_fds=True)
    (stdoutdata, stderrdata) = proc.communicate()
    if proc.returncode not in success_return_code:
        raise RuntimeError('Command: %s failed (%s): stderr: %s stdout: %s'
                           % (cmd, proc.returncode, stderrdata, stdoutdata))
    return str(stdoutdata)


def dlm_ready():
    """
    Indicate if the DLM is ready of not.

    With dlmfs, the DLM is ready once the DLM filesystem is mounted
    under /dlm.

    @return: C{True} if the DLM is ready, C{False} otherwise.
    @rtype: C{bool}

    """
    return os.path.ismount(DLMFS)


# just do a mkdir, if it already exists, we're good, if not just create it
def dlm_join_domain(domain=DLM_DOMAIN_NAME):
    _dir = join(DLMFS, domain)
    if not isdir(_dir):
        os.mkdir(_dir)
    # else: already joined

# leaving a domain is basically removing the directory.
def dlm_leave_domain(domain=DLM_DOMAIN_NAME, force=True):
    _dir = join(DLMFS, domain)
    if force:
        cmd = ["rm", "-fr", _dir]
    else:
        cmd = ["rmdir", _dir]
    run_cmd(cmd)

# acquire a lock
def dlm_acquire_lock(lock):
    dlm_join_domain()

    # a lock is a filename in the domain directory
    lock_path = join(DLM_DOMAIN_PATH, lock)

    try:
        if not exists(lock_path):
            fd = os.open(lock_path, os.O_CREAT | os.O_NONBLOCK)
            os.close(fd)
        # create the EX lock
        # creating a file with O_RDWR causes an EX lock
        fd = os.open(lock_path, os.O_RDWR | os.O_NONBLOCK)
        # once the file is created in this mode, you can close it
        # and you still keep the lock
        os.close(fd)
    except Exception, e:
        if exists(lock_path):
            os.remove(lock_path)
        raise e


def dlm_release_lock(lock):
    # releasing a lock is as easy as just removing the file
    lock_path = join(DLM_DOMAIN_PATH, lock)
    if exists(lock_path):
        os.remove(lock_path)

def acquire_master_dlm_lock():
    ETXTBUSY = 26

    dlm_join_domain()

    # close() does not downconvert the lock level nor does it drop the lock. The
    # holder still owns the lock at that level after close.
    # close() allows any downconvert request to succeed.
    # However, a downconvert request is only generated for queued requests. And
    # O_NONBLOCK is specifically a noqueue dlm request.

    # 1) O_CREAT | O_NONBLOCK will create a lock file if it does not exist, whether
    #    we are the lock holder or not.
    # 2) if we hold O_RDWR lock, and we close but not delete it, we still hold it.
    #    afterward, O_RDWR will succeed, but O_RDWR | O_NONBLOCK will not.
    # 3) but if we donnot hold the lock, O_RDWR will hang there waiting,
    #    which is not desirable -- any uninterruptable hang is undesirable.
    # 4) if noboday else hold the lock either, but the lock file exists as side effect
    #    of 1), with O_NONBLOCK, it may result in ETXTBUSY

    # a) we need O_NONBLOCK to avoid scenario (3)
    # b) we need to delete it ourself to avoid (2)
    #   *) if we do not succeed with (1), remove the lock file to avoid (4)
    #   *) if everything is good, we drop it and we remove it
    #   *) if killed by a program, this program should remove the file
    #   *) if crashed, but not rebooted, something needs to remove the file
    #   *) on reboot/reset the lock is released to the other node(s)

    try:
        if not exists(DLM_LOCK_MASTER):
            fd = os.open(DLM_LOCK_MASTER, os.O_CREAT | os.O_NONBLOCK)
            os.close(fd)

        master_lock = os.open(DLM_LOCK_MASTER, os.O_RDWR | os.O_NONBLOCK)

        os.close(master_lock)
        print "i am master of the multiverse"
        # at this point, I know I am the master and I can add code to do
        # things that only a master can do, such as, consider setting
        # a virtual IP or, if I am master, I start a program
        # and if not, then I make sure I don't run that program (or VIP)
        # so the magic starts here...
        return True

    except OSError, e:
        if e.errno == ETXTBUSY:
            print "i am not the master"
            # if we are not master and the file exists, remove it or
            # we will never succeed
            if exists(DLM_LOCK_MASTER):
                os.remove(DLM_LOCK_MASTER)
        else:
            raise e


def release_master_dlm_lock():
    if exists(DLM_LOCK_MASTER):
        os.remove(DLM_LOCK_MASTER)


def run_forever():
    # set socket default timeout for all connections
    print "Checking DLM"
    if dlm_ready():
       print "DLM Ready - joining domain : " + DLM_DOMAIN_NAME
    else:
       print "DLM not ready - bailing, fix the cluster stack"
       sys.exit(1)

    dlm_join_domain()

    print "Starting main loop..."

    socket.setdefaulttimeout(30)
    while True:
        try:
            acquire_master_dlm_lock()
            sleep(20)
        except Exception, e:
            sleep(SLEEP_ON_ERR)
        except (KeyboardInterrupt, SystemExit):
            if exists(DLM_LOCK_MASTER):
               os.remove(DLM_LOCK_MASTER)
            # if you control-c out of this, then you lose the lock!
            # delete it on exit for release
            print "cleaned up master lock file"
            sys.exit(1)


if __name__ == '__main__':
    run_forever()

Thursday Jan 03, 2013

OCFS2 global heartbeat

A cool, but often missed feature in Oracle Linux is the inclusion of OCFS2. OCFS2 is a native Linux clusterfilesystem which was written many years ago at Oracle (hence the name Oracle Cluster Filesystem) and which got included in the mainline Linux kernel around 2.6.16 somewhere back in 2005. The filesystem is widely used and has a number of really cool features.

  • simplicity : it's incredibly easy to configure the filesystem and clusterstack. There is literally one small text-based config file.
  • complete : ocfs2 contains all the components needed : a nodemanager, a heartbeat, a distributed lock manager and the actual cluster filesystem
  • small : the size of the filesystem and the needed tools is incredibly small. It consists of a few kernel modules and a small set of userspace tools. All the kernel modules together add up to about 2.5Mb in size and the userspace package is a mere 800Kb.
  • integrated : it's a native Linux filesystem so it makes use of all the normal kernel infrastructure. There is no duplication of structures caches, it fits right into the standard Linux filesystem structure.
  • part of Oracle Linux/UEK : ocfs2, like other linux filesystems, is built as kernel modules. When customers use Oracle Linux's UEK or UEK2, we automatically compile the kernel modules for the filesystem. Other distributions like SLES have done the same. We fully support OCFS2 as part of Oracle Linux as a general purpose cluster filesystem.
  • feature rich :
    OCFS2 is POSIX-compliant
    Optimized Allocations (extents, reservations, sparse, unwritten extents, punch holes)
    REFLINKs (inode-based writeable snapshots)
    Indexed Directories
    Metadata Checksums
    Extended Attributes (unlimited number of attributes per inode)
    Advanced Security (POSIX ACLs and SELinux)
    User and Group Quotas
    Variable Block and Cluster sizes
    Journaling (Ordered and Writeback data journaling modes)
    Endian and Architecture Neutral (x86, x86_64, ia64 and ppc64) - yes, you can mount the filesystem in a heterogeneous cluster.
    Buffered, Direct, Asynchronous, Splice and Memory Mapped I/Os
    In-built Clusterstack with a Distributed Lock Manager
    Cluster-aware Tools (mkfs, fsck, tunefs, etc.)
  • One of the main features added most recently is Global Heartbeat. OCFS2 as a filesystem typically was used with what's called local heartbeat. Basically for every filesystem you mounted, it would start its own local heartbeat, membership mechanism. The disk heartbeat means a disk io every 1 or 2 seconds for every node in the cluster, for every device. It was never a problem when the number of mounted volumes was relatively small but once customers were using 20+ volumes the overhead of the multiple disk heartbeats became significant and at times became a stability issue.

    global heartbeat was written to provide a solution to the multiple heartbeats. It is now possible to specify on which device(s) you want a heartbeat thread and then you can mount many other volumes that do not have their own and the heartbeat is shared amongst those one, or few threads and as such significantly reducing disk IO overhead.

    I was playing with this a little bit the other day and noticed that this wasn't very well documented so why not write it up here and share it with everyone. Getting started with OCFS2 is just really easy and withing just a few minutes it is possible to have a complete installation.

    I started with two servers installed with Oracle Linux 6.3. Each server has 2 network interfaces, one public and one private. The servers have a local disk and a shared storage device. For cluster filesystems, typically this shared storage device should be either a shared SAN disk or an iscsi device but it is also possible with Oracle Linux and UEK2 to create a shared virtual device on an nfs server and use this device for the cluster filesystem. This technique is used with Oracle VM where the shared storage is NAS-based.I just wrote a blog entry about how to do that here.

    While it is technically possible to create a working ocfs2 configuration using just one network and a single IP per server, it is certainly not ideal and not a recommended configuration for real world use. In any cluster environment it's highly recommended to have a private network for cluster traffic.The biggest reason for instability in a clustering environment is a bad/unreliable network and/or storage. Many times the environment has an overloaded network which causes network heartbeats to fail or disks where failover takes longer than the default configuration and the only alternative we have at that point, is to reboot the node(s).

    Typically when I do a test like this, I make sure I use the latest versions of the OS release. So after an installation of Oracle Linux 6.3, I just do a yum update on all my nodes to have the latest packages and also latest kernel version installed and then do a reboot. That gets me to 2.6.39-300.17.3.el6uek.x86_64 at the time of writing. Of course all this is freely accessibly from http://public-yum.oracle.com.

    Depending on the type of installation you did (basic, minimal, etc...) you may or may not have to add RPMs. Do a simple check rpm -q ocfs2-tools to see if the tools are installed, if not, just run yum install ocfs2-tools. And that's it. All required software is now installed. The kernel modules are already part of the uek2 kernel and the required tools (mkfs, fsck, o2cb,..) are part of the ocfs2-tools RPM.

    Next up: create the filesystem on the shared disk device and configure the cluster.

    One requirement for using global heartbeat is that the heartbeat device needs to be a NON-partitioned disk. Other OCFS2 volumes you want to create and mount can be on partitioned disks, but a device for the heartbeat needs to be on an empty disk. Let's assume /dev/sdb in this example.

    # mkfs.ocfs2 -b 4K -C 4K -J size=4M -N 4 -L ocfs2vol1 \
    --cluster-name=ocfs2 --cluster-stack=o2cb --global-heartbeat /dev/sdb
    This creates a filesystem with a 4K blocksize (normal value), clustersize of 4K (if you have many small files, this is a good value, if you have few large files, go to 1M).

    Journalsize of 4M if you have a large filesystem with a lot of metadata changes you might want to increase this. I did not add an option for 32bit or 64bit journals. if you want to create huge filesystems then use block64 which uses jbd2.

    The filesystem is created for 4 nodes (-N 4) this can be modified if your cluster needs to grow larger so you can always tune this with tunefs.ocfs2.

    Label ocfs2vol1, this is a disklabel you can later use to mount by label a filesystem.

    clustername=ocfs2, this is the default name but if you want to have your own name for your cluster you can put a different value here, remember it because you will need to configure the clusterstack with the clustername later.

    cluster-stack=o2cb : it is possible to have different cluster-stacks used such as pacemaker or cman.

    global-heartbeat : make sure that the filesystem is prepared and built to support global heartbeat

    /dev/sdb : the device to use for the filesystem.

    
    # mkfs.ocfs2 -b 4K -C 4K -J size=4M -N 4 -L ocfs2vol1 --cluster-name=ocfs2 \
    --cluster-stack=o2cb --force --global-heartbeat /dev/sdb
    mkfs.ocfs2 1.8.0
    Cluster stack: o2cb
    Cluster name: ocfs2
    Stack Flags: 0x1
    NOTE: Feature extended slot map may be enabled
    Overwriting existing ocfs2 partition.
    WARNING: Cluster check disabled.
    Proceed (y/N): y
    Label: ocfs2vol1
    Features: sparse extended-slotmap backup-super unwritten inline-data strict-journal-super xattr indexed-dirs refcount discontig-bg
    Block size: 4096 (12 bits)
    Cluster size: 4096 (12 bits)
    Volume size: 10725765120 (2618595 clusters) (2618595 blocks)
    Cluster groups: 82 (tail covers 5859 clusters, rest cover 32256 clusters)
    Extent allocator size: 4194304 (1 groups)
    Journal size: 4194304
    Node slots: 4
    Creating bitmaps: done
    Initializing superblock: done
    Writing system files: done
    Writing superblock: done
    Writing backup superblock: 2 block(s)
    Formatting Journals: done
    Growing extent allocator: done
    Formatting slot map: done
    Formatting quota files: done
    Writing lost+found: done
    mkfs.ocfs2 successful
    

    Now, we just have to configure the o2cb stack and we're done.

  • add the cluster : o2cb add-cluster ocfs2
  • add the nodes :
    o2cb add-node --ip 192.168.199.1 --number 0 ocfs2 host1
    o2cb add-node --ip 192.168.199.2 --number 1 ocfs2 host2
  • it is very important to use the hostname of the server (the name you get when typing hostname) for each node!
  • add the heartbeat device:
    run the following command and take the UUID value of the filesystem/device you want to use for heartbeat mounted.ocfs2 -d
    # mounted.ocfs2 -d
    Device      Stack  Cluster  F  UUID                              Label
    /dev/sdb   o2cb   ocfs2    G  244A6AAAE77F4053803734530FC4E0B7  ocfs2vol1
    
    o2cb add-heartbeat ocfs2 244A6AAAE77F4053803734530FC4E0B7
  • enable global heartbeat o2cb heartbeat-mode ocfs2 global
  • start the clusterstack : /etc/init.d/o2cb enable
  • verify that the stack is up and running : o2cb cluster-status
  • That's it. If you want to enable this at boot time, you can configure o2cb to start automatically by running /etc/init.d/o2cb configure. This allows you to set different heartbeat time out values and also whether or not to start the clusterstack at boot time.

    Now that a first node is configured, all you have to do is copy the file /etc/ocfs2/cluster.conf to all the other nodes in your cluster. You do not have to edit it on the other nodes, you just need to have an exact copy everywhere. You also do not need to redo the above commands, except 1) make sure ocfs2-tools is installed everywhere and if you want to start at boot time, re-run the /etc/init.d/o2cb configure on the other nodes as well. From here on, you can just mount your filesystems :

    mount /dev/sdb /mountpoint1 on each node.

    If you create more OCFS2 volumes you can just keep mounting them all, and with global heartbeat, you will just have one (or a few) hb's going on.

    have fun...

    Here is vmstat output, the first output shows a single heartbeat and 8 mounted filesystems, the second vmstat output shows 8 mounted filesystems with their own local heartbeat. Even though the IO amount is low, it shows that there are about 8x more IOs happening (from 1 every other second to 4 every second). As these are small IOs, they will move the diskhead to a specific place all the time and interrupt performance if you have it on each device. Hopefully this shows the benefits of global heartbeat.

    # vmstat 1
    procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
     r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
     0  0      0 789752  26220  97620    0    0     1     0   41   34  0  0 100  0  0
     0  0      0 789752  26220  97620    0    0     0     0   46   22  0  0 100  0  0
     0  0      0 789752  26220  97620    0    0     1     1   38   29  0  0 100  0  0
     0  0      0 789752  26228  97620    0    0     0    52   52   41  0  0 100  1  0
     0  0      0 789752  26228  97620    0    0     1     0   28   26  0  0 100  0  0
     0  0      0 789760  26228  97620    0    0     0     0   30   30  0  0 100  0  0
     0  0      0 789760  26228  97620    0    0     1     1   26   20  0  0 100  0  0
     0  0      0 789760  26228  97620    0    0     0     0   54   37  0  1 100  0  0
     0  0      0 789760  26228  97620    0    0     1     0   29   28  0  0 100  0  0
     0  0      0 789760  26236  97612    0    0     0    16   43   48  0  0 100  0  0
     0  0      0 789760  26236  97620    0    0     1     1   48   28  0  0 100  0  0
     0  0      0 789760  26236  97620    0    0     0     0   42   30  0  0 100  0  0
     0  0      0 789760  26236  97620    0    0     1     0   26   30  0  0 100  0  0
     0  0      0 789760  26236  97620    0    0     0     0   35   24  0  0 100  0  0
     0  1      0 789760  26240  97616    0    0     1    21   29   27  0  0 100  0  0
     0  0      0 789760  26244  97620    0    0     0     4   51   44  0  0 100  0  0
     0  0      0 789760  26244  97620    0    0     1     0   31   24  0  0 100  0  0
     0  0      0 789760  26244  97620    0    0     0     0   25   28  0  0 100  0  0
     0  0      0 789760  26244  97620    0    0     1     1   30   20  0  0 100  0  0
     0  0      0 789760  26244  97620    0    0     0     0   41   30  0  0 100  0  0
     0  0      0 789760  26252  97616    0    0     1    16   56   44  0  0 100  0  0
    
    
    
    # vmstat 1
    procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
     r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
     0  0      0 784364  28732  98620    0    0     4    46   54   64  0  0 100  0  0
     0  0      0 784364  28732  98628    0    0     4     2   60   48  0  0 100  0  0
     0  0      0 784364  28732  98628    0    0     4     2   51   53  0  0 100  0  0
     0  0      0 784364  28732  98628    0    0     4     2   58   50  0  0 100  0  0
     0  0      0 784364  28732  98628    0    0     4     2   56   44  0  0 100  0  0
     0  0      0 784364  28732  98628    0    0     4     2   46   47  0  0 100  0  0
     0  0      0 784364  28732  98628    0    0     4     2   65   54  0  0 100  0  0
     0  0      0 784388  28740  98620    0    0     4    14   65   55  0  0 100  0  0
     0  0      0 784388  28740  98628    0    0     4     2   46   48  0  0 100  0  0
     0  0      0 784388  28740  98628    0    0     4     2   52   42  0  0 100  0  0
     0  0      0 784388  28740  98628    0    0     4     2   51   58  0  0 100  0  0
     0  0      0 784388  28740  98628    0    0     4     2   36   43  0  0 100  0  0
     0  0      0 784396  28740  98628    0    0     4     2   39   47  0  0 100  0  0
     0  0      0 784396  28740  98628    0    0     4     2   52   54  0  0 100  0  0
     0  0      0 784396  28740  98628    0    0     4     2   42   48  0  0 100  0  0
     0  0      0 784404  28748  98620    0    0     4    14   52   63  0  0 100  0  0
     0  0      0 784404  28748  98628    0    0     4     2   32   42  0  0 100  0  0
     0  0      0 784404  28748  98628    0    0     4     2   50   40  0  0 100  0  0
     0  0      0 784404  28748  98628    0    0     4     2   58   56  0  0 100  0  0
     0  0      0 784412  28748  98628    0    0     4     2   39   46  0  0 100  0  0
     0  0      0 784412  28748  98628    0    0     4     2   45   50  0  0 100  0  0
     0  0      0 784412  28748  98628    0    0     4     2   43   42  0  0 100  0  0
     0  0      0 784288  28748  98628    0    0     4     6   48   52  0  0 100  0  0
    
    

    Wednesday Jul 04, 2012

    What's up with OCFS2?

    On Linux there are many filesystem choices and even from Oracle we provide a number of filesystems, all with their own advantages and use cases. Customers often confuse ACFS with OCFS or OCFS2 which then causes assumptions to be made such as one replacing the other etc... I thought it would be good to write up a summary of how OCFS2 got to where it is, what we're up to still, how it is different from other options and how this really is a cool native Linux cluster filesystem that we worked on for many years and is still widely used.

    Work on a cluster filesystem at Oracle started many years ago, in the early 2000's when the Oracle Database Cluster development team wrote a cluster filesystem for Windows that was primarily focused on providing an alternative to raw disk devices and help customers with the deployment of Oracle Real Application Cluster (RAC). Oracle RAC is a cluster technology that lets us make a cluster of Oracle Database servers look like one big database. The RDBMS runs on many nodes and they all work on the same data. It's a Shared Disk database design. There are many advantages doing this but I will not go into detail as that is not the purpose of my write up. Suffice it to say that Oracle RAC expects all the database data to be visible in a consistent, coherent way, across all the nodes in the cluster. To do that, there were/are a few options : 1) use raw disk devices that are shared, through SCSI, FC, or iSCSI 2) use a network filesystem (NFS) 3) use a cluster filesystem(CFS) which basically gives you a filesystem that's coherent across all nodes using shared disks. It is sort of (but not quite) combining option 1 and 2 except that you don't do network access to the files, the files are effectively locally visible as if it was a local filesystem.

    So OCFS (Oracle Cluster FileSystem) on Windows was born. Since Linux was becoming a very important and popular platform, we decided that we would also make this available on Linux and thus the porting of OCFS/Windows started. The first version of OCFS was really primarily focused on replacing the use of Raw devices with a simple filesystem that lets you create files and provide direct IO to these files to get basically native raw disk performance. The filesystem was not designed to be fully POSIX compliant and it did not have any where near good/decent performance for regular file create/delete/access operations. Cache coherency was easy since it was basically always direct IO down to the disk device and this ensured that any time one issues a write() command it would go directly down to the disk, and not return until the write() was completed. Same for read() any sort of read from a datafile would be a read() operation that went all the way to disk and return. We did not cache any data when it came down to Oracle data files.

    So while OCFS worked well for that, since it did not have much of a normal filesystem feel, it was not something that could be submitted to the kernel mail list for inclusion into Linux as another native linux filesystem (setting aside the Windows porting code ...) it did its job well, it was very easy to configure, node membership was simple, locking was disk based (so very slow but it existed), you could create regular files and do regular filesystem operations to a certain extent but anything that was not database data file related was just not very useful in general. Logfiles ok, standard filesystem use, not so much. Up to this point, all the work was done, at Oracle, by Oracle developers.

    Once OCFS (1) was out for a while and there was a lot of use in the database RAC world, many customers wanted to do more and were asking for features that you'd expect in a normal native filesystem, a real "general purposes cluster filesystem". So the team sat down and basically started from scratch to implement what's now known as OCFS2 (Oracle Cluster FileSystem release 2). Some basic criteria were :

  • Design it with a real Distributed Lock Manager and use the network for lock negotiation instead of the disk
  • Make it a Linux native filesystem instead of a native shim layer and a portable core
  • Support standard Posix compliancy and be fully cache coherent with all operations
  • Support all the filesystem features Linux offers (ACL, extended Attributes, quotas, sparse files,...)
  • Be modern, support large files, 32/64bit, journaling, data ordered journaling, endian neutral, we can mount on both endian /cross architecture,..
  • Needless to say, this was a huge development effort that took many years to complete. A few big milestones happened along the way...

  • OCFS2 was development in the open, we did not have a private tree that we worked on without external code review from the Linux Filesystem maintainers, great folks like Christopher Hellwig reviewed the code regularly to make sure we were not doing anything out of line, we submitted the code for review on lkml a number of times to see if we were getting close for it to be included into the mainline kernel. Using this development model is standard practice for anyone that wants to write code that goes into the kernel and having any chance of doing so without a complete rewrite or.. shall I say flamefest when submitted. It saved us a tremendous amount of time by not having to re-fit code for it to be in a Linus acceptable state. Some other filesystems that were trying to get into the kernel that didn't follow an open development model had a lot harder time and a lot harsher criticism.
  • March 2006, when Linus released 2.6.16, OCFS2 officially became part of the mainline kernel, it was accepted a little earlier in the release candidates but in 2.6.16. OCFS2 became officially part of the mainline Linux kernel tree as one of the many filesystems. It was the first cluster filesystem to make it into the kernel tree. Our hope was that it would then end up getting picked up by the distribution vendors to make it easy for everyone to have access to a CFS. Today the source code for OCFS2 is approximately 85000 lines of code.
  • We made OCFS2 production with full support for customers that ran Oracle database on Linux, no extra or separate support contract needed. OCFS2 1.0.0 started being built for RHEL4 for x86, x86-64, ppc, s390x and ia64. For RHEL5 starting with OCFS2 1.2.
  • SuSE was very interested in high availability and clustering and decided to build and include OCFS2 with SLES9 for their customers and was, next to Oracle, the main contributor to the filesystem for both new features and bug fixes.
  • Source code was always available even prior to inclusion into mainline and as of 2.6.16, source code was just part of a Linux kernel download from kernel.org, which it still is, today. So the latest OCFS2 code is always the upstream mainline Linux kernel.
  • OCFS2 is the cluster filesystem used in Oracle VM 2 and Oracle VM 3 as the virtual disk repository filesystem.
  • Since the filesystem is in the Linux kernel it's released under the GPL v2
  • The release model has always been that new feature development happened in the mainline kernel and we then built consistent, well tested, snapshots that had versions, 1.2, 1.4, 1.6, 1.8. But these releases were effectively just snapshots in time that were tested for stability and release quality.

    OCFS2 is very easy to use, there's a simple text file that contains the node information (hostname, node number, cluster name) and a file that contains the cluster heartbeat timeouts. It is very small, and very efficient. As Sunil Mushran wrote in the manual :

  • OCFS2 is an efficient, easily configured, quickly installed, fully integrated and compatible, feature-rich, architecture and endian neutral, cache coherent, ordered data journaling, POSIX-compliant, shared disk cluster file system.
  • Here is a list of some of the important features that are included :

  • Variable Block and Cluster sizes Supports block sizes ranging from 512 bytes to 4 KB and cluster sizes ranging from 4 KB to 1 MB (increments in power of 2).
  • Extent-based Allocations Tracks the allocated space in ranges of clusters making it especially efficient for storing very large files.
  • Optimized Allocations Supports sparse files, inline-data, unwritten extents, hole punching and allocation reservation for higher performance and efficient storage.
  • File Cloning/snapshots REFLINK is a feature which introduces copy-on-write clones of files in a cluster coherent way.
  • Indexed Directories Allows efficient access to millions of objects in a directory.
  • Metadata Checksums Detects silent corruption in inodes and directories.
  • Extended Attributes Supports attaching an unlimited number of name:value pairs to the file system objects like regular files, directories, symbolic links, etc.
  • Advanced Security Supports POSIX ACLs and SELinux in addition to the traditional file access permission model.
  • Quotas Supports user and group quotas.
  • Journaling Supports both ordered and writeback data journaling modes to provide file system consistency in the event of power failure or system crash.
  • Endian and Architecture neutral Supports a cluster of nodes with mixed architectures. Allows concurrent mounts on nodes running 32-bit and 64-bit, little-endian (x86, x86_64, ia64) and big-endian (ppc64) architectures.
  • In-built Cluster-stack with DLM Includes an easy to configure, in-kernel cluster-stack with a distributed lock manager.
  • Buffered, Direct, Asynchronous, Splice and Memory Mapped I/Os Supports all modes of I/Os for maximum flexibility and performance.
  • Comprehensive Tools Support Provides a familiar EXT3-style tool-set that uses similar parameters for ease-of-use.
  • The filesystem was distributed for Linux distributions in separate RPM form and this had to be built for every single kernel errata release or every updated kernel provided by the vendor. We provided builds from Oracle for Oracle Linux and all kernels released by Oracle and for Red Hat Enterprise Linux. SuSE provided the modules directly for every kernel they shipped. With the introduction of the Unbreakable Enterprise Kernel for Oracle Linux and our interest in reducing the overhead of building filesystem modules for every minor release, we decide to make OCFS2 available as part of UEK. There was no more need for separate kernel modules, everything was built-in and a kernel upgrade automatically updated the filesystem, as it should. UEK allowed us to not having to backport new upstream filesystem code into an older kernel version, backporting features into older versions introduces risk and requires extra testing because the code is basically partially rewritten. The UEK model works really well for continuing to provide OCFS2 without that extra overhead.

    Because the RHEL kernel did not contain OCFS2 as a kernel module (it is in the source tree but it is not built by the vendor in kernel module form) we stopped adding the extra packages to Oracle Linux and its RHEL compatible kernel and for RHEL. Oracle Linux customers/users obviously get OCFS2 included as part of the Unbreakable Enterprise Kernel, SuSE customers get it by SuSE distributed with SLES and Red Hat can decide to distribute OCFS2 to their customers if they chose to as it's just a matter of compiling the module and making it available.

    OCFS2 today, in the mainline kernel is pretty much feature complete in terms of integration with every filesystem feature Linux offers and it is still actively maintained with Joel Becker being the primary maintainer. Since we use OCFS2 as part of Oracle VM, we continue to look at interesting new functionality to add, REFLINK was a good example, and as such we continue to enhance the filesystem where it makes sense. Bugfixes and any sort of code that goes into the mainline Linux kernel that affects filesystems, automatically also modifies OCFS2 so it's in kernel, actively maintained but not a lot of new development happening at this time. We continue to fully support OCFS2 as part of Oracle Linux and the Unbreakable Enterprise Kernel and other vendors make their own decisions on support as it's really a Linux cluster filesystem now more than something that we provide to customers. It really just is part of Linux like EXT3 or BTRFS etc, the OS distribution vendors decide.

    Do not confuse OCFS2 with ACFS (ASM cluster Filesystem) also known as Oracle Cloud Filesystem. ACFS is a filesystem that's provided by Oracle on various OS platforms and really integrates into Oracle ASM (Automatic Storage Management). It's a very powerful Cluster Filesystem but it's not distributed as part of the Operating System, it's distributed with the Oracle Database product and installs with and lives inside Oracle ASM. ACFS obviously is fully supported on Linux (Oracle Linux, Red Hat Enterprise Linux) but OCFS2 independently as a native Linux filesystem is also, and continues to also be supported. ACFS is very much tied into the Oracle RDBMS, OCFS2 is just a standard native Linux filesystem with no ties into Oracle products. Customers running the Oracle database and ASM really should consider using ACFS as it also provides storage/clustered volume management. Customers wanting to use a simple, easy to use generic Linux cluster filesystem should consider using OCFS2.

    To learn more about OCFS2 in detail, you can find good documentation on http://oss.oracle.com/projects/ocfs2 in the Documentation area, or get the latest mainline kernel from http://kernel.org and read the source.

    One final, unrelated note - since I am not always able to publicly answer or respond to comments, I do not want to selectively publish comments from readers. Sometimes I forget to publish comments, sometime I publish them and sometimes I would publish them but if for some reason I cannot publicly comment on them, it becomes a very one-sided stream. So for now I am going to not publish comments from anyone, to be fair to all sides. You are always welcome to email me and I will do my best to respond to technical questions, questions about strategy or direction are sometimes not possible to answer for obvious reasons.

    Saturday Jan 28, 2012

    The latest bits around ocfs2

    It's been a while since we last posted something about ocfs2 on our Oracle blogs but that doesn't mean the filesystem hasn't evolved. I had to write a bit of a summary for a customer so I figured it would be a good idea to just add a blog entry and document it here as well.

    OCFS2 is a native Linux cluster filesystem that has been around for quite a few years now and was developed at Oracle and also has had a ton of contributions from the folks at SuSE over several years. The filesystem got officially merged into 2.6.16 and all the changes since have been going into mainline first and then trickled down into versions we build for Linux distributions. So we have ocfs2 versions 1.2, 1.4, 1.6 (1.8 for Oracle VM 3) which are specific snapshots of the filesystem code and then released for specific kernels like 2.6.18 or 2.6.32.

    SLES has a version of ocfs2 that they build, other vendors decided to not compile in the filesystem so for Oracle Linux, we of course make sure we have current versions available as well. We also provide support for the filesystem as part of Oracle Linux support. You do not need to buy extra clustering or filesystem add-on options, the code is part of Oracle Linux and the support is part of our regular Oracle Linux support subscriptions.

    Many of the ocfs2 users, use the filesystem as an alternative to nfs, when I read the articles on the ocfs2 public maillists, this is a comment that comes back frequently. So there must be some truth to it as these are all unsolicited 3rd party comments :)... One nice thing with ocfs2 is that it's so very easy to set up. Just a simple text file (config file) on each node with the list of hostnames, ip addresses and you're basically good to go. One does need shared storage as it's a real cluster filesystem. This shared storage can be iscsi, san/fc or shared scsi and we highly recommend a private network so that you can isolate the cluster traffic. The main problem reports we get tend to be due to overloading servers. In a cluster filesystem you have to ensure that you know really what is going on with all servers, otherwise there is the potential for data corruption. This means that if a node gets in trouble, overloaded network or running out of memory, it will likely end up halting or rebooting the node so that the other servers can happily continue. A large percentage of customer reports tend to be related to misconfigured networks (share the interconnect / cluster traffic with everything else) or bad/slow disk subsystems that get overloaded and the heartbeat IOs cannot make it to the device.

    One of the reasons ocfs2 is so trivial to configure, is that the entire ecosystem is integrated. It comes with its own embedded clustering stack, o2cb. This mini, specialized clusterstack provides node membership, heartbeat services and a distributed lock manager. This stack is not designed to be a general purpose userspace clusterstack but really tailored towards the basic requirements for our filesystem. Another really cool feature, I'd say it's in my top 3 cool features list for ocfs2, is dlmfs. dlmfs is a virtual filesystem that exposes a few simple locktypes : shared read, exclusive and trylock. There's a libo2dlm to use this in applications or you can simply use your shell to create a domain and locks just by doing mkdir and touch of files. Someone with some time on their hands could theoretically with some shell magic hack together a little userspace cluster daemon that could monitor applications or nodes and handle start, stop, restart. It's on my todo list but I haven't had time :) anyway, it's a very nifty feature.

    Anyway, I digress... so one of the customer questions I had recently was about what's going on with ocfs2 and has there been any development effort. I decided to go look at the linux kernels since 2.6.27 and collect the list of checkins that have happened since. These features are also in our latest ocfs2 as part of Oracle VM 3.0 and for the most part also in our kernel (Unbreakable Enterprise Kernel). Here is the list, I think it's pretty impressive :

    
        Remove JBD compatibility layer 
        Add POSIX ACLs
        Add security xattr support (extended attributes for SELinux)
        Implement quota recovery 
        Periodic quota syncing 
        Implementation of local and global quota file handling
        Enable quota accounting on mount, disable on umount 
        Add a name indexed b-tree to directory inodes 
        Optimize inode allocation by remembering last group
        Optimize inode group allocation by recording last used group.
        Expose the file system state via debugfs 
        Add statistics for the checksum and ecc operations.
        Add CoW support. (reflink is unlimited inode-based (file based) writeable snapshots - very very useful for virtualization)
        Add ioctl for reflink. 
        Enable refcount tree support. 
        Always include ACL support 
        Implement allocation reservations, which reduces fragmentation significantly 
        Optimize punching-hole code, speeds up significantly some rare operations 
        Discontiguous block groups, necessary to improve some kind of allocations. It is a feature that marks an incompatible bit, ie, it makes a forward-compatible change
        Make nointr ("don't allow file operations to be interrupted") a default mount option 
        Allow huge (> 16 TiB) volumes to mount   (support for huge volumes)
        Add a mount option "coherency=*" to handle cluster coherency for O_DIRECT writes.
        Add new OCFS2_IOC_INFO ioctl: offers the none-privileged end-user a possibility to get filesys info gathering 
        Add support for heartbeat=global mount option (instead of having a heartbeat per filesystem you can now have a single heartbeat)
        SSD trimming support 
        Support for moving extents (preparation for defragmentation)
    
    

    There are a number of external articles written about ocfs2, one that I found is here.

    Have fun...

    About

    Wim Coekaerts is the Senior Vice President of Linux and Virtualization Engineering for Oracle. He is responsible for Oracle's complete desktop to data center virtualization product line and the Oracle Linux support program.

    You can follow him on Twitter at @wimcoekaerts

    Search

    Categories
    Archives
    « April 2014
    SunMonTueWedThuFriSat
      
    1
    2
    3
    4
    5
    6
    7
    9
    10
    11
    12
    13
    14
    15
    16
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
       
           
    Today