« dm nfs | Main | dlmfs »

OCFS2 global heartbeat

A cool, but often missed feature in Oracle Linux is the inclusion of OCFS2. OCFS2 is a native Linux clusterfilesystem which was written many years ago at Oracle (hence the name Oracle Cluster Filesystem) and which got included in the mainline Linux kernel around 2.6.16 somewhere back in 2005. The filesystem is widely used and has a number of really cool features.

  • simplicity : it's incredibly easy to configure the filesystem and clusterstack. There is literally one small text-based config file.
  • complete : ocfs2 contains all the components needed : a nodemanager, a heartbeat, a distributed lock manager and the actual cluster filesystem
  • small : the size of the filesystem and the needed tools is incredibly small. It consists of a few kernel modules and a small set of userspace tools. All the kernel modules together add up to about 2.5Mb in size and the userspace package is a mere 800Kb.
  • integrated : it's a native Linux filesystem so it makes use of all the normal kernel infrastructure. There is no duplication of structures caches, it fits right into the standard Linux filesystem structure.
  • part of Oracle Linux/UEK : ocfs2, like other linux filesystems, is built as kernel modules. When customers use Oracle Linux's UEK or UEK2, we automatically compile the kernel modules for the filesystem. Other distributions like SLES have done the same. We fully support OCFS2 as part of Oracle Linux as a general purpose cluster filesystem.
  • feature rich :
    OCFS2 is POSIX-compliant
    Optimized Allocations (extents, reservations, sparse, unwritten extents, punch holes)
    REFLINKs (inode-based writeable snapshots)
    Indexed Directories
    Metadata Checksums
    Extended Attributes (unlimited number of attributes per inode)
    Advanced Security (POSIX ACLs and SELinux)
    User and Group Quotas
    Variable Block and Cluster sizes
    Journaling (Ordered and Writeback data journaling modes)
    Endian and Architecture Neutral (x86, x86_64, ia64 and ppc64) - yes, you can mount the filesystem in a heterogeneous cluster.
    Buffered, Direct, Asynchronous, Splice and Memory Mapped I/Os
    In-built Clusterstack with a Distributed Lock Manager
    Cluster-aware Tools (mkfs, fsck, tunefs, etc.)
  • One of the main features added most recently is Global Heartbeat. OCFS2 as a filesystem typically was used with what's called local heartbeat. Basically for every filesystem you mounted, it would start its own local heartbeat, membership mechanism. The disk heartbeat means a disk io every 1 or 2 seconds for every node in the cluster, for every device. It was never a problem when the number of mounted volumes was relatively small but once customers were using 20+ volumes the overhead of the multiple disk heartbeats became significant and at times became a stability issue.

    global heartbeat was written to provide a solution to the multiple heartbeats. It is now possible to specify on which device(s) you want a heartbeat thread and then you can mount many other volumes that do not have their own and the heartbeat is shared amongst those one, or few threads and as such significantly reducing disk IO overhead.

    I was playing with this a little bit the other day and noticed that this wasn't very well documented so why not write it up here and share it with everyone. Getting started with OCFS2 is just really easy and withing just a few minutes it is possible to have a complete installation.

    I started with two servers installed with Oracle Linux 6.3. Each server has 2 network interfaces, one public and one private. The servers have a local disk and a shared storage device. For cluster filesystems, typically this shared storage device should be either a shared SAN disk or an iscsi device but it is also possible with Oracle Linux and UEK2 to create a shared virtual device on an nfs server and use this device for the cluster filesystem. This technique is used with Oracle VM where the shared storage is NAS-based.I just wrote a blog entry about how to do that here.

    While it is technically possible to create a working ocfs2 configuration using just one network and a single IP per server, it is certainly not ideal and not a recommended configuration for real world use. In any cluster environment it's highly recommended to have a private network for cluster traffic.The biggest reason for instability in a clustering environment is a bad/unreliable network and/or storage. Many times the environment has an overloaded network which causes network heartbeats to fail or disks where failover takes longer than the default configuration and the only alternative we have at that point, is to reboot the node(s).

    Typically when I do a test like this, I make sure I use the latest versions of the OS release. So after an installation of Oracle Linux 6.3, I just do a yum update on all my nodes to have the latest packages and also latest kernel version installed and then do a reboot. That gets me to 2.6.39-300.17.3.el6uek.x86_64 at the time of writing. Of course all this is freely accessibly from http://public-yum.oracle.com.

    Depending on the type of installation you did (basic, minimal, etc...) you may or may not have to add RPMs. Do a simple check rpm -q ocfs2-tools to see if the tools are installed, if not, just run yum install ocfs2-tools. And that's it. All required software is now installed. The kernel modules are already part of the uek2 kernel and the required tools (mkfs, fsck, o2cb,..) are part of the ocfs2-tools RPM.

    Next up: create the filesystem on the shared disk device and configure the cluster.

    One requirement for using global heartbeat is that the heartbeat device needs to be a NON-partitioned disk. Other OCFS2 volumes you want to create and mount can be on partitioned disks, but a device for the heartbeat needs to be on an empty disk. Let's assume /dev/sdb in this example.

    # mkfs.ocfs2 -b 4K -C 4K -J size=4M -N 4 -L ocfs2vol1 \
    --cluster-name=ocfs2 --cluster-stack=o2cb --global-heartbeat /dev/sdb
    This creates a filesystem with a 4K blocksize (normal value), clustersize of 4K (if you have many small files, this is a good value, if you have few large files, go to 1M).

    Journalsize of 4M if you have a large filesystem with a lot of metadata changes you might want to increase this. I did not add an option for 32bit or 64bit journals. if you want to create huge filesystems then use block64 which uses jbd2.

    The filesystem is created for 4 nodes (-N 4) this can be modified if your cluster needs to grow larger so you can always tune this with tunefs.ocfs2.

    Label ocfs2vol1, this is a disklabel you can later use to mount by label a filesystem.

    clustername=ocfs2, this is the default name but if you want to have your own name for your cluster you can put a different value here, remember it because you will need to configure the clusterstack with the clustername later.

    cluster-stack=o2cb : it is possible to have different cluster-stacks used such as pacemaker or cman.

    global-heartbeat : make sure that the filesystem is prepared and built to support global heartbeat

    /dev/sdb : the device to use for the filesystem.

    
    # mkfs.ocfs2 -b 4K -C 4K -J size=4M -N 4 -L ocfs2vol1 --cluster-name=ocfs2 \
    --cluster-stack=o2cb --force --global-heartbeat /dev/sdb
    mkfs.ocfs2 1.8.0
    Cluster stack: o2cb
    Cluster name: ocfs2
    Stack Flags: 0x1
    NOTE: Feature extended slot map may be enabled
    Overwriting existing ocfs2 partition.
    WARNING: Cluster check disabled.
    Proceed (y/N): y
    Label: ocfs2vol1
    Features: sparse extended-slotmap backup-super unwritten inline-data strict-journal-super xattr indexed-dirs refcount discontig-bg
    Block size: 4096 (12 bits)
    Cluster size: 4096 (12 bits)
    Volume size: 10725765120 (2618595 clusters) (2618595 blocks)
    Cluster groups: 82 (tail covers 5859 clusters, rest cover 32256 clusters)
    Extent allocator size: 4194304 (1 groups)
    Journal size: 4194304
    Node slots: 4
    Creating bitmaps: done
    Initializing superblock: done
    Writing system files: done
    Writing superblock: done
    Writing backup superblock: 2 block(s)
    Formatting Journals: done
    Growing extent allocator: done
    Formatting slot map: done
    Formatting quota files: done
    Writing lost+found: done
    mkfs.ocfs2 successful
    

    Now, we just have to configure the o2cb stack and we're done.

  • add the cluster : o2cb add-cluster ocfs2
  • add the nodes :
    o2cb add-node --ip 192.168.199.1 --number 0 ocfs2 host1
    o2cb add-node --ip 192.168.199.2 --number 1 ocfs2 host2
  • it is very important to use the hostname of the server (the name you get when typing hostname) for each node!
  • add the heartbeat device:
    run the following command and take the UUID value of the filesystem/device you want to use for heartbeat mounted.ocfs2 -d
    # mounted.ocfs2 -d
    Device      Stack  Cluster  F  UUID                              Label
    /dev/sdb   o2cb   ocfs2    G  244A6AAAE77F4053803734530FC4E0B7  ocfs2vol1
    
    o2cb add-heartbeat ocfs2 244A6AAAE77F4053803734530FC4E0B7
  • enable global heartbeat o2cb heartbeat-mode ocfs2 global
  • start the clusterstack : /etc/init.d/o2cb enable
  • verify that the stack is up and running : o2cb cluster-status
  • That's it. If you want to enable this at boot time, you can configure o2cb to start automatically by running /etc/init.d/o2cb configure. This allows you to set different heartbeat time out values and also whether or not to start the clusterstack at boot time.

    Now that a first node is configured, all you have to do is copy the file /etc/ocfs2/cluster.conf to all the other nodes in your cluster. You do not have to edit it on the other nodes, you just need to have an exact copy everywhere. You also do not need to redo the above commands, except 1) make sure ocfs2-tools is installed everywhere and if you want to start at boot time, re-run the /etc/init.d/o2cb configure on the other nodes as well. From here on, you can just mount your filesystems :

    mount /dev/sdb /mountpoint1 on each node.

    If you create more OCFS2 volumes you can just keep mounting them all, and with global heartbeat, you will just have one (or a few) hb's going on.

    have fun...

    Here is vmstat output, the first output shows a single heartbeat and 8 mounted filesystems, the second vmstat output shows 8 mounted filesystems with their own local heartbeat. Even though the IO amount is low, it shows that there are about 8x more IOs happening (from 1 every other second to 4 every second). As these are small IOs, they will move the diskhead to a specific place all the time and interrupt performance if you have it on each device. Hopefully this shows the benefits of global heartbeat.

    # vmstat 1
    procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
     r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
     0  0      0 789752  26220  97620    0    0     1     0   41   34  0  0 100  0  0
     0  0      0 789752  26220  97620    0    0     0     0   46   22  0  0 100  0  0
     0  0      0 789752  26220  97620    0    0     1     1   38   29  0  0 100  0  0
     0  0      0 789752  26228  97620    0    0     0    52   52   41  0  0 100  1  0
     0  0      0 789752  26228  97620    0    0     1     0   28   26  0  0 100  0  0
     0  0      0 789760  26228  97620    0    0     0     0   30   30  0  0 100  0  0
     0  0      0 789760  26228  97620    0    0     1     1   26   20  0  0 100  0  0
     0  0      0 789760  26228  97620    0    0     0     0   54   37  0  1 100  0  0
     0  0      0 789760  26228  97620    0    0     1     0   29   28  0  0 100  0  0
     0  0      0 789760  26236  97612    0    0     0    16   43   48  0  0 100  0  0
     0  0      0 789760  26236  97620    0    0     1     1   48   28  0  0 100  0  0
     0  0      0 789760  26236  97620    0    0     0     0   42   30  0  0 100  0  0
     0  0      0 789760  26236  97620    0    0     1     0   26   30  0  0 100  0  0
     0  0      0 789760  26236  97620    0    0     0     0   35   24  0  0 100  0  0
     0  1      0 789760  26240  97616    0    0     1    21   29   27  0  0 100  0  0
     0  0      0 789760  26244  97620    0    0     0     4   51   44  0  0 100  0  0
     0  0      0 789760  26244  97620    0    0     1     0   31   24  0  0 100  0  0
     0  0      0 789760  26244  97620    0    0     0     0   25   28  0  0 100  0  0
     0  0      0 789760  26244  97620    0    0     1     1   30   20  0  0 100  0  0
     0  0      0 789760  26244  97620    0    0     0     0   41   30  0  0 100  0  0
     0  0      0 789760  26252  97616    0    0     1    16   56   44  0  0 100  0  0
    
    
    
    # vmstat 1
    procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
     r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
     0  0      0 784364  28732  98620    0    0     4    46   54   64  0  0 100  0  0
     0  0      0 784364  28732  98628    0    0     4     2   60   48  0  0 100  0  0
     0  0      0 784364  28732  98628    0    0     4     2   51   53  0  0 100  0  0
     0  0      0 784364  28732  98628    0    0     4     2   58   50  0  0 100  0  0
     0  0      0 784364  28732  98628    0    0     4     2   56   44  0  0 100  0  0
     0  0      0 784364  28732  98628    0    0     4     2   46   47  0  0 100  0  0
     0  0      0 784364  28732  98628    0    0     4     2   65   54  0  0 100  0  0
     0  0      0 784388  28740  98620    0    0     4    14   65   55  0  0 100  0  0
     0  0      0 784388  28740  98628    0    0     4     2   46   48  0  0 100  0  0
     0  0      0 784388  28740  98628    0    0     4     2   52   42  0  0 100  0  0
     0  0      0 784388  28740  98628    0    0     4     2   51   58  0  0 100  0  0
     0  0      0 784388  28740  98628    0    0     4     2   36   43  0  0 100  0  0
     0  0      0 784396  28740  98628    0    0     4     2   39   47  0  0 100  0  0
     0  0      0 784396  28740  98628    0    0     4     2   52   54  0  0 100  0  0
     0  0      0 784396  28740  98628    0    0     4     2   42   48  0  0 100  0  0
     0  0      0 784404  28748  98620    0    0     4    14   52   63  0  0 100  0  0
     0  0      0 784404  28748  98628    0    0     4     2   32   42  0  0 100  0  0
     0  0      0 784404  28748  98628    0    0     4     2   50   40  0  0 100  0  0
     0  0      0 784404  28748  98628    0    0     4     2   58   56  0  0 100  0  0
     0  0      0 784412  28748  98628    0    0     4     2   39   46  0  0 100  0  0
     0  0      0 784412  28748  98628    0    0     4     2   45   50  0  0 100  0  0
     0  0      0 784412  28748  98628    0    0     4     2   43   42  0  0 100  0  0
     0  0      0 784288  28748  98628    0    0     4     6   48   52  0  0 100  0  0
    
    
    « dm nfs | Main | dlmfs »
    Comments:

    Just tried this and it won't work.
    I'm using oracle Linux 6.4 with the stock uek kernel.

    Some googling suggest that it might have something to do with symlinks in /sys/block and the fact that I'm running this in a xen VM.

    Posted by guest on June 08, 2013 at 12:23 AM PDT #

    a new version of ocfs2-tools should work. make sure you run with the very latest ocfs2-tools update

    Posted by guest on June 13, 2013 at 09:34 AM PDT #

    Is there any place I can get this updated version of ocfs2-tools package? The connection speed to the official repos are way too slow to practically apply any changes automatically.

    Posted by guest on June 13, 2013 at 06:18 PM PDT #

    The requested URL could not be retrieved

    --------------------------------------------------------------------------------

    While trying to retrieve the URL: http://public-yum-qa.oracle.com/repo/OracleLinux/OL6/latest/x86_64/ocfs2-tools-1.8.0-11.el6.x86_64.rpm

    The following error was encountered:

    Unable to determine IP address from host name for public-yum-qa.oracle.com

    The dnsserver returned:

    Name Error: The domain name does not exist.

    This means that:
    The cache was not able to resolve the hostname presented in the URL.
    Check if the address is correct.

    Your cache administrator is webmaster.

    --------------------------------------------------------------------------------

    Generated Fri, 14 Jun 2013 01:25:47 GMT by google.com (squid/2.7.STABLE6)

    Posted by guest on June 13, 2013 at 06:27 PM PDT #

    argh sorry!! drop the -qa

    Posted by guest on June 13, 2013 at 06:40 PM PDT #

    Post a Comment:
    • HTML Syntax: NOT allowed
    About

    Wim Coekaerts is the Senior Vice President of Linux and Virtualization Engineering for Oracle. He is responsible for Oracle's complete desktop to data center virtualization product line and the Oracle Linux support program.

    You can follow him on Twitter at @wimcoekaerts

    Search

    Categories
    Archives
    « April 2014
    SunMonTueWedThuFriSat
      
    1
    2
    3
    4
    5
    6
    7
    9
    10
    11
    12
    13
    14
    15
    16
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
       
           
    Today