Thursday Jan 03, 2013

OCFS2 global heartbeat

A cool, but often missed feature in Oracle Linux is the inclusion of OCFS2. OCFS2 is a native Linux clusterfilesystem which was written many years ago at Oracle (hence the name Oracle Cluster Filesystem) and which got included in the mainline Linux kernel around 2.6.16 somewhere back in 2005. The filesystem is widely used and has a number of really cool features.

  • simplicity : it's incredibly easy to configure the filesystem and clusterstack. There is literally one small text-based config file.
  • complete : ocfs2 contains all the components needed : a nodemanager, a heartbeat, a distributed lock manager and the actual cluster filesystem
  • small : the size of the filesystem and the needed tools is incredibly small. It consists of a few kernel modules and a small set of userspace tools. All the kernel modules together add up to about 2.5Mb in size and the userspace package is a mere 800Kb.
  • integrated : it's a native Linux filesystem so it makes use of all the normal kernel infrastructure. There is no duplication of structures caches, it fits right into the standard Linux filesystem structure.
  • part of Oracle Linux/UEK : ocfs2, like other linux filesystems, is built as kernel modules. When customers use Oracle Linux's UEK or UEK2, we automatically compile the kernel modules for the filesystem. Other distributions like SLES have done the same. We fully support OCFS2 as part of Oracle Linux as a general purpose cluster filesystem.
  • feature rich :
    OCFS2 is POSIX-compliant
    Optimized Allocations (extents, reservations, sparse, unwritten extents, punch holes)
    REFLINKs (inode-based writeable snapshots)
    Indexed Directories
    Metadata Checksums
    Extended Attributes (unlimited number of attributes per inode)
    Advanced Security (POSIX ACLs and SELinux)
    User and Group Quotas
    Variable Block and Cluster sizes
    Journaling (Ordered and Writeback data journaling modes)
    Endian and Architecture Neutral (x86, x86_64, ia64 and ppc64) - yes, you can mount the filesystem in a heterogeneous cluster.
    Buffered, Direct, Asynchronous, Splice and Memory Mapped I/Os
    In-built Clusterstack with a Distributed Lock Manager
    Cluster-aware Tools (mkfs, fsck, tunefs, etc.)
  • One of the main features added most recently is Global Heartbeat. OCFS2 as a filesystem typically was used with what's called local heartbeat. Basically for every filesystem you mounted, it would start its own local heartbeat, membership mechanism. The disk heartbeat means a disk io every 1 or 2 seconds for every node in the cluster, for every device. It was never a problem when the number of mounted volumes was relatively small but once customers were using 20+ volumes the overhead of the multiple disk heartbeats became significant and at times became a stability issue.

    global heartbeat was written to provide a solution to the multiple heartbeats. It is now possible to specify on which device(s) you want a heartbeat thread and then you can mount many other volumes that do not have their own and the heartbeat is shared amongst those one, or few threads and as such significantly reducing disk IO overhead.

    I was playing with this a little bit the other day and noticed that this wasn't very well documented so why not write it up here and share it with everyone. Getting started with OCFS2 is just really easy and withing just a few minutes it is possible to have a complete installation.

    I started with two servers installed with Oracle Linux 6.3. Each server has 2 network interfaces, one public and one private. The servers have a local disk and a shared storage device. For cluster filesystems, typically this shared storage device should be either a shared SAN disk or an iscsi device but it is also possible with Oracle Linux and UEK2 to create a shared virtual device on an nfs server and use this device for the cluster filesystem. This technique is used with Oracle VM where the shared storage is NAS-based.I just wrote a blog entry about how to do that here.

    While it is technically possible to create a working ocfs2 configuration using just one network and a single IP per server, it is certainly not ideal and not a recommended configuration for real world use. In any cluster environment it's highly recommended to have a private network for cluster traffic.The biggest reason for instability in a clustering environment is a bad/unreliable network and/or storage. Many times the environment has an overloaded network which causes network heartbeats to fail or disks where failover takes longer than the default configuration and the only alternative we have at that point, is to reboot the node(s).

    Typically when I do a test like this, I make sure I use the latest versions of the OS release. So after an installation of Oracle Linux 6.3, I just do a yum update on all my nodes to have the latest packages and also latest kernel version installed and then do a reboot. That gets me to 2.6.39-300.17.3.el6uek.x86_64 at the time of writing. Of course all this is freely accessibly from http://public-yum.oracle.com.

    Depending on the type of installation you did (basic, minimal, etc...) you may or may not have to add RPMs. Do a simple check rpm -q ocfs2-tools to see if the tools are installed, if not, just run yum install ocfs2-tools. And that's it. All required software is now installed. The kernel modules are already part of the uek2 kernel and the required tools (mkfs, fsck, o2cb,..) are part of the ocfs2-tools RPM.

    Next up: create the filesystem on the shared disk device and configure the cluster.

    One requirement for using global heartbeat is that the heartbeat device needs to be a NON-partitioned disk. Other OCFS2 volumes you want to create and mount can be on partitioned disks, but a device for the heartbeat needs to be on an empty disk. Let's assume /dev/sdb in this example.

    # mkfs.ocfs2 -b 4K -C 4K -J size=4M -N 4 -L ocfs2vol1 \
    --cluster-name=ocfs2 --cluster-stack=o2cb --global-heartbeat /dev/sdb
    This creates a filesystem with a 4K blocksize (normal value), clustersize of 4K (if you have many small files, this is a good value, if you have few large files, go to 1M).

    Journalsize of 4M if you have a large filesystem with a lot of metadata changes you might want to increase this. I did not add an option for 32bit or 64bit journals. if you want to create huge filesystems then use block64 which uses jbd2.

    The filesystem is created for 4 nodes (-N 4) this can be modified if your cluster needs to grow larger so you can always tune this with tunefs.ocfs2.

    Label ocfs2vol1, this is a disklabel you can later use to mount by label a filesystem.

    clustername=ocfs2, this is the default name but if you want to have your own name for your cluster you can put a different value here, remember it because you will need to configure the clusterstack with the clustername later.

    cluster-stack=o2cb : it is possible to have different cluster-stacks used such as pacemaker or cman.

    global-heartbeat : make sure that the filesystem is prepared and built to support global heartbeat

    /dev/sdb : the device to use for the filesystem.

    
    # mkfs.ocfs2 -b 4K -C 4K -J size=4M -N 4 -L ocfs2vol1 --cluster-name=ocfs2 \
    --cluster-stack=o2cb --force --global-heartbeat /dev/sdb
    mkfs.ocfs2 1.8.0
    Cluster stack: o2cb
    Cluster name: ocfs2
    Stack Flags: 0x1
    NOTE: Feature extended slot map may be enabled
    Overwriting existing ocfs2 partition.
    WARNING: Cluster check disabled.
    Proceed (y/N): y
    Label: ocfs2vol1
    Features: sparse extended-slotmap backup-super unwritten inline-data strict-journal-super xattr indexed-dirs refcount discontig-bg
    Block size: 4096 (12 bits)
    Cluster size: 4096 (12 bits)
    Volume size: 10725765120 (2618595 clusters) (2618595 blocks)
    Cluster groups: 82 (tail covers 5859 clusters, rest cover 32256 clusters)
    Extent allocator size: 4194304 (1 groups)
    Journal size: 4194304
    Node slots: 4
    Creating bitmaps: done
    Initializing superblock: done
    Writing system files: done
    Writing superblock: done
    Writing backup superblock: 2 block(s)
    Formatting Journals: done
    Growing extent allocator: done
    Formatting slot map: done
    Formatting quota files: done
    Writing lost+found: done
    mkfs.ocfs2 successful
    

    Now, we just have to configure the o2cb stack and we're done.

  • add the cluster : o2cb add-cluster ocfs2
  • add the nodes :
    o2cb add-node --ip 192.168.199.1 --number 0 ocfs2 host1
    o2cb add-node --ip 192.168.199.2 --number 1 ocfs2 host2
  • it is very important to use the hostname of the server (the name you get when typing hostname) for each node!
  • add the heartbeat device:
    run the following command and take the UUID value of the filesystem/device you want to use for heartbeat mounted.ocfs2 -d
    # mounted.ocfs2 -d
    Device      Stack  Cluster  F  UUID                              Label
    /dev/sdb   o2cb   ocfs2    G  244A6AAAE77F4053803734530FC4E0B7  ocfs2vol1
    
    o2cb add-heartbeat ocfs2 244A6AAAE77F4053803734530FC4E0B7
  • enable global heartbeat o2cb heartbeat-mode ocfs2 global
  • start the clusterstack : /etc/init.d/o2cb enable
  • verify that the stack is up and running : o2cb cluster-status
  • That's it. If you want to enable this at boot time, you can configure o2cb to start automatically by running /etc/init.d/o2cb configure. This allows you to set different heartbeat time out values and also whether or not to start the clusterstack at boot time.

    Now that a first node is configured, all you have to do is copy the file /etc/ocfs2/cluster.conf to all the other nodes in your cluster. You do not have to edit it on the other nodes, you just need to have an exact copy everywhere. You also do not need to redo the above commands, except 1) make sure ocfs2-tools is installed everywhere and if you want to start at boot time, re-run the /etc/init.d/o2cb configure on the other nodes as well. From here on, you can just mount your filesystems :

    mount /dev/sdb /mountpoint1 on each node.

    If you create more OCFS2 volumes you can just keep mounting them all, and with global heartbeat, you will just have one (or a few) hb's going on.

    have fun...

    Here is vmstat output, the first output shows a single heartbeat and 8 mounted filesystems, the second vmstat output shows 8 mounted filesystems with their own local heartbeat. Even though the IO amount is low, it shows that there are about 8x more IOs happening (from 1 every other second to 4 every second). As these are small IOs, they will move the diskhead to a specific place all the time and interrupt performance if you have it on each device. Hopefully this shows the benefits of global heartbeat.

    # vmstat 1
    procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
     r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
     0  0      0 789752  26220  97620    0    0     1     0   41   34  0  0 100  0  0
     0  0      0 789752  26220  97620    0    0     0     0   46   22  0  0 100  0  0
     0  0      0 789752  26220  97620    0    0     1     1   38   29  0  0 100  0  0
     0  0      0 789752  26228  97620    0    0     0    52   52   41  0  0 100  1  0
     0  0      0 789752  26228  97620    0    0     1     0   28   26  0  0 100  0  0
     0  0      0 789760  26228  97620    0    0     0     0   30   30  0  0 100  0  0
     0  0      0 789760  26228  97620    0    0     1     1   26   20  0  0 100  0  0
     0  0      0 789760  26228  97620    0    0     0     0   54   37  0  1 100  0  0
     0  0      0 789760  26228  97620    0    0     1     0   29   28  0  0 100  0  0
     0  0      0 789760  26236  97612    0    0     0    16   43   48  0  0 100  0  0
     0  0      0 789760  26236  97620    0    0     1     1   48   28  0  0 100  0  0
     0  0      0 789760  26236  97620    0    0     0     0   42   30  0  0 100  0  0
     0  0      0 789760  26236  97620    0    0     1     0   26   30  0  0 100  0  0
     0  0      0 789760  26236  97620    0    0     0     0   35   24  0  0 100  0  0
     0  1      0 789760  26240  97616    0    0     1    21   29   27  0  0 100  0  0
     0  0      0 789760  26244  97620    0    0     0     4   51   44  0  0 100  0  0
     0  0      0 789760  26244  97620    0    0     1     0   31   24  0  0 100  0  0
     0  0      0 789760  26244  97620    0    0     0     0   25   28  0  0 100  0  0
     0  0      0 789760  26244  97620    0    0     1     1   30   20  0  0 100  0  0
     0  0      0 789760  26244  97620    0    0     0     0   41   30  0  0 100  0  0
     0  0      0 789760  26252  97616    0    0     1    16   56   44  0  0 100  0  0
    
    
    
    # vmstat 1
    procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
     r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
     0  0      0 784364  28732  98620    0    0     4    46   54   64  0  0 100  0  0
     0  0      0 784364  28732  98628    0    0     4     2   60   48  0  0 100  0  0
     0  0      0 784364  28732  98628    0    0     4     2   51   53  0  0 100  0  0
     0  0      0 784364  28732  98628    0    0     4     2   58   50  0  0 100  0  0
     0  0      0 784364  28732  98628    0    0     4     2   56   44  0  0 100  0  0
     0  0      0 784364  28732  98628    0    0     4     2   46   47  0  0 100  0  0
     0  0      0 784364  28732  98628    0    0     4     2   65   54  0  0 100  0  0
     0  0      0 784388  28740  98620    0    0     4    14   65   55  0  0 100  0  0
     0  0      0 784388  28740  98628    0    0     4     2   46   48  0  0 100  0  0
     0  0      0 784388  28740  98628    0    0     4     2   52   42  0  0 100  0  0
     0  0      0 784388  28740  98628    0    0     4     2   51   58  0  0 100  0  0
     0  0      0 784388  28740  98628    0    0     4     2   36   43  0  0 100  0  0
     0  0      0 784396  28740  98628    0    0     4     2   39   47  0  0 100  0  0
     0  0      0 784396  28740  98628    0    0     4     2   52   54  0  0 100  0  0
     0  0      0 784396  28740  98628    0    0     4     2   42   48  0  0 100  0  0
     0  0      0 784404  28748  98620    0    0     4    14   52   63  0  0 100  0  0
     0  0      0 784404  28748  98628    0    0     4     2   32   42  0  0 100  0  0
     0  0      0 784404  28748  98628    0    0     4     2   50   40  0  0 100  0  0
     0  0      0 784404  28748  98628    0    0     4     2   58   56  0  0 100  0  0
     0  0      0 784412  28748  98628    0    0     4     2   39   46  0  0 100  0  0
     0  0      0 784412  28748  98628    0    0     4     2   45   50  0  0 100  0  0
     0  0      0 784412  28748  98628    0    0     4     2   43   42  0  0 100  0  0
     0  0      0 784288  28748  98628    0    0     4     6   48   52  0  0 100  0  0
    
    

    dm nfs

    A little known feature that we make good use of in Oracle VM is called dm nfs. Basically the ability to create a device mapper device directly on an nfs-based file/filesystem. We use this in Oracle VM 3 if your shared storage for the cluster is nfs based.

    Oracle VM clustering relies on the OCFS2 clusterstack/filesystem that is native in the kernel (uek2/2.6.39-x). When we create an HA-enabled pool, we create, what we call, a pool filesystem. That filesystem contains an ocfs2 volume so that we can store cluster-wide data. In particular we store shared database files that are needed by the Oracle VM agents on the nodes for HA. It contains info on pool membership, which VMs are in HA mode, what the pool IP is etc...

    When the user provides an nfs filesystem for the pool, we do the following :

  • mount the nfs volume in /nfsmnt/
  • create a 10GB sized file ovspoolfs.img
  • create a dm nfs volume(/dev/mapper/ovspoolfs> on this ovspoolfs.img file
  • create an ocfs2 volume on this dm nfs device
  • mount the ocfs2 volume on /poolfsmnt/
  • If someone wants to try out something that relies on block-based shared storage devices, such as ocfs2, but does not have iSCSI or SAN storage, using nfs is an alternative and dm nfs just makes it really easy.

    To do this yourself, the following commands will do it for you :

  • to find out if any such devices exist just type dmsetup table --target nfs
  • to create your own device, do something like this:
  • mount mynfsserver:/mountpoint /mnt
    dd if=/dev/zero of=/mnt/myvolume.img bs=1M count=2000 
    dmsetup create myvolume --table "0 4096000 nfs /mnt/myvolume.img 0"
    
    So mount the nfs volume, create a file which will be the container of the blockdevice, in this case a 2GB file and then create the dm device. The values for the dmsetup command are the following:

    myvolume = the name of the /dev/mapper device. Here we end up with /dev/mapper/myvolume

    table = start (normally always 0), number of blocks/length, this is in 512byte blocks, so you double the number,nfs since this is on nfs, filename of the nfs based file, offset (normally always 0)

    So now you have /dev/mapper/myvolume, it acts like a normal block device. If you do this on multiple servers, you can actually create an ocfs2 filesystem on this block device and it will be consistent across the servers.

    Credits go to Chuck Lever for writing dm nfs in the first place, thanks Chuck :) The code for dm nfs is here.

    About

    Wim Coekaerts is the Senior Vice President of Linux and Virtualization Engineering for Oracle. He is responsible for Oracle's complete desktop to data center virtualization product line and the Oracle Linux support program.

    You can follow him on Twitter at @wimcoekaerts

    Search

    Categories
    Archives
    « January 2013 »
    SunMonTueWedThuFriSat
      
    1
    2
    7
    8
    9
    10
    11
    12
    13
    14
    15
    17
    18
    19
    20
    23
    24
    25
    26
    27
    28
    29
    30
    31
      
           
    Today