Saturday Jan 05, 2013

Using Oracle VM messages to configure a Virtual Machine.

In the previous blog entry, I walked through the steps on how to set up a VM with the necessary packages to enable Oracle VM template configuration. The template configuration scripts are add-ons one can install inside a VM running in an Oracle VM 3 environment. Once installed, it is possible to enable the configuration scripts and shutdown the VM so that after cloning or reboot, we go through an initial setup dialog.

At startup time, if ovmd is enabled, it will start executing configuration scripts that need input to configure and continue. It is possible to send this configuration data through the virtual console of the VM or through the Oracle VM API. To use the Oracle VM API to send configuration messages, you have two options :

(1) use the Oracle VM CLI. As of Oracle VM 3.1, we include an Oracle VM CLI server by default when installing Oracle VM Manager. This process starts on port 10000 on the Oracle VM Manager node and acts as an ssh server. You can log into this cli using the admin username/password and then execute cli commands.

# ssh admin@localhost -p 10000
admin@localhost's password: 
OVM> sendVmMessage Vm name=ol6u3apitest key=foo message=bar log=no
Command: sendVmMessage Vm name=ol6u3apitest key=foo message=bar log=no
Status: Success
Time: 2012-12-27 09:04:29,890 PST

The cli command for sending a message is sendVmMessage Vm name=[vmname] key=[key] message=[value]

If you do not want to log the out of the commands then add log=no

(2) use the Oracle VM utilities. If you install the Oracle VM Utilities, see here to get started, then :

# ./ovm_vmmessage -u admin -p ######## -h localhost -v ol6u3apitest -k foo -V bar
Oracle VM VM Message utility 0.5.2.
VM : 'ol6u3apitest' has status :  Running.
Sending message.
Message sent successfully.

The ovm_vmmessage command connects to Oracle VM Manager and sends a key/value pair to the VM you select.

ovm_vmmessage -u [adminuser] -p [adminpassword] -h [managernode] -v [vmname] -k [key] -V [value]

These two commands basically allow the admin user to send simple key - value pair messages to a given VM. This is the basic mechanism we rely on to remotely configure a VM using the Oracle VM template config scripts.

For the template configuration we provide, and depending on the scripts you installed, there is a well-defined set of variables (keys) that you can set, listed below. In our scripts we have one variable that is required and this has to be set/send at the end of the configuration. This is configuring the root password. Everything else is optional. Sending the root password variable triggers the reconfiguration to execute. As an example, if you install the ovm-template-config-selinux package, then part of the configuration can be to set the selinux mode. The variable is and the values can be enforcing,permissive or disabled. So to set the value of SELinux, you basically send a message with key and value enforcing (or so..).

# ./ovm_vmmessage -u admin -p ######## -h localhost -v ol6u3apitest \
        -k -V enforcing

Do this for every variable you want to define and at the end send the root password.

# ./ovm_vmmessage -u admin -p ######## -h localhost -v ol6u3apitest \ 
        -k -V "mypassword"

Once the above message gets sent, the ovm-template-config scripts will set up all the values and the VM will end up in a configured state. You can use this to send ssh keys, set up extra users, configure the virtual network devices etc.. To get the list of configuration variables just run # ovm-template-config --human-readable --enumerate configure and it will list the variables with a description like below.

It is also possible to selectively enable and disable scripts. This work very similar to chk-config. # ovm-chkconfig --list will show which scripts/modules are registered and whether they are enabled to run at configure time and/or cleanup time. At this point, the other options are not implemented (suspend/resume/..). If you have installed datetime but do not want to have it run or be an option, then a simple # ovm-chkconfig --target configure datetime off will disable it. This allows you, for each VM or template, to selectively enable or disable configuration options. If you disable a module then the output of ovm-template-config will reflect those changes.

The next blog entry will talk about how to make generic use of the VM message API and possible extend the ovm-template-configure modules for your own applications.

  [{u'description': u'SELinux mode: enforcing, permissive or disabled.',
    u'hidden': True,
    u'key': u''}]),
  [{u'description': u'Whether to enable network firewall: True or False.',
    u'hidden': True,
    u'key': u''}]),
  [{u'description': u'System date and time in format year-month-day-hour-minute-second, e.g., "2011-4-7-9-2-42".',
    u'hidden': True,
    u'key': u''},
   {u'description': u'System time zone, e.g., "America/New_York".',
    u'hidden': True,
    u'key': u''},
   {u'description': u'Whether to keep hardware clock in UTC: True or False.',
    u'hidden': True,
    u'key': u''},
   {u'description': u'Whether to enable NTP service: True or False.',
    u'hidden': True,
    u'key': u''},
   {u'description': u'NTP servers separated by comma, e.g., ",".',
    u'hidden': True,
    u'key': u''},
   {u'description': u'Whether to enable NTP local time source: True or False.',
    u'hidden': True,
    u'key': u''}]),
  [{u'description': u'System host name, e.g., "localhost.localdomain".',
    u'key': u''},
   {u'description': u'Hostname entry for /etc/hosts, e.g., " localhost.localdomain localhost".',
    u'hidden': True,
    u'key': u''},
   {u'description': u'Network device to configure, e.g., "eth0".',
    u'key': u''},
   {u'depends': u'',
    u'description': u'Network device hardware address, e.g., "00:16:3E:28:0F:4E".',
    u'hidden': True,
    u'key': u''},
   {u'depends': u'',
    u'description': u'Network device MTU, e.g., "1500".',
    u'hidden': True,
    u'key': u''},
   {u'choices': [u'yes', u'no'],
    u'depends': u'',
    u'description': u'Activate interface on system boot: yes or no.',
    u'key': u''},
   {u'choices': [u'dhcp', u'static'],
    u'depends': u'',
    u'description': u'Boot protocol: dhcp or static.',
    u'key': u''},
   {u'depends': u'',
    u'description': u'IP address of the interface.',
    u'key': u'',
    u'requires': [u'',
                  [u'static', u'none', None]]},
   {u'depends': u'',
    u'description': u'Netmask of the interface.',
    u'key': u'',
    u'requires': [u'',
                  [u'static', u'none', None]]},
   {u'depends': u'',
    u'description': u'Gateway IP address.',
    u'key': u'',
    u'requires': [u'',
                  [u'static', u'none', None]]},
   {u'depends': u'',
    u'description': u'DNS servers separated by comma, e.g., ",".',
    u'key': u'',
    u'requires': [u'',
                  [u'static', u'none', None]]},
   {u'description': u'DNS search domains separated by comma, e.g., ",".',
    u'hidden': True,
    u'key': u''},
   {u'description': u'Network device to configure, e.g., "eth0".',
    u'hidden': True,
    u'key': u''},
   {u'depends': u'',
    u'description': u'Network device hardware address, e.g., "00:16:3E:28:0F:4E".',
    u'hidden': True,
    u'key': u''},
   {u'depends': u'',
    u'description': u'Network device MTU, e.g., "1500".',
    u'hidden': True,
    u'key': u''},
   {u'choices': [u'yes', u'no'],
    u'depends': u'',
    u'description': u'Activate interface on system boot: yes or no.',
    u'hidden': True,
    u'key': u''},
   {u'choices': [u'dhcp', u'static'],
    u'depends': u'',
    u'description': u'Boot protocol: dhcp or static.',
    u'hidden': True,
    u'key': u''},
   {u'depends': u'',
    u'description': u'IP address of the interface.',
    u'hidden': True,
    u'key': u'',
    u'requires': [u'',
                  [u'static', u'none', None]]},
   {u'depends': u'',
    u'description': u'Netmask of the interface.',
    u'hidden': True,
    u'key': u'',
    u'requires': [u'',
                  [u'static', u'none', None]]},
   {u'depends': u'',
    u'description': u'Gateway IP address.',
    u'hidden': True,
    u'key': u'',
    u'requires': [u'',
                  [u'static', u'none', None]]},
   {u'depends': u'',
    u'description': u'DNS servers separated by comma, e.g., ",".',
    u'hidden': True,
    u'key': u'',
    u'requires': [u'',
                  [u'static', u'none', None]]},
   {u'description': u'DNS search domains separated by comma, e.g., ",".',
    u'hidden': True,
    u'key': u''}]),
  [{u'description': u'Name of the user on which to perform operation.',
    u'hidden': True,
    u'key': u''},
   {u'description': u'Action to perform on the user: add, del or mod.',
    u'hidden': True,
    u'key': u''},
   {u'description': u'User ID.',
    u'hidden': True,
    u'key': u''},
   {u'description': u'User initial login group.',
    u'hidden': True,
    u'key': u''},
   {u'description': u'Supplementary groups separated by comma.',
    u'hidden': True,
    u'key': u''},
   {u'description': u'User password.',
    u'hidden': True,
    u'key': u'',
    u'password': True},
   {u'description': u'New name of the user.',
    u'hidden': True,
    u'key': u''},
   {u'description': u'Name of the group on which to perform operation.',
    u'hidden': True,
    u'key': u''},
   {u'description': u'Action to perform on the group: add, del or mod.',
    u'hidden': True,
    u'key': u''},
   {u'description': u'Group ID.',
    u'hidden': True,
    u'key': u''},
   {u'description': u'New name of the group.',
    u'hidden': True,
    u'key': u''}]),
  [{u'description': u'Host private rsa1 key for protocol version 1.',
    u'hidden': True,
    u'key': u''},
   {u'description': u'Host public rsa1 key for protocol version 1.',
    u'hidden': True,
    u'key': u''},
   {u'description': u'Host private rsa key.',
    u'hidden': True,
    u'key': u''},
   {u'description': u'Host public rsa key.',
    u'hidden': True,
    u'key': u''},
   {u'description': u'Host private dsa key.',
    u'hidden': True,
    u'key': u''},
   {u'description': u'Host public dsa key.',
    u'hidden': True,
    u'key': u''},
   {u'description': u'Name of the user to add a key.',
    u'hidden': True,
    u'key': u''},
   {u'description': u'Authorized public keys.',
    u'hidden': True,
    u'key': u''},
   {u'description': u'Private key for authentication.',
    u'hidden': True,
    u'key': u''},
   {u'description': u'Private key type: rsa, dsa or rsa1.',
    u'hidden': True,
    u'key': u''},
   {u'description': u'Known hosts.',
    u'hidden': True,
    u'key': u''}]),
  [{u'description': u'System root password.',
    u'key': u'',
    u'password': True,
    u'required': True}])]

Configure Oracle Linux 6.3 as an Oracle VM template

I have been asked a few times how one can make use of the Oracle VM API to configure an Oracle Linux VM running on top of Oracle VM 3. In the next few blog entries we will go through the various steps. This one will start at the beginning and get you to a completely prepared VM.

  • Create a VM with a default installation of Oracle Linux 6 update 3
  • You can freely download Oracle Linux installation images from Choose any type of installation you want, basic, desktop, server, minimal...

    Oracle Linux 6.3 comes with kernel 2.6.39-200.24.1 (UEK2)

    # uname -a
    Linux ol6u3 2.6.39-200.24.1.el6uek.x86_64 #1 SMP Sat Jun 23 02:39:07 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux

  • Update the VM to the latest version of UEK and in general as a best practice update to the latest patches and reboot the VM
  • Oracle Linux updates are freely available on our public-yum site and the default install of Oracle Linux 6.3 already points to this location for updates.

    # yum update 
    # reboot
    # uname -a
    Linux ol6u3 2.6.39-300.17.3.el6uek.x86_64 #1 SMP Wed Dec 19 06:28:03 PST 2012 x86_64 x86_64 x86_64 GNU/Linux

    There is an extra kernel module required for the Oracle VM API to work, the ovmapi kernel module provides the ability to communicate messages back and forth between the host and the VM and as such between Oracle VM Manager, through the VM API to the VM and back. We included this kernel module in the 2.6.39-300 kernel to make it easy. There is no need to install extra kernel modules or keep kernel modules up to date when or if we have a new update. The source code for this kernel module is of course part of the UEK2 source tree.

  • Enable the Oracle Linux add-on channel
  • After reboot, download the latest public-yum repo file from public-yum which contains more repositories and enable the add-on channel which contains the Oracle VM API packages:

    inside the VM :

    # cd /etc/yum.repos.d
    # rm public-yum-ol6.repo    <- (replace the original version with this newer version)
    # wget

  • Edit the public-yum-ol6.repo file to enable the ol6_addons channel.
  • Find the ol6_addons section and change enabled=0 to enabled=1.

    name=Oracle Linux $releasever Add ons ($basearch)

    Save the file.

  • Install the Oracle VM API packages
  • # yum install ovmd xenstoreprovider python-simplejson ovm-template-config

    This installs the basic necessary packages on Oracle Linux 6 to support the Oracle VM API. xenstore provider is the library which communicates with the ovmapi kernel infrastructure. ovmd is a daemon that handles configuration and re-configuration events and provides a mechanism to send/receive messages between the VM and the Oracle VM Manager.

  • Add additional configuration packages you want
  • In order to be able to create a VM template that includes basic OS configuration system scripts, you can decide to install any or all of the following :

    ovm-template-config-authentication : Oracle VM template authentication configuration script.
    ovm-template-config-datetime       : Oracle VM template datetime configuration script.
    ovm-template-config-firewall       : Oracle VM template firewall configuration script.
    ovm-template-config-network        : Oracle VM template network configuration script.
    ovm-template-config-selinux        : Oracle VM template selinux configuration script.
    ovm-template-config-ssh            : Oracle VM template ssh configuration script.
    ovm-template-config-system         : Oracle VM template system configuration script.
    ovm-template-config-user           : Oracle VM template user configuration script.

    Simply type # yum install ovm-template-config-... to install whichever you want.

  • Enable ovmd
  • To enable ovmd (recommended) do :

    # chkconfig ovmd on 
    # /etc/init.d/ovmd start
  • Prepare your VM for first boot configuration
  • If you want to shutdown this VM and enable the first boot configuration as a template, execute :

    # ovmd -s cleanup
    # service ovmd enable-initial-config
    # shutdown -h now

    After cloning this VM or starting it, it will act as a first time boot VM and it will require configuration input through the VM API or on the virtual VM console.

    My next blog will go into detail on how to send messages through the Oracle VM API for remote configuration and also how to extend the scripts.

    Friday Jan 04, 2013


    dlmfs is a really cool nifty feature as part of OCFS2. Basically, it's a virtual filesystem that allows a user/program to use the DLM through simple filesystem commands/manipulation. Without having to write programs that link with cluster libraries or do complex things, you can literally write a few lines of Python, Java or C code that let you create locks across a number of servers. We use this feature in Oracle VM to coordinate the master server and the locking of VMs across multiple nodes in a cluster. It allows us to make sure that a VM cannot start on multiple servers at once. Every VM is backed by a DLM lock, but by using dlmfs, this is simply a file in the dlmfs filesystem.

    To show you how easy and powerful this is, I took some of the Oracle VM agent Python code, this is a very simple example of how to create a lock domain, a lock and when you know you get the lock or not. The focus here is just a master lock which y ou could use for an agent that is responsible for a virtual IP or some executable that you want to locate on a given server but the calls to create any kind of lock are in the code. Anyone that wants to experiment with this can add their own bits in a matter of minutes.

    The prerequisite is simple : take a number of servers, configure an ocfs2 volume and an ocfs2 cluster (see previous blog entries) and run the script. You do not have to set up an ocfs2 volume if you do not want to, you could just set up the domain without actually mounting the filesystem. (See the global heartbeat blog). So practically this can be done with a very small simple setup.

    My example has two nodes, wcoekaer-emgc1 and wcoekaer-emgc2 are the two Oracle Linux 6 nodes, configured with a shared disk and an ocfs2 filesystem mounted. This setup ensures that the dlmfs kernel module is loaded and the cluster is online. Take the python code listed here and just execute it on both nodes.

    [root@wcoekaer-emgc2 ~]# lsmod |grep ocfs
    ocfs2                1092529  1 
    ocfs2_dlmfs            20160  1 
    ocfs2_stack_o2cb        4103  1 
    ocfs2_dlm             228380  1 ocfs2_stack_o2cb
    ocfs2_nodemanager     219951  12 ocfs2,ocfs2_dlmfs,ocfs2_stack_o2cb,ocfs2_dlm
    ocfs2_stackglue        11896  3 ocfs2,ocfs2_dlmfs,ocfs2_stack_o2cb
    configfs               29244  2 ocfs2_nodemanager
    jbd2                   93114  2 ocfs2,ext4
    You see that the ocfs2_dlmfs kernel module is loaded.

    [root@wcoekaer-emgc2 ~]# mount |grep dlmfs
    ocfs2_dlmfs on /dlm type ocfs2_dlmfs (rw)
    The dlmfs virtual filesystem is mounted on /dlm.

    I now execute on both nodes and show some output, after a while I kill (control-c) the script on the master node and you see the other node take over the lock. I then restart the script and reboot the other node and you see the same.

    [root@wcoekaer-emgc1 ~]# ./ 
    Checking DLM
    DLM Ready - joining domain : mycluster
    Starting main loop...
    i am master of the multiverse
    i am master of the multiverse
    i am master of the multiverse
    i am master of the multiverse
    ^Ccleaned up master lock file
    [root@wcoekaer-emgc1 ~]# ./ 
    Checking DLM
    DLM Ready - joining domain : mycluster
    Starting main loop...
    i am not the master
    i am not the master
    i am not the master
    i am not the master
    i am master of the multiverse
    This shows that I started as master, then hit ctrl-c, I drop the lock, the other node takes the lock, then I reboot the other node and I take the lock again.

    [root@wcoekaer-emgc2 ~]# ./
    Checking DLM
    DLM Ready - joining domain : mycluster
    Starting main loop...
    i am not the master
    i am not the master
    i am not the master
    i am not the master
    i am master of the multiverse
    i am master of the multiverse
    i am master of the multiverse
    [1]+  Stopped                 ./
    [root@wcoekaer-emgc2 ~]# bg
    [1]+ ./ &
    [root@wcoekaer-emgc2 ~]# reboot -f
    Here you see that when this node started without being master, then at time of ctrl-c on the other node, became master, then after a forced reboot, the lock automatically gets released.

    And here is the code, just copy it to your servers and execute it...

    # Copyright (C) 2006-2012 Oracle. All rights reserved.
    # This program is free software; you can redistribute it and/or modify it under
    # the terms of the GNU General Public License as published by the Free Software
    # Foundation, version 2.  This program is distributed in the hope that it will
    # be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of
    # Public License for more details.  You should have received a copy of the GNU
    # General Public License along with this program; if not, write to the Free
    # Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
    # 021110-1307, USA.
    import sys
    import subprocess
    import stat
    import time
    import os
    import re
    import socket
    from time import sleep
    from os.path import join, isdir, exists
    # defines
    # dlmfs is where the dlmfs filesystem is mounted
    # the default, normal place for ocfs2 setups is /dlm
    # ocfs2_dlmfs on /dlm type ocfs2_dlmfs (rw)
    DLMFS = "/dlm"
    # we need a domain name which really is just a subdir in dlmfs
    # default to "mycluster" so then it creates /dlm/mycluster
    # locks are created inside this directory/domain
    DLM_DOMAIN_NAME = "mycluster"
    # the main lock to use for being the owner of a lock
    # this can be any name, the filename is just the lockname
    DLM_LOCK_MASTER = DLM_DOMAIN_PATH + "/" + "master"
    # just a timeout
    SLEEP_ON_ERR = 60
    def run_cmd(cmd, success_return_code=(0,)):
        if not isinstance(cmd, list):
            raise Exception("Only accepts list!")
        cmd = [str(x) for x in cmd]
        proc = subprocess.Popen(cmd, stdout=subprocess.PIPE,
                                stderr=subprocess.PIPE, close_fds=True)
        (stdoutdata, stderrdata) = proc.communicate()
        if proc.returncode not in success_return_code:
            raise RuntimeError('Command: %s failed (%s): stderr: %s stdout: %s'
                               % (cmd, proc.returncode, stderrdata, stdoutdata))
        return str(stdoutdata)
    def dlm_ready():
        Indicate if the DLM is ready of not.
        With dlmfs, the DLM is ready once the DLM filesystem is mounted
        under /dlm.
        @return: C{True} if the DLM is ready, C{False} otherwise.
        @rtype: C{bool}
        return os.path.ismount(DLMFS)
    # just do a mkdir, if it already exists, we're good, if not just create it
    def dlm_join_domain(domain=DLM_DOMAIN_NAME):
        _dir = join(DLMFS, domain)
        if not isdir(_dir):
        # else: already joined
    # leaving a domain is basically removing the directory.
    def dlm_leave_domain(domain=DLM_DOMAIN_NAME, force=True):
        _dir = join(DLMFS, domain)
        if force:
            cmd = ["rm", "-fr", _dir]
            cmd = ["rmdir", _dir]
    # acquire a lock
    def dlm_acquire_lock(lock):
        # a lock is a filename in the domain directory
        lock_path = join(DLM_DOMAIN_PATH, lock)
            if not exists(lock_path):
                fd =, os.O_CREAT | os.O_NONBLOCK)
            # create the EX lock
            # creating a file with O_RDWR causes an EX lock
            fd =, os.O_RDWR | os.O_NONBLOCK)
            # once the file is created in this mode, you can close it
            # and you still keep the lock
        except Exception, e:
            if exists(lock_path):
            raise e
    def dlm_release_lock(lock):
        # releasing a lock is as easy as just removing the file
        lock_path = join(DLM_DOMAIN_PATH, lock)
        if exists(lock_path):
    def acquire_master_dlm_lock():
        ETXTBUSY = 26
        # close() does not downconvert the lock level nor does it drop the lock. The
        # holder still owns the lock at that level after close.
        # close() allows any downconvert request to succeed.
        # However, a downconvert request is only generated for queued requests. And
        # O_NONBLOCK is specifically a noqueue dlm request.
        # 1) O_CREAT | O_NONBLOCK will create a lock file if it does not exist, whether
        #    we are the lock holder or not.
        # 2) if we hold O_RDWR lock, and we close but not delete it, we still hold it.
        #    afterward, O_RDWR will succeed, but O_RDWR | O_NONBLOCK will not.
        # 3) but if we donnot hold the lock, O_RDWR will hang there waiting,
        #    which is not desirable -- any uninterruptable hang is undesirable.
        # 4) if noboday else hold the lock either, but the lock file exists as side effect
        #    of 1), with O_NONBLOCK, it may result in ETXTBUSY
        # a) we need O_NONBLOCK to avoid scenario (3)
        # b) we need to delete it ourself to avoid (2)
        #   *) if we do not succeed with (1), remove the lock file to avoid (4)
        #   *) if everything is good, we drop it and we remove it
        #   *) if killed by a program, this program should remove the file
        #   *) if crashed, but not rebooted, something needs to remove the file
        #   *) on reboot/reset the lock is released to the other node(s)
            if not exists(DLM_LOCK_MASTER):
                fd =, os.O_CREAT | os.O_NONBLOCK)
            master_lock =, os.O_RDWR | os.O_NONBLOCK)
            print "i am master of the multiverse"
            # at this point, I know I am the master and I can add code to do
            # things that only a master can do, such as, consider setting
            # a virtual IP or, if I am master, I start a program
            # and if not, then I make sure I don't run that program (or VIP)
            # so the magic starts here...
            return True
        except OSError, e:
            if e.errno == ETXTBUSY:
                print "i am not the master"
                # if we are not master and the file exists, remove it or
                # we will never succeed
                if exists(DLM_LOCK_MASTER):
                raise e
    def release_master_dlm_lock():
        if exists(DLM_LOCK_MASTER):
    def run_forever():
        # set socket default timeout for all connections
        print "Checking DLM"
        if dlm_ready():
           print "DLM Ready - joining domain : " + DLM_DOMAIN_NAME
           print "DLM not ready - bailing, fix the cluster stack"
        print "Starting main loop..."
        while True:
            except Exception, e:
            except (KeyboardInterrupt, SystemExit):
                if exists(DLM_LOCK_MASTER):
                # if you control-c out of this, then you lose the lock!
                # delete it on exit for release
                print "cleaned up master lock file"
    if __name__ == '__main__':

    Thursday Jan 03, 2013

    OCFS2 global heartbeat

    A cool, but often missed feature in Oracle Linux is the inclusion of OCFS2. OCFS2 is a native Linux clusterfilesystem which was written many years ago at Oracle (hence the name Oracle Cluster Filesystem) and which got included in the mainline Linux kernel around 2.6.16 somewhere back in 2005. The filesystem is widely used and has a number of really cool features.

  • simplicity : it's incredibly easy to configure the filesystem and clusterstack. There is literally one small text-based config file.
  • complete : ocfs2 contains all the components needed : a nodemanager, a heartbeat, a distributed lock manager and the actual cluster filesystem
  • small : the size of the filesystem and the needed tools is incredibly small. It consists of a few kernel modules and a small set of userspace tools. All the kernel modules together add up to about 2.5Mb in size and the userspace package is a mere 800Kb.
  • integrated : it's a native Linux filesystem so it makes use of all the normal kernel infrastructure. There is no duplication of structures caches, it fits right into the standard Linux filesystem structure.
  • part of Oracle Linux/UEK : ocfs2, like other linux filesystems, is built as kernel modules. When customers use Oracle Linux's UEK or UEK2, we automatically compile the kernel modules for the filesystem. Other distributions like SLES have done the same. We fully support OCFS2 as part of Oracle Linux as a general purpose cluster filesystem.
  • feature rich :
    OCFS2 is POSIX-compliant
    Optimized Allocations (extents, reservations, sparse, unwritten extents, punch holes)
    REFLINKs (inode-based writeable snapshots)
    Indexed Directories
    Metadata Checksums
    Extended Attributes (unlimited number of attributes per inode)
    Advanced Security (POSIX ACLs and SELinux)
    User and Group Quotas
    Variable Block and Cluster sizes
    Journaling (Ordered and Writeback data journaling modes)
    Endian and Architecture Neutral (x86, x86_64, ia64 and ppc64) - yes, you can mount the filesystem in a heterogeneous cluster.
    Buffered, Direct, Asynchronous, Splice and Memory Mapped I/Os
    In-built Clusterstack with a Distributed Lock Manager
    Cluster-aware Tools (mkfs, fsck, tunefs, etc.)
  • One of the main features added most recently is Global Heartbeat. OCFS2 as a filesystem typically was used with what's called local heartbeat. Basically for every filesystem you mounted, it would start its own local heartbeat, membership mechanism. The disk heartbeat means a disk io every 1 or 2 seconds for every node in the cluster, for every device. It was never a problem when the number of mounted volumes was relatively small but once customers were using 20+ volumes the overhead of the multiple disk heartbeats became significant and at times became a stability issue.

    global heartbeat was written to provide a solution to the multiple heartbeats. It is now possible to specify on which device(s) you want a heartbeat thread and then you can mount many other volumes that do not have their own and the heartbeat is shared amongst those one, or few threads and as such significantly reducing disk IO overhead.

    I was playing with this a little bit the other day and noticed that this wasn't very well documented so why not write it up here and share it with everyone. Getting started with OCFS2 is just really easy and withing just a few minutes it is possible to have a complete installation.

    I started with two servers installed with Oracle Linux 6.3. Each server has 2 network interfaces, one public and one private. The servers have a local disk and a shared storage device. For cluster filesystems, typically this shared storage device should be either a shared SAN disk or an iscsi device but it is also possible with Oracle Linux and UEK2 to create a shared virtual device on an nfs server and use this device for the cluster filesystem. This technique is used with Oracle VM where the shared storage is NAS-based.I just wrote a blog entry about how to do that here.

    While it is technically possible to create a working ocfs2 configuration using just one network and a single IP per server, it is certainly not ideal and not a recommended configuration for real world use. In any cluster environment it's highly recommended to have a private network for cluster traffic.The biggest reason for instability in a clustering environment is a bad/unreliable network and/or storage. Many times the environment has an overloaded network which causes network heartbeats to fail or disks where failover takes longer than the default configuration and the only alternative we have at that point, is to reboot the node(s).

    Typically when I do a test like this, I make sure I use the latest versions of the OS release. So after an installation of Oracle Linux 6.3, I just do a yum update on all my nodes to have the latest packages and also latest kernel version installed and then do a reboot. That gets me to 2.6.39-300.17.3.el6uek.x86_64 at the time of writing. Of course all this is freely accessibly from

    Depending on the type of installation you did (basic, minimal, etc...) you may or may not have to add RPMs. Do a simple check rpm -q ocfs2-tools to see if the tools are installed, if not, just run yum install ocfs2-tools. And that's it. All required software is now installed. The kernel modules are already part of the uek2 kernel and the required tools (mkfs, fsck, o2cb,..) are part of the ocfs2-tools RPM.

    Next up: create the filesystem on the shared disk device and configure the cluster.

    One requirement for using global heartbeat is that the heartbeat device needs to be a NON-partitioned disk. Other OCFS2 volumes you want to create and mount can be on partitioned disks, but a device for the heartbeat needs to be on an empty disk. Let's assume /dev/sdb in this example.

    # mkfs.ocfs2 -b 4K -C 4K -J size=4M -N 4 -L ocfs2vol1 \
    --cluster-name=ocfs2 --cluster-stack=o2cb --global-heartbeat /dev/sdb
    This creates a filesystem with a 4K blocksize (normal value), clustersize of 4K (if you have many small files, this is a good value, if you have few large files, go to 1M).

    Journalsize of 4M if you have a large filesystem with a lot of metadata changes you might want to increase this. I did not add an option for 32bit or 64bit journals. if you want to create huge filesystems then use block64 which uses jbd2.

    The filesystem is created for 4 nodes (-N 4) this can be modified if your cluster needs to grow larger so you can always tune this with tunefs.ocfs2.

    Label ocfs2vol1, this is a disklabel you can later use to mount by label a filesystem.

    clustername=ocfs2, this is the default name but if you want to have your own name for your cluster you can put a different value here, remember it because you will need to configure the clusterstack with the clustername later.

    cluster-stack=o2cb : it is possible to have different cluster-stacks used such as pacemaker or cman.

    global-heartbeat : make sure that the filesystem is prepared and built to support global heartbeat

    /dev/sdb : the device to use for the filesystem.

    # mkfs.ocfs2 -b 4K -C 4K -J size=4M -N 4 -L ocfs2vol1 --cluster-name=ocfs2 \
    --cluster-stack=o2cb --force --global-heartbeat /dev/sdb
    mkfs.ocfs2 1.8.0
    Cluster stack: o2cb
    Cluster name: ocfs2
    Stack Flags: 0x1
    NOTE: Feature extended slot map may be enabled
    Overwriting existing ocfs2 partition.
    WARNING: Cluster check disabled.
    Proceed (y/N): y
    Label: ocfs2vol1
    Features: sparse extended-slotmap backup-super unwritten inline-data strict-journal-super xattr indexed-dirs refcount discontig-bg
    Block size: 4096 (12 bits)
    Cluster size: 4096 (12 bits)
    Volume size: 10725765120 (2618595 clusters) (2618595 blocks)
    Cluster groups: 82 (tail covers 5859 clusters, rest cover 32256 clusters)
    Extent allocator size: 4194304 (1 groups)
    Journal size: 4194304
    Node slots: 4
    Creating bitmaps: done
    Initializing superblock: done
    Writing system files: done
    Writing superblock: done
    Writing backup superblock: 2 block(s)
    Formatting Journals: done
    Growing extent allocator: done
    Formatting slot map: done
    Formatting quota files: done
    Writing lost+found: done
    mkfs.ocfs2 successful

    Now, we just have to configure the o2cb stack and we're done.

  • add the cluster : o2cb add-cluster ocfs2
  • add the nodes :
    o2cb add-node --ip --number 0 ocfs2 host1
    o2cb add-node --ip --number 1 ocfs2 host2
  • it is very important to use the hostname of the server (the name you get when typing hostname) for each node!
  • add the heartbeat device:
    run the following command and take the UUID value of the filesystem/device you want to use for heartbeat mounted.ocfs2 -d
    # mounted.ocfs2 -d
    Device      Stack  Cluster  F  UUID                              Label
    /dev/sdb   o2cb   ocfs2    G  244A6AAAE77F4053803734530FC4E0B7  ocfs2vol1
    o2cb add-heartbeat ocfs2 244A6AAAE77F4053803734530FC4E0B7
  • enable global heartbeat o2cb heartbeat-mode ocfs2 global
  • start the clusterstack : /etc/init.d/o2cb enable
  • verify that the stack is up and running : o2cb cluster-status
  • That's it. If you want to enable this at boot time, you can configure o2cb to start automatically by running /etc/init.d/o2cb configure. This allows you to set different heartbeat time out values and also whether or not to start the clusterstack at boot time.

    Now that a first node is configured, all you have to do is copy the file /etc/ocfs2/cluster.conf to all the other nodes in your cluster. You do not have to edit it on the other nodes, you just need to have an exact copy everywhere. You also do not need to redo the above commands, except 1) make sure ocfs2-tools is installed everywhere and if you want to start at boot time, re-run the /etc/init.d/o2cb configure on the other nodes as well. From here on, you can just mount your filesystems :

    mount /dev/sdb /mountpoint1 on each node.

    If you create more OCFS2 volumes you can just keep mounting them all, and with global heartbeat, you will just have one (or a few) hb's going on.

    have fun...

    Here is vmstat output, the first output shows a single heartbeat and 8 mounted filesystems, the second vmstat output shows 8 mounted filesystems with their own local heartbeat. Even though the IO amount is low, it shows that there are about 8x more IOs happening (from 1 every other second to 4 every second). As these are small IOs, they will move the diskhead to a specific place all the time and interrupt performance if you have it on each device. Hopefully this shows the benefits of global heartbeat.

    # vmstat 1
    procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
     r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
     0  0      0 789752  26220  97620    0    0     1     0   41   34  0  0 100  0  0
     0  0      0 789752  26220  97620    0    0     0     0   46   22  0  0 100  0  0
     0  0      0 789752  26220  97620    0    0     1     1   38   29  0  0 100  0  0
     0  0      0 789752  26228  97620    0    0     0    52   52   41  0  0 100  1  0
     0  0      0 789752  26228  97620    0    0     1     0   28   26  0  0 100  0  0
     0  0      0 789760  26228  97620    0    0     0     0   30   30  0  0 100  0  0
     0  0      0 789760  26228  97620    0    0     1     1   26   20  0  0 100  0  0
     0  0      0 789760  26228  97620    0    0     0     0   54   37  0  1 100  0  0
     0  0      0 789760  26228  97620    0    0     1     0   29   28  0  0 100  0  0
     0  0      0 789760  26236  97612    0    0     0    16   43   48  0  0 100  0  0
     0  0      0 789760  26236  97620    0    0     1     1   48   28  0  0 100  0  0
     0  0      0 789760  26236  97620    0    0     0     0   42   30  0  0 100  0  0
     0  0      0 789760  26236  97620    0    0     1     0   26   30  0  0 100  0  0
     0  0      0 789760  26236  97620    0    0     0     0   35   24  0  0 100  0  0
     0  1      0 789760  26240  97616    0    0     1    21   29   27  0  0 100  0  0
     0  0      0 789760  26244  97620    0    0     0     4   51   44  0  0 100  0  0
     0  0      0 789760  26244  97620    0    0     1     0   31   24  0  0 100  0  0
     0  0      0 789760  26244  97620    0    0     0     0   25   28  0  0 100  0  0
     0  0      0 789760  26244  97620    0    0     1     1   30   20  0  0 100  0  0
     0  0      0 789760  26244  97620    0    0     0     0   41   30  0  0 100  0  0
     0  0      0 789760  26252  97616    0    0     1    16   56   44  0  0 100  0  0
    # vmstat 1
    procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
     r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
     0  0      0 784364  28732  98620    0    0     4    46   54   64  0  0 100  0  0
     0  0      0 784364  28732  98628    0    0     4     2   60   48  0  0 100  0  0
     0  0      0 784364  28732  98628    0    0     4     2   51   53  0  0 100  0  0
     0  0      0 784364  28732  98628    0    0     4     2   58   50  0  0 100  0  0
     0  0      0 784364  28732  98628    0    0     4     2   56   44  0  0 100  0  0
     0  0      0 784364  28732  98628    0    0     4     2   46   47  0  0 100  0  0
     0  0      0 784364  28732  98628    0    0     4     2   65   54  0  0 100  0  0
     0  0      0 784388  28740  98620    0    0     4    14   65   55  0  0 100  0  0
     0  0      0 784388  28740  98628    0    0     4     2   46   48  0  0 100  0  0
     0  0      0 784388  28740  98628    0    0     4     2   52   42  0  0 100  0  0
     0  0      0 784388  28740  98628    0    0     4     2   51   58  0  0 100  0  0
     0  0      0 784388  28740  98628    0    0     4     2   36   43  0  0 100  0  0
     0  0      0 784396  28740  98628    0    0     4     2   39   47  0  0 100  0  0
     0  0      0 784396  28740  98628    0    0     4     2   52   54  0  0 100  0  0
     0  0      0 784396  28740  98628    0    0     4     2   42   48  0  0 100  0  0
     0  0      0 784404  28748  98620    0    0     4    14   52   63  0  0 100  0  0
     0  0      0 784404  28748  98628    0    0     4     2   32   42  0  0 100  0  0
     0  0      0 784404  28748  98628    0    0     4     2   50   40  0  0 100  0  0
     0  0      0 784404  28748  98628    0    0     4     2   58   56  0  0 100  0  0
     0  0      0 784412  28748  98628    0    0     4     2   39   46  0  0 100  0  0
     0  0      0 784412  28748  98628    0    0     4     2   45   50  0  0 100  0  0
     0  0      0 784412  28748  98628    0    0     4     2   43   42  0  0 100  0  0
     0  0      0 784288  28748  98628    0    0     4     6   48   52  0  0 100  0  0

    dm nfs

    A little known feature that we make good use of in Oracle VM is called dm nfs. Basically the ability to create a device mapper device directly on an nfs-based file/filesystem. We use this in Oracle VM 3 if your shared storage for the cluster is nfs based.

    Oracle VM clustering relies on the OCFS2 clusterstack/filesystem that is native in the kernel (uek2/2.6.39-x). When we create an HA-enabled pool, we create, what we call, a pool filesystem. That filesystem contains an ocfs2 volume so that we can store cluster-wide data. In particular we store shared database files that are needed by the Oracle VM agents on the nodes for HA. It contains info on pool membership, which VMs are in HA mode, what the pool IP is etc...

    When the user provides an nfs filesystem for the pool, we do the following :

  • mount the nfs volume in /nfsmnt/
  • create a 10GB sized file ovspoolfs.img
  • create a dm nfs volume(/dev/mapper/ovspoolfs> on this ovspoolfs.img file
  • create an ocfs2 volume on this dm nfs device
  • mount the ocfs2 volume on /poolfsmnt/
  • If someone wants to try out something that relies on block-based shared storage devices, such as ocfs2, but does not have iSCSI or SAN storage, using nfs is an alternative and dm nfs just makes it really easy.

    To do this yourself, the following commands will do it for you :

  • to find out if any such devices exist just type dmsetup table --target nfs
  • to create your own device, do something like this:
  • mount mynfsserver:/mountpoint /mnt
    dd if=/dev/zero of=/mnt/myvolume.img bs=1M count=2000 
    dmsetup create myvolume --table "0 4096000 nfs /mnt/myvolume.img 0"
    So mount the nfs volume, create a file which will be the container of the blockdevice, in this case a 2GB file and then create the dm device. The values for the dmsetup command are the following:

    myvolume = the name of the /dev/mapper device. Here we end up with /dev/mapper/myvolume

    table = start (normally always 0), number of blocks/length, this is in 512byte blocks, so you double the number,nfs since this is on nfs, filename of the nfs based file, offset (normally always 0)

    So now you have /dev/mapper/myvolume, it acts like a normal block device. If you do this on multiple servers, you can actually create an ocfs2 filesystem on this block device and it will be consistent across the servers.

    Credits go to Chuck Lever for writing dm nfs in the first place, thanks Chuck :) The code for dm nfs is here.

    Tuesday Nov 27, 2012

    Introducing the Oracle Linux Playground yum repo

    We just introduced a new yum repository/channel on called the playground channel. What we started doing is the following:

    When a new stable mainline kernel is released by Linus or GregKH, we internally build RPMs to test it and do some QA work around it to keep track of what's going on with the latest development kernels. It helps us understand how performance moves up or down and if there are issues, we try to help look into them and of course send that stuff back upstream. Many Linux users out there are interested in trying out the latest features but there are some potential barriers to do this.

    (1) in general, you are looking at an upstream development distribution, which means that everything changes both in userspace(random applications) and kernel. Projects like Fedora are very useful and someone that wants to just see how the entire distribution evolves with all the changes, this is a great way to be current. A drawback here, though, is that if you have applications that are not part of the distribution, there's a lot of manual work involved or they might just not work because the changes are too drastic. The introduction of systemd is a good example.

    (2) when you look at many of our customers, that are interested in our database products or applications, the starting point of having a supported/certified userspace/distribution, like Oracle Linux, is a much easier way to get your feet wet in seeing what new/future Linux kernel enhancements could do.

    This is where the playground channel comes into play. When you install Oracle Linux 6 (which anyone can download and use from, grab the latest public yum repository file, put it in /etc/yum.repos.d and enable the playground repo :

    name=Latest mainline stable kernel for Oracle Linux 6 ($basearch) - Unsupported 
    Now, all you need to do : type yum update and you will be downloading the latest stable kernel which will install cleanly on Oracle Linux 6. Thus you end up with a stable Linux distribution where you can install all your software, and then download the latest stable kernel (at time of writing this is 3.6.7) without having to recompile a kernel, without having to jump through hoops.

    There is of course a big, very important disclaimer this is NOT for PRODUCTION use.

    We want to try and help make it easy for people that are interested, from a user perspective, where the Linux kernel is going and make it easy to install and use it and play around with new features. Without having to learn how to compile a kernel and without necessarily having to install a complete new distribution with all the changes top to bottom.

    So we don't or won't introduce any new userspace changes, this project really is around making it easy to try out the latest upstream Linux kernels in a very easy way on an environment that's stable and you can keep current, since all the latest errata for Oracle Linux 6 are published on the public yum repo as well. So one repository location for all your current changes and the upstream kernels. We hope that this will get more users to try out the latest kernel and report their findings. We are always interested in understanding stability and performance characteristics.

    As new features are going into the mainline kernel, that could potentially be interesting or useful for various products, we will try to point them out on our blogs and give an example on how something can be used so you can try it out for yourselves.

    Anyway, I hope people will find this useful and that it will help increase interested in upstream development beyond reading lkml by some of the more non-kernel-developer types.

    Wednesday Jul 11, 2012

    Oracle VM VirtualBox virtual appliance images for Oracle VM 3.1.1 server and Manager

    I updated the Oracle VM VirtualBox appliances for Oracle VM Manager. It now contains the latest release+patch Oracle VM Manager 3.1.1 build 365. Alongside the Manager VM I also created a preconfigured server setup. Oracle VM Server 3.1.1 build 365. The nice thing with this combination is that you can effectively run a smaller server pool on your desktop or laptop if you have a decent amount of RAM. I managed to create a 2 node server pool. Basically run the Manager VM + 2 server VMs on one 8gb macbook. Of course it wasn't terribly fast or useful to run anything serious but it was good enough to test HA and show off the functionality.

    The VM's can be downloaded here. There are a few important things :

  • You can only run ParaVirtualized guests in the server VMs
  • I precreated nfs directories and started the nfs server bits in the Manager VM so you can use the Manager as an nfs repository for shared storage to the servers
  • I created a yum directory and updated httpd.conf on the Manager VM so that you can use and test the yum update facilities
  • You can add extra virtual disks to the server and they will show up as local storage for local repositories
  • iscsi target modules are installed on the Manager VM so you can also test out iscsi if you want to
  • It is highly recommended that you first start with the Manager VM to run in 4GB and then you can, if you want, drop it to about 3GB in size
  • The servers can run in as little as 700Mb-1GB
  • Use STATIC ip addresses for these VMs
  • Since this is build 365, the Oracle VM CLI (ssh) server is also installed in the Manager VM!
  • note : when creating the VM in VirtualBox, go into the netowrk settings of the VM and make sure that your virtual network is associated with the correct physical network. These VMs were exported on a Linux server with the virtual networks bound to eth0 as a device. On a Mac or a Windows PC this is likely a different name so to be safe just modify this before starting the VM

    The Manager VM is like the previous version 3.0. The VM starts and does an auto-login into Xwindows and there is a readme file that opens up in firefox and a login button that starts firefox to the local management port for Oracle VM.

    the server root password and ovs-agent password (to discover the server is ovsroot)

    Thursday Jul 05, 2012


    (updated 3/18/13 to fix disknaming error) Oracle ASMlib on Linux has been a topic of discussion a number of times since it was released way back when in 2004. There is a lot of confusion around it and certainly a lot of misinformation out there for no good reason. Let me try to give a bit of history around Oracle ASMLib.

    Oracle ASMLib was introduced at the time Oracle released Oracle Database 10g R1. 10gR1 introduced a very cool important new features called Oracle ASM (Automatic Storage Management). A very simplistic description would be that this is a very sophisticated volume manager for Oracle data. Give your devices directly to the ASM instance and we manage the storage for you, clustered, highly available, redundant, performance, etc, etc... We recommend using Oracle ASM for all database deployments, single instance or clustered (RAC).

    The ASM instance manages the storage and every Oracle server process opens and operates on the storage devices like it would open and operate on regular datafiles or raw devices. So by default since 10gR1 up to today, we do not interact differently with ASM managed block devices than we did before with a datafile being mapped to a raw device. All of this is without ASMLib, so ignore that one for now. Standard Oracle on any platform that we support (Linux, Windows, Solaris, AIX, ...) does it the exact same way. You start an ASM instance, it handles storage management, all the database instances use and open that storage and read/write from/to it. There are no extra pieces of software needed, including on Linux. ASM is fully functional and selfcontained without any other components.

    In order for the admin to provide a raw device to ASM or to the database, it has to have persistent device naming. If you booted up a server where a raw disk was named /dev/sdf and you give it to ASM (or even just creating a tablespace without asm on that device with datafile '/dev/sdf') and next time you boot up and that device is now /dev/sdg, you end up with an error. Just like you can't just change datafile names, you can't change device filenames without telling the database, or ASM. persistent device naming on Linux, especially back in those days ways to say it bluntly, a nightmare. In fact there were a number of issues (dating back to 2004) :

    Correction to the above: ASM can handle device name changes across reboots with the correct ASM_DISKSTRING in the init.ora, it will be able to find the disks even if they changed, however part of device naming and device metadata on reboot is the correct ownership (oracle:dba ....). With ASMLib in place, this is not an issue and it will take care of ownership and permissions of the ASM disk devices.

  • Linux async IO wasn't pretty
  • persistent device naming including permissions (had to be owned by oracle and the dba group) was very, very difficult to manage
  • system resource usage in terms of open file descriptors
  • So given the above, we tried to find a way to make this easier on the admins, in many ways, similar to why we started working on OCFS a few years earlier -> how can we make life easier for the admins on Linux.

    A feature of Oracle ASM is the ability for third parties to write an extension using what's called ASMLib. It is possible for any third party OS or storage vendor to write a library using a specific Oracle defined interface that gets used by the ASM instance and by the database instance when available. This interface offered 2 components :

  • Define an IO interface - allow any IO to the devices to go through ASMLib
  • Define device discovery - implement an external way of discovering, labeling devices to provide to ASM and the Oracle database instance
  • This is similar to a library that a number of companies have implemented over many years called libODM (Oracle Disk Manager). ODM was specified many years before we introduced ASM and allowed third party vendors to implement their own IO routines so that the database would use this library if installed and make use of the library open/read/write/close,.. routines instead of the standard OS interfaces. PolyServe back in the day used this to optimize their storage solution, Veritas used (and I believe still uses) this for their filesystem. It basically allowed, in particular, filesystem vendors to write libraries that could optimize access to their storage or filesystem.. so ASMLib was not something new, it was basically based on the same model. You have libodm for just database access, you have libasm for asm/database access.

    Since this library interface existed, we decided to do a reference implementation on Linux. We wrote an ASMLib for Linux that could be used on any Linux platform and other vendors could see how this worked and potentially implement their own solution. As I mentioned earlier, ASMLib and ODMLib are libraries for third party extensions. ASMLib for Linux, since it was a reference implementation implemented both interfaces, the storage discovery part and the IO part. There are 2 components :

  • Oracle ASMLib - the userspace library with config tools (a shared object and some scripts)
  • oracleasm.ko - a kernel module that implements the asm device for /dev/oracleasm/*
  • The userspace library is a binary-only module since it links with and contains Oracle header files but is generic, we only have one asm library for the various Linux platforms. This library is opened by Oracle ASM and by Oracle database processes and this library interacts with the OS through the asm device (/dev/asm). It can install on Oracle Linux, on SuSE SLES, on Red Hat RHEL,.. The library itself doesn't actually care much about the OS version, the kernel module and device cares. The support tools are simple scripts that allow the admin to label devices and scan for disks and devices. This way you can say create an ASM disk label foo on, currently /dev/sdf... So if /dev/sdf disappears and next time is /dev/sdg, we just scan for the label foo and we discover it as /dev/sdg and life goes on without any worry. Also, when the database needs access to the device, we don't have to worry about file permissions or anything it will be taken care of. So it's a convenience thing.

    Correction: the extra advantage with ASMLib here being the fact that it will take care of the file permissions and ownership of the device.

    The kernel module oracleasm.ko is a Linux kernel module/device driver. It implements a device /dev/oracleasm/* and any and all IO goes through ASMLib -> /dev/oracleasm. This kernel module is obviously a very specific Oracle related device driver but it was released under the GPL v2 so anyone could easily build it for their Linux distribution kernels.

    Advantages for using ASMLib :

  • A good async IO interface for the database, the entire IO interface is based on an optimal ASYNC model for performance
  • A single file descriptor per Oracle process, not one per device or datafile per process reducing # of open filehandles overhead
  • Device scanning and labeling built-in so you do not have to worry about messing with udev or devlabel, permissions or the likes which can be very complex and error prone.
  • Just like with OCFS and OCFS2, each kernel version (major or minor) has to get a new version of the device drivers. We started out building the oracleasm kernel module rpms for many distributions, SLES (in fact in the early days still even for this thing called United Linux) and RHEL. The driver didn't make sense to get pushed into upstream Linux because it's unique and specific to the Oracle database.

    As it takes a huge effort in terms of build infrastructure and QA and release management to build kernel modules for every architecture, every linux distribution and every major and minor version we worked with the vendors to get them to add this tiny kernel module to their infrastructure. (60k source code file). The folks at SuSE understood this was good for them and their customers and us and added it to SLES. So every build coming from SuSE for SLES contains the oracleasm.ko module. We weren't as successful with other vendors so for quite some time we continued to build it for RHEL and of course as we introduced Oracle Linux end of 2006 also for Oracle Linux. With Oracle Linux it became easy for us because we just added the code to our build system and as we churned out Oracle Linux kernels whether it was for a public release or for customers that needed a one off fix where they also used asmlib, we didn't have to do any extra work it was just all nicely integrated.

    With the introduction of Oracle Linux's Unbreakable Enterprise Kernel and our interest in being able to exploit ASMLib more, we started working on a very exciting project called Data Integrity. Oracle (Martin Petersen in particular) worked for many years with the T10 standards committee and storage vendors and implemented Linux kernel support for DIF/DIX, data protection in the Linux kernel, note to those that wonder, yes it's all in mainline Linux and under the GPL. This basically gave us all the features in the Linux kernel to checksum a data block, send it to the storage adapter, which can then validate that block and checksum in firmware before it sends it over the wire to the storage array, which can then do another checksum and to the actual DISK which does a final validation before writing the block to the physical media. So what was missing was the ability for a userspace application (read: Oracle RDBMS) to write a block which then has a checksum and validation all the way down to the disk. application to disk.

    Because we have ASMLib we had an entry into the Linux kernel and Martin added support in ASMLib (kernel driver + userspace) for this functionality. Now, this is all based on relatively current Linux kernels, the oracleasm kernel module depends on the main kernel to have support for it so we can make use of it. Thanks to UEK and us having the ability to ship a more modern, current version of the Linux kernel we were able to introduce this feature into ASMLib for Linux from Oracle. This combined with the fact that we build the asm kernel module when we build every single UEK kernel allowed us to continue improving ASMLib and provide it to our customers.

    So today, we (Oracle) provide Oracle ASMLib for Oracle Linux and in particular on the Unbreakable Enterprise Kernel. We did the build/testing/delivery of ASMLib for RHEL until RHEL5 but since RHEL6 decided that it was too much effort for us to also maintain all the build and test environments for RHEL and we did not have the ability to use the latest kernel features to introduce the Data Integrity features and we didn't want to end up with multiple versions of asmlib as maintained by us. SuSE SLES still builds and comes with the oracleasm module and they do all the work and RHAT it certainly welcome to do the same. They don't have to rebuild the userspace library, it's really about the kernel module.

    And finally to re-iterate a few important things :

  • Oracle ASM does not in any way require ASMLib to function completely. ASMlib is a small set of extensions, in particular to make device management easier but there are no extra features exposed through Oracle ASM with ASMLib enabled or disabled. Often customers confuse ASMLib with ASM. again, ASM exists on every Oracle supported OS and on every supported Linux OS, SLES, RHEL, OL withoutASMLib
  • Oracle ASMLib userspace is available for OTN and the kernel module is shipped along with OL/UEK for every build and by SuSE for SLES for every of their builds
  • ASMLib kernel module was built by us for RHEL4 and RHEL5 but we do not build it for RHEL6, nor for the OL6 RHCK kernel. Only for UEK
  • ASMLib for Linux is/was a reference implementation for any third party vendor to be able to offer, if they want to, their own version for their own OS or storage
  • ASMLib as provided by Oracle for Linux continues to be enhanced and evolve and for the kernel module we use UEK as the base OS kernel
  • hope this helps.

    Wednesday Jul 04, 2012

    What's up with OCFS2?

    On Linux there are many filesystem choices and even from Oracle we provide a number of filesystems, all with their own advantages and use cases. Customers often confuse ACFS with OCFS or OCFS2 which then causes assumptions to be made such as one replacing the other etc... I thought it would be good to write up a summary of how OCFS2 got to where it is, what we're up to still, how it is different from other options and how this really is a cool native Linux cluster filesystem that we worked on for many years and is still widely used.

    Work on a cluster filesystem at Oracle started many years ago, in the early 2000's when the Oracle Database Cluster development team wrote a cluster filesystem for Windows that was primarily focused on providing an alternative to raw disk devices and help customers with the deployment of Oracle Real Application Cluster (RAC). Oracle RAC is a cluster technology that lets us make a cluster of Oracle Database servers look like one big database. The RDBMS runs on many nodes and they all work on the same data. It's a Shared Disk database design. There are many advantages doing this but I will not go into detail as that is not the purpose of my write up. Suffice it to say that Oracle RAC expects all the database data to be visible in a consistent, coherent way, across all the nodes in the cluster. To do that, there were/are a few options : 1) use raw disk devices that are shared, through SCSI, FC, or iSCSI 2) use a network filesystem (NFS) 3) use a cluster filesystem(CFS) which basically gives you a filesystem that's coherent across all nodes using shared disks. It is sort of (but not quite) combining option 1 and 2 except that you don't do network access to the files, the files are effectively locally visible as if it was a local filesystem.

    So OCFS (Oracle Cluster FileSystem) on Windows was born. Since Linux was becoming a very important and popular platform, we decided that we would also make this available on Linux and thus the porting of OCFS/Windows started. The first version of OCFS was really primarily focused on replacing the use of Raw devices with a simple filesystem that lets you create files and provide direct IO to these files to get basically native raw disk performance. The filesystem was not designed to be fully POSIX compliant and it did not have any where near good/decent performance for regular file create/delete/access operations. Cache coherency was easy since it was basically always direct IO down to the disk device and this ensured that any time one issues a write() command it would go directly down to the disk, and not return until the write() was completed. Same for read() any sort of read from a datafile would be a read() operation that went all the way to disk and return. We did not cache any data when it came down to Oracle data files.

    So while OCFS worked well for that, since it did not have much of a normal filesystem feel, it was not something that could be submitted to the kernel mail list for inclusion into Linux as another native linux filesystem (setting aside the Windows porting code ...) it did its job well, it was very easy to configure, node membership was simple, locking was disk based (so very slow but it existed), you could create regular files and do regular filesystem operations to a certain extent but anything that was not database data file related was just not very useful in general. Logfiles ok, standard filesystem use, not so much. Up to this point, all the work was done, at Oracle, by Oracle developers.

    Once OCFS (1) was out for a while and there was a lot of use in the database RAC world, many customers wanted to do more and were asking for features that you'd expect in a normal native filesystem, a real "general purposes cluster filesystem". So the team sat down and basically started from scratch to implement what's now known as OCFS2 (Oracle Cluster FileSystem release 2). Some basic criteria were :

  • Design it with a real Distributed Lock Manager and use the network for lock negotiation instead of the disk
  • Make it a Linux native filesystem instead of a native shim layer and a portable core
  • Support standard Posix compliancy and be fully cache coherent with all operations
  • Support all the filesystem features Linux offers (ACL, extended Attributes, quotas, sparse files,...)
  • Be modern, support large files, 32/64bit, journaling, data ordered journaling, endian neutral, we can mount on both endian /cross architecture,..
  • Needless to say, this was a huge development effort that took many years to complete. A few big milestones happened along the way...

  • OCFS2 was development in the open, we did not have a private tree that we worked on without external code review from the Linux Filesystem maintainers, great folks like Christopher Hellwig reviewed the code regularly to make sure we were not doing anything out of line, we submitted the code for review on lkml a number of times to see if we were getting close for it to be included into the mainline kernel. Using this development model is standard practice for anyone that wants to write code that goes into the kernel and having any chance of doing so without a complete rewrite or.. shall I say flamefest when submitted. It saved us a tremendous amount of time by not having to re-fit code for it to be in a Linus acceptable state. Some other filesystems that were trying to get into the kernel that didn't follow an open development model had a lot harder time and a lot harsher criticism.
  • March 2006, when Linus released 2.6.16, OCFS2 officially became part of the mainline kernel, it was accepted a little earlier in the release candidates but in 2.6.16. OCFS2 became officially part of the mainline Linux kernel tree as one of the many filesystems. It was the first cluster filesystem to make it into the kernel tree. Our hope was that it would then end up getting picked up by the distribution vendors to make it easy for everyone to have access to a CFS. Today the source code for OCFS2 is approximately 85000 lines of code.
  • We made OCFS2 production with full support for customers that ran Oracle database on Linux, no extra or separate support contract needed. OCFS2 1.0.0 started being built for RHEL4 for x86, x86-64, ppc, s390x and ia64. For RHEL5 starting with OCFS2 1.2.
  • SuSE was very interested in high availability and clustering and decided to build and include OCFS2 with SLES9 for their customers and was, next to Oracle, the main contributor to the filesystem for both new features and bug fixes.
  • Source code was always available even prior to inclusion into mainline and as of 2.6.16, source code was just part of a Linux kernel download from, which it still is, today. So the latest OCFS2 code is always the upstream mainline Linux kernel.
  • OCFS2 is the cluster filesystem used in Oracle VM 2 and Oracle VM 3 as the virtual disk repository filesystem.
  • Since the filesystem is in the Linux kernel it's released under the GPL v2
  • The release model has always been that new feature development happened in the mainline kernel and we then built consistent, well tested, snapshots that had versions, 1.2, 1.4, 1.6, 1.8. But these releases were effectively just snapshots in time that were tested for stability and release quality.

    OCFS2 is very easy to use, there's a simple text file that contains the node information (hostname, node number, cluster name) and a file that contains the cluster heartbeat timeouts. It is very small, and very efficient. As Sunil Mushran wrote in the manual :

  • OCFS2 is an efficient, easily configured, quickly installed, fully integrated and compatible, feature-rich, architecture and endian neutral, cache coherent, ordered data journaling, POSIX-compliant, shared disk cluster file system.
  • Here is a list of some of the important features that are included :

  • Variable Block and Cluster sizes Supports block sizes ranging from 512 bytes to 4 KB and cluster sizes ranging from 4 KB to 1 MB (increments in power of 2).
  • Extent-based Allocations Tracks the allocated space in ranges of clusters making it especially efficient for storing very large files.
  • Optimized Allocations Supports sparse files, inline-data, unwritten extents, hole punching and allocation reservation for higher performance and efficient storage.
  • File Cloning/snapshots REFLINK is a feature which introduces copy-on-write clones of files in a cluster coherent way.
  • Indexed Directories Allows efficient access to millions of objects in a directory.
  • Metadata Checksums Detects silent corruption in inodes and directories.
  • Extended Attributes Supports attaching an unlimited number of name:value pairs to the file system objects like regular files, directories, symbolic links, etc.
  • Advanced Security Supports POSIX ACLs and SELinux in addition to the traditional file access permission model.
  • Quotas Supports user and group quotas.
  • Journaling Supports both ordered and writeback data journaling modes to provide file system consistency in the event of power failure or system crash.
  • Endian and Architecture neutral Supports a cluster of nodes with mixed architectures. Allows concurrent mounts on nodes running 32-bit and 64-bit, little-endian (x86, x86_64, ia64) and big-endian (ppc64) architectures.
  • In-built Cluster-stack with DLM Includes an easy to configure, in-kernel cluster-stack with a distributed lock manager.
  • Buffered, Direct, Asynchronous, Splice and Memory Mapped I/Os Supports all modes of I/Os for maximum flexibility and performance.
  • Comprehensive Tools Support Provides a familiar EXT3-style tool-set that uses similar parameters for ease-of-use.
  • The filesystem was distributed for Linux distributions in separate RPM form and this had to be built for every single kernel errata release or every updated kernel provided by the vendor. We provided builds from Oracle for Oracle Linux and all kernels released by Oracle and for Red Hat Enterprise Linux. SuSE provided the modules directly for every kernel they shipped. With the introduction of the Unbreakable Enterprise Kernel for Oracle Linux and our interest in reducing the overhead of building filesystem modules for every minor release, we decide to make OCFS2 available as part of UEK. There was no more need for separate kernel modules, everything was built-in and a kernel upgrade automatically updated the filesystem, as it should. UEK allowed us to not having to backport new upstream filesystem code into an older kernel version, backporting features into older versions introduces risk and requires extra testing because the code is basically partially rewritten. The UEK model works really well for continuing to provide OCFS2 without that extra overhead.

    Because the RHEL kernel did not contain OCFS2 as a kernel module (it is in the source tree but it is not built by the vendor in kernel module form) we stopped adding the extra packages to Oracle Linux and its RHEL compatible kernel and for RHEL. Oracle Linux customers/users obviously get OCFS2 included as part of the Unbreakable Enterprise Kernel, SuSE customers get it by SuSE distributed with SLES and Red Hat can decide to distribute OCFS2 to their customers if they chose to as it's just a matter of compiling the module and making it available.

    OCFS2 today, in the mainline kernel is pretty much feature complete in terms of integration with every filesystem feature Linux offers and it is still actively maintained with Joel Becker being the primary maintainer. Since we use OCFS2 as part of Oracle VM, we continue to look at interesting new functionality to add, REFLINK was a good example, and as such we continue to enhance the filesystem where it makes sense. Bugfixes and any sort of code that goes into the mainline Linux kernel that affects filesystems, automatically also modifies OCFS2 so it's in kernel, actively maintained but not a lot of new development happening at this time. We continue to fully support OCFS2 as part of Oracle Linux and the Unbreakable Enterprise Kernel and other vendors make their own decisions on support as it's really a Linux cluster filesystem now more than something that we provide to customers. It really just is part of Linux like EXT3 or BTRFS etc, the OS distribution vendors decide.

    Do not confuse OCFS2 with ACFS (ASM cluster Filesystem) also known as Oracle Cloud Filesystem. ACFS is a filesystem that's provided by Oracle on various OS platforms and really integrates into Oracle ASM (Automatic Storage Management). It's a very powerful Cluster Filesystem but it's not distributed as part of the Operating System, it's distributed with the Oracle Database product and installs with and lives inside Oracle ASM. ACFS obviously is fully supported on Linux (Oracle Linux, Red Hat Enterprise Linux) but OCFS2 independently as a native Linux filesystem is also, and continues to also be supported. ACFS is very much tied into the Oracle RDBMS, OCFS2 is just a standard native Linux filesystem with no ties into Oracle products. Customers running the Oracle database and ASM really should consider using ACFS as it also provides storage/clustered volume management. Customers wanting to use a simple, easy to use generic Linux cluster filesystem should consider using OCFS2.

    To learn more about OCFS2 in detail, you can find good documentation on in the Documentation area, or get the latest mainline kernel from and read the source.

    One final, unrelated note - since I am not always able to publicly answer or respond to comments, I do not want to selectively publish comments from readers. Sometimes I forget to publish comments, sometime I publish them and sometimes I would publish them but if for some reason I cannot publicly comment on them, it becomes a very one-sided stream. So for now I am going to not publish comments from anyone, to be fair to all sides. You are always welcome to email me and I will do my best to respond to technical questions, questions about strategy or direction are sometimes not possible to answer for obvious reasons.

    Friday Jun 29, 2012

    My own personal use of Oracle Linux

    It always is easier to explain something with examples... Many people still don't seem to understand some of the convenient things around using Oracle Linux and since I personally (surprise!) use it at home, let me give you an idea.

    I have quite a few servers at home and I also have 2 hosted servers with a hosted provider. The servers at home I use mostly to play with random Linux related things, or with Oracle VM or just try out various new Oracle products to learn more. I like the technology, it's like a hobby really. To be able to have a good installation experience and use an officially certified Linux distribution and not waste time trying to find the right libraries, I, of course, use Oracle Linux. Now, at least I can get a copy of Oracle Linux for free (even if I was not working for Oracle) and I can/could use that on as many servers at home (or at my company if I worked elsewhere) for testing, development and production. I just go to and download the version(s) I want and off I go.

    Now, I also have the right (and not because I am an employee) to take those images and put them on my own server and give them to someone else, I in fact, just recently set up my own mirror on my own hosted server. I don't have to remove oracle-logos, I don't have to rebuild the ISO images, I don't have to recompile anything, I can just put the whole binary distribution on my own server without contract. Perfectly free to do so. Of course the source code of all of this is there, I have a copy of the UEK code at home, just cloned from And as you can see, the entire changelog, checkins, merges from Linus's tree, complete overview of everything that got changed from kernel to kernel, from patch to patch, errata to errata. No obfuscating, no tar balls and spending time with diff, or go read bug reports to find out what changed (seems silly to me).

    Some of my servers are on the external network and I need to be current with security errata, but guess what, no problem, my servers are hooked up to which is open, free, and completely up to date, in a consistent, reliable way with any errata, security or bugfix. So I have nothing to worry about. Also, not because I am an employee. Anyone can. And, with this, I also can, and have, set up my own mirror site that hosts these RPMs. both binary and source rpms. Because I am free to get them and distribute them. I am quite capable of supporting my servers on my own, so I don't need to rely on the support organization so I don't need to have a support subscription :-). So I don't need to pay. Neither would you, at least not with Oracle Linux.

    Another cool thing. The hosted servers came (unfortunately) with Centos installed. While Centos works just fine as is, I tend to prefer to be current with my security errata(reliably) and I prefer to just maintain one yum repository instead of 2, I converted them over to Oracle Linux as well (in place) so they happily receive and use the exact same RPMs. Since Oracle Linux is exactly the same from a user/application point of view as RHEL, including files like /etc/redhat-release and no changes from .el. to .centos. I know I have nothing to worry about installing one of the RHEL applications. So, OL everywhere makes my life a lot easier and why not...

    Next! Since I run Oracle VM and I have -tons- of VM's on my machines, in some cases on my big WOPR box I have 15-20 VMs running. Well, no problem, OL is free and I don't have to worry about counting the number of VMs, whether it's 1, or 4, or more than 10 ... like some other alternatives started doing...

    and finally :) I like to try out new stuff, not 3 year old stuff. So with UEK2 as part of OL6 (and 6.3 in particular) I can play with a 3.0.x based kernel and it just installs and runs perfectly clean with OL6, so quite current stuff in an environment that I know works, no need to toy around with an unsupported pre-alpha upstream distribution with libraries and versions that are not compatible with production software (I have nothing against ubuntu or fedora or opensuse... just not what I can rely on or use for what I need, and I don't need a desktop).

    pretty compelling. I say... and again, it doesn't matter that I work for Oracle, if I was working elsewhere, or not at all, all of the above would still apply. Student, teacher, developer, whatever. contrast this with $349 for 2 sockets and oneguest and selfsupport per year to even just get the software bits.

    Wednesday Jun 27, 2012

    Oracle Linux 6 update 3

    Oracle Linux 6.3 channels are now available online

  • repositories. Both base channels and latest channels are available (for free for everyone)
  • repositories. Behind our customer portal but effectively the same content.
  • Source RPMs (.srpm) are being uploaded to

    OL6.3 contains UEK2 kernel-uek-2.6.39-200.24.1. The source rpm is in the above location but our public GIT repository will be synced up shortly as well at;a=summary. Unlike some others, of course, complete source, complete changelog, complete checkin history, both mainline and our own available. No need to go assemble things from a website manually.

    Another cool thing coming up is a boot iso for OL6.3 that boots up uek (2.6.39-200.24.1) as install kernel and uses btrfs as the default filesystem for installation. So latest and greatest direct access to btrfs, a modern well-tested, current kernel, freely available. Enjoy.

    Since it takes a few days for our ISOs to be on, I mirrored them on my own server :

    Sunday Jun 24, 2012

    Oracle VM 3.1.1 build 365 released

    A few days ago we released a patch update for Oracle VM 3.1.1 (build 365).

    Oracle VM Manager 3.1.1 Build 365 is now available from My Oracle Support patch ID 14227416

    Oracle VM Server 3.1.1 errata updates are, as usual, released on ULN in the ovm3_3.1.1_x86_64_patch channel.

    Just a reminder, when we publish errata for Oracle VM, the notifications are sent through the oraclevm-errata maillist. You can sign up here.

    Some of the bugfixes in 3.1.1 :

    14054162 - Removes unnecessary locks when creating VNICs in a multi-threaded operation.
    14111234 - Fixes the issue when discovering a virtual machine that has disks in a un-discovered repository or has un-discovered physical disks.
    14054133 - Fixes a bug of object not found where vdisks are left stale in certain multi-thread operations.
    14176607 - Fixes the issue where Oracle VM Manager would hang after a restart due to various tasks running jobs in the global context.
    14136410 - Fixes the stale lock issue on multithreaded server where object not found error happens in some rare situations.
    14186058 - Fixes the issue where Oracle VM Manager fails to discover the server or start the server after the server hardware configuration (i.e. BIOS) was modified.
    14198734 - Fixes the issue where HTTP cannot be disabled.
    14065401 - Fixes Oracle VM Manager UI time-out issue where the default value was not long enough for storage repository creation.
    14163755 - Fixes the issue when migrating a virtual machine the list of target servers (and "other servers") was not ordered by name.
    14163762 - Fixes the size of the "Edit Vlan Group" window to display all information correctly.
    14197783 - Fixes the issue that navigation tree (servers) was not ordered by name.

    I strongly suggest everyone to use this latest build and also update the server to the latest version.
    have at it.

    Sunday Jun 03, 2012

    Oracle VM RAC template - what it took

    In my previous posting I introduced the latest Oracle Real Application Cluster / Oracle VM template. I mentioned how easy it is to deploy a complete Oracle RAC cluster with Oracle VM. In fact, you don't need any prior knowledge at all to get a complete production-ready setup going.

    Here is an example... I built a 4 node RAC cluster, completely configured in just over 40 minutes - starting from import template into Oracle VM, create VMs to fully up and running Oracle RAC. And what was needed? 1 textfile with some hostnames and ip addresses and

    The setup is a 4 node cluster where each VM has 8GB of RAM and 4 vCPUs. The shared ASM storage in this case is 100GB, 5 x 20GB volumes. The VM names are racovm.0-racovm.3. The deploycluster script starts the VMs, verifies the configuration and sends the database cluster configuration info through Oracle VM Manager to the 4 node VMs. Once the VMs are up and running, the first VM starts the actual Oracle RAC setup inside and talks to the 3 other VMs. I did not log into any VM until after everything was completed. In fact, I connected to the database remotely before logging in at all.

    # ./ -u admin -H localhost --vms racovm.0,racovm.1,racovm.2,racovm.3 --netconfig ./netconfig.ini

    Oracle RAC OneCommand (v1.1.0) for Oracle VM - deploy cluster - (c) 2011-2012 Oracle 
    Corporation (com: 26700:v1.1.0, lib: 126247:v1.1.0, var: 1100:v1.1.0) - v2.4.3 -
    (x86_64) Invoked as root at Sat Jun 2 17:31:29 2012 (size: 37500, mtime: Wed May 16 00:13:19 2012) Using: ./ -u admin -H localhost --vms racovm.0,racovm.1,racovm.2,racovm.3
    --netconfig ./netconfig.ini INFO: Login password to Oracle VM Manager not supplied on command line or environment
    (DEPLOYCLUSTER_MGR_PASSWORD), prompting... Password: INFO: Attempting to connect to Oracle VM Manager... INFO: Oracle VM Client ( protocol (1.8) CONNECTED (tcp) to Oracle VM Manager ( protocol (1.8) IP (
    UUID (0004fb0000010000cbce8a3181569a3e) INFO: Inspecting /root/rac/deploycluster/netconfig.ini for number of nodes defined... INFO: Detected 4 nodes in: /root/rac/deploycluster/netconfig.ini INFO: Located a total of (4) VMs; 4 VMs with a simple name of: ['racovm.0', 'racovm.1', 'racovm.2', 'racovm.3'] INFO: Verifying all (4) VMs are in Running state INFO: VM with a simple name of "racovm.0" is in a Stopped state, attempting to start it...
    OK. INFO: VM with a simple name of "racovm.1" is in a Stopped state, attempting to start it...
    OK. INFO: VM with a simple name of "racovm.2" is in a Stopped state, attempting to start it...
    OK. INFO: VM with a simple name of "racovm.3" is in a Stopped state, attempting to start it...
    OK. INFO: Detected that all (4) VMs specified on command have (5) common shared disks between
    them (ASM_MIN_DISKS=5) INFO: The (4) VMs passed basic sanity checks and in Running state, sending cluster details
    as follows: netconfig.ini (Network setup): /root/rac/deploycluster/netconfig.ini buildcluster: yes INFO: Starting to send cluster details to all (4) VM(s)....... INFO: Sending to VM with a simple name of "racovm.0".... INFO: Sending to VM with a simple name of "racovm.1"..... INFO: Sending to VM with a simple name of "racovm.2"..... INFO: Sending to VM with a simple name of "racovm.3"...... INFO: Cluster details sent to (4) VMs... Check log (default location /u01/racovm/buildcluster.log) on build VM (racovm.0)... INFO: completed successfully at 17:32:02 in 33.2 seconds (00m:33s) Logfile at: /root/rac/deploycluster/deploycluster2.log
    my netconfig.ini
    # Node specific information
    # Common data
    # Device used to transfer network information to second node
    # in interview mode
    # 11gR2 specific data

    last few lines of the in-VM log file :

    2012-06-02 14:01:40:[clusterstate:Time :db11rac1] Completed successfully in 2 seconds 
    (0h:00m:02s) 2012-06-02 14:01:40:[buildcluster:Done :db11rac1] Build 11gR2 RAC Cluster 2012-06-02 14:01:40:[buildcluster:Time :db11rac1] Completed successfully in 1779 seconds

    From start_vm to completely configured : 29m:39s. The other 10m was the import template and create 4 VMs from template along with the shared storage configuration.

    This consists of a complete Oracle 11gR2 RAC database with ASM, CRS and the RDBMS up and running on all 4 nodes. Simply connect and use. Production ready.

    Oracle on Oracle.

    Tuesday May 29, 2012

    New Oracles VM RAC template with support for oracle vm 3 built-in

    The RAC team did it again (thanks Saar!) - another awesome set of Oracle VM templates published and uploaded to My Oracle Support.

    You can find the main page here.

    What's special about the latest version of DeployCluster is that it integrates tightly with Oracle VM 3 manager. It basically is an Oracle VM frontend that helps start VMs, pass arguments down automatically and there is absolutely no need to log into the Oracle VM servers or the guests. Once it completes, you have an entire Oracle RAC database setup ready to go.

    Here's a short summary of the steps :

  • Set up an Oracle VM 3 server pool
  • Download the Oracle VM RAC template from
  • Import the template into Oracle VM using Oracle VM Manager repository -> import
  • Create a public and private network in Oracle VM Manager in the network tab
  • Configure the template with the right public and private virtual networks
  • Create a set of shared disks (physical or virtual) to assign to the VMs you want to create (for ASM/at least 5)
  • Clone a set of VMs from the template (as many RAC nodes as you plan to configure)
  • With Oracle VM 3.1 you can clone with a number so one clone command for, say 8 VMs is easy.
  • Assign the shared devices/disks to the cloned VMs
  • Create a netconfig.ini file on your manager node or a client where you plan to run DeployCluster
  • This little text file just contains the IP addresses, hostnames etc for your cluster. It is a very simple small textfile.
  • Run with the VM names as argument
  • Done.

    At this point, the tool will connect to Oracle VM Manager, start the VMs and configure each one,

  • Configure the OS (Oracle Linux)
  • Configure the disks with ASM
  • Configure the clusterware (CRS)
  • Configure ASM
  • Create database instances on each node.
  • Now you are ready to log in, and use your x node database cluster. x No need to download various products from various websites, click on trial licenses for the OS, go to a Virtual Machine store with sample and test versions only - this is production ready and supported.

    Software. Complete.

    example netconfig.ini :

    # Node specific information
    # Common data
    # Device used to transfer network information to second node
    # in interview mode
    # 11gR2 specific data

    Saturday May 26, 2012

    example of transcendent memory and oracle databases

    I did some tests with tmem using an Oracle Database 11gR2 and swingbench setup. You can see a graph below. Let me try to explain what this means.

    Using Oracle VM 3 with some changes booting dom0 (additional parameters at the boot prompt) and with UEK2 as a guest kernel in my VM, I can make use of autoballooning. What you see in the graph below is very simple : it's a timeline (horizontal)of how much actual memory the VM is using/needing. I created 3 16GB VMs that I wanted to run on a 36GB Oracle VM server (so more VM memory than we have physically available in the server). When I start a 16GB VM (vertical) the Linux guest immediately balloons down to about 700Mb in size. It automatically releases pages to the hypervisor that are not needed, it's free/idle memory otherwise. Then I start a database with a 4GB SGA, as you can see, the second I start the DB, the VM grows to just over 4GB in size. Then I start swingbench runs, 25, 50, 100, 500, 1000 users. Every time such a run starts, you can see memory use/grab go up, when swingbench stops it goes down. In the end after the last run with 1000 users I also shut down the database instance and memory drops all the way to 700Mb.

    I ran 3 guests with swingbench and the database in each and through dynamic ballooning and the guests cooperatively working with the hypervisor, I was able to start all 3 16GB VMs and there was no performance impact. When there was free memory in the hypervisor, cleancache kicked in and guests made use of those pages, including deduping and compression of the pages.

    If you want to play with this yourself, you can run this command in dom0 to get decent statistics out of the setup : xm tmem-list --long --all | /usr/sbin/xen-tmem-list-parse. It will show you the compression ratio, the cache ratios etc. I used those statistics to generate the chart below. This yet another example of how, when one can control both the hypervisor and the guest operating system and have things work together, you get better and more interesting results than just black box VM management.



    Wim Coekaerts is the Senior Vice President of Linux and Virtualization Engineering for Oracle. He is responsible for Oracle's complete desktop to data center virtualization product line and the Oracle Linux support program.

    You can follow him on Twitter at @wimcoekaerts


    « July 2016