Tuesday Jun 11, 2013

ovm_utils 0.6.5

Finally found some time to play with ovm_utils again and added another little tool to the package.

ovm_utils is a collection of little tools I wrote over the last year or 2. They can help make command line use a little easier. Of course we have since introduced a real ovm_cli in Oracle VM Manager in 3.1 which is officially part of the product and officially supported. ovm_utils is provided as-is, for fun. If you find them useful, great, if not, oh well :-)

ovm_logger (there's also a man page as part of the utilities man/man8/...) is a little tool that you can run as a daemon or just as a log dump tool. Oracle VM Manager runs most of it's tasks as jobs and handles most responses as events. So we have a joblog and an eventlog in the Oracle VM Manager database. When an action occurs from the UI or if an error gets reported from an agent, these things then create jobs and events. If you run the ovm_logger with -d, it will just start up, open the joblog and eventlog and dump the history to stdout, completed with the timestamp of when it occured. You probably want to re-direct that output to a file because it can be a lot of data.

If you run ovm_logger by itself, (without -d) then it basically starts logging events and jobs as of the time you start the tool. Any new job or event that occurs from then on, will be displayed, until you cancel the tool, kill it or use ctrl-c.

Examples :

./ovm_logger -u admin -p MyPassword -h localhost -X -d > /tmp/logoutput

./ovm_logger -u admin -p MyPassword -h localhost -X

# ./ovm_logger -u admin -p Manager1 -h localhost -X 
Oracle VM Log utility 0.6.4.
Connecting with a secure connection.
Tue Jun 11 03:48:34 PDT 2013  Oracle VM Log
Tue Jun 11 03:48:34 PDT 2013  Oracle VM Manager Version :
Tue Jun 11 03:48:34 PDT 2013  Oracle VM Manager IP      :
Tue Jun 11 03:48:34 PDT 2013  Oracle VM Manager UUID    : 0004fb0000010000b66b471827b0b09d
Tue Jun 11 03:49:04 PDT 2013  Job - Rediscover Server wcoekaer-srv1
Tue Jun 11 03:49:29 PDT 2013  Job - Refresh File Server srv4nfs
Tue Jun 11 03:49:39 PDT 2013  Job - Start Virtual Machine ol6u3apitest
Tue Jun 11 03:49:54 PDT 2013  Event - Job Aborted
Tue Jun 11 03:49:54 PDT 2013  (06/11/2013 03:49:51:970 AM)
Due to Abort by user: admin
Tue Jun 11 03:49:54 PDT 2013  Job - Discover Server thisonedoesntexist
Tue Jun 11 03:49:54 PDT 2013  []
Tue Jun 11 03:50:29 PDT 2013  Event - Job Internal Error (Operation)
Tue Jun 11 03:50:29 PDT 2013  (06/11/2013 03:50:26:420 AM)
OVMAPI_4010E Attempt to send command: get_api_version to server: failed. OVMAPI_4004E Server Failed Command: get_api_version , Status: org.apache.xmlrpc.XmlRpcException: I/O error while communicating with HTTP server: Connection refused [Tue Jun 11 03:50:26 PDT 2013] [Tue Jun 11 03:50:26 PDT 2013]
Tue Jun 11 03:50:29 PDT 2013  Job - Discover Server wcoekaer-srv3
< Tue Jun 11 03:50:29 PDT 2013  [{OPERATION_NAME=Discover Manager Server Discover, JOB_STEP=Commit, SERVER_NAME=Unknown, EXIT_STATUS=Failed:OVMAPI_4010E Attempt to send command: get_api_version to server: failed. OVMAPI_4004E Server Failed Command: get_api_version , Status: org.apache.xmlrpc.XmlRpcException: I/O error while communicating with HTTP server: Connection refused [Tue Jun 11 03:50:26 PDT 2013] [Tue Jun 11 03:50:26 PDT 2013], MANAGED_OBJECT_NAME=OVM Foundry : Discover Manager<235>}, {OPERATION_NAME=Discover Manager Server Discover, JOB_STEP=Rollback, SERVER_NAME=Unknown, EXIT_STATUS=DONE, MANAGED_OBJECT_NAME=OVM Foundry : Discover Manager<235>}]

Anyway it's simple but it helps to easily do some form of audit on operations that happened and highlights errors in red.
have fun...

Thursday May 16, 2013

ksplice and how it really helps with 0day stuff

So a nasty bug report came out the other day on linux, a serious exploit. Everyone scrambled to get a kernel built and (tested) and released and then there's of course the effort of bringing down applications, multi-tiered environments being way more complex in terms of orchestration of bringing down multiple systems, installing the updated kernel and rebooting and bringing everything back up in an orderly fashion.

Of course for all our customers that use ksplice and enjoy the cool zero downtime patching, theyt might not even have noticed if they ran *as many do* ksplice in automated mode or others just had to issue one single very simple command and they were done. No applications to bring down, no systems to reboot... and still safe, secure, patched, current.

some more specifics on the ksplice blog here.

There's also Time to release. The ksplice patch was available on Tuesday (5/14) while the RPM for the kernel was released on Thursday (5/16) by us and the other similar distributions. No hassle...

Tuesday Jan 22, 2013

oracle vm 3.2.1 released!

Pleased to announce the release of Oracle VM 3.2.1

The press release is here. The documentation library can be found here.

The release notes in the documentation show what's new and also a list of bugs fixed. Here's the summary of what's new :

The new features and enhancements in Oracle VM Release 3.2.1 include:

Performance, Scalability and Security

Support for Oracle VM Server for SPARC: Oracle VM Manager can now be used to discover SPARC servers running Oracle VM Server for SPARC, and perform virtual machine management tasks.

New Dom0 Kernel in Oracle VM Server for x86: The Dom0 kernel in Oracle VM Server for x86 has been updated so that it is now the same Oracle Unbreakable Enterprise Kernel 2 (UEK2) as used in Oracle Linux, for complete binary compatibility with drivers supported in Oracle Linux. Due to the specialized nature of the Oracle VM Dom0 environment (as opposed to the more general purpose Oracle Linux environment) some Linux drivers may not be appropriate to support in the context of Oracle VM, even if the driver is fully compatible with the UEK2 kernel in Oracle Linux. Do not install any additional drivers unless directed to do so by Oracle Support Services.


MySQL Database Support: MySQL Database is used as the bundled database for the Oracle VM Manager management repository for simple installations. Support for an existing Oracle SE/EE Database is still included within the installer so that you can perform a custom installation to take advantage of your existing infrastructure. Simple installation using the bundled MySQL Database is fully supported within production environments.

Discontinued inclusion of Oracle XE Databases: Oracle VM Manager no longer bundles the Oracle XE database as a backend database. If you are currently running Oracle VM Manager using Oracle XE and you intend to upgrade you must first migrate your database to Oracle SE or Oracle EE.

Oracle VM Server Support Tools: A meta-package is provided on the Oracle VM Server ISO enabling you to install packages to assist with support. These packages are not installed automatically as they are Oracle VM Server does not depend on them. Installation of the meta-package and its dependencies may assist with the resolution of support queries and can be installed at your own discretion. Note that the sudo package was previously installed as a dependency for Oracle VM Server, but that this package has now been made a dependency of the ovs-support-tools meta-package. If you require sudo on your Oracle VM Server installations, you should install the ovs-support-tools meta-package.

Improved Usability

Oracle VM Command Line Interface (CLI): The new Oracle VM Command Line Interface can be used to perform the same functions as the Oracle VM Manager Web Interface, such as managing all your server pools, servers and guests. The CLI commands can be scripted and run in conjunction with the Web Interface, thus bringing more flexibility to help you deploy and manage an Oracle VM environment. The CLI supports public-key authentication, allowing users to write scripts without embedding passwords, to facilitate secure remote login to Oracle VM Manager. The CLI also includes a full audit log for all commands executed using the facility. See the Oracle VM Command Line Interface User's Guide for information on using the CLI.

Accessibility options: Options to display the UI in a more accessible way for screen readers, improve the contrast, or increase the font size. See Oracle VM Manager user interface Accessibility Features for more information.

Health tab: Monitor the overall health and status of your virtualization environment and view historical statistics such as memory and CPU usage. See Health Tab for information on using the Health tab.

Multi-select of objects: Select one or more objects to perform an action on multiple objects, for example, upgrading multiple Oracle VM Servers in one step, rather than upgrading them individually. See Multi-Select Functionality for information on using the multi-select feature.

Search for objects: In many of the tab management panes and in some of the dialog boxes you can search for objects. This is of particular benefit to large deployments with many objects such as virtual machines or Oracle VM Servers. See Name Filters for information on using the search feature.

Tagging of objects: It is now possible to tag virtual machines, servers and server pool objects within Oracle VM Manager to create logical groupings of items, making it easier to search for objects by tag.

Alphabetized tables and other UI listings: Items listed in tables and other UI listings are now sorted alphabetically within Oracle VM Manager by default, to make it easier to find objects in larger deployments.

Present repository to server pools: In addition to presenting a storage repository to individual Oracle VM Servers, you can now present a repository to all Oracle VM Servers in one or more server pools. See Presenting or Unpresenting a Storage Repository for more information.

OCFS2 timout configuration: An additional attribute has been added to allow you to determine the timout in seconds for a cluster when configuring a clustered server pool within Oracle VM Manager.

NFS refresh servers and access lists for non-uniform exports: For NFS configurations where different server pools are exposed to different exports, it is now possible to configure non-uniform exports and access lists to control how server pool refreshes are performed. For more information on this feature, please see NFS Access Groups for Non-uniform Exports.

Configure multiple iSCSI access hosts: You can now configure multiple access hosts for iSCSI storage devices

Sizes of disks, ISOs and vdisks: Oracle VM Manager now shows the sizes of disks, ISOs and vdisks within the virtual machine edit dialog, to make it easier to select a disk.

Automated backups and easy restore: Oracle VM Manager installations taking advantage of the bundled MySQL Enterprise Edition Database include fully automated database backups and a quick restore tool that can help with easy database restoration.

Serial console access: A serial console java applet has been included within Oracle VM Manager to allow serial console access to virtual machines running on both SPARC and x86 hardware. This facility complements the existing VNC-based console access to virtual machines running on x86 hardware.

Set preferences for recurring jobs: Facilities have been provided within Oracle VM Manager to control the preferences for recurring jobs. These include the ability to enable, disable or set the interval for tasks such as refreshing repositories and file systems; and to control the Yum Update checking task.

Processor Compatibility Groups: Since virtual machines can only be migrated between servers that use compatible processor types, Oracle VM Manager now provides the ability to define Processor Compatibility Groups to enable you to pick which servers a virtual machine can be migrated between.

Configure additional Utility and Virtual Machine roles: New roles are now supported on Oracle VM Servers to control the type of functionality that the server will be responsible for. The Virtual Machine role is required in order for an Oracle VM Server to run a virtual machine. Oracle VM Servers configured with the Utility role are favoured for performing operations such as file cloning, importing of templates, the creation of repositories, and other operations not directly related to running a virtual machine.

Directly import a virtual machine: It is now possible to directly import a virtual machine using Oracle VM Manager, no longer requiring that you first import to a template and then clone.

Virtual machine start policy: You can now specify a start policy for a virtual machine, determining whether to always start the virtual machine on the server on which it has been placed, or to start the virtual machine on the best possible server in the server pool.

Hot-add a VNIC to a virtual machine: It is now possible to add a VNIC directly to a running virtual machine from within Oracle VM Manager.

Send messages to a virtual machine: Facilities have been provided within Oracle VM Manager to send messages directly to a virtual machine in the form of key-value pairs.

NTP configuration: Ensuring that time is synchronized across all servers is important. Oracle VM Manager now provides a facility to bulk configure NTP across all servers.

My personal favorites are (1) MySQL as a repository database (2) adding support for SPARC servers running Oracle VM for SPARC in Oracle VM Manager (3) the CLI server (4) Server Utility versus VM server roles (5) cluster timeout configuration (and a better default) (6) direct VM import and (7) serial console for a VM.

have fun

Monday Jan 21, 2013

oracle linux playground channel sample

If you have a system with Oracle Linux 6 installed but you are not using public-yum, and you want to play with our mainline kernel builds from the playground channel, then you need to create a simple, small yum repo file and you are all set.

Some reasons could be that your system is configured for a local yum repository for updates, or you are registered directly with ULN.

Either way, a very simple example file can be found here. Just put the file in /etc/yum.repos.d.

# cat /etc/yum.repos.d/playground.repo 
name=Oracle Linux mainline kernel playground $releasever ($basearch)

Once this file exists, you can use yum to install the new kernels. At time of writing, this is kernel-3.7.2-3.7.y.20130115.ol6.x86_64. Just go look in the directory to see which kernels have been published and pick the one you want to install. As you can see source, binary, devel, debug, headers, firmware and doc versions of the packages are there.

# yum install kernel-3.7.2-3.7.y.20130115.ol6.x86_64
Loaded plugins: refresh-packagekit, rhnplugin, security
Setting up Install Process
Resolving Dependencies
--> Running transaction check
---> Package kernel.x86_64 0:3.7.2-3.7.y.20130115.ol6 will be installed
--> Processing Dependency: kernel-firmware = 3.7.2-3.7.y.20130115.ol6 
      for package: kernel-3.7.2-3.7.y.20130115.ol6.x86_64
--> Running transaction check
---> Package kernel-firmware.noarch 0:2.6.32-279.19.1.el6 will be updated
---> Package kernel-firmware.noarch 0:3.7.2-3.7.y.20130115.ol6 will be an update
--> Finished Dependency Resolution

Dependencies Resolved

 Package           Arch     Version                      Repository        Size
 kernel            x86_64   3.7.2-3.7.y.20130115.ol6     ol6_playground    24 M
Updating for dependencies:
 kernel-firmware   noarch   3.7.2-3.7.y.20130115.ol6     ol6_playground   997 k

Transaction Summary
Install       1 Package(s)
Upgrade       1 Package(s)

Total download size: 25 M
Is this ok [y/N]: y
Downloading Packages:
(1/2): kernel-3.7.2-3.7.y.20130115.ol6.x86_64.rpm  
                                      |  24 MB     00:18     
(2/2): kernel-firmware-3.7.2-3.7.y.20130115.ol6.noarch.rpm   
                                      | 997 kB     00:00     
                             1.3 MB/s |  25 MB     00:19     
Running rpm_check_debug
Running Transaction Test
Transaction Test Succeeded
Running Transaction
  Updating   : kernel-firmware-3.7.2-3.7.y.20130115.ol6.noarch            
  Installing : kernel-3.7.2-3.7.y.20130115.ol6.x86_64                      
  Cleanup    : kernel-firmware-2.6.32-279.19.1.el6.noarch                   
  Verifying  : kernel-firmware-3.7.2-3.7.y.20130115.ol6.noarch                     
  Verifying  : kernel-3.7.2-3.7.y.20130115.ol6.x86_64                             
  Verifying  : kernel-firmware-2.6.32-279.19.1.el6.noarch                          

  kernel.x86_64 0:3.7.2-3.7.y.20130115.ol6                                                                    

Dependency Updated:
  kernel-firmware.noarch 0:3.7.2-3.7.y.20130115.ol6                                                           

Now just a simple reboot and you are all set.

Wednesday Jan 16, 2013

Oracle Linux 5.9

Oracle Linux 5.9 was uploaded yesterday to http://linux.oracle.com (ULN) and to http://public-yum.oracle.com. The _latest channels are current and the 5.9_base channels contain the core.

ISO images will be available shortly from http://edelivery.oracle.com. If there is an urgent need to get the ISOs through My Oracle Support, simply file a service request.

Release notes are here.

Sunday Jan 06, 2013

oracle vm template config script example

The programmatic way to extend Oracle VM Template Configure is to build your own module.

To write your own module, you have to build an RPM that contains a configure script in a specific format, let's go through the steps to do this.

Oracle VM template configure works very similar to the init.d and chkconfig script model. For template config we have the /etc/template.d directory, all the scripts go into /etc/template.d/scripts. Then symlinks are made to other subdirectories based on the type of target the scripts provide. At this point we handle configure and cleanup. When a script/module gets added using ovm-chkconfig, the header of the script is read to verify the name, priority and targets and then a symlink is made to the corresponding subdirectories under /etc/template.d.

As an example, you have /etc/init.d/sshd which is the main sshd initscript and when sshd is enabled you will find a symlink in /etc/rc3.d/S55sshd to /etc/init.d/sshd. These symlinks are created by chkconfig when you enable or disable a service. The same thing goes for Oracle VM template config and the content of /etc/template.d/scripts. You will see /etc/template.d/scripts/ssh and since ssh (on my system) is enabled for the configure target, I have a symlink to /etc/template.d/configure.d/70ssh.

Like init.d, the digit in front of the script name specifies the priority at which it should be run.

The most important and complex part is writing your own script for your own application. Our scripts are in python, theoretically you could write it in a different language, as long as the input, output and argument handling remains the same. The examples here will all be in python. Each script has 2 main part : (1) the script header which contains information like script name, targets, priorities and description and (2) the actual script which has to handle a small set of parameters. You can take a look at the existing scripts for examples.

(1) script header
Aside from a copyright header that suits your needs, the script headers require a very specific comment block, here is an example :

# name: network
# configure: 50
# cleanup: 50
# description: Script to configure template network.

You have to use the exact same format. Provide your own script name, which will be used when calling ovm-chkconfig, the targets (right now we implement configure and cleanup) and the priority for your script. The priority will specify in what order the scripts get executed. You do not have to implement all targets, if you have a configure target but not cleanup, that is OK, same goes for cleanup versus configure. It is up to you. The configure target gets called when a first boot/initial start of the VM happens, cleanup happens when you manually initiate a cleanup in your VM or when you want to restore the VM to its original state.

# name: [script name]
# [target]: [priority]
# [target]: [priority]
# description: a description and can
#   cross multiple lines.

Now for the body of the script. Basically the main requirement is that it accepts a [target] parameter. Let's say we have script called foo that needs to be run at configure time, then the script (/etc/template.d/scripts) will have to accept and understand handling the parameter configure. If you also want to call it for cleanup, then it has to handle cleanup. You can have your script handle any other arguments, this is totally up to you, they are optional for our purposes. There is one optional parameter which is useful to implement and this is -e or --enumerate. ovm-template-config uses this to be able to enumerate the parameters for a target for your script.

Here is the firewall example:

# ovm-template-config --human-readable --enumerate configure --script firewall
  [{u'description': u'Whether to enable network firewall: True or False.',
    u'hidden': True,
    u'key': u'com.oracle.linux.network.firewall'}])]
and if you run the script manually :

# ./firewall configure -e
[{"hidden": true, "description": "Whether to enable network firewall: True or False.", "key": "com.oracle.linux.network.firewall"}]

In other words, the firewall script lists the parameters it expects when run as a configure target.

Now here is an example of the script body, in python. It implements the configure and cleanup target and handles the enumerate argument. Part of the magic is handled in templateconfig.cli.

    import json
except ImportError:
    import simplejson as json
from templateconfig.cli import main

def do_enumerate(target):
    param = []
    if target == 'configure':
        param += []
    elif target == 'cleanup':
        param += []
    return json.dumps(param)

def do_configure(param):
    param = json.loads(param)
    return json.dumps(param)

def do_unconfigure(param):
    param = json.loads(param)
    return json.dumps(param)

def do_cleanup(param):
    param = json.loads(param)
    return json.dumps(param)

if __name__ == '__main__':
    main(do_enumerate, {'configure': do_configure, 'cleanup': do_cleanup})

So now you can fill this out with your own parameters and code. Again taking the firewall script as an example, to add expected keys :

def do_enumerate(target):
    param = []
    if target == 'configure':
        param += [{'key': 'com.oracle.linux.network.firewall',
                   'description': 'Whether to enable network firewall: True or False.',
                   'hidden': True}]
    return json.dumps(param)

The above shows that this script expect the key com.oracle.linux.firewall to be set and what the default is, along with a description. Add this for each key/value pair that you expect for your script and then afterwards it is easy to understand what the input to your script needs to be, again by running ovm-template-config.

To execute actions at configure time, based on values set, here's a do_configure() example:

def do_configure(param):
    param = json.loads(param)
    firewall = param.get('com.oracle.linux.network.firewall')
    if firewall == 'True':
        shell_cmd('service iptables start')
        shell_cmd('service ip6tables start')
        shell_cmd('chkconfig --level 2345 iptables on')
        shell_cmd('chkconfig --level 2345 ip6tables on')
    elif firewall == 'False':
        shell_cmd('service iptables stop')
        shell_cmd('service ip6tables stop')
        shell_cmd('chkconfig --level 2345 iptables off')
        shell_cmd('chkconfig --level 2345 ip6tables off')
    return json.dumps(param)

When the script is called, you can use param.get() to retrieve key/value variables and then just make use of it. Just like in the firewall example, you can do whatever you want, call out other commands, add more python code, it's up to you...

It is also possible to alter keys or add new keys which then get sent back. So if you want your script to communicate values back which can be retrieved later through the manager API, for instance with ovm_vmmessage -q, you can simply to this :

param['key'] = 'some value'

Key can be an existing key, or a new one.

And that's really it... for the script. Next up is packaging.

In order to install and configure these template configure scripts, they have to be packaged in an RPM, with a specific naming convention. Package the script(s), there can be more than one, as ovm-template-config-[scriptname]. Ideally in the post install of the RPM you want to add the script automatically. Execute # /usr/sbin/ovm-chkconfig --add [scriptname]. When de-installing a script/RPM, remove it at un-install time, # /usr/sbin/ovm-chkconfig --del [scriptname].

Here is an example of an RPM spec file that can be used:

Name: ovm-template-config-example
Version: 3.0
Release: 1%{?dist}
Summary: Oracle VM template example configuration script.
Group: Applications/System
License: GPL
URL: http://www.oracle.com/virtualization
Source0: %{name}-%{version}.tar.gz
BuildRoot: %(mktemp -ud %{_tmppath}/%{name}-%{version}-%{release}-XXXXXX)
BuildArch: noarch
Requires: ovm-template-config

Oracle VM template example configuration script.

%setup -q



if [ $1 = 1 ]; then
    /usr/sbin/ovm-chkconfig --add example

if [ $1 = 0 ]; then
    /usr/sbin/ovm-chkconfig --del example


* Tue Mar 22 2011 Zhigang Wang  - 3.0-1
- Initial build.

Modify the content to your liking, change the name example to your script name, and add whatever else dependencies you might have or whatever files need to be bundled along with this. If you want to bundle executables or scripts that live in other locations, that's allowed. As you can see from the spec file, it automatically called ovm-chkconfig --add and --del at post-install and pre-uninstall time of the RPM.

In order to create RPMs, you have to install rpmbuild, # yum install rpm-build.

To make it easy, here's a Makefile you can use and help automate all of this :


	@echo 'Commonly used make targets:'
	@echo '  install    - install program'
	@echo '  dist       - create a source tarball'
	@echo '  rpm        - build RPM packages'
	@echo '  clean      - remove files created by other targets'

dist: clean
	mkdir $(PACKAGE)-$(VERSION)
	tar -cSp --to-stdout --exclude .svn --exclude .hg --exclude .hgignore \
	    --exclude $(PACKAGE)-$(VERSION) * | tar -x -C $(PACKAGE)-$(VERSION)
	tar -czSpf $(PACKAGE)-$(VERSION).tar.gz $(PACKAGE)-$(VERSION)
	rm -rf $(PACKAGE)-$(VERSION)

	install -D example $(DESTDIR)/etc/template.d/scripts/example

rpm: dist
	rpmbuild -ta $(PACKAGE)-$(VERSION).tar.gz

	rm -fr $(PACKAGE)-$(VERSION)
	find . -name '*.py[cdo]' -exec rm -f '{}' ';'
	rm -f *.tar.gz

.PHONY: dist install rpm clean

Create a directory, copy over your script, the spec file and this Makefile. Run # make dist, to create a src tarball of your code and then # make rpm. This will generate an RPM in the RPMS/noarch directory. For instance: /root/rpmbuild/RPMS/noarch/ovm-template-config-test-3.0-1.el6.noarch.rpm

Next you can take this RPM and install it on a target system.

# rpm -ivh  /root/rpmbuild/RPMS/noarch/ovm-template-config-test-3.0-1.el6.noarch.rpm
Preparing...                ########################################### [100%]
   1:ovm-template-config-tes########################################### [100%]

And as you can see, it's added to the ovm-chkconfig list :

# ovm-chkconfig --list|grep testtest                 on:75       
off         off         on:25       off         off         off         off        

One point of caution : the configure scripts get executed very early on in the bootstage. ovmd is executed as S00ovmd. This is well before many other services are (1) configured, (2) running. So if your product requires services like network connectivity or others to be up and running, then you have to split up the configuration into two parts. First, use the above to gather configuration data remotely, store it in a way that you can use it, and then add your own /etc/init.d scripts which can take this data afterwards. So you can have your own init scripts executed at a late stage when the services you depend on are available.

That's really all there is to it. Thanks to Zhigang for example code I have used here.

Saturday Jan 05, 2013

Oracle Linux Ksplice offline client

We just uploaded the Ksplice uptrack Offline edition client to ULN. Until recently, in order to be able to rely on Ksplice zero downtime patches, you know, the ability to apply security updates and bugfixes on Oracle Linux without the need for a reboot, each server made a direct connection to our server. It was required for each server on the intranet to have a direct connection to linux.oracle.com.

By introducing the offline client, customers with Oracle Linux Premier or Oracle Linux Premier Limited support can create a local intranet yum repository that creates a local mirror of the ULN ksplice channel and just use the yum update command to install the latest ksplice updates. This will allow customers to have just one server connected to the oracle server and every other system just have a local connection.

Here is an example on how to get this going, and then setting up a local yum repository is an exercise up to the reader :)

  • Register a server with ULN and have premier Oracle Linux support. (using # uln_register)
  • Configure rebootless updates with # uln_register
  • Log into ULN and modify the channel subscriptions for the server to include the channel Ksplice for Oracle Linux 6
  • Verify that you are subscribed by running # yum repolist on your server and you should see the ol6_x86_64_ksplice channel included
  • Install the offline edition of uptrack, the Ksplice update client # yum install uptrack-offline
  • If there are ksplice updates available for the kernel you run, you can install the uptrack-updates-[version] rpm. # yum install uptrack-updates-`uname -r`.(i386|x86_64)

    Now, as we release new zero downtime updates, there will be a newer version of the uptrack-updates RPM for your kernel. A simple # yum update will pick this up, (without the need for a reboot) and the update will apply new ksplice updates. You can see the version of the uptrack-updates by doing # rpm -qa | grep uptrack-updates it will show the date appended at the end. And of course the usual commands # uptrack-show or # uptrack-uname -r.

    The offline client allows you to create a local yum repository for your servers that are covered under Oracle Linux Premier and Oracle Linux Premier Limited support subscriptions so that the servers just point to your local yum repository and all you have to do is keep this repository up-to-date for the ksplice channel and install the above RPMs on each server and run a yum update. In other words, the offline client doesn't require each server to be connected to the Oracle Unbreakable Linux network.

    To set up a local yum repository, follow these instructions.

    have fun!

  • Using Oracle VM messages to configure a Virtual Machine.

    In the previous blog entry, I walked through the steps on how to set up a VM with the necessary packages to enable Oracle VM template configuration. The template configuration scripts are add-ons one can install inside a VM running in an Oracle VM 3 environment. Once installed, it is possible to enable the configuration scripts and shutdown the VM so that after cloning or reboot, we go through an initial setup dialog.

    At startup time, if ovmd is enabled, it will start executing configuration scripts that need input to configure and continue. It is possible to send this configuration data through the virtual console of the VM or through the Oracle VM API. To use the Oracle VM API to send configuration messages, you have two options :

    (1) use the Oracle VM CLI. As of Oracle VM 3.1, we include an Oracle VM CLI server by default when installing Oracle VM Manager. This process starts on port 10000 on the Oracle VM Manager node and acts as an ssh server. You can log into this cli using the admin username/password and then execute cli commands.

    # ssh admin@localhost -p 10000
    admin@localhost's password: 
    OVM> sendVmMessage Vm name=ol6u3apitest key=foo message=bar log=no
    Command: sendVmMessage Vm name=ol6u3apitest key=foo message=bar log=no
    Status: Success
    Time: 2012-12-27 09:04:29,890 PST

    The cli command for sending a message is sendVmMessage Vm name=[vmname] key=[key] message=[value]

    If you do not want to log the out of the commands then add log=no

    (2) use the Oracle VM utilities. If you install the Oracle VM Utilities, see here to get started, then :

    # ./ovm_vmmessage -u admin -p ######## -h localhost -v ol6u3apitest -k foo -V bar
    Oracle VM VM Message utility 0.5.2.
    VM : 'ol6u3apitest' has status :  Running.
    Sending message.
    Message sent successfully.

    The ovm_vmmessage command connects to Oracle VM Manager and sends a key/value pair to the VM you select.

    ovm_vmmessage -u [adminuser] -p [adminpassword] -h [managernode] -v [vmname] -k [key] -V [value]

    These two commands basically allow the admin user to send simple key - value pair messages to a given VM. This is the basic mechanism we rely on to remotely configure a VM using the Oracle VM template config scripts.

    For the template configuration we provide, and depending on the scripts you installed, there is a well-defined set of variables (keys) that you can set, listed below. In our scripts we have one variable that is required and this has to be set/send at the end of the configuration. This is configuring the root password. Everything else is optional. Sending the root password variable triggers the reconfiguration to execute. As an example, if you install the ovm-template-config-selinux package, then part of the configuration can be to set the selinux mode. The variable is com.oracle.linux.selinux.mode and the values can be enforcing,permissive or disabled. So to set the value of SELinux, you basically send a message with key com.oracle.linux.selinux.mode and value enforcing (or so..).

    # ./ovm_vmmessage -u admin -p ######## -h localhost -v ol6u3apitest \
            -k com.oracle.linux.selinux.mode -V enforcing

    Do this for every variable you want to define and at the end send the root password.

    # ./ovm_vmmessage -u admin -p ######## -h localhost -v ol6u3apitest \ 
            -k com.oracle.linux.root-password -V "mypassword"

    Once the above message gets sent, the ovm-template-config scripts will set up all the values and the VM will end up in a configured state. You can use this to send ssh keys, set up extra users, configure the virtual network devices etc.. To get the list of configuration variables just run # ovm-template-config --human-readable --enumerate configure and it will list the variables with a description like below.

    It is also possible to selectively enable and disable scripts. This work very similar to chk-config. # ovm-chkconfig --list will show which scripts/modules are registered and whether they are enabled to run at configure time and/or cleanup time. At this point, the other options are not implemented (suspend/resume/..). If you have installed datetime but do not want to have it run or be an option, then a simple # ovm-chkconfig --target configure datetime off will disable it. This allows you, for each VM or template, to selectively enable or disable configuration options. If you disable a module then the output of ovm-template-config will reflect those changes.

    The next blog entry will talk about how to make generic use of the VM message API and possible extend the ovm-template-configure modules for your own applications.

      [{u'description': u'SELinux mode: enforcing, permissive or disabled.',
        u'hidden': True,
        u'key': u'com.oracle.linux.selinux.mode'}]),
      [{u'description': u'Whether to enable network firewall: True or False.',
        u'hidden': True,
        u'key': u'com.oracle.linux.network.firewall'}]),
      [{u'description': u'System date and time in format year-month-day-hour-minute-second, e.g., "2011-4-7-9-2-42".',
        u'hidden': True,
        u'key': u'com.oracle.linux.datetime.datetime'},
       {u'description': u'System time zone, e.g., "America/New_York".',
        u'hidden': True,
        u'key': u'com.oracle.linux.datetime.timezone'},
       {u'description': u'Whether to keep hardware clock in UTC: True or False.',
        u'hidden': True,
        u'key': u'com.oracle.linux.datetime.utc'},
       {u'description': u'Whether to enable NTP service: True or False.',
        u'hidden': True,
        u'key': u'com.oracle.linux.datetime.ntp'},
       {u'description': u'NTP servers separated by comma, e.g., "time.example.com,0.example.pool.ntp.org".',
        u'hidden': True,
        u'key': u'com.oracle.linux.datetime.ntp-servers'},
       {u'description': u'Whether to enable NTP local time source: True or False.',
        u'hidden': True,
        u'key': u'com.oracle.linux.datetime.ntp-local-time-source'}]),
      [{u'description': u'System host name, e.g., "localhost.localdomain".',
        u'key': u'com.oracle.linux.network.hostname'},
       {u'description': u'Hostname entry for /etc/hosts, e.g., " localhost.localdomain localhost".',
        u'hidden': True,
        u'key': u'com.oracle.linux.network.host.0'},
       {u'description': u'Network device to configure, e.g., "eth0".',
        u'key': u'com.oracle.linux.network.device.0'},
       {u'depends': u'com.oracle.linux.network.device.0',
        u'description': u'Network device hardware address, e.g., "00:16:3E:28:0F:4E".',
        u'hidden': True,
        u'key': u'com.oracle.linux.network.hwaddr.0'},
       {u'depends': u'com.oracle.linux.network.device.0',
        u'description': u'Network device MTU, e.g., "1500".',
        u'hidden': True,
        u'key': u'com.oracle.linux.network.mtu.0'},
       {u'choices': [u'yes', u'no'],
        u'depends': u'com.oracle.linux.network.device.0',
        u'description': u'Activate interface on system boot: yes or no.',
        u'key': u'com.oracle.linux.network.onboot.0'},
       {u'choices': [u'dhcp', u'static'],
        u'depends': u'com.oracle.linux.network.device.0',
        u'description': u'Boot protocol: dhcp or static.',
        u'key': u'com.oracle.linux.network.bootproto.0'},
       {u'depends': u'com.oracle.linux.network.bootproto.0',
        u'description': u'IP address of the interface.',
        u'key': u'com.oracle.linux.network.ipaddr.0',
        u'requires': [u'com.oracle.linux.network.bootproto.0',
                      [u'static', u'none', None]]},
       {u'depends': u'com.oracle.linux.network.bootproto.0',
        u'description': u'Netmask of the interface.',
        u'key': u'com.oracle.linux.network.netmask.0',
        u'requires': [u'com.oracle.linux.network.bootproto.0',
                      [u'static', u'none', None]]},
       {u'depends': u'com.oracle.linux.network.bootproto.0',
        u'description': u'Gateway IP address.',
        u'key': u'com.oracle.linux.network.gateway.0',
        u'requires': [u'com.oracle.linux.network.bootproto.0',
                      [u'static', u'none', None]]},
       {u'depends': u'com.oracle.linux.network.bootproto.0',
        u'description': u'DNS servers separated by comma, e.g., ",".',
        u'key': u'com.oracle.linux.network.dns-servers.0',
        u'requires': [u'com.oracle.linux.network.bootproto.0',
                      [u'static', u'none', None]]},
       {u'description': u'DNS search domains separated by comma, e.g., "us.example.com,cn.example.com".',
        u'hidden': True,
        u'key': u'com.oracle.linux.network.dns-search-domains.0'},
       {u'description': u'Network device to configure, e.g., "eth0".',
        u'hidden': True,
        u'key': u'com.oracle.linux.network.device.1'},
       {u'depends': u'com.oracle.linux.network.device.1',
        u'description': u'Network device hardware address, e.g., "00:16:3E:28:0F:4E".',
        u'hidden': True,
        u'key': u'com.oracle.linux.network.hwaddr.1'},
       {u'depends': u'com.oracle.linux.network.device.1',
        u'description': u'Network device MTU, e.g., "1500".',
        u'hidden': True,
        u'key': u'com.oracle.linux.network.mtu.1'},
       {u'choices': [u'yes', u'no'],
        u'depends': u'com.oracle.linux.network.device.1',
        u'description': u'Activate interface on system boot: yes or no.',
        u'hidden': True,
        u'key': u'com.oracle.linux.network.onboot.1'},
       {u'choices': [u'dhcp', u'static'],
        u'depends': u'com.oracle.linux.network.device.1',
        u'description': u'Boot protocol: dhcp or static.',
        u'hidden': True,
        u'key': u'com.oracle.linux.network.bootproto.1'},
       {u'depends': u'com.oracle.linux.network.bootproto.1',
        u'description': u'IP address of the interface.',
        u'hidden': True,
        u'key': u'com.oracle.linux.network.ipaddr.1',
        u'requires': [u'com.oracle.linux.network.bootproto.1',
                      [u'static', u'none', None]]},
       {u'depends': u'com.oracle.linux.network.bootproto.1',
        u'description': u'Netmask of the interface.',
        u'hidden': True,
        u'key': u'com.oracle.linux.network.netmask.1',
        u'requires': [u'com.oracle.linux.network.bootproto.1',
                      [u'static', u'none', None]]},
       {u'depends': u'com.oracle.linux.network.bootproto.1',
        u'description': u'Gateway IP address.',
        u'hidden': True,
        u'key': u'com.oracle.linux.network.gateway.1',
        u'requires': [u'com.oracle.linux.network.bootproto.1',
                      [u'static', u'none', None]]},
       {u'depends': u'com.oracle.linux.network.bootproto.1',
        u'description': u'DNS servers separated by comma, e.g., ",".',
        u'hidden': True,
        u'key': u'com.oracle.linux.network.dns-servers.1',
        u'requires': [u'com.oracle.linux.network.bootproto.1',
                      [u'static', u'none', None]]},
       {u'description': u'DNS search domains separated by comma, e.g., "us.example.com,cn.example.com".',
        u'hidden': True,
        u'key': u'com.oracle.linux.network.dns-search-domains.1'}]),
      [{u'description': u'Name of the user on which to perform operation.',
        u'hidden': True,
        u'key': u'com.oracle.linux.user.name.0'},
       {u'description': u'Action to perform on the user: add, del or mod.',
        u'hidden': True,
        u'key': u'com.oracle.linux.user.action.0'},
       {u'description': u'User ID.',
        u'hidden': True,
        u'key': u'com.oracle.linux.user.uid.0'},
       {u'description': u'User initial login group.',
        u'hidden': True,
        u'key': u'com.oracle.linux.user.group.0'},
       {u'description': u'Supplementary groups separated by comma.',
        u'hidden': True,
        u'key': u'com.oracle.linux.user.groups.0'},
       {u'description': u'User password.',
        u'hidden': True,
        u'key': u'com.oracle.linux.user.password.0',
        u'password': True},
       {u'description': u'New name of the user.',
        u'hidden': True,
        u'key': u'com.oracle.linux.user.new-name.0'},
       {u'description': u'Name of the group on which to perform operation.',
        u'hidden': True,
        u'key': u'com.oracle.linux.group.name.0'},
       {u'description': u'Action to perform on the group: add, del or mod.',
        u'hidden': True,
        u'key': u'com.oracle.linux.group.action.0'},
       {u'description': u'Group ID.',
        u'hidden': True,
        u'key': u'com.oracle.linux.group.gid.0'},
       {u'description': u'New name of the group.',
        u'hidden': True,
        u'key': u'com.oracle.linux.group.new-name.0'}]),
      [{u'description': u'Host private rsa1 key for protocol version 1.',
        u'hidden': True,
        u'key': u'com.oracle.linux.ssh.host-key'},
       {u'description': u'Host public rsa1 key for protocol version 1.',
        u'hidden': True,
        u'key': u'com.oracle.linux.ssh.host-key-pub'},
       {u'description': u'Host private rsa key.',
        u'hidden': True,
        u'key': u'com.oracle.linux.ssh.host-rsa-key'},
       {u'description': u'Host public rsa key.',
        u'hidden': True,
        u'key': u'com.oracle.linux.ssh.host-rsa-key-pub'},
       {u'description': u'Host private dsa key.',
        u'hidden': True,
        u'key': u'com.oracle.linux.ssh.host-dsa-key'},
       {u'description': u'Host public dsa key.',
        u'hidden': True,
        u'key': u'com.oracle.linux.ssh.host-dsa-key-pub'},
       {u'description': u'Name of the user to add a key.',
        u'hidden': True,
        u'key': u'com.oracle.linux.ssh.user.0'},
       {u'description': u'Authorized public keys.',
        u'hidden': True,
        u'key': u'com.oracle.linux.ssh.authorized-keys.0'},
       {u'description': u'Private key for authentication.',
        u'hidden': True,
        u'key': u'com.oracle.linux.ssh.private-key.0'},
       {u'description': u'Private key type: rsa, dsa or rsa1.',
        u'hidden': True,
        u'key': u'com.oracle.linux.ssh.private-key-type.0'},
       {u'description': u'Known hosts.',
        u'hidden': True,
        u'key': u'com.oracle.linux.ssh.known-hosts.0'}]),
      [{u'description': u'System root password.',
        u'key': u'com.oracle.linux.root-password',
        u'password': True,
        u'required': True}])]

    Configure Oracle Linux 6.3 as an Oracle VM template

    I have been asked a few times how one can make use of the Oracle VM API to configure an Oracle Linux VM running on top of Oracle VM 3. In the next few blog entries we will go through the various steps. This one will start at the beginning and get you to a completely prepared VM.

  • Create a VM with a default installation of Oracle Linux 6 update 3
  • You can freely download Oracle Linux installation images from http://edelivery.oracle.com/linux. Choose any type of installation you want, basic, desktop, server, minimal...

    Oracle Linux 6.3 comes with kernel 2.6.39-200.24.1 (UEK2)

    # uname -a
    Linux ol6u3 2.6.39-200.24.1.el6uek.x86_64 #1 SMP Sat Jun 23 02:39:07 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux

  • Update the VM to the latest version of UEK and in general as a best practice update to the latest patches and reboot the VM
  • Oracle Linux updates are freely available on our public-yum site and the default install of Oracle Linux 6.3 already points to this location for updates.

    # yum update 
    # reboot
    # uname -a
    Linux ol6u3 2.6.39-300.17.3.el6uek.x86_64 #1 SMP Wed Dec 19 06:28:03 PST 2012 x86_64 x86_64 x86_64 GNU/Linux

    There is an extra kernel module required for the Oracle VM API to work, the ovmapi kernel module provides the ability to communicate messages back and forth between the host and the VM and as such between Oracle VM Manager, through the VM API to the VM and back. We included this kernel module in the 2.6.39-300 kernel to make it easy. There is no need to install extra kernel modules or keep kernel modules up to date when or if we have a new update. The source code for this kernel module is of course part of the UEK2 source tree.

  • Enable the Oracle Linux add-on channel
  • After reboot, download the latest public-yum repo file from public-yum which contains more repositories and enable the add-on channel which contains the Oracle VM API packages:

    inside the VM :

    # cd /etc/yum.repos.d
    # rm public-yum-ol6.repo    <- (replace the original version with this newer version)
    # wget http://public-yum.oracle.com/public-yum-ol6.repo

  • Edit the public-yum-ol6.repo file to enable the ol6_addons channel.
  • Find the ol6_addons section and change enabled=0 to enabled=1.

    name=Oracle Linux $releasever Add ons ($basearch)

    Save the file.

  • Install the Oracle VM API packages
  • # yum install ovmd xenstoreprovider python-simplejson ovm-template-config

    This installs the basic necessary packages on Oracle Linux 6 to support the Oracle VM API. xenstore provider is the library which communicates with the ovmapi kernel infrastructure. ovmd is a daemon that handles configuration and re-configuration events and provides a mechanism to send/receive messages between the VM and the Oracle VM Manager.

  • Add additional configuration packages you want
  • In order to be able to create a VM template that includes basic OS configuration system scripts, you can decide to install any or all of the following :

    ovm-template-config-authentication : Oracle VM template authentication configuration script.
    ovm-template-config-datetime       : Oracle VM template datetime configuration script.
    ovm-template-config-firewall       : Oracle VM template firewall configuration script.
    ovm-template-config-network        : Oracle VM template network configuration script.
    ovm-template-config-selinux        : Oracle VM template selinux configuration script.
    ovm-template-config-ssh            : Oracle VM template ssh configuration script.
    ovm-template-config-system         : Oracle VM template system configuration script.
    ovm-template-config-user           : Oracle VM template user configuration script.

    Simply type # yum install ovm-template-config-... to install whichever you want.

  • Enable ovmd
  • To enable ovmd (recommended) do :

    # chkconfig ovmd on 
    # /etc/init.d/ovmd start
  • Prepare your VM for first boot configuration
  • If you want to shutdown this VM and enable the first boot configuration as a template, execute :

    # ovmd -s cleanup
    # service ovmd enable-initial-config
    # shutdown -h now

    After cloning this VM or starting it, it will act as a first time boot VM and it will require configuration input through the VM API or on the virtual VM console.

    My next blog will go into detail on how to send messages through the Oracle VM API for remote configuration and also how to extend the scripts.

    Friday Jan 04, 2013


    dlmfs is a really cool nifty feature as part of OCFS2. Basically, it's a virtual filesystem that allows a user/program to use the DLM through simple filesystem commands/manipulation. Without having to write programs that link with cluster libraries or do complex things, you can literally write a few lines of Python, Java or C code that let you create locks across a number of servers. We use this feature in Oracle VM to coordinate the master server and the locking of VMs across multiple nodes in a cluster. It allows us to make sure that a VM cannot start on multiple servers at once. Every VM is backed by a DLM lock, but by using dlmfs, this is simply a file in the dlmfs filesystem.

    To show you how easy and powerful this is, I took some of the Oracle VM agent Python code, this is a very simple example of how to create a lock domain, a lock and when you know you get the lock or not. The focus here is just a master lock which y ou could use for an agent that is responsible for a virtual IP or some executable that you want to locate on a given server but the calls to create any kind of lock are in the code. Anyone that wants to experiment with this can add their own bits in a matter of minutes.

    The prerequisite is simple : take a number of servers, configure an ocfs2 volume and an ocfs2 cluster (see previous blog entries) and run the script. You do not have to set up an ocfs2 volume if you do not want to, you could just set up the domain without actually mounting the filesystem. (See the global heartbeat blog). So practically this can be done with a very small simple setup.

    My example has two nodes, wcoekaer-emgc1 and wcoekaer-emgc2 are the two Oracle Linux 6 nodes, configured with a shared disk and an ocfs2 filesystem mounted. This setup ensures that the dlmfs kernel module is loaded and the cluster is online. Take the python code listed here and just execute it on both nodes.

    [root@wcoekaer-emgc2 ~]# lsmod |grep ocfs
    ocfs2                1092529  1 
    ocfs2_dlmfs            20160  1 
    ocfs2_stack_o2cb        4103  1 
    ocfs2_dlm             228380  1 ocfs2_stack_o2cb
    ocfs2_nodemanager     219951  12 ocfs2,ocfs2_dlmfs,ocfs2_stack_o2cb,ocfs2_dlm
    ocfs2_stackglue        11896  3 ocfs2,ocfs2_dlmfs,ocfs2_stack_o2cb
    configfs               29244  2 ocfs2_nodemanager
    jbd2                   93114  2 ocfs2,ext4
    You see that the ocfs2_dlmfs kernel module is loaded.

    [root@wcoekaer-emgc2 ~]# mount |grep dlmfs
    ocfs2_dlmfs on /dlm type ocfs2_dlmfs (rw)
    The dlmfs virtual filesystem is mounted on /dlm.

    I now execute dlm.py on both nodes and show some output, after a while I kill (control-c) the script on the master node and you see the other node take over the lock. I then restart the dlm.py script and reboot the other node and you see the same.

    [root@wcoekaer-emgc1 ~]# ./dlm.py 
    Checking DLM
    DLM Ready - joining domain : mycluster
    Starting main loop...
    i am master of the multiverse
    i am master of the multiverse
    i am master of the multiverse
    i am master of the multiverse
    ^Ccleaned up master lock file
    [root@wcoekaer-emgc1 ~]# ./dlm.py 
    Checking DLM
    DLM Ready - joining domain : mycluster
    Starting main loop...
    i am not the master
    i am not the master
    i am not the master
    i am not the master
    i am master of the multiverse
    This shows that I started as master, then hit ctrl-c, I drop the lock, the other node takes the lock, then I reboot the other node and I take the lock again.

    [root@wcoekaer-emgc2 ~]# ./dlm.py
    Checking DLM
    DLM Ready - joining domain : mycluster
    Starting main loop...
    i am not the master
    i am not the master
    i am not the master
    i am not the master
    i am master of the multiverse
    i am master of the multiverse
    i am master of the multiverse
    [1]+  Stopped                 ./dlm.py
    [root@wcoekaer-emgc2 ~]# bg
    [1]+ ./dlm.py &
    [root@wcoekaer-emgc2 ~]# reboot -f
    Here you see that when this node started without being master, then at time of ctrl-c on the other node, became master, then after a forced reboot, the lock automatically gets released.

    And here is the code, just copy it to your servers and execute it...

    # Copyright (C) 2006-2012 Oracle. All rights reserved.
    # This program is free software; you can redistribute it and/or modify it under
    # the terms of the GNU General Public License as published by the Free Software
    # Foundation, version 2.  This program is distributed in the hope that it will
    # be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of
    # Public License for more details.  You should have received a copy of the GNU
    # General Public License along with this program; if not, write to the Free
    # Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
    # 021110-1307, USA.
    import sys
    import subprocess
    import stat
    import time
    import os
    import re
    import socket
    from time import sleep
    from os.path import join, isdir, exists
    # defines
    # dlmfs is where the dlmfs filesystem is mounted
    # the default, normal place for ocfs2 setups is /dlm
    # ocfs2_dlmfs on /dlm type ocfs2_dlmfs (rw)
    DLMFS = "/dlm"
    # we need a domain name which really is just a subdir in dlmfs
    # default to "mycluster" so then it creates /dlm/mycluster
    # locks are created inside this directory/domain
    DLM_DOMAIN_NAME = "mycluster"
    # the main lock to use for being the owner of a lock
    # this can be any name, the filename is just the lockname
    DLM_LOCK_MASTER = DLM_DOMAIN_PATH + "/" + "master"
    # just a timeout
    SLEEP_ON_ERR = 60
    def run_cmd(cmd, success_return_code=(0,)):
        if not isinstance(cmd, list):
            raise Exception("Only accepts list!")
        cmd = [str(x) for x in cmd]
        proc = subprocess.Popen(cmd, stdout=subprocess.PIPE,
                                stderr=subprocess.PIPE, close_fds=True)
        (stdoutdata, stderrdata) = proc.communicate()
        if proc.returncode not in success_return_code:
            raise RuntimeError('Command: %s failed (%s): stderr: %s stdout: %s'
                               % (cmd, proc.returncode, stderrdata, stdoutdata))
        return str(stdoutdata)
    def dlm_ready():
        Indicate if the DLM is ready of not.
        With dlmfs, the DLM is ready once the DLM filesystem is mounted
        under /dlm.
        @return: C{True} if the DLM is ready, C{False} otherwise.
        @rtype: C{bool}
        return os.path.ismount(DLMFS)
    # just do a mkdir, if it already exists, we're good, if not just create it
    def dlm_join_domain(domain=DLM_DOMAIN_NAME):
        _dir = join(DLMFS, domain)
        if not isdir(_dir):
        # else: already joined
    # leaving a domain is basically removing the directory.
    def dlm_leave_domain(domain=DLM_DOMAIN_NAME, force=True):
        _dir = join(DLMFS, domain)
        if force:
            cmd = ["rm", "-fr", _dir]
            cmd = ["rmdir", _dir]
    # acquire a lock
    def dlm_acquire_lock(lock):
        # a lock is a filename in the domain directory
        lock_path = join(DLM_DOMAIN_PATH, lock)
            if not exists(lock_path):
                fd = os.open(lock_path, os.O_CREAT | os.O_NONBLOCK)
            # create the EX lock
            # creating a file with O_RDWR causes an EX lock
            fd = os.open(lock_path, os.O_RDWR | os.O_NONBLOCK)
            # once the file is created in this mode, you can close it
            # and you still keep the lock
        except Exception, e:
            if exists(lock_path):
            raise e
    def dlm_release_lock(lock):
        # releasing a lock is as easy as just removing the file
        lock_path = join(DLM_DOMAIN_PATH, lock)
        if exists(lock_path):
    def acquire_master_dlm_lock():
        ETXTBUSY = 26
        # close() does not downconvert the lock level nor does it drop the lock. The
        # holder still owns the lock at that level after close.
        # close() allows any downconvert request to succeed.
        # However, a downconvert request is only generated for queued requests. And
        # O_NONBLOCK is specifically a noqueue dlm request.
        # 1) O_CREAT | O_NONBLOCK will create a lock file if it does not exist, whether
        #    we are the lock holder or not.
        # 2) if we hold O_RDWR lock, and we close but not delete it, we still hold it.
        #    afterward, O_RDWR will succeed, but O_RDWR | O_NONBLOCK will not.
        # 3) but if we donnot hold the lock, O_RDWR will hang there waiting,
        #    which is not desirable -- any uninterruptable hang is undesirable.
        # 4) if noboday else hold the lock either, but the lock file exists as side effect
        #    of 1), with O_NONBLOCK, it may result in ETXTBUSY
        # a) we need O_NONBLOCK to avoid scenario (3)
        # b) we need to delete it ourself to avoid (2)
        #   *) if we do not succeed with (1), remove the lock file to avoid (4)
        #   *) if everything is good, we drop it and we remove it
        #   *) if killed by a program, this program should remove the file
        #   *) if crashed, but not rebooted, something needs to remove the file
        #   *) on reboot/reset the lock is released to the other node(s)
            if not exists(DLM_LOCK_MASTER):
                fd = os.open(DLM_LOCK_MASTER, os.O_CREAT | os.O_NONBLOCK)
            master_lock = os.open(DLM_LOCK_MASTER, os.O_RDWR | os.O_NONBLOCK)
            print "i am master of the multiverse"
            # at this point, I know I am the master and I can add code to do
            # things that only a master can do, such as, consider setting
            # a virtual IP or, if I am master, I start a program
            # and if not, then I make sure I don't run that program (or VIP)
            # so the magic starts here...
            return True
        except OSError, e:
            if e.errno == ETXTBUSY:
                print "i am not the master"
                # if we are not master and the file exists, remove it or
                # we will never succeed
                if exists(DLM_LOCK_MASTER):
                raise e
    def release_master_dlm_lock():
        if exists(DLM_LOCK_MASTER):
    def run_forever():
        # set socket default timeout for all connections
        print "Checking DLM"
        if dlm_ready():
           print "DLM Ready - joining domain : " + DLM_DOMAIN_NAME
           print "DLM not ready - bailing, fix the cluster stack"
        print "Starting main loop..."
        while True:
            except Exception, e:
            except (KeyboardInterrupt, SystemExit):
                if exists(DLM_LOCK_MASTER):
                # if you control-c out of this, then you lose the lock!
                # delete it on exit for release
                print "cleaned up master lock file"
    if __name__ == '__main__':

    Thursday Jan 03, 2013

    OCFS2 global heartbeat

    A cool, but often missed feature in Oracle Linux is the inclusion of OCFS2. OCFS2 is a native Linux clusterfilesystem which was written many years ago at Oracle (hence the name Oracle Cluster Filesystem) and which got included in the mainline Linux kernel around 2.6.16 somewhere back in 2005. The filesystem is widely used and has a number of really cool features.

  • simplicity : it's incredibly easy to configure the filesystem and clusterstack. There is literally one small text-based config file.
  • complete : ocfs2 contains all the components needed : a nodemanager, a heartbeat, a distributed lock manager and the actual cluster filesystem
  • small : the size of the filesystem and the needed tools is incredibly small. It consists of a few kernel modules and a small set of userspace tools. All the kernel modules together add up to about 2.5Mb in size and the userspace package is a mere 800Kb.
  • integrated : it's a native Linux filesystem so it makes use of all the normal kernel infrastructure. There is no duplication of structures caches, it fits right into the standard Linux filesystem structure.
  • part of Oracle Linux/UEK : ocfs2, like other linux filesystems, is built as kernel modules. When customers use Oracle Linux's UEK or UEK2, we automatically compile the kernel modules for the filesystem. Other distributions like SLES have done the same. We fully support OCFS2 as part of Oracle Linux as a general purpose cluster filesystem.
  • feature rich :
    OCFS2 is POSIX-compliant
    Optimized Allocations (extents, reservations, sparse, unwritten extents, punch holes)
    REFLINKs (inode-based writeable snapshots)
    Indexed Directories
    Metadata Checksums
    Extended Attributes (unlimited number of attributes per inode)
    Advanced Security (POSIX ACLs and SELinux)
    User and Group Quotas
    Variable Block and Cluster sizes
    Journaling (Ordered and Writeback data journaling modes)
    Endian and Architecture Neutral (x86, x86_64, ia64 and ppc64) - yes, you can mount the filesystem in a heterogeneous cluster.
    Buffered, Direct, Asynchronous, Splice and Memory Mapped I/Os
    In-built Clusterstack with a Distributed Lock Manager
    Cluster-aware Tools (mkfs, fsck, tunefs, etc.)
  • One of the main features added most recently is Global Heartbeat. OCFS2 as a filesystem typically was used with what's called local heartbeat. Basically for every filesystem you mounted, it would start its own local heartbeat, membership mechanism. The disk heartbeat means a disk io every 1 or 2 seconds for every node in the cluster, for every device. It was never a problem when the number of mounted volumes was relatively small but once customers were using 20+ volumes the overhead of the multiple disk heartbeats became significant and at times became a stability issue.

    global heartbeat was written to provide a solution to the multiple heartbeats. It is now possible to specify on which device(s) you want a heartbeat thread and then you can mount many other volumes that do not have their own and the heartbeat is shared amongst those one, or few threads and as such significantly reducing disk IO overhead.

    I was playing with this a little bit the other day and noticed that this wasn't very well documented so why not write it up here and share it with everyone. Getting started with OCFS2 is just really easy and withing just a few minutes it is possible to have a complete installation.

    I started with two servers installed with Oracle Linux 6.3. Each server has 2 network interfaces, one public and one private. The servers have a local disk and a shared storage device. For cluster filesystems, typically this shared storage device should be either a shared SAN disk or an iscsi device but it is also possible with Oracle Linux and UEK2 to create a shared virtual device on an nfs server and use this device for the cluster filesystem. This technique is used with Oracle VM where the shared storage is NAS-based.I just wrote a blog entry about how to do that here.

    While it is technically possible to create a working ocfs2 configuration using just one network and a single IP per server, it is certainly not ideal and not a recommended configuration for real world use. In any cluster environment it's highly recommended to have a private network for cluster traffic.The biggest reason for instability in a clustering environment is a bad/unreliable network and/or storage. Many times the environment has an overloaded network which causes network heartbeats to fail or disks where failover takes longer than the default configuration and the only alternative we have at that point, is to reboot the node(s).

    Typically when I do a test like this, I make sure I use the latest versions of the OS release. So after an installation of Oracle Linux 6.3, I just do a yum update on all my nodes to have the latest packages and also latest kernel version installed and then do a reboot. That gets me to 2.6.39-300.17.3.el6uek.x86_64 at the time of writing. Of course all this is freely accessibly from http://public-yum.oracle.com.

    Depending on the type of installation you did (basic, minimal, etc...) you may or may not have to add RPMs. Do a simple check rpm -q ocfs2-tools to see if the tools are installed, if not, just run yum install ocfs2-tools. And that's it. All required software is now installed. The kernel modules are already part of the uek2 kernel and the required tools (mkfs, fsck, o2cb,..) are part of the ocfs2-tools RPM.

    Next up: create the filesystem on the shared disk device and configure the cluster.

    One requirement for using global heartbeat is that the heartbeat device needs to be a NON-partitioned disk. Other OCFS2 volumes you want to create and mount can be on partitioned disks, but a device for the heartbeat needs to be on an empty disk. Let's assume /dev/sdb in this example.

    # mkfs.ocfs2 -b 4K -C 4K -J size=4M -N 4 -L ocfs2vol1 \
    --cluster-name=ocfs2 --cluster-stack=o2cb --global-heartbeat /dev/sdb
    This creates a filesystem with a 4K blocksize (normal value), clustersize of 4K (if you have many small files, this is a good value, if you have few large files, go to 1M).

    Journalsize of 4M if you have a large filesystem with a lot of metadata changes you might want to increase this. I did not add an option for 32bit or 64bit journals. if you want to create huge filesystems then use block64 which uses jbd2.

    The filesystem is created for 4 nodes (-N 4) this can be modified if your cluster needs to grow larger so you can always tune this with tunefs.ocfs2.

    Label ocfs2vol1, this is a disklabel you can later use to mount by label a filesystem.

    clustername=ocfs2, this is the default name but if you want to have your own name for your cluster you can put a different value here, remember it because you will need to configure the clusterstack with the clustername later.

    cluster-stack=o2cb : it is possible to have different cluster-stacks used such as pacemaker or cman.

    global-heartbeat : make sure that the filesystem is prepared and built to support global heartbeat

    /dev/sdb : the device to use for the filesystem.

    # mkfs.ocfs2 -b 4K -C 4K -J size=4M -N 4 -L ocfs2vol1 --cluster-name=ocfs2 \
    --cluster-stack=o2cb --force --global-heartbeat /dev/sdb
    mkfs.ocfs2 1.8.0
    Cluster stack: o2cb
    Cluster name: ocfs2
    Stack Flags: 0x1
    NOTE: Feature extended slot map may be enabled
    Overwriting existing ocfs2 partition.
    WARNING: Cluster check disabled.
    Proceed (y/N): y
    Label: ocfs2vol1
    Features: sparse extended-slotmap backup-super unwritten inline-data strict-journal-super xattr indexed-dirs refcount discontig-bg
    Block size: 4096 (12 bits)
    Cluster size: 4096 (12 bits)
    Volume size: 10725765120 (2618595 clusters) (2618595 blocks)
    Cluster groups: 82 (tail covers 5859 clusters, rest cover 32256 clusters)
    Extent allocator size: 4194304 (1 groups)
    Journal size: 4194304
    Node slots: 4
    Creating bitmaps: done
    Initializing superblock: done
    Writing system files: done
    Writing superblock: done
    Writing backup superblock: 2 block(s)
    Formatting Journals: done
    Growing extent allocator: done
    Formatting slot map: done
    Formatting quota files: done
    Writing lost+found: done
    mkfs.ocfs2 successful

    Now, we just have to configure the o2cb stack and we're done.

  • add the cluster : o2cb add-cluster ocfs2
  • add the nodes :
    o2cb add-node --ip --number 0 ocfs2 host1
    o2cb add-node --ip --number 1 ocfs2 host2
  • it is very important to use the hostname of the server (the name you get when typing hostname) for each node!
  • add the heartbeat device:
    run the following command and take the UUID value of the filesystem/device you want to use for heartbeat mounted.ocfs2 -d
    # mounted.ocfs2 -d
    Device      Stack  Cluster  F  UUID                              Label
    /dev/sdb   o2cb   ocfs2    G  244A6AAAE77F4053803734530FC4E0B7  ocfs2vol1
    o2cb add-heartbeat ocfs2 244A6AAAE77F4053803734530FC4E0B7
  • enable global heartbeat o2cb heartbeat-mode ocfs2 global
  • start the clusterstack : /etc/init.d/o2cb enable
  • verify that the stack is up and running : o2cb cluster-status
  • That's it. If you want to enable this at boot time, you can configure o2cb to start automatically by running /etc/init.d/o2cb configure. This allows you to set different heartbeat time out values and also whether or not to start the clusterstack at boot time.

    Now that a first node is configured, all you have to do is copy the file /etc/ocfs2/cluster.conf to all the other nodes in your cluster. You do not have to edit it on the other nodes, you just need to have an exact copy everywhere. You also do not need to redo the above commands, except 1) make sure ocfs2-tools is installed everywhere and if you want to start at boot time, re-run the /etc/init.d/o2cb configure on the other nodes as well. From here on, you can just mount your filesystems :

    mount /dev/sdb /mountpoint1 on each node.

    If you create more OCFS2 volumes you can just keep mounting them all, and with global heartbeat, you will just have one (or a few) hb's going on.

    have fun...

    Here is vmstat output, the first output shows a single heartbeat and 8 mounted filesystems, the second vmstat output shows 8 mounted filesystems with their own local heartbeat. Even though the IO amount is low, it shows that there are about 8x more IOs happening (from 1 every other second to 4 every second). As these are small IOs, they will move the diskhead to a specific place all the time and interrupt performance if you have it on each device. Hopefully this shows the benefits of global heartbeat.

    # vmstat 1
    procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
     r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
     0  0      0 789752  26220  97620    0    0     1     0   41   34  0  0 100  0  0
     0  0      0 789752  26220  97620    0    0     0     0   46   22  0  0 100  0  0
     0  0      0 789752  26220  97620    0    0     1     1   38   29  0  0 100  0  0
     0  0      0 789752  26228  97620    0    0     0    52   52   41  0  0 100  1  0
     0  0      0 789752  26228  97620    0    0     1     0   28   26  0  0 100  0  0
     0  0      0 789760  26228  97620    0    0     0     0   30   30  0  0 100  0  0
     0  0      0 789760  26228  97620    0    0     1     1   26   20  0  0 100  0  0
     0  0      0 789760  26228  97620    0    0     0     0   54   37  0  1 100  0  0
     0  0      0 789760  26228  97620    0    0     1     0   29   28  0  0 100  0  0
     0  0      0 789760  26236  97612    0    0     0    16   43   48  0  0 100  0  0
     0  0      0 789760  26236  97620    0    0     1     1   48   28  0  0 100  0  0
     0  0      0 789760  26236  97620    0    0     0     0   42   30  0  0 100  0  0
     0  0      0 789760  26236  97620    0    0     1     0   26   30  0  0 100  0  0
     0  0      0 789760  26236  97620    0    0     0     0   35   24  0  0 100  0  0
     0  1      0 789760  26240  97616    0    0     1    21   29   27  0  0 100  0  0
     0  0      0 789760  26244  97620    0    0     0     4   51   44  0  0 100  0  0
     0  0      0 789760  26244  97620    0    0     1     0   31   24  0  0 100  0  0
     0  0      0 789760  26244  97620    0    0     0     0   25   28  0  0 100  0  0
     0  0      0 789760  26244  97620    0    0     1     1   30   20  0  0 100  0  0
     0  0      0 789760  26244  97620    0    0     0     0   41   30  0  0 100  0  0
     0  0      0 789760  26252  97616    0    0     1    16   56   44  0  0 100  0  0
    # vmstat 1
    procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
     r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
     0  0      0 784364  28732  98620    0    0     4    46   54   64  0  0 100  0  0
     0  0      0 784364  28732  98628    0    0     4     2   60   48  0  0 100  0  0
     0  0      0 784364  28732  98628    0    0     4     2   51   53  0  0 100  0  0
     0  0      0 784364  28732  98628    0    0     4     2   58   50  0  0 100  0  0
     0  0      0 784364  28732  98628    0    0     4     2   56   44  0  0 100  0  0
     0  0      0 784364  28732  98628    0    0     4     2   46   47  0  0 100  0  0
     0  0      0 784364  28732  98628    0    0     4     2   65   54  0  0 100  0  0
     0  0      0 784388  28740  98620    0    0     4    14   65   55  0  0 100  0  0
     0  0      0 784388  28740  98628    0    0     4     2   46   48  0  0 100  0  0
     0  0      0 784388  28740  98628    0    0     4     2   52   42  0  0 100  0  0
     0  0      0 784388  28740  98628    0    0     4     2   51   58  0  0 100  0  0
     0  0      0 784388  28740  98628    0    0     4     2   36   43  0  0 100  0  0
     0  0      0 784396  28740  98628    0    0     4     2   39   47  0  0 100  0  0
     0  0      0 784396  28740  98628    0    0     4     2   52   54  0  0 100  0  0
     0  0      0 784396  28740  98628    0    0     4     2   42   48  0  0 100  0  0
     0  0      0 784404  28748  98620    0    0     4    14   52   63  0  0 100  0  0
     0  0      0 784404  28748  98628    0    0     4     2   32   42  0  0 100  0  0
     0  0      0 784404  28748  98628    0    0     4     2   50   40  0  0 100  0  0
     0  0      0 784404  28748  98628    0    0     4     2   58   56  0  0 100  0  0
     0  0      0 784412  28748  98628    0    0     4     2   39   46  0  0 100  0  0
     0  0      0 784412  28748  98628    0    0     4     2   45   50  0  0 100  0  0
     0  0      0 784412  28748  98628    0    0     4     2   43   42  0  0 100  0  0
     0  0      0 784288  28748  98628    0    0     4     6   48   52  0  0 100  0  0

    dm nfs

    A little known feature that we make good use of in Oracle VM is called dm nfs. Basically the ability to create a device mapper device directly on an nfs-based file/filesystem. We use this in Oracle VM 3 if your shared storage for the cluster is nfs based.

    Oracle VM clustering relies on the OCFS2 clusterstack/filesystem that is native in the kernel (uek2/2.6.39-x). When we create an HA-enabled pool, we create, what we call, a pool filesystem. That filesystem contains an ocfs2 volume so that we can store cluster-wide data. In particular we store shared database files that are needed by the Oracle VM agents on the nodes for HA. It contains info on pool membership, which VMs are in HA mode, what the pool IP is etc...

    When the user provides an nfs filesystem for the pool, we do the following :

  • mount the nfs volume in /nfsmnt/
  • create a 10GB sized file ovspoolfs.img
  • create a dm nfs volume(/dev/mapper/ovspoolfs> on this ovspoolfs.img file
  • create an ocfs2 volume on this dm nfs device
  • mount the ocfs2 volume on /poolfsmnt/
  • If someone wants to try out something that relies on block-based shared storage devices, such as ocfs2, but does not have iSCSI or SAN storage, using nfs is an alternative and dm nfs just makes it really easy.

    To do this yourself, the following commands will do it for you :

  • to find out if any such devices exist just type dmsetup table --target nfs
  • to create your own device, do something like this:
  • mount mynfsserver:/mountpoint /mnt
    dd if=/dev/zero of=/mnt/myvolume.img bs=1M count=2000 
    dmsetup create myvolume --table "0 4096000 nfs /mnt/myvolume.img 0"
    So mount the nfs volume, create a file which will be the container of the blockdevice, in this case a 2GB file and then create the dm device. The values for the dmsetup command are the following:

    myvolume = the name of the /dev/mapper device. Here we end up with /dev/mapper/myvolume

    table = start (normally always 0), number of blocks/length, this is in 512byte blocks, so you double the number,nfs since this is on nfs, filename of the nfs based file, offset (normally always 0)

    So now you have /dev/mapper/myvolume, it acts like a normal block device. If you do this on multiple servers, you can actually create an ocfs2 filesystem on this block device and it will be consistent across the servers.

    Credits go to Chuck Lever for writing dm nfs in the first place, thanks Chuck :) The code for dm nfs is here.

    Tuesday Nov 27, 2012

    Introducing the Oracle Linux Playground yum repo

    We just introduced a new yum repository/channel on http://public-yum.oracle.com called the playground channel. What we started doing is the following:

    When a new stable mainline kernel is released by Linus or GregKH, we internally build RPMs to test it and do some QA work around it to keep track of what's going on with the latest development kernels. It helps us understand how performance moves up or down and if there are issues, we try to help look into them and of course send that stuff back upstream. Many Linux users out there are interested in trying out the latest features but there are some potential barriers to do this.

    (1) in general, you are looking at an upstream development distribution, which means that everything changes both in userspace(random applications) and kernel. Projects like Fedora are very useful and someone that wants to just see how the entire distribution evolves with all the changes, this is a great way to be current. A drawback here, though, is that if you have applications that are not part of the distribution, there's a lot of manual work involved or they might just not work because the changes are too drastic. The introduction of systemd is a good example.

    (2) when you look at many of our customers, that are interested in our database products or applications, the starting point of having a supported/certified userspace/distribution, like Oracle Linux, is a much easier way to get your feet wet in seeing what new/future Linux kernel enhancements could do.

    This is where the playground channel comes into play. When you install Oracle Linux 6 (which anyone can download and use from http://edelivery.oracle.com/linux), grab the latest public yum repository file http://public-yum.oracle.com/public-yum-ol6.repo, put it in /etc/yum.repos.d and enable the playground repo :

    name=Latest mainline stable kernel for Oracle Linux 6 ($basearch) - Unsupported 
    Now, all you need to do : type yum update and you will be downloading the latest stable kernel which will install cleanly on Oracle Linux 6. Thus you end up with a stable Linux distribution where you can install all your software, and then download the latest stable kernel (at time of writing this is 3.6.7) without having to recompile a kernel, without having to jump through hoops.

    There is of course a big, very important disclaimer this is NOT for PRODUCTION use.

    We want to try and help make it easy for people that are interested, from a user perspective, where the Linux kernel is going and make it easy to install and use it and play around with new features. Without having to learn how to compile a kernel and without necessarily having to install a complete new distribution with all the changes top to bottom.

    So we don't or won't introduce any new userspace changes, this project really is around making it easy to try out the latest upstream Linux kernels in a very easy way on an environment that's stable and you can keep current, since all the latest errata for Oracle Linux 6 are published on the public yum repo as well. So one repository location for all your current changes and the upstream kernels. We hope that this will get more users to try out the latest kernel and report their findings. We are always interested in understanding stability and performance characteristics.

    As new features are going into the mainline kernel, that could potentially be interesting or useful for various products, we will try to point them out on our blogs and give an example on how something can be used so you can try it out for yourselves.

    Anyway, I hope people will find this useful and that it will help increase interested in upstream development beyond reading lkml by some of the more non-kernel-developer types.

    Thursday Jul 05, 2012


    (updated 3/18/13 to fix disknaming error) Oracle ASMlib on Linux has been a topic of discussion a number of times since it was released way back when in 2004. There is a lot of confusion around it and certainly a lot of misinformation out there for no good reason. Let me try to give a bit of history around Oracle ASMLib.

    Oracle ASMLib was introduced at the time Oracle released Oracle Database 10g R1. 10gR1 introduced a very cool important new features called Oracle ASM (Automatic Storage Management). A very simplistic description would be that this is a very sophisticated volume manager for Oracle data. Give your devices directly to the ASM instance and we manage the storage for you, clustered, highly available, redundant, performance, etc, etc... We recommend using Oracle ASM for all database deployments, single instance or clustered (RAC).

    The ASM instance manages the storage and every Oracle server process opens and operates on the storage devices like it would open and operate on regular datafiles or raw devices. So by default since 10gR1 up to today, we do not interact differently with ASM managed block devices than we did before with a datafile being mapped to a raw device. All of this is without ASMLib, so ignore that one for now. Standard Oracle on any platform that we support (Linux, Windows, Solaris, AIX, ...) does it the exact same way. You start an ASM instance, it handles storage management, all the database instances use and open that storage and read/write from/to it. There are no extra pieces of software needed, including on Linux. ASM is fully functional and selfcontained without any other components.

    In order for the admin to provide a raw device to ASM or to the database, it has to have persistent device naming. If you booted up a server where a raw disk was named /dev/sdf and you give it to ASM (or even just creating a tablespace without asm on that device with datafile '/dev/sdf') and next time you boot up and that device is now /dev/sdg, you end up with an error. Just like you can't just change datafile names, you can't change device filenames without telling the database, or ASM. persistent device naming on Linux, especially back in those days ways to say it bluntly, a nightmare. In fact there were a number of issues (dating back to 2004) :

    Correction to the above: ASM can handle device name changes across reboots with the correct ASM_DISKSTRING in the init.ora, it will be able to find the disks even if they changed, however part of device naming and device metadata on reboot is the correct ownership (oracle:dba ....). With ASMLib in place, this is not an issue and it will take care of ownership and permissions of the ASM disk devices.

  • Linux async IO wasn't pretty
  • persistent device naming including permissions (had to be owned by oracle and the dba group) was very, very difficult to manage
  • system resource usage in terms of open file descriptors
  • So given the above, we tried to find a way to make this easier on the admins, in many ways, similar to why we started working on OCFS a few years earlier -> how can we make life easier for the admins on Linux.

    A feature of Oracle ASM is the ability for third parties to write an extension using what's called ASMLib. It is possible for any third party OS or storage vendor to write a library using a specific Oracle defined interface that gets used by the ASM instance and by the database instance when available. This interface offered 2 components :

  • Define an IO interface - allow any IO to the devices to go through ASMLib
  • Define device discovery - implement an external way of discovering, labeling devices to provide to ASM and the Oracle database instance
  • This is similar to a library that a number of companies have implemented over many years called libODM (Oracle Disk Manager). ODM was specified many years before we introduced ASM and allowed third party vendors to implement their own IO routines so that the database would use this library if installed and make use of the library open/read/write/close,.. routines instead of the standard OS interfaces. PolyServe back in the day used this to optimize their storage solution, Veritas used (and I believe still uses) this for their filesystem. It basically allowed, in particular, filesystem vendors to write libraries that could optimize access to their storage or filesystem.. so ASMLib was not something new, it was basically based on the same model. You have libodm for just database access, you have libasm for asm/database access.

    Since this library interface existed, we decided to do a reference implementation on Linux. We wrote an ASMLib for Linux that could be used on any Linux platform and other vendors could see how this worked and potentially implement their own solution. As I mentioned earlier, ASMLib and ODMLib are libraries for third party extensions. ASMLib for Linux, since it was a reference implementation implemented both interfaces, the storage discovery part and the IO part. There are 2 components :

  • Oracle ASMLib - the userspace library with config tools (a shared object and some scripts)
  • oracleasm.ko - a kernel module that implements the asm device for /dev/oracleasm/*
  • The userspace library is a binary-only module since it links with and contains Oracle header files but is generic, we only have one asm library for the various Linux platforms. This library is opened by Oracle ASM and by Oracle database processes and this library interacts with the OS through the asm device (/dev/asm). It can install on Oracle Linux, on SuSE SLES, on Red Hat RHEL,.. The library itself doesn't actually care much about the OS version, the kernel module and device cares. The support tools are simple scripts that allow the admin to label devices and scan for disks and devices. This way you can say create an ASM disk label foo on, currently /dev/sdf... So if /dev/sdf disappears and next time is /dev/sdg, we just scan for the label foo and we discover it as /dev/sdg and life goes on without any worry. Also, when the database needs access to the device, we don't have to worry about file permissions or anything it will be taken care of. So it's a convenience thing.

    Correction: the extra advantage with ASMLib here being the fact that it will take care of the file permissions and ownership of the device.

    The kernel module oracleasm.ko is a Linux kernel module/device driver. It implements a device /dev/oracleasm/* and any and all IO goes through ASMLib -> /dev/oracleasm. This kernel module is obviously a very specific Oracle related device driver but it was released under the GPL v2 so anyone could easily build it for their Linux distribution kernels.

    Advantages for using ASMLib :

  • A good async IO interface for the database, the entire IO interface is based on an optimal ASYNC model for performance
  • A single file descriptor per Oracle process, not one per device or datafile per process reducing # of open filehandles overhead
  • Device scanning and labeling built-in so you do not have to worry about messing with udev or devlabel, permissions or the likes which can be very complex and error prone.
  • Just like with OCFS and OCFS2, each kernel version (major or minor) has to get a new version of the device drivers. We started out building the oracleasm kernel module rpms for many distributions, SLES (in fact in the early days still even for this thing called United Linux) and RHEL. The driver didn't make sense to get pushed into upstream Linux because it's unique and specific to the Oracle database.

    As it takes a huge effort in terms of build infrastructure and QA and release management to build kernel modules for every architecture, every linux distribution and every major and minor version we worked with the vendors to get them to add this tiny kernel module to their infrastructure. (60k source code file). The folks at SuSE understood this was good for them and their customers and us and added it to SLES. So every build coming from SuSE for SLES contains the oracleasm.ko module. We weren't as successful with other vendors so for quite some time we continued to build it for RHEL and of course as we introduced Oracle Linux end of 2006 also for Oracle Linux. With Oracle Linux it became easy for us because we just added the code to our build system and as we churned out Oracle Linux kernels whether it was for a public release or for customers that needed a one off fix where they also used asmlib, we didn't have to do any extra work it was just all nicely integrated.

    With the introduction of Oracle Linux's Unbreakable Enterprise Kernel and our interest in being able to exploit ASMLib more, we started working on a very exciting project called Data Integrity. Oracle (Martin Petersen in particular) worked for many years with the T10 standards committee and storage vendors and implemented Linux kernel support for DIF/DIX, data protection in the Linux kernel, note to those that wonder, yes it's all in mainline Linux and under the GPL. This basically gave us all the features in the Linux kernel to checksum a data block, send it to the storage adapter, which can then validate that block and checksum in firmware before it sends it over the wire to the storage array, which can then do another checksum and to the actual DISK which does a final validation before writing the block to the physical media. So what was missing was the ability for a userspace application (read: Oracle RDBMS) to write a block which then has a checksum and validation all the way down to the disk. application to disk.

    Because we have ASMLib we had an entry into the Linux kernel and Martin added support in ASMLib (kernel driver + userspace) for this functionality. Now, this is all based on relatively current Linux kernels, the oracleasm kernel module depends on the main kernel to have support for it so we can make use of it. Thanks to UEK and us having the ability to ship a more modern, current version of the Linux kernel we were able to introduce this feature into ASMLib for Linux from Oracle. This combined with the fact that we build the asm kernel module when we build every single UEK kernel allowed us to continue improving ASMLib and provide it to our customers.

    So today, we (Oracle) provide Oracle ASMLib for Oracle Linux and in particular on the Unbreakable Enterprise Kernel. We did the build/testing/delivery of ASMLib for RHEL until RHEL5 but since RHEL6 decided that it was too much effort for us to also maintain all the build and test environments for RHEL and we did not have the ability to use the latest kernel features to introduce the Data Integrity features and we didn't want to end up with multiple versions of asmlib as maintained by us. SuSE SLES still builds and comes with the oracleasm module and they do all the work and RHAT it certainly welcome to do the same. They don't have to rebuild the userspace library, it's really about the kernel module.

    And finally to re-iterate a few important things :

  • Oracle ASM does not in any way require ASMLib to function completely. ASMlib is a small set of extensions, in particular to make device management easier but there are no extra features exposed through Oracle ASM with ASMLib enabled or disabled. Often customers confuse ASMLib with ASM. again, ASM exists on every Oracle supported OS and on every supported Linux OS, SLES, RHEL, OL withoutASMLib
  • Oracle ASMLib userspace is available for OTN and the kernel module is shipped along with OL/UEK for every build and by SuSE for SLES for every of their builds
  • ASMLib kernel module was built by us for RHEL4 and RHEL5 but we do not build it for RHEL6, nor for the OL6 RHCK kernel. Only for UEK
  • ASMLib for Linux is/was a reference implementation for any third party vendor to be able to offer, if they want to, their own version for their own OS or storage
  • ASMLib as provided by Oracle for Linux continues to be enhanced and evolve and for the kernel module we use UEK as the base OS kernel
  • hope this helps.

    Wednesday Jul 04, 2012

    What's up with OCFS2?

    On Linux there are many filesystem choices and even from Oracle we provide a number of filesystems, all with their own advantages and use cases. Customers often confuse ACFS with OCFS or OCFS2 which then causes assumptions to be made such as one replacing the other etc... I thought it would be good to write up a summary of how OCFS2 got to where it is, what we're up to still, how it is different from other options and how this really is a cool native Linux cluster filesystem that we worked on for many years and is still widely used.

    Work on a cluster filesystem at Oracle started many years ago, in the early 2000's when the Oracle Database Cluster development team wrote a cluster filesystem for Windows that was primarily focused on providing an alternative to raw disk devices and help customers with the deployment of Oracle Real Application Cluster (RAC). Oracle RAC is a cluster technology that lets us make a cluster of Oracle Database servers look like one big database. The RDBMS runs on many nodes and they all work on the same data. It's a Shared Disk database design. There are many advantages doing this but I will not go into detail as that is not the purpose of my write up. Suffice it to say that Oracle RAC expects all the database data to be visible in a consistent, coherent way, across all the nodes in the cluster. To do that, there were/are a few options : 1) use raw disk devices that are shared, through SCSI, FC, or iSCSI 2) use a network filesystem (NFS) 3) use a cluster filesystem(CFS) which basically gives you a filesystem that's coherent across all nodes using shared disks. It is sort of (but not quite) combining option 1 and 2 except that you don't do network access to the files, the files are effectively locally visible as if it was a local filesystem.

    So OCFS (Oracle Cluster FileSystem) on Windows was born. Since Linux was becoming a very important and popular platform, we decided that we would also make this available on Linux and thus the porting of OCFS/Windows started. The first version of OCFS was really primarily focused on replacing the use of Raw devices with a simple filesystem that lets you create files and provide direct IO to these files to get basically native raw disk performance. The filesystem was not designed to be fully POSIX compliant and it did not have any where near good/decent performance for regular file create/delete/access operations. Cache coherency was easy since it was basically always direct IO down to the disk device and this ensured that any time one issues a write() command it would go directly down to the disk, and not return until the write() was completed. Same for read() any sort of read from a datafile would be a read() operation that went all the way to disk and return. We did not cache any data when it came down to Oracle data files.

    So while OCFS worked well for that, since it did not have much of a normal filesystem feel, it was not something that could be submitted to the kernel mail list for inclusion into Linux as another native linux filesystem (setting aside the Windows porting code ...) it did its job well, it was very easy to configure, node membership was simple, locking was disk based (so very slow but it existed), you could create regular files and do regular filesystem operations to a certain extent but anything that was not database data file related was just not very useful in general. Logfiles ok, standard filesystem use, not so much. Up to this point, all the work was done, at Oracle, by Oracle developers.

    Once OCFS (1) was out for a while and there was a lot of use in the database RAC world, many customers wanted to do more and were asking for features that you'd expect in a normal native filesystem, a real "general purposes cluster filesystem". So the team sat down and basically started from scratch to implement what's now known as OCFS2 (Oracle Cluster FileSystem release 2). Some basic criteria were :

  • Design it with a real Distributed Lock Manager and use the network for lock negotiation instead of the disk
  • Make it a Linux native filesystem instead of a native shim layer and a portable core
  • Support standard Posix compliancy and be fully cache coherent with all operations
  • Support all the filesystem features Linux offers (ACL, extended Attributes, quotas, sparse files,...)
  • Be modern, support large files, 32/64bit, journaling, data ordered journaling, endian neutral, we can mount on both endian /cross architecture,..
  • Needless to say, this was a huge development effort that took many years to complete. A few big milestones happened along the way...

  • OCFS2 was development in the open, we did not have a private tree that we worked on without external code review from the Linux Filesystem maintainers, great folks like Christopher Hellwig reviewed the code regularly to make sure we were not doing anything out of line, we submitted the code for review on lkml a number of times to see if we were getting close for it to be included into the mainline kernel. Using this development model is standard practice for anyone that wants to write code that goes into the kernel and having any chance of doing so without a complete rewrite or.. shall I say flamefest when submitted. It saved us a tremendous amount of time by not having to re-fit code for it to be in a Linus acceptable state. Some other filesystems that were trying to get into the kernel that didn't follow an open development model had a lot harder time and a lot harsher criticism.
  • March 2006, when Linus released 2.6.16, OCFS2 officially became part of the mainline kernel, it was accepted a little earlier in the release candidates but in 2.6.16. OCFS2 became officially part of the mainline Linux kernel tree as one of the many filesystems. It was the first cluster filesystem to make it into the kernel tree. Our hope was that it would then end up getting picked up by the distribution vendors to make it easy for everyone to have access to a CFS. Today the source code for OCFS2 is approximately 85000 lines of code.
  • We made OCFS2 production with full support for customers that ran Oracle database on Linux, no extra or separate support contract needed. OCFS2 1.0.0 started being built for RHEL4 for x86, x86-64, ppc, s390x and ia64. For RHEL5 starting with OCFS2 1.2.
  • SuSE was very interested in high availability and clustering and decided to build and include OCFS2 with SLES9 for their customers and was, next to Oracle, the main contributor to the filesystem for both new features and bug fixes.
  • Source code was always available even prior to inclusion into mainline and as of 2.6.16, source code was just part of a Linux kernel download from kernel.org, which it still is, today. So the latest OCFS2 code is always the upstream mainline Linux kernel.
  • OCFS2 is the cluster filesystem used in Oracle VM 2 and Oracle VM 3 as the virtual disk repository filesystem.
  • Since the filesystem is in the Linux kernel it's released under the GPL v2
  • The release model has always been that new feature development happened in the mainline kernel and we then built consistent, well tested, snapshots that had versions, 1.2, 1.4, 1.6, 1.8. But these releases were effectively just snapshots in time that were tested for stability and release quality.

    OCFS2 is very easy to use, there's a simple text file that contains the node information (hostname, node number, cluster name) and a file that contains the cluster heartbeat timeouts. It is very small, and very efficient. As Sunil Mushran wrote in the manual :

  • OCFS2 is an efficient, easily configured, quickly installed, fully integrated and compatible, feature-rich, architecture and endian neutral, cache coherent, ordered data journaling, POSIX-compliant, shared disk cluster file system.
  • Here is a list of some of the important features that are included :

  • Variable Block and Cluster sizes Supports block sizes ranging from 512 bytes to 4 KB and cluster sizes ranging from 4 KB to 1 MB (increments in power of 2).
  • Extent-based Allocations Tracks the allocated space in ranges of clusters making it especially efficient for storing very large files.
  • Optimized Allocations Supports sparse files, inline-data, unwritten extents, hole punching and allocation reservation for higher performance and efficient storage.
  • File Cloning/snapshots REFLINK is a feature which introduces copy-on-write clones of files in a cluster coherent way.
  • Indexed Directories Allows efficient access to millions of objects in a directory.
  • Metadata Checksums Detects silent corruption in inodes and directories.
  • Extended Attributes Supports attaching an unlimited number of name:value pairs to the file system objects like regular files, directories, symbolic links, etc.
  • Advanced Security Supports POSIX ACLs and SELinux in addition to the traditional file access permission model.
  • Quotas Supports user and group quotas.
  • Journaling Supports both ordered and writeback data journaling modes to provide file system consistency in the event of power failure or system crash.
  • Endian and Architecture neutral Supports a cluster of nodes with mixed architectures. Allows concurrent mounts on nodes running 32-bit and 64-bit, little-endian (x86, x86_64, ia64) and big-endian (ppc64) architectures.
  • In-built Cluster-stack with DLM Includes an easy to configure, in-kernel cluster-stack with a distributed lock manager.
  • Buffered, Direct, Asynchronous, Splice and Memory Mapped I/Os Supports all modes of I/Os for maximum flexibility and performance.
  • Comprehensive Tools Support Provides a familiar EXT3-style tool-set that uses similar parameters for ease-of-use.
  • The filesystem was distributed for Linux distributions in separate RPM form and this had to be built for every single kernel errata release or every updated kernel provided by the vendor. We provided builds from Oracle for Oracle Linux and all kernels released by Oracle and for Red Hat Enterprise Linux. SuSE provided the modules directly for every kernel they shipped. With the introduction of the Unbreakable Enterprise Kernel for Oracle Linux and our interest in reducing the overhead of building filesystem modules for every minor release, we decide to make OCFS2 available as part of UEK. There was no more need for separate kernel modules, everything was built-in and a kernel upgrade automatically updated the filesystem, as it should. UEK allowed us to not having to backport new upstream filesystem code into an older kernel version, backporting features into older versions introduces risk and requires extra testing because the code is basically partially rewritten. The UEK model works really well for continuing to provide OCFS2 without that extra overhead.

    Because the RHEL kernel did not contain OCFS2 as a kernel module (it is in the source tree but it is not built by the vendor in kernel module form) we stopped adding the extra packages to Oracle Linux and its RHEL compatible kernel and for RHEL. Oracle Linux customers/users obviously get OCFS2 included as part of the Unbreakable Enterprise Kernel, SuSE customers get it by SuSE distributed with SLES and Red Hat can decide to distribute OCFS2 to their customers if they chose to as it's just a matter of compiling the module and making it available.

    OCFS2 today, in the mainline kernel is pretty much feature complete in terms of integration with every filesystem feature Linux offers and it is still actively maintained with Joel Becker being the primary maintainer. Since we use OCFS2 as part of Oracle VM, we continue to look at interesting new functionality to add, REFLINK was a good example, and as such we continue to enhance the filesystem where it makes sense. Bugfixes and any sort of code that goes into the mainline Linux kernel that affects filesystems, automatically also modifies OCFS2 so it's in kernel, actively maintained but not a lot of new development happening at this time. We continue to fully support OCFS2 as part of Oracle Linux and the Unbreakable Enterprise Kernel and other vendors make their own decisions on support as it's really a Linux cluster filesystem now more than something that we provide to customers. It really just is part of Linux like EXT3 or BTRFS etc, the OS distribution vendors decide.

    Do not confuse OCFS2 with ACFS (ASM cluster Filesystem) also known as Oracle Cloud Filesystem. ACFS is a filesystem that's provided by Oracle on various OS platforms and really integrates into Oracle ASM (Automatic Storage Management). It's a very powerful Cluster Filesystem but it's not distributed as part of the Operating System, it's distributed with the Oracle Database product and installs with and lives inside Oracle ASM. ACFS obviously is fully supported on Linux (Oracle Linux, Red Hat Enterprise Linux) but OCFS2 independently as a native Linux filesystem is also, and continues to also be supported. ACFS is very much tied into the Oracle RDBMS, OCFS2 is just a standard native Linux filesystem with no ties into Oracle products. Customers running the Oracle database and ASM really should consider using ACFS as it also provides storage/clustered volume management. Customers wanting to use a simple, easy to use generic Linux cluster filesystem should consider using OCFS2.

    To learn more about OCFS2 in detail, you can find good documentation on http://oss.oracle.com/projects/ocfs2 in the Documentation area, or get the latest mainline kernel from http://kernel.org and read the source.

    One final, unrelated note - since I am not always able to publicly answer or respond to comments, I do not want to selectively publish comments from readers. Sometimes I forget to publish comments, sometime I publish them and sometimes I would publish them but if for some reason I cannot publicly comment on them, it becomes a very one-sided stream. So for now I am going to not publish comments from anyone, to be fair to all sides. You are always welcome to email me and I will do my best to respond to technical questions, questions about strategy or direction are sometimes not possible to answer for obvious reasons.


    Wim Coekaerts is the Senior Vice President of Linux and Virtualization Engineering for Oracle. He is responsible for Oracle's complete desktop to data center virtualization product line and the Oracle Linux support program.

    You can follow him on Twitter at @wimcoekaerts


    « July 2016