X

An Oracle blog about Exadata

  • March 11, 2009

Provisioning your GRID with Oracle VM Templates

Rene Kundersma
Software Engineer
Introduction (Chapter 1)



Linux node installation and configuration (virtualized or not) for an Oracle Grid environment can be done on various ways. Of course, one could do this all manually, but for the larger environments this would of course be undo able.




Also, you want to make sure each installation has the same specifications, and you want to be sure human errors that may occur during the installation are brought back to a minimum.




This blog entry will have chapters in which all details of an automated Oracle VM cloning process will be described.




The setup as described below is used to prepare education environments. It will also work for proof of concept envrionments and most parts of it may be even usable in your own Grid deployment strategy.




The setup described allows you to setup an GRID environments that students can use to learn (for instance) how to install RAC, configure DataGuard, work with Enterprise Manager Grid Control. I can also be used to learn students how to work with Swingbench or FCF all within their own infrastructure.




This virtualized solution help to quickly setup, repair, catch-up, restore and adapt the setup.
It will save your IT department costs on hardware and storage and it will save you lots of time.




The pictures on this page are best viewed with Firefox.




Bare metal provisioning




Within the Oracle Grid, Oracle Enterprise Manager Grid Control release 10.2.0.4 with kickstart and PXE-boot is used more often these days as a way to do a so called "bare metal" installation of the OS:
kix.gif




After this bare metal installation "post configuration scripts" took care of the node specific settings.




Even with the use of Oracle Virtual Machines on top of such a node, the kickstart procedure can still be used; without too much effort a PXE-boot configuration for virtualized guests can be setup.




This way of "bare metal installation" or better "virtual metal installation" by PXE-boot for VM Guests is a nice solution, which I will describe one day. But why would one do a complete installation for each VM while each VM differs only on a couple of configuration files ?




This blog entry explains how to use an Oracle VM template to provision Virtual Guest Operating Systems for in a Grid situation.




For educational purposes, where classes with a lot of students have to work each with their own Grid environment, a procedure is worked out to provision a blade system with operating systems and software, Grid ready, all based on Oracle templates.




As said, more options are possible, this is how my solution works, it may work for you also.
1. An example OS configuration is provided (node specific configuration files). From that template files a VM Guest specific configuration is generated automatically. This configuration describes settings for hostname, ipnumbers, etc.

2. A vm template (image) is provided.




By automating the two steps above, one can easily and quickly setup Virtualized Oracle Linux Nodes, ready for RAC.




The next chapter will be about the configuration templates and the cloning process




The process (Chapter 2)




With this configuration templates as described earlier, "configuration clones" can be made.
In this example I am using HP blade technology. On each blade six VMs will be running.
For each blade and for each VM running on top of that the configuration files are generated.




It makes sense to define configuration templates.
With the use of scripts you could use these templates and generate configuration files for each specific vm.




With a VM template in one hand, and an automatically generated set of configuration files in the other you can quickly build, or rebuild the infrastructure over and over again.




Even if you need to make changes that reflect all vm's, they can be rolled out quite quickly.




As said, this solution is extremely useful for education purposes, or situations where you have to provide lots of VM guests ready to be used instantly. Possible other uses are in proof of concept environments.




In short the work flow of the cloning process looks like the following:
1. A default virtual machine image is copied over

2. Configuration files for the VM are generated, based upon the blade number and vm number and purpose of the VM

3. The VM image is "mounted" and configuration files are overwritten with the generated configuration files. Also binaries (other programs) are put in place

4. The VM image is unmounted and if needed "file based shared storage" is created.

5. The VM boots for the first time, ready to use immediately, totally pre-configured





The concept itself can of course also be used for the Linux provisioning of your virtualized infrastructure as an alternative to bare metal provisioning.




The next chapter will describe the hardware used and the chosen storage solutions for this example.




Hardware used (Chapter 3)




As discussed in the previous chapter, this project is build on HP blade technology.




The solution described is of course independent of the hardware chosen.




However, in order to describe the complete setup this chapter is here to describe the hardware used.




blade01.JPG




This blade enclosure (C3000) has eight blades, each blade has:

- two nics (broadcom)

- two hba's (qlogic)

- 16 GB of RAM

- two quad core Intel Xeon processors





Storage to the blades is made available by NFS and Fiber Channel




The NFS share is used to provide the VM template that will be used as source.




The same NFS share is also available to the VM guests in order to provide the guests the option to install software from a shared location.




The SAN Storage comes from an HP MSA. This MSA devices are used for OCFS2. This is where the VM images files will be placed




Each blade is available by a public network interface.




Also a private network is setup as interconnect network for OCFS2 between the blades.




For each blade the architecture be equal to the diagram below.




blade02.jpg




VM distribution (Chapter 4)




As said in an earlier chapter, each blade has 16GB RAM, so this is enough to run at least 6 VMs of 2GB RAM each.




The purpose is to have:

- 3 vms for Real Application Clusters (RAC) (11.1.0.7 CRS/ASM/RDBMS)

- 1 vm for Dataguard (11.1.0.7 ASM/RDBMS)

- 1 vm to run swingbench and demo applications

- 1 vm to run Enterprise Manager grid Control (EMGC).





This will look this way:

blade03.jpg




As each blade has 146 GB local storage, there is room to have some VM's on local disks. Since, there is no intention to live migrate these nodes they can be put on a non-shared location.




VM number six (EMGC) is too big to fit next to the other VMs on local storage.
For reason a shared OCFS mount is made.




Each VM uses the Oracle VM provided location for the VMs (/OVS/running_pool)
With symbolic links the storage for the EMGC vm is brought to the OCFS2 shared disk: GRIDNODE09 -> /OVS_shared_large/running_pool/oemgc/nlhpblade07




By default OCFS2 allows four nodes to concurrently mount OCFS2 filesystem. In order to mount the OCFS2 filesystem on all blades concurrently you have to specify the –N X argument with the execution of mkfs where X is the max. number of nodes that will concurrently mount the OCFS filesystem ever.



mkfs.ocfs2 -b 4K -C 32K -N 8 -L ovmsdisk /dev/sdb1





PV Templates (Chapter 5)




Before doing any specific VM changes, first a template is chosen, in this case Oracle Enterprise Linux 5 update 2 (OEL5U2).




This is an Oracle VM template downloaded from OTN.




Our template is a para-virtualized template, based on a 32bit architecture.




To remind you, this is how the para-virtualized architecture looks:
blade04.jpg




b.t.w. para-virtualized kernels often work faster then hardware virtualized guests.




Please see this link for more information on hardware v.s. para-virtualized guests




As part of the procedure described, the template will be copied over six times to each blade. In order to use the VMs on a specific blade for a specific purpose configuration files must be made. The next chapter describes how this works.




VM Specific files and clone procedure (Chapter 6)




Each virtualized guest has a small set of configuration files that are specific for that OS.
Typically these files exists outside of the guest (vm.cfg) and inside the guest.




Specific files inside the vm:

- /etc/sysconfig/network-scripts/ifcfg-eth*

- /etc/sysconfig/network

- ssh configuration files



Specific files outside the vm:

- vm.cfg



For VMs running on the same blade (and being part of the same 'grid') there are also files in common:

- nsswitch.conf

- resolv.conf

- sudoers

- sysctl.conf

- hosts



The files mentioned above need to be changed. This is because of the fact each machine needs it's own NIC's with specific MAC Addresses and it's own ip-numbers.




Of course, within a grid (on a blade) each VM has to have a unique name.




In order to make sure unique MAC addresses will be generated, one has to setup standards.




For the MAC addresses, the following formula is used: 00:16:3E:XD:0Y:0Z, where:

X: the number of the blade

Y: the number of the VM,

Z: the number of the NIC within that VM.



Host names will be used multiple times (but not within the same grid), the only thing that needs to change are the corresponding ip-numbers, these must be unique across the grids.


For example, the MAC address for the second NIC on the third VM on blade 7 would look like: HWADDR=00:16:3E:7D:03:02


The same strategy is used to determine the ip-numbers to be used:

- For the public network 192.168.200.1XY is used.

- For the internal network 10.0.0.1XY is used

- For the vip 192.168.200.XY is used.



Where:

X: the number of the Blade

Y: the number of the VM



For example:

- the public ip-number of node 3 on blade7 would be: 192.168.200.173

- the private ip-number of node 3 on blade7 would be: 10.0.0.173

- the virtual ip-number of node 3 on blade7 would be:192.168.200.73



So, from here, as long as you know for which blade and for which VM you will be generating the configuration, you can script that:

[root@nlhpblade07 tools]# ./clone_conf.sh nlhpblade01
Copying config files from /OVS_shared_large/conf/nlhpblade07 to /OVS_shared_large/conf/nlhpblade01...
Performing config changes specific to the blade and the VM...
# nlhpblade01 - GRIDNODE01 - /OVS_shared_large/conf/nlhpblade01/GRIDNODE01/ifcfg-eth0
# nlhpblade01 - GRIDNODE01 - /OVS_shared_large/conf/nlhpblade01/GRIDNODE01/ifcfg-eth1
# nlhpblade01 - GRIDNODE01 - /OVS_shared_large/conf/nlhpblade01/GRIDNODE01/network
# nlhpblade01 - GRIDNODE01 - /OVS_shared_large/conf/nlhpblade01/GRIDNODE01/vm.cfg
# nlhpblade01 - GRIDNODE02 - /OVS_shared_large/conf/nlhpblade01/GRIDNODE02/ifcfg-eth0
# nlhpblade01 - GRIDNODE02 - /OVS_shared_large/conf/nlhpblade01/GRIDNODE02/ifcfg-eth1
# nlhpblade01 - GRIDNODE02 - /OVS_shared_large/conf/nlhpblade01/GRIDNODE02/network
# nlhpblade01 - GRIDNODE02 - /OVS_shared_large/conf/nlhpblade01/GRIDNODE02/vm.cfg
# nlhpblade01 - GRIDNODE03 - /OVS_shared_large/conf/nlhpblade01/GRIDNODE03/ifcfg-eth0
# nlhpblade01 - GRIDNODE03 - /OVS_shared_large/conf/nlhpblade01/GRIDNODE03/ifcfg-eth1
# nlhpblade01 - GRIDNODE03 - /OVS_shared_large/conf/nlhpblade01/GRIDNODE03/network
# nlhpblade01 - GRIDNODE03 - /OVS_shared_large/conf/nlhpblade01/GRIDNODE03/vm.cfg
# nlhpblade01 - GRIDNODE04 - /OVS_shared_large/conf/nlhpblade01/GRIDNODE04/ifcfg-eth0
# nlhpblade01 - GRIDNODE04 - /OVS_shared_large/conf/nlhpblade01/GRIDNODE04/ifcfg-eth1
# nlhpblade01 - GRIDNODE04 - /OVS_shared_large/conf/nlhpblade01/GRIDNODE04/network
# nlhpblade01 - GRIDNODE04 - /OVS_shared_large/conf/nlhpblade01/GRIDNODE04/vm.cfg
# nlhpblade01 - GRIDNODE05 - /OVS_shared_large/conf/nlhpblade01/GRIDNODE05/ifcfg-eth0
# nlhpblade01 - GRIDNODE05 - /OVS_shared_large/conf/nlhpblade01/GRIDNODE05/ifcfg-eth1
# nlhpblade01 - GRIDNODE05 - /OVS_shared_large/conf/nlhpblade01/GRIDNODE05/network
# nlhpblade01 - GRIDNODE05 - /OVS_shared_large/conf/nlhpblade01/GRIDNODE05/vm.cfg
# nlhpblade01 - GRIDNODE09 - /OVS_shared_large/conf/nlhpblade01/GRIDNODE09/ifcfg-eth0
# nlhpblade01 - GRIDNODE09 - /OVS_shared_large/conf/nlhpblade01/GRIDNODE09/network
# nlhpblade01 - GRIDNODE09 - /OVS_shared_large/conf/nlhpblade01/GRIDNODE09/vm.cfg
Performing node common changes for the configuration files...
# nlhpblade01 - GRIDNODE09 - /OVS_shared_large/conf/nlhpblade01/common/cluster.conf
# nlhpblade01 - GRIDNODE09 - /OVS_shared_large/conf/nlhpblade01/common/hosts
[root@nlhpblade07 tools]#

'mounting a vm' (Chapter 7)




Now that we generated the node specific configuration files and copied the basic template we are ready to modify the OS before even booting it. What will happen after 'mounting' the VM image file is that the generated configuration will be copied over into the VM.




As said, at this moment the VM is an image file, for example /OVS/running_pool/GRIDNODE01/system.img. XEN will setup a loop in order to boot the OS from that image.




We do kind of the same in order to change the OS before we boot it:




First, the losetup command is used to associate a loop device with the file.
A loop device, is a pseudo-device that makes a file accessible as a block device.
[root@nlhpblade07 GRIDNODE03]#  losetup /dev/loop9 system.img

Now we have mapped the image file to a block device, we want to see the partitions on that.
For this we use the command kpartx. Kpartx creates device maps from partitioned tables.
Kpart is part of device-mapper multipath
[root@nlhpblade07 GRIDNODE03]# kpartx -a /dev/loop9

So, lets see what partitions device-mapper has for us:
[root@nlhpblade07 GRIDNODE03]# ls /dev/mapper/loop9*
/dev/mapper/loop9p1 /dev/mapper/loop9p2 /dev/mapper/loop9p3

kpartx found three partitions and told DM there are three partitions available.
Let's see if we can identify the types:
[root@nlhpblade07 GRIDNODE03]# file -s /dev/mapper/loop9p1
/dev/mapper/loop9p1: Linux rev 1.0 ext3 filesystem data

This is probably the /boot partition of the vm.
[root@nlhpblade07 GRIDNODE03]# file -s /dev/mapper/loop9p2
/dev/mapper/loop9p2: LVM2 (Linux Logical Volume Manager) , UUID: t2SAm03KoxfUcCOS3OYmsXf9ubqcy9q

This maybe the root or the swap partition
[root@nlhpblade07 GRIDNODE03]# file -s /dev/mapper/loop9p3
/dev/mapper/loop9p3: LVM2 (Linux Logical Volume Manager) , UUID: j2U7KUWen1ePjDvm4hTclZvA5YJyvl9
[root@nlhpblade07 GRIDNODE03]# fdisk -l /dev/mapper/loop9p2

This may also be the root or the swap partition




So, in order to make a better guess in finding the root partition, let's see what the sizes are:
[root@nlhpblade07 GRIDNODE03]# fdisk -l /dev/mapper/loop9p2
Disk /dev/mapper/loop9p2: 13.8 GB, 13851371520 bytes
255 heads, 63 sectors/track, 1684 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk /dev/mapper/loop9p2 doesn't contain a valid partition table
[root@nlhpblade07 GRIDNODE03]# fdisk -l /dev/mapper/loop9p3
Disk /dev/mapper/loop9p3: 5362 MB, 5362882560 bytes
255 heads, 63 sectors/track, 652 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk /dev/mapper/loop9p3 doesn't contain a valid partition table

As we can see, one partition is 5GB and the other is 13GB.
Best guess would be, the 5GB partion is the swap and the 13GB partition the OS.




With the command vgscan we can scan the newly 'discovered' 'disks' and search for volume groups on them:
[root@nlhpblade07 GRIDNODE03]# vgscan
Reading all physical volumes. This may take a while...
Found volume group "VolGroup00" using metadata type lvm2

vgdisplay says we have one volume group (VolGroup00):
[root@nlhpblade07 GRIDNODE03]# vgdisplay
--- Volume group ---
VG Name VolGroup00
System ID
Format lvm2
Metadata Areas 2
Metadata Sequence No 5
VG Access read/write
VG Status resizable
MAX LV 0
Cur LV 2
Open LV 0
Max PV 0
Cur PV 2
Act PV 2
VG Size 17.84 GB
PE Size 32.00 MB
Total PE 571
Alloc PE / Size 571 / 17.84 GB
Free PE / Size 0 / 0
VG UUID kmhYBm-Mpbv-usx2-vDur-rEVb-uP4i-kcP4fc

With the command, vgchange -a we can make logical volumes available to use for the kernel.
[root@nlhpblade07 GRIDNODE03]# vgchange -a y VolGroup00
2 logical volume(s) in volume group "VolGroup00" now active
[root@nlhpblade07 GRIDNODE03]# lvdisplay

lvdisplay can be use to see to see the attributes of a logical volume:
[root@nlhpblade07 GRIDNODE03]# lvdisplay
--- Logical volume ---
LV Name /dev/VolGroup00/LogVol00
VG Name VolGroup00
LV UUID B13hk3-f5qY-3gDY-Ackt-13gK-DZDc-cTWx3V
LV Write Access read/write
LV Status available
# open 0
LV Size 14.72 GB
Current LE 471
Segments 2
Allocation inherit
Read ahead sectors 0
Block device 253:3
--- Logical volume ---
LV Name /dev/VolGroup00/LogVol01
VG Name VolGroup00
LV UUID iEO4oG-XPMU-syWF-qupo-811i-G6Gg-QZEw5f
LV Write Access read/write
LV Status available
# open 0
LV Size 3.12 GB
Current LE 100
Segments 1
Allocation inherit
Read ahead sectors 0
Block device 253:4

So, now we found, (and made available to the logical volume) the root filesystem where the VM is on. Now we can mount that:
[root@nlhpblade07 GRIDNODE03]# mkdir guest_local_LogVol00; 
[root@nlhpblade07 GRIDNODE03]# mount /dev/VolGroup00/LogVol00 guest_local_LogVol00

See the contents of the filesystem:
[root@nlhpblade07 GRIDNODE03]# cd guest_local_LogVol00/
[root@nlhpblade07 guest_local_LogVol00]# ls -la
total 224
drwxr-xr-x 26 root root 4096 Jan 14 2009 .
drwxr-xr-x 3 root root 4096 Oct 22 22:30 ..
-rw-r--r-- 1 root root 0 Jul 24 05:02 .autorelabel
drwxr-xr-x 2 root root 4096 Dec 20 2008 bin
drwxr-xr-x 2 root root 4096 Jun 6 11:26 boot
drwxr-xr-x 4 root root 4096 Jun 6 11:26 dev
drwxr-xr-x 94 root root 12288 Jan 14 2009 etc
drwxr-xr-x 3 root root 4096 Jun 6 11:50 home
drwxr-xr-x 14 root root 4096 Dec 20 2008 lib
drwx------ 2 root root 16384 Jun 6 11:26 lost+found
drwxr-xr-x 2 root root 4096 Apr 21 2008 media
drwxr-xr-x 2 root root 4096 May 22 09:51 misc
drwxr-xr-x 3 root root 4096 Dec 20 2008 mnt
dr-xr-xr-x 2 root root 4096 Jun 10 11:11 net
drwxr-xr-x 3 root root 4096 Aug 21 04:11 opt
-rw-r--r-- 1 root root 0 Jan 14 2009 poweroff
drwxr-xr-x 2 root root 4096 Jun 6 11:26 proc
drwxr-x--- 17 root root 4096 Jan 13 2009 root
drwxr-xr-x 2 root root 12288 Dec 20 2008 sbin
drwxr-xr-x 4 500 500 4096 Jan 14 2009 scratch
drwxr-xr-x 2 root root 4096 Jun 6 11:26 selinux
drwxr-xr-x 2 root root 4096 Apr 21 2008 srv
drwxr-xr-x 2 root root 4096 Jun 6 11:26 sys
drwxr-xr-x 3 root root 4096 Jun 6 11:33 tftpboot
drwxrwxrwt 9 root root 4096 Jan 14 2009 tmp
drwxr-xr-x 3 root root 4096 Dec 20 2008 u01
drwxr-xr-x 14 root root 4096 Jun 6 11:31 usr
drwxr-xr-x 21 root root 4096 Jun 6 11:37 var

As this seems a rather easy way to mount a vm image file, it is still not something you will do very quickly for 40 VM images very quickly.




For this reason, the described solution is scripted and called mount_vm.sh.
This is how it works:
[root@nlhpblade07 GRIDNODE05]# mount_vm.sh GRIDNODE05
Starting mount...
contents /etc/sysconfig/network -file of mounted node:
NETWORKING=yes
NETWORKING_IPV6=no
HOSTNAME=gridnode05.nl.oracle.com
Generating unmount script
To unmount your image run /tmp/umount_GRIDNODE05.30992.sh as root
Mounting finished...

As you can see the images is mounted my a script and a script to unmount is automatically generated. In order to verify the right image file is mounted the contents of the file /etc/sysconfig/network is shown.




'changing a vm' (Chapter 8)




Now the vm image is mounted to the filesystem, we can go back to the generated config files.
From here it is easy to copy over all specific configuration files to the vm.
Better, would be to make a script available to do this, and that is done for this solution:
[root@nlhpblade07 GRIDNODE05]# change_vm.sh GRIDNODE05
If you are sure, hit Y or y to continue
Y
Continuing...
Starting config change for VM GRIDNODE05 on nlhpblade07...
Copying swingbench...
Changing ownership of swingbench files...
Copying FCF-Java Demo...
Changing ownership of FCF-Java Demo files...
This vm requires pre-build file /OVS/sharedDisk/rdbms_home_11r1_01_ocfs.img as shared Oracle RDBMS HOME
Finished changing config...

Now the VM is modified internally and still mounted, the unmount has to be done.
This can be done by running the generated unmount script.
This script was generated during the mount.
[root@nlhpblade07 GRIDNODE05]# /tmp/umount_GRIDNODE05.30992.sh
Unmount finished
If the unmount succeeded, you can remove this file
rm: remove regular file `/tmp/umount_GRIDNODE05.30992.sh'? y





All Together (Chapter 9)




In essence the procedure described above should be repeated for each VM you want to clone and change on each blade. This may already save you hours of work and reduces chances on mistakes, but still may seem a lot of steps. In order to repeat this for each blade, for each vm, from here it is just a matter of scripting.




So, you could make a script, that for each blade, would do the following:



Pseudo:
for each blade in blade list
do
Stop all vms first
while vms still running
do
wait 10 seconds
done
Restore all machines (from NFS)
Clone conf
Mount and change all vms
Start all
done

For all 42 VM images the implemented version of the script above runs about four hours. After this a complete 42 node education environment is setup.


NLHPBLADE%20TA3-1.jpg


Extra Options (Chapter 10)




Besides changing only configuration settings on a VM as mentioned in the Change a VM chapter, other activities can also be done.




For this solution the following options are also implemented.

- configure vnc

- configure ocfs2 within the guest, use a shared Oracle home to save space

- copy and configure software (like an oracle home or swingbench)

- create ASM disk files

- create OCR and Voting disks

- configure sudoers

- provide software for EM Agent Deployment

- configure ssh




Rene Kundersma
Oracle Expert Services, The Netherlands

Join the discussion

Comments ( 2 )
  • Shawn Zeng Thursday, March 12, 2009
    Hi Rene,
    This is a great blog entry, thanks for your practice sharing.
    I found 2 typos in chapter 6
    a. |For the MAC addresses, the following formula is used: 0:16:3E:XD:0Y:0Z, where:| i guess the
    macaddr should begin with '00'.
    b. |- the private ip-number of node 3 on blade7 would be: 10.0.0.171|, the ipaddr should be 10.0.0.173
  • Rene Friday, March 13, 2009
    Shawn,
    Thanks for your compliment.
    I will fix the two typo soon !
    Rene
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.